All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-17 13:57 ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-17 13:57 UTC (permalink / raw)
  To: David Airlie, Jerome Glisse, Alex Deucher, Andrew Morton
  Cc: John Bridgman, Joerg Roedel, Andrew Lewycky,
	Christian König, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

Forgot to cc mailing list on cover letter. Sorry.

As a continuation to the existing discussion, here is a v2 patch series 
restructured with a cleaner history and no totally-different-early-versions of 
the code.

Instead of 83 patches, there are now a total of 25 patches, where 5 of them
are modifications to radeon driver and 18 of them include only amdkfd code.
There is no code going away or even modified between patches, only added.

The driver was renamed from radeon_kfd to amdkfd and moved to reside under
drm/radeon/amdkfd. This move was done to emphasize the fact that this driver is 
an AMD-only driver at this point. Having said that, we do foresee a generic hsa 
framework being implemented in the future and in that case, we will adjust 
amdkfd to work within that framework.

As the amdkfd driver should support multiple AMD gfx drivers, we want to keep it 
as a seperate driver from radeon. Therefore, the amdkfd code is contained in its 
own folder. The amdkfd folder was put under the radeon folder because the only 
AMD gfx driver in the Linux kernel at this point
is the radeon driver. Having said that, we will probably need to move it (maybe 
to be directly under drm) after we integrate with additional AMD gfx drivers.

For people who like to review using git, the v2 patch set is located at:
http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2

Written by Oded Gabbayh <oded.gabbay@amd.com>

Original Cover Letter:

This patch set implements a Heterogeneous System Architecture (HSA) driver for 
radeon-family GPUs.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system 
resources more effectively via HW features including shared pageable memory, 
userspace-accessible work queues, and platform-level atomics. In addition to the 
memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of 
GPUs also performs HW-level validation of commands passed in through the queues 
(aka rings).

The code in this patch set is intended to serve both as a sample driver for 
other HSA-compatible hardware devices and as a production driver for 
radeon-family processors. The code is architected to support multiple CPUs each 
with connected GPUs, although the current implementation focuses on a single 
Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics 
driver (kgd).
AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware 
functionality between HSA compute and regular gfx/compute (memory, interrupts, 
registers), while other functionality has been added specifically for HSA 
compute  (hw scheduler for virtualized compute rings). All shared hardware is 
owned by the radeon graphics driver, and an interface between kfd and kgd allows 
the kfd to make use of those shared resources, while HSA-specific functionality 
is managed directly by kfd by submitting packets into an HSA-specific command 
queue (the "HIQ").

During kfd module initialization a char device node (/dev/kfd) is created 
(surviving until module exit), with ioctls for queue creation & management, and 
data structures are initialized for managing HSA device topology.
The rest of the initialization is driven by calls from the radeon kgd at the 
following points :

- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)

During the probe and init processing per-device data structures are established 
which connect to the associated graphics kernel driver. This information is 
exposed to userspace via sysfs, along with a version number allowing userspace 
to determine if a topology change has occurred while it was reading from sysfs.
The interface between kfd and kgd also allows the kfd to request buffer 
management services from kgd, and allows kgd to route interrupt requests to kfd 
code since the interrupt block is shared between regular graphics/compute and 
HSA compute subsystems in the GPU.

The kfd code works with an open source usermode library ("libhsakmt") which is 
in the final stages of IP review and should be published in a separate repo over 
the next few days.
The code operates in one of three modes, selectable via the sched_policy module 
parameter :

- sched_policy=0 uses a hardware scheduler running in the MEC block within CP, 
and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow oversubscription, so 
create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver manually assigns 
queues to HW slots by programming registers

The "no HW scheduling" option is for debug & new hardware bringup only, so has 
less test coverage than the other options. Default in the current code is "HW 
scheduling without oversubscription" since that is where we have the most test 
coverage but we expect to change the default to "HW scheduling with 
oversubscription" after further testing. This effectively removes the HW limit 
on the number of work queues available to applications.

Programs running on the GPU are associated with an address space through the 
VMID field, which is translated to a unique PASID at access time via a set of 16 
VMID-to-PASID mapping registers. The available VMIDs (currently 16) are 
partitioned (under control of the radeon kgd) between current gfx/compute and 
HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping 
registers are updated by the HW scheduler when used, and by driver code if HW 
scheduling is not being used.
The Sea Islands compute queues use a new "doorbell" mechanism instead of the 
earlier kernel-managed write pointer registers. Doorbells use a separate BAR 
dedicated for this purpose, and pages within the doorbell aperture are mapped to 
userspace (each page mapped to only one user address space). Writes to the 
doorbell aperture are intercepted by GPU hardware, allowing userspace code to 
safely manage work queues (rings) without requiring a kernel call for every ring 
update.
First step for an application process is to open the kfd device. Calls to open 
create a kfd "process" structure only for the first thread of the process. 
Subsequent open calls are checked to see if they are from processes using the 
same mm_struct and, if so, don't do anything. The kfd per-process data lives as 
long as the mm_struct exists. Each mm_struct is associated with a unique PASID, 
allowing the IOMMUv2 to make userspace process memory accessible to the GPU.
Next step is for the application to collect topology information via sysfs. This 
gives userspace enough information to be able to identify specific nodes 
(processors) in subsequent queue management calls. Application processes can 
create queues on multiple processors, and processors support queues from 
multiple processes.
At this point the application can create work queues in userspace memory and 
pass them through the usermode library to kfd to have them mapped onto HW queue 
slots so that commands written to the queues can be executed by the GPU. Queue 
operations specify a processor node, and so the bulk of this code is 
device-specific.
Written by John Bridgman <John.Bridgman@amd.com>


Alexey Skidanov (1):
   amdkfd: Implement the Get Process Aperture IOCTL

Andrew Lewycky (3):
   amdkfd: Add basic modules to amdkfd
   amdkfd: Add interrupt handling module
   amdkfd: Implement the Set Memory Policy IOCTL

Ben Goz (8):
   amdkfd: Add queue module
   amdkfd: Add mqd_manager module
   amdkfd: Add kernel queue module
   amdkfd: Add module parameter of scheduling policy
   amdkfd: Add packet manager module
   amdkfd: Add process queue manager module
   amdkfd: Add device queue manager module
   amdkfd: Implement the create/destroy/update queue IOCTLs

Evgeny Pinchuk (3):
   amdkfd: Add topology module to amdkfd
   amdkfd: Implement the Get Clock Counters IOCTL
   amdkfd: Implement the PMC Acquire/Release IOCTLs

Oded Gabbay (10):
   mm: Add kfd_process pointer to mm_struct
   drm/radeon: reduce number of free VMIDs and pipes in KV
   drm/radeon/cik: Don't touch int of pipes 1-7
   drm/radeon: Report doorbell configuration to amdkfd
   drm/radeon: adding synchronization for GRBM GFX
   drm/radeon: Add radeon <--> amdkfd interface
   Update MAINTAINERS and CREDITS files with amdkfd info
   amdkfd: Add IOCTL set definitions of amdkfd
   amdkfd: Add amdkfd skeleton driver
   amdkfd: Add binding/unbinding calls to amd_iommu driver

  CREDITS                                            |    7 +
  MAINTAINERS                                        |   10 +
  drivers/gpu/drm/radeon/Kconfig                     |    2 +
  drivers/gpu/drm/radeon/Makefile                    |    3 +
  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
  drivers/gpu/drm/radeon/cik.c                       |  154 +--
  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
  drivers/gpu/drm/radeon/radeon.h                    |    9 +
  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
  include/linux/mm_types.h                           |   14 +
  include/uapi/linux/kfd_ioctl.h                     |  133 +++
  43 files changed, 9226 insertions(+), 95 deletions(-)
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
  create mode 100644 include/uapi/linux/kfd_ioctl.h

-- 
1.9.1


^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-17 13:57 ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-17 13:57 UTC (permalink / raw)
  To: David Airlie, Jerome Glisse, Alex Deucher, Andrew Morton
  Cc: John Bridgman, Joerg Roedel, Andrew Lewycky,
	Christian König, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

Forgot to cc mailing list on cover letter. Sorry.

As a continuation to the existing discussion, here is a v2 patch series 
restructured with a cleaner history and no totally-different-early-versions of 
the code.

Instead of 83 patches, there are now a total of 25 patches, where 5 of them
are modifications to radeon driver and 18 of them include only amdkfd code.
There is no code going away or even modified between patches, only added.

The driver was renamed from radeon_kfd to amdkfd and moved to reside under
drm/radeon/amdkfd. This move was done to emphasize the fact that this driver is 
an AMD-only driver at this point. Having said that, we do foresee a generic hsa 
framework being implemented in the future and in that case, we will adjust 
amdkfd to work within that framework.

As the amdkfd driver should support multiple AMD gfx drivers, we want to keep it 
as a seperate driver from radeon. Therefore, the amdkfd code is contained in its 
own folder. The amdkfd folder was put under the radeon folder because the only 
AMD gfx driver in the Linux kernel at this point
is the radeon driver. Having said that, we will probably need to move it (maybe 
to be directly under drm) after we integrate with additional AMD gfx drivers.

For people who like to review using git, the v2 patch set is located at:
http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2

Written by Oded Gabbayh <oded.gabbay@amd.com>

Original Cover Letter:

This patch set implements a Heterogeneous System Architecture (HSA) driver for 
radeon-family GPUs.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system 
resources more effectively via HW features including shared pageable memory, 
userspace-accessible work queues, and platform-level atomics. In addition to the 
memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of 
GPUs also performs HW-level validation of commands passed in through the queues 
(aka rings).

The code in this patch set is intended to serve both as a sample driver for 
other HSA-compatible hardware devices and as a production driver for 
radeon-family processors. The code is architected to support multiple CPUs each 
with connected GPUs, although the current implementation focuses on a single 
Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics 
driver (kgd).
AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware 
functionality between HSA compute and regular gfx/compute (memory, interrupts, 
registers), while other functionality has been added specifically for HSA 
compute  (hw scheduler for virtualized compute rings). All shared hardware is 
owned by the radeon graphics driver, and an interface between kfd and kgd allows 
the kfd to make use of those shared resources, while HSA-specific functionality 
is managed directly by kfd by submitting packets into an HSA-specific command 
queue (the "HIQ").

During kfd module initialization a char device node (/dev/kfd) is created 
(surviving until module exit), with ioctls for queue creation & management, and 
data structures are initialized for managing HSA device topology.
The rest of the initialization is driven by calls from the radeon kgd at the 
following points :

- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)

During the probe and init processing per-device data structures are established 
which connect to the associated graphics kernel driver. This information is 
exposed to userspace via sysfs, along with a version number allowing userspace 
to determine if a topology change has occurred while it was reading from sysfs.
The interface between kfd and kgd also allows the kfd to request buffer 
management services from kgd, and allows kgd to route interrupt requests to kfd 
code since the interrupt block is shared between regular graphics/compute and 
HSA compute subsystems in the GPU.

The kfd code works with an open source usermode library ("libhsakmt") which is 
in the final stages of IP review and should be published in a separate repo over 
the next few days.
The code operates in one of three modes, selectable via the sched_policy module 
parameter :

- sched_policy=0 uses a hardware scheduler running in the MEC block within CP, 
and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow oversubscription, so 
create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver manually assigns 
queues to HW slots by programming registers

The "no HW scheduling" option is for debug & new hardware bringup only, so has 
less test coverage than the other options. Default in the current code is "HW 
scheduling without oversubscription" since that is where we have the most test 
coverage but we expect to change the default to "HW scheduling with 
oversubscription" after further testing. This effectively removes the HW limit 
on the number of work queues available to applications.

Programs running on the GPU are associated with an address space through the 
VMID field, which is translated to a unique PASID at access time via a set of 16 
VMID-to-PASID mapping registers. The available VMIDs (currently 16) are 
partitioned (under control of the radeon kgd) between current gfx/compute and 
HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping 
registers are updated by the HW scheduler when used, and by driver code if HW 
scheduling is not being used.
The Sea Islands compute queues use a new "doorbell" mechanism instead of the 
earlier kernel-managed write pointer registers. Doorbells use a separate BAR 
dedicated for this purpose, and pages within the doorbell aperture are mapped to 
userspace (each page mapped to only one user address space). Writes to the 
doorbell aperture are intercepted by GPU hardware, allowing userspace code to 
safely manage work queues (rings) without requiring a kernel call for every ring 
update.
First step for an application process is to open the kfd device. Calls to open 
create a kfd "process" structure only for the first thread of the process. 
Subsequent open calls are checked to see if they are from processes using the 
same mm_struct and, if so, don't do anything. The kfd per-process data lives as 
long as the mm_struct exists. Each mm_struct is associated with a unique PASID, 
allowing the IOMMUv2 to make userspace process memory accessible to the GPU.
Next step is for the application to collect topology information via sysfs. This 
gives userspace enough information to be able to identify specific nodes 
(processors) in subsequent queue management calls. Application processes can 
create queues on multiple processors, and processors support queues from 
multiple processes.
At this point the application can create work queues in userspace memory and 
pass them through the usermode library to kfd to have them mapped onto HW queue 
slots so that commands written to the queues can be executed by the GPU. Queue 
operations specify a processor node, and so the bulk of this code is 
device-specific.
Written by John Bridgman <John.Bridgman@amd.com>


Alexey Skidanov (1):
   amdkfd: Implement the Get Process Aperture IOCTL

Andrew Lewycky (3):
   amdkfd: Add basic modules to amdkfd
   amdkfd: Add interrupt handling module
   amdkfd: Implement the Set Memory Policy IOCTL

Ben Goz (8):
   amdkfd: Add queue module
   amdkfd: Add mqd_manager module
   amdkfd: Add kernel queue module
   amdkfd: Add module parameter of scheduling policy
   amdkfd: Add packet manager module
   amdkfd: Add process queue manager module
   amdkfd: Add device queue manager module
   amdkfd: Implement the create/destroy/update queue IOCTLs

Evgeny Pinchuk (3):
   amdkfd: Add topology module to amdkfd
   amdkfd: Implement the Get Clock Counters IOCTL
   amdkfd: Implement the PMC Acquire/Release IOCTLs

Oded Gabbay (10):
   mm: Add kfd_process pointer to mm_struct
   drm/radeon: reduce number of free VMIDs and pipes in KV
   drm/radeon/cik: Don't touch int of pipes 1-7
   drm/radeon: Report doorbell configuration to amdkfd
   drm/radeon: adding synchronization for GRBM GFX
   drm/radeon: Add radeon <--> amdkfd interface
   Update MAINTAINERS and CREDITS files with amdkfd info
   amdkfd: Add IOCTL set definitions of amdkfd
   amdkfd: Add amdkfd skeleton driver
   amdkfd: Add binding/unbinding calls to amd_iommu driver

  CREDITS                                            |    7 +
  MAINTAINERS                                        |   10 +
  drivers/gpu/drm/radeon/Kconfig                     |    2 +
  drivers/gpu/drm/radeon/Makefile                    |    3 +
  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
  drivers/gpu/drm/radeon/cik.c                       |  154 +--
  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
  drivers/gpu/drm/radeon/radeon.h                    |    9 +
  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
  include/linux/mm_types.h                           |   14 +
  include/uapi/linux/kfd_ioctl.h                     |  133 +++
  43 files changed, 9226 insertions(+), 95 deletions(-)
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
  create mode 100644 include/uapi/linux/kfd_ioctl.h

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-17 13:57 ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-17 13:57 UTC (permalink / raw)
  To: David Airlie, Jerome Glisse, Alex Deucher, Andrew Morton
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov

Forgot to cc mailing list on cover letter. Sorry.

As a continuation to the existing discussion, here is a v2 patch series 
restructured with a cleaner history and no totally-different-early-versions of 
the code.

Instead of 83 patches, there are now a total of 25 patches, where 5 of them
are modifications to radeon driver and 18 of them include only amdkfd code.
There is no code going away or even modified between patches, only added.

The driver was renamed from radeon_kfd to amdkfd and moved to reside under
drm/radeon/amdkfd. This move was done to emphasize the fact that this driver is 
an AMD-only driver at this point. Having said that, we do foresee a generic hsa 
framework being implemented in the future and in that case, we will adjust 
amdkfd to work within that framework.

As the amdkfd driver should support multiple AMD gfx drivers, we want to keep it 
as a seperate driver from radeon. Therefore, the amdkfd code is contained in its 
own folder. The amdkfd folder was put under the radeon folder because the only 
AMD gfx driver in the Linux kernel at this point
is the radeon driver. Having said that, we will probably need to move it (maybe 
to be directly under drm) after we integrate with additional AMD gfx drivers.

For people who like to review using git, the v2 patch set is located at:
http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2

Written by Oded Gabbayh <oded.gabbay@amd.com>

Original Cover Letter:

This patch set implements a Heterogeneous System Architecture (HSA) driver for 
radeon-family GPUs.
HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share system 
resources more effectively via HW features including shared pageable memory, 
userspace-accessible work queues, and platform-level atomics. In addition to the 
memory protection mechanisms in GPUVM and IOMMUv2, the Sea Islands family of 
GPUs also performs HW-level validation of commands passed in through the queues 
(aka rings).

The code in this patch set is intended to serve both as a sample driver for 
other HSA-compatible hardware devices and as a production driver for 
radeon-family processors. The code is architected to support multiple CPUs each 
with connected GPUs, although the current implementation focuses on a single 
Kaveri/Berlin APU, and works alongside the existing radeon kernel graphics 
driver (kgd).
AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware 
functionality between HSA compute and regular gfx/compute (memory, interrupts, 
registers), while other functionality has been added specifically for HSA 
compute  (hw scheduler for virtualized compute rings). All shared hardware is 
owned by the radeon graphics driver, and an interface between kfd and kgd allows 
the kfd to make use of those shared resources, while HSA-specific functionality 
is managed directly by kfd by submitting packets into an HSA-specific command 
queue (the "HIQ").

During kfd module initialization a char device node (/dev/kfd) is created 
(surviving until module exit), with ioctls for queue creation & management, and 
data structures are initialized for managing HSA device topology.
The rest of the initialization is driven by calls from the radeon kgd at the 
following points :

- radeon_init (kfd_init)
- radeon_exit (kfd_fini)
- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
- radeon_driver_unload_kms (kfd_device_fini)

During the probe and init processing per-device data structures are established 
which connect to the associated graphics kernel driver. This information is 
exposed to userspace via sysfs, along with a version number allowing userspace 
to determine if a topology change has occurred while it was reading from sysfs.
The interface between kfd and kgd also allows the kfd to request buffer 
management services from kgd, and allows kgd to route interrupt requests to kfd 
code since the interrupt block is shared between regular graphics/compute and 
HSA compute subsystems in the GPU.

The kfd code works with an open source usermode library ("libhsakmt") which is 
in the final stages of IP review and should be published in a separate repo over 
the next few days.
The code operates in one of three modes, selectable via the sched_policy module 
parameter :

- sched_policy=0 uses a hardware scheduler running in the MEC block within CP, 
and allows oversubscription (more queues than HW slots)
- sched_policy=1 also uses HW scheduling but does not allow oversubscription, so 
create_queue requests fail when we run out of HW slots
- sched_policy=2 does not use HW scheduling, so the driver manually assigns 
queues to HW slots by programming registers

The "no HW scheduling" option is for debug & new hardware bringup only, so has 
less test coverage than the other options. Default in the current code is "HW 
scheduling without oversubscription" since that is where we have the most test 
coverage but we expect to change the default to "HW scheduling with 
oversubscription" after further testing. This effectively removes the HW limit 
on the number of work queues available to applications.

Programs running on the GPU are associated with an address space through the 
VMID field, which is translated to a unique PASID at access time via a set of 16 
VMID-to-PASID mapping registers. The available VMIDs (currently 16) are 
partitioned (under control of the radeon kgd) between current gfx/compute and 
HSA compute, with each getting 8 in the current code. The VMID-to-PASID mapping 
registers are updated by the HW scheduler when used, and by driver code if HW 
scheduling is not being used.
The Sea Islands compute queues use a new "doorbell" mechanism instead of the 
earlier kernel-managed write pointer registers. Doorbells use a separate BAR 
dedicated for this purpose, and pages within the doorbell aperture are mapped to 
userspace (each page mapped to only one user address space). Writes to the 
doorbell aperture are intercepted by GPU hardware, allowing userspace code to 
safely manage work queues (rings) without requiring a kernel call for every ring 
update.
First step for an application process is to open the kfd device. Calls to open 
create a kfd "process" structure only for the first thread of the process. 
Subsequent open calls are checked to see if they are from processes using the 
same mm_struct and, if so, don't do anything. The kfd per-process data lives as 
long as the mm_struct exists. Each mm_struct is associated with a unique PASID, 
allowing the IOMMUv2 to make userspace process memory accessible to the GPU.
Next step is for the application to collect topology information via sysfs. This 
gives userspace enough information to be able to identify specific nodes 
(processors) in subsequent queue management calls. Application processes can 
create queues on multiple processors, and processors support queues from 
multiple processes.
At this point the application can create work queues in userspace memory and 
pass them through the usermode library to kfd to have them mapped onto HW queue 
slots so that commands written to the queues can be executed by the GPU. Queue 
operations specify a processor node, and so the bulk of this code is 
device-specific.
Written by John Bridgman <John.Bridgman@amd.com>


Alexey Skidanov (1):
   amdkfd: Implement the Get Process Aperture IOCTL

Andrew Lewycky (3):
   amdkfd: Add basic modules to amdkfd
   amdkfd: Add interrupt handling module
   amdkfd: Implement the Set Memory Policy IOCTL

Ben Goz (8):
   amdkfd: Add queue module
   amdkfd: Add mqd_manager module
   amdkfd: Add kernel queue module
   amdkfd: Add module parameter of scheduling policy
   amdkfd: Add packet manager module
   amdkfd: Add process queue manager module
   amdkfd: Add device queue manager module
   amdkfd: Implement the create/destroy/update queue IOCTLs

Evgeny Pinchuk (3):
   amdkfd: Add topology module to amdkfd
   amdkfd: Implement the Get Clock Counters IOCTL
   amdkfd: Implement the PMC Acquire/Release IOCTLs

Oded Gabbay (10):
   mm: Add kfd_process pointer to mm_struct
   drm/radeon: reduce number of free VMIDs and pipes in KV
   drm/radeon/cik: Don't touch int of pipes 1-7
   drm/radeon: Report doorbell configuration to amdkfd
   drm/radeon: adding synchronization for GRBM GFX
   drm/radeon: Add radeon <--> amdkfd interface
   Update MAINTAINERS and CREDITS files with amdkfd info
   amdkfd: Add IOCTL set definitions of amdkfd
   amdkfd: Add amdkfd skeleton driver
   amdkfd: Add binding/unbinding calls to amd_iommu driver

  CREDITS                                            |    7 +
  MAINTAINERS                                        |   10 +
  drivers/gpu/drm/radeon/Kconfig                     |    2 +
  drivers/gpu/drm/radeon/Makefile                    |    3 +
  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
  drivers/gpu/drm/radeon/cik.c                       |  154 +--
  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
  drivers/gpu/drm/radeon/radeon.h                    |    9 +
  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
  include/linux/mm_types.h                           |   14 +
  include/uapi/linux/kfd_ioctl.h                     |  133 +++
  43 files changed, 9226 insertions(+), 95 deletions(-)
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
  create mode 100644 include/uapi/linux/kfd_ioctl.h

-- 
1.9.1

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-17 13:57 ` Oded Gabbay
  (?)
@ 2014-07-20 17:46   ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-20 17:46 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Christian König,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> Forgot to cc mailing list on cover letter. Sorry.
> 
> As a continuation to the existing discussion, here is a v2 patch series
> restructured with a cleaner history and no totally-different-early-versions
> of the code.
> 
> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> are modifications to radeon driver and 18 of them include only amdkfd code.
> There is no code going away or even modified between patches, only added.
> 
> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> is an AMD-only driver at this point. Having said that, we do foresee a
> generic hsa framework being implemented in the future and in that case, we
> will adjust amdkfd to work within that framework.
> 
> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> contained in its own folder. The amdkfd folder was put under the radeon
> folder because the only AMD gfx driver in the Linux kernel at this point
> is the radeon driver. Having said that, we will probably need to move it
> (maybe to be directly under drm) after we integrate with additional AMD gfx
> drivers.
> 
> For people who like to review using git, the v2 patch set is located at:
> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> 
> Written by Oded Gabbayh <oded.gabbay@amd.com>

So quick comments before i finish going over all patches. There is many
things that need more documentation espacialy as of right now there is
no userspace i can go look at.

There few show stopper, biggest one is gpu memory pinning this is a big
no, that would need serious arguments for any hope of convincing me on
that side.

It might be better to add a drivers/gpu/drm/amd directory and add common
stuff there.

Given that this is not intended to be final HSA api AFAICT then i would
say this far better to avoid the whole kfd module and add ioctl to radeon.
This would avoid crazy communication btw radeon and kfd.

The whole aperture business needs some serious explanation. Especialy as
you want to use userspace address there is nothing to prevent userspace
program from allocating things at address you reserve for lds, scratch,
... only sane way would be to move those lds, scratch inside the virtual
address reserved for kernel (see kernel memory map).

The whole business of locking performance counter for exclusive per process
access is a big NO. Which leads me to the questionable usefullness of user
space command ring. I only see issues with that. First and foremost i would
need to see solid figures that kernel ioctl or syscall has a higher an
overhead that is measurable in any meaning full way against a simple
function call. I know the userspace command ring is a big marketing features
that please ignorant userspace programmer. But really this only brings issues
and for absolutely not upside afaict.

So i would rather see a very simple ioctl that write the doorbell and might
do more than that in case of ring/queue overcommit where it would first have
to wait for a free ring/queue to schedule stuff. This would also allow sane
implementation of things like performance counter that could be acquire by
kernel for duration of a job submitted by userspace. While still not optimal
this would be better that userspace locking.


I might have more thoughts once i am done with all the patches.

Cheers,
Jérôme

> 
> Original Cover Letter:
> 
> This patch set implements a Heterogeneous System Architecture (HSA) driver
> for radeon-family GPUs.
> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> system resources more effectively via HW features including shared pageable
> memory, userspace-accessible work queues, and platform-level atomics. In
> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> Islands family of GPUs also performs HW-level validation of commands passed
> in through the queues (aka rings).
> 
> The code in this patch set is intended to serve both as a sample driver for
> other HSA-compatible hardware devices and as a production driver for
> radeon-family processors. The code is architected to support multiple CPUs
> each with connected GPUs, although the current implementation focuses on a
> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> graphics driver (kgd).
> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> functionality between HSA compute and regular gfx/compute (memory,
> interrupts, registers), while other functionality has been added
> specifically for HSA compute  (hw scheduler for virtualized compute rings).
> All shared hardware is owned by the radeon graphics driver, and an interface
> between kfd and kgd allows the kfd to make use of those shared resources,
> while HSA-specific functionality is managed directly by kfd by submitting
> packets into an HSA-specific command queue (the "HIQ").
> 
> During kfd module initialization a char device node (/dev/kfd) is created
> (surviving until module exit), with ioctls for queue creation & management,
> and data structures are initialized for managing HSA device topology.
> The rest of the initialization is driven by calls from the radeon kgd at the
> following points :
> 
> - radeon_init (kfd_init)
> - radeon_exit (kfd_fini)
> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> - radeon_driver_unload_kms (kfd_device_fini)
> 
> During the probe and init processing per-device data structures are
> established which connect to the associated graphics kernel driver. This
> information is exposed to userspace via sysfs, along with a version number
> allowing userspace to determine if a topology change has occurred while it
> was reading from sysfs.
> The interface between kfd and kgd also allows the kfd to request buffer
> management services from kgd, and allows kgd to route interrupt requests to
> kfd code since the interrupt block is shared between regular
> graphics/compute and HSA compute subsystems in the GPU.
> 
> The kfd code works with an open source usermode library ("libhsakmt") which
> is in the final stages of IP review and should be published in a separate
> repo over the next few days.
> The code operates in one of three modes, selectable via the sched_policy
> module parameter :
> 
> - sched_policy=0 uses a hardware scheduler running in the MEC block within
> CP, and allows oversubscription (more queues than HW slots)
> - sched_policy=1 also uses HW scheduling but does not allow
> oversubscription, so create_queue requests fail when we run out of HW slots
> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> queues to HW slots by programming registers
> 
> The "no HW scheduling" option is for debug & new hardware bringup only, so
> has less test coverage than the other options. Default in the current code
> is "HW scheduling without oversubscription" since that is where we have the
> most test coverage but we expect to change the default to "HW scheduling
> with oversubscription" after further testing. This effectively removes the
> HW limit on the number of work queues available to applications.
> 
> Programs running on the GPU are associated with an address space through the
> VMID field, which is translated to a unique PASID at access time via a set
> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> are partitioned (under control of the radeon kgd) between current
> gfx/compute and HSA compute, with each getting 8 in the current code. The
> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> and by driver code if HW scheduling is not being used.
> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> dedicated for this purpose, and pages within the doorbell aperture are
> mapped to userspace (each page mapped to only one user address space).
> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> userspace code to safely manage work queues (rings) without requiring a
> kernel call for every ring update.
> First step for an application process is to open the kfd device. Calls to
> open create a kfd "process" structure only for the first thread of the
> process. Subsequent open calls are checked to see if they are from processes
> using the same mm_struct and, if so, don't do anything. The kfd per-process
> data lives as long as the mm_struct exists. Each mm_struct is associated
> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> accessible to the GPU.
> Next step is for the application to collect topology information via sysfs.
> This gives userspace enough information to be able to identify specific
> nodes (processors) in subsequent queue management calls. Application
> processes can create queues on multiple processors, and processors support
> queues from multiple processes.
> At this point the application can create work queues in userspace memory and
> pass them through the usermode library to kfd to have them mapped onto HW
> queue slots so that commands written to the queues can be executed by the
> GPU. Queue operations specify a processor node, and so the bulk of this code
> is device-specific.
> Written by John Bridgman <John.Bridgman@amd.com>
> 
> 
> Alexey Skidanov (1):
>   amdkfd: Implement the Get Process Aperture IOCTL
> 
> Andrew Lewycky (3):
>   amdkfd: Add basic modules to amdkfd
>   amdkfd: Add interrupt handling module
>   amdkfd: Implement the Set Memory Policy IOCTL
> 
> Ben Goz (8):
>   amdkfd: Add queue module
>   amdkfd: Add mqd_manager module
>   amdkfd: Add kernel queue module
>   amdkfd: Add module parameter of scheduling policy
>   amdkfd: Add packet manager module
>   amdkfd: Add process queue manager module
>   amdkfd: Add device queue manager module
>   amdkfd: Implement the create/destroy/update queue IOCTLs
> 
> Evgeny Pinchuk (3):
>   amdkfd: Add topology module to amdkfd
>   amdkfd: Implement the Get Clock Counters IOCTL
>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> 
> Oded Gabbay (10):
>   mm: Add kfd_process pointer to mm_struct
>   drm/radeon: reduce number of free VMIDs and pipes in KV
>   drm/radeon/cik: Don't touch int of pipes 1-7
>   drm/radeon: Report doorbell configuration to amdkfd
>   drm/radeon: adding synchronization for GRBM GFX
>   drm/radeon: Add radeon <--> amdkfd interface
>   Update MAINTAINERS and CREDITS files with amdkfd info
>   amdkfd: Add IOCTL set definitions of amdkfd
>   amdkfd: Add amdkfd skeleton driver
>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> 
>  CREDITS                                            |    7 +
>  MAINTAINERS                                        |   10 +
>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>  include/linux/mm_types.h                           |   14 +
>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>  43 files changed, 9226 insertions(+), 95 deletions(-)
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> 
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-20 17:46   ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-20 17:46 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Christian König,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> Forgot to cc mailing list on cover letter. Sorry.
> 
> As a continuation to the existing discussion, here is a v2 patch series
> restructured with a cleaner history and no totally-different-early-versions
> of the code.
> 
> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> are modifications to radeon driver and 18 of them include only amdkfd code.
> There is no code going away or even modified between patches, only added.
> 
> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> is an AMD-only driver at this point. Having said that, we do foresee a
> generic hsa framework being implemented in the future and in that case, we
> will adjust amdkfd to work within that framework.
> 
> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> contained in its own folder. The amdkfd folder was put under the radeon
> folder because the only AMD gfx driver in the Linux kernel at this point
> is the radeon driver. Having said that, we will probably need to move it
> (maybe to be directly under drm) after we integrate with additional AMD gfx
> drivers.
> 
> For people who like to review using git, the v2 patch set is located at:
> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> 
> Written by Oded Gabbayh <oded.gabbay@amd.com>

So quick comments before i finish going over all patches. There is many
things that need more documentation espacialy as of right now there is
no userspace i can go look at.

There few show stopper, biggest one is gpu memory pinning this is a big
no, that would need serious arguments for any hope of convincing me on
that side.

It might be better to add a drivers/gpu/drm/amd directory and add common
stuff there.

Given that this is not intended to be final HSA api AFAICT then i would
say this far better to avoid the whole kfd module and add ioctl to radeon.
This would avoid crazy communication btw radeon and kfd.

The whole aperture business needs some serious explanation. Especialy as
you want to use userspace address there is nothing to prevent userspace
program from allocating things at address you reserve for lds, scratch,
... only sane way would be to move those lds, scratch inside the virtual
address reserved for kernel (see kernel memory map).

The whole business of locking performance counter for exclusive per process
access is a big NO. Which leads me to the questionable usefullness of user
space command ring. I only see issues with that. First and foremost i would
need to see solid figures that kernel ioctl or syscall has a higher an
overhead that is measurable in any meaning full way against a simple
function call. I know the userspace command ring is a big marketing features
that please ignorant userspace programmer. But really this only brings issues
and for absolutely not upside afaict.

So i would rather see a very simple ioctl that write the doorbell and might
do more than that in case of ring/queue overcommit where it would first have
to wait for a free ring/queue to schedule stuff. This would also allow sane
implementation of things like performance counter that could be acquire by
kernel for duration of a job submitted by userspace. While still not optimal
this would be better that userspace locking.


I might have more thoughts once i am done with all the patches.

Cheers,
Jerome

> 
> Original Cover Letter:
> 
> This patch set implements a Heterogeneous System Architecture (HSA) driver
> for radeon-family GPUs.
> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> system resources more effectively via HW features including shared pageable
> memory, userspace-accessible work queues, and platform-level atomics. In
> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> Islands family of GPUs also performs HW-level validation of commands passed
> in through the queues (aka rings).
> 
> The code in this patch set is intended to serve both as a sample driver for
> other HSA-compatible hardware devices and as a production driver for
> radeon-family processors. The code is architected to support multiple CPUs
> each with connected GPUs, although the current implementation focuses on a
> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> graphics driver (kgd).
> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> functionality between HSA compute and regular gfx/compute (memory,
> interrupts, registers), while other functionality has been added
> specifically for HSA compute  (hw scheduler for virtualized compute rings).
> All shared hardware is owned by the radeon graphics driver, and an interface
> between kfd and kgd allows the kfd to make use of those shared resources,
> while HSA-specific functionality is managed directly by kfd by submitting
> packets into an HSA-specific command queue (the "HIQ").
> 
> During kfd module initialization a char device node (/dev/kfd) is created
> (surviving until module exit), with ioctls for queue creation & management,
> and data structures are initialized for managing HSA device topology.
> The rest of the initialization is driven by calls from the radeon kgd at the
> following points :
> 
> - radeon_init (kfd_init)
> - radeon_exit (kfd_fini)
> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> - radeon_driver_unload_kms (kfd_device_fini)
> 
> During the probe and init processing per-device data structures are
> established which connect to the associated graphics kernel driver. This
> information is exposed to userspace via sysfs, along with a version number
> allowing userspace to determine if a topology change has occurred while it
> was reading from sysfs.
> The interface between kfd and kgd also allows the kfd to request buffer
> management services from kgd, and allows kgd to route interrupt requests to
> kfd code since the interrupt block is shared between regular
> graphics/compute and HSA compute subsystems in the GPU.
> 
> The kfd code works with an open source usermode library ("libhsakmt") which
> is in the final stages of IP review and should be published in a separate
> repo over the next few days.
> The code operates in one of three modes, selectable via the sched_policy
> module parameter :
> 
> - sched_policy=0 uses a hardware scheduler running in the MEC block within
> CP, and allows oversubscription (more queues than HW slots)
> - sched_policy=1 also uses HW scheduling but does not allow
> oversubscription, so create_queue requests fail when we run out of HW slots
> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> queues to HW slots by programming registers
> 
> The "no HW scheduling" option is for debug & new hardware bringup only, so
> has less test coverage than the other options. Default in the current code
> is "HW scheduling without oversubscription" since that is where we have the
> most test coverage but we expect to change the default to "HW scheduling
> with oversubscription" after further testing. This effectively removes the
> HW limit on the number of work queues available to applications.
> 
> Programs running on the GPU are associated with an address space through the
> VMID field, which is translated to a unique PASID at access time via a set
> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> are partitioned (under control of the radeon kgd) between current
> gfx/compute and HSA compute, with each getting 8 in the current code. The
> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> and by driver code if HW scheduling is not being used.
> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> dedicated for this purpose, and pages within the doorbell aperture are
> mapped to userspace (each page mapped to only one user address space).
> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> userspace code to safely manage work queues (rings) without requiring a
> kernel call for every ring update.
> First step for an application process is to open the kfd device. Calls to
> open create a kfd "process" structure only for the first thread of the
> process. Subsequent open calls are checked to see if they are from processes
> using the same mm_struct and, if so, don't do anything. The kfd per-process
> data lives as long as the mm_struct exists. Each mm_struct is associated
> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> accessible to the GPU.
> Next step is for the application to collect topology information via sysfs.
> This gives userspace enough information to be able to identify specific
> nodes (processors) in subsequent queue management calls. Application
> processes can create queues on multiple processors, and processors support
> queues from multiple processes.
> At this point the application can create work queues in userspace memory and
> pass them through the usermode library to kfd to have them mapped onto HW
> queue slots so that commands written to the queues can be executed by the
> GPU. Queue operations specify a processor node, and so the bulk of this code
> is device-specific.
> Written by John Bridgman <John.Bridgman@amd.com>
> 
> 
> Alexey Skidanov (1):
>   amdkfd: Implement the Get Process Aperture IOCTL
> 
> Andrew Lewycky (3):
>   amdkfd: Add basic modules to amdkfd
>   amdkfd: Add interrupt handling module
>   amdkfd: Implement the Set Memory Policy IOCTL
> 
> Ben Goz (8):
>   amdkfd: Add queue module
>   amdkfd: Add mqd_manager module
>   amdkfd: Add kernel queue module
>   amdkfd: Add module parameter of scheduling policy
>   amdkfd: Add packet manager module
>   amdkfd: Add process queue manager module
>   amdkfd: Add device queue manager module
>   amdkfd: Implement the create/destroy/update queue IOCTLs
> 
> Evgeny Pinchuk (3):
>   amdkfd: Add topology module to amdkfd
>   amdkfd: Implement the Get Clock Counters IOCTL
>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> 
> Oded Gabbay (10):
>   mm: Add kfd_process pointer to mm_struct
>   drm/radeon: reduce number of free VMIDs and pipes in KV
>   drm/radeon/cik: Don't touch int of pipes 1-7
>   drm/radeon: Report doorbell configuration to amdkfd
>   drm/radeon: adding synchronization for GRBM GFX
>   drm/radeon: Add radeon <--> amdkfd interface
>   Update MAINTAINERS and CREDITS files with amdkfd info
>   amdkfd: Add IOCTL set definitions of amdkfd
>   amdkfd: Add amdkfd skeleton driver
>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> 
>  CREDITS                                            |    7 +
>  MAINTAINERS                                        |   10 +
>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>  include/linux/mm_types.h                           |   14 +
>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>  43 files changed, 9226 insertions(+), 95 deletions(-)
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> 
> -- 
> 1.9.1
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-20 17:46   ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-20 17:46 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> Forgot to cc mailing list on cover letter. Sorry.
> 
> As a continuation to the existing discussion, here is a v2 patch series
> restructured with a cleaner history and no totally-different-early-versions
> of the code.
> 
> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> are modifications to radeon driver and 18 of them include only amdkfd code.
> There is no code going away or even modified between patches, only added.
> 
> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> is an AMD-only driver at this point. Having said that, we do foresee a
> generic hsa framework being implemented in the future and in that case, we
> will adjust amdkfd to work within that framework.
> 
> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> contained in its own folder. The amdkfd folder was put under the radeon
> folder because the only AMD gfx driver in the Linux kernel at this point
> is the radeon driver. Having said that, we will probably need to move it
> (maybe to be directly under drm) after we integrate with additional AMD gfx
> drivers.
> 
> For people who like to review using git, the v2 patch set is located at:
> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> 
> Written by Oded Gabbayh <oded.gabbay@amd.com>

So quick comments before i finish going over all patches. There is many
things that need more documentation espacialy as of right now there is
no userspace i can go look at.

There few show stopper, biggest one is gpu memory pinning this is a big
no, that would need serious arguments for any hope of convincing me on
that side.

It might be better to add a drivers/gpu/drm/amd directory and add common
stuff there.

Given that this is not intended to be final HSA api AFAICT then i would
say this far better to avoid the whole kfd module and add ioctl to radeon.
This would avoid crazy communication btw radeon and kfd.

The whole aperture business needs some serious explanation. Especialy as
you want to use userspace address there is nothing to prevent userspace
program from allocating things at address you reserve for lds, scratch,
... only sane way would be to move those lds, scratch inside the virtual
address reserved for kernel (see kernel memory map).

The whole business of locking performance counter for exclusive per process
access is a big NO. Which leads me to the questionable usefullness of user
space command ring. I only see issues with that. First and foremost i would
need to see solid figures that kernel ioctl or syscall has a higher an
overhead that is measurable in any meaning full way against a simple
function call. I know the userspace command ring is a big marketing features
that please ignorant userspace programmer. But really this only brings issues
and for absolutely not upside afaict.

So i would rather see a very simple ioctl that write the doorbell and might
do more than that in case of ring/queue overcommit where it would first have
to wait for a free ring/queue to schedule stuff. This would also allow sane
implementation of things like performance counter that could be acquire by
kernel for duration of a job submitted by userspace. While still not optimal
this would be better that userspace locking.


I might have more thoughts once i am done with all the patches.

Cheers,
Jérôme

> 
> Original Cover Letter:
> 
> This patch set implements a Heterogeneous System Architecture (HSA) driver
> for radeon-family GPUs.
> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> system resources more effectively via HW features including shared pageable
> memory, userspace-accessible work queues, and platform-level atomics. In
> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> Islands family of GPUs also performs HW-level validation of commands passed
> in through the queues (aka rings).
> 
> The code in this patch set is intended to serve both as a sample driver for
> other HSA-compatible hardware devices and as a production driver for
> radeon-family processors. The code is architected to support multiple CPUs
> each with connected GPUs, although the current implementation focuses on a
> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> graphics driver (kgd).
> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> functionality between HSA compute and regular gfx/compute (memory,
> interrupts, registers), while other functionality has been added
> specifically for HSA compute  (hw scheduler for virtualized compute rings).
> All shared hardware is owned by the radeon graphics driver, and an interface
> between kfd and kgd allows the kfd to make use of those shared resources,
> while HSA-specific functionality is managed directly by kfd by submitting
> packets into an HSA-specific command queue (the "HIQ").
> 
> During kfd module initialization a char device node (/dev/kfd) is created
> (surviving until module exit), with ioctls for queue creation & management,
> and data structures are initialized for managing HSA device topology.
> The rest of the initialization is driven by calls from the radeon kgd at the
> following points :
> 
> - radeon_init (kfd_init)
> - radeon_exit (kfd_fini)
> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> - radeon_driver_unload_kms (kfd_device_fini)
> 
> During the probe and init processing per-device data structures are
> established which connect to the associated graphics kernel driver. This
> information is exposed to userspace via sysfs, along with a version number
> allowing userspace to determine if a topology change has occurred while it
> was reading from sysfs.
> The interface between kfd and kgd also allows the kfd to request buffer
> management services from kgd, and allows kgd to route interrupt requests to
> kfd code since the interrupt block is shared between regular
> graphics/compute and HSA compute subsystems in the GPU.
> 
> The kfd code works with an open source usermode library ("libhsakmt") which
> is in the final stages of IP review and should be published in a separate
> repo over the next few days.
> The code operates in one of three modes, selectable via the sched_policy
> module parameter :
> 
> - sched_policy=0 uses a hardware scheduler running in the MEC block within
> CP, and allows oversubscription (more queues than HW slots)
> - sched_policy=1 also uses HW scheduling but does not allow
> oversubscription, so create_queue requests fail when we run out of HW slots
> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> queues to HW slots by programming registers
> 
> The "no HW scheduling" option is for debug & new hardware bringup only, so
> has less test coverage than the other options. Default in the current code
> is "HW scheduling without oversubscription" since that is where we have the
> most test coverage but we expect to change the default to "HW scheduling
> with oversubscription" after further testing. This effectively removes the
> HW limit on the number of work queues available to applications.
> 
> Programs running on the GPU are associated with an address space through the
> VMID field, which is translated to a unique PASID at access time via a set
> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> are partitioned (under control of the radeon kgd) between current
> gfx/compute and HSA compute, with each getting 8 in the current code. The
> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> and by driver code if HW scheduling is not being used.
> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> dedicated for this purpose, and pages within the doorbell aperture are
> mapped to userspace (each page mapped to only one user address space).
> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> userspace code to safely manage work queues (rings) without requiring a
> kernel call for every ring update.
> First step for an application process is to open the kfd device. Calls to
> open create a kfd "process" structure only for the first thread of the
> process. Subsequent open calls are checked to see if they are from processes
> using the same mm_struct and, if so, don't do anything. The kfd per-process
> data lives as long as the mm_struct exists. Each mm_struct is associated
> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> accessible to the GPU.
> Next step is for the application to collect topology information via sysfs.
> This gives userspace enough information to be able to identify specific
> nodes (processors) in subsequent queue management calls. Application
> processes can create queues on multiple processors, and processors support
> queues from multiple processes.
> At this point the application can create work queues in userspace memory and
> pass them through the usermode library to kfd to have them mapped onto HW
> queue slots so that commands written to the queues can be executed by the
> GPU. Queue operations specify a processor node, and so the bulk of this code
> is device-specific.
> Written by John Bridgman <John.Bridgman@amd.com>
> 
> 
> Alexey Skidanov (1):
>   amdkfd: Implement the Get Process Aperture IOCTL
> 
> Andrew Lewycky (3):
>   amdkfd: Add basic modules to amdkfd
>   amdkfd: Add interrupt handling module
>   amdkfd: Implement the Set Memory Policy IOCTL
> 
> Ben Goz (8):
>   amdkfd: Add queue module
>   amdkfd: Add mqd_manager module
>   amdkfd: Add kernel queue module
>   amdkfd: Add module parameter of scheduling policy
>   amdkfd: Add packet manager module
>   amdkfd: Add process queue manager module
>   amdkfd: Add device queue manager module
>   amdkfd: Implement the create/destroy/update queue IOCTLs
> 
> Evgeny Pinchuk (3):
>   amdkfd: Add topology module to amdkfd
>   amdkfd: Implement the Get Clock Counters IOCTL
>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> 
> Oded Gabbay (10):
>   mm: Add kfd_process pointer to mm_struct
>   drm/radeon: reduce number of free VMIDs and pipes in KV
>   drm/radeon/cik: Don't touch int of pipes 1-7
>   drm/radeon: Report doorbell configuration to amdkfd
>   drm/radeon: adding synchronization for GRBM GFX
>   drm/radeon: Add radeon <--> amdkfd interface
>   Update MAINTAINERS and CREDITS files with amdkfd info
>   amdkfd: Add IOCTL set definitions of amdkfd
>   amdkfd: Add amdkfd skeleton driver
>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> 
>  CREDITS                                            |    7 +
>  MAINTAINERS                                        |   10 +
>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>  include/linux/mm_types.h                           |   14 +
>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>  43 files changed, 9226 insertions(+), 95 deletions(-)
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> 
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-20 17:46   ` Jerome Glisse
  (?)
@ 2014-07-21  3:03     ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21  3:03 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Christian König,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > Forgot to cc mailing list on cover letter. Sorry.
> > 
> > As a continuation to the existing discussion, here is a v2 patch series
> > restructured with a cleaner history and no totally-different-early-versions
> > of the code.
> > 
> > Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> > are modifications to radeon driver and 18 of them include only amdkfd code.
> > There is no code going away or even modified between patches, only added.
> > 
> > The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> > is an AMD-only driver at this point. Having said that, we do foresee a
> > generic hsa framework being implemented in the future and in that case, we
> > will adjust amdkfd to work within that framework.
> > 
> > As the amdkfd driver should support multiple AMD gfx drivers, we want to
> > keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > contained in its own folder. The amdkfd folder was put under the radeon
> > folder because the only AMD gfx driver in the Linux kernel at this point
> > is the radeon driver. Having said that, we will probably need to move it
> > (maybe to be directly under drm) after we integrate with additional AMD gfx
> > drivers.
> > 
> > For people who like to review using git, the v2 patch set is located at:
> > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > 
> > Written by Oded Gabbayh <oded.gabbay@amd.com>
> 
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
> 
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
> 
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
> 
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
> 
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).

So i skimmed over the iommu v2 specification and while the iommu v2 claims
to obey the user/supervisor flags of the cpu page table, it does not seems
that this is a property set in the iommu against a pasid (ie is a given
pasid is allow supervisor access or not). It seems that the supervisor is
part of pcie tlp request which i assume is control by the gpu. So how is
this bit set ? How can we make sure that there is no way to abuse it ?

> 
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring. I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
> 
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.
> 
> 
> I might have more thoughts once i am done with all the patches.
> 
> Cheers,
> Jérôme
> 
> > 
> > Original Cover Letter:
> > 
> > This patch set implements a Heterogeneous System Architecture (HSA) driver
> > for radeon-family GPUs.
> > HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> > system resources more effectively via HW features including shared pageable
> > memory, userspace-accessible work queues, and platform-level atomics. In
> > addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> > Islands family of GPUs also performs HW-level validation of commands passed
> > in through the queues (aka rings).
> > 
> > The code in this patch set is intended to serve both as a sample driver for
> > other HSA-compatible hardware devices and as a production driver for
> > radeon-family processors. The code is architected to support multiple CPUs
> > each with connected GPUs, although the current implementation focuses on a
> > single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> > graphics driver (kgd).
> > AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> > functionality between HSA compute and regular gfx/compute (memory,
> > interrupts, registers), while other functionality has been added
> > specifically for HSA compute  (hw scheduler for virtualized compute rings).
> > All shared hardware is owned by the radeon graphics driver, and an interface
> > between kfd and kgd allows the kfd to make use of those shared resources,
> > while HSA-specific functionality is managed directly by kfd by submitting
> > packets into an HSA-specific command queue (the "HIQ").
> > 
> > During kfd module initialization a char device node (/dev/kfd) is created
> > (surviving until module exit), with ioctls for queue creation & management,
> > and data structures are initialized for managing HSA device topology.
> > The rest of the initialization is driven by calls from the radeon kgd at the
> > following points :
> > 
> > - radeon_init (kfd_init)
> > - radeon_exit (kfd_fini)
> > - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> > - radeon_driver_unload_kms (kfd_device_fini)
> > 
> > During the probe and init processing per-device data structures are
> > established which connect to the associated graphics kernel driver. This
> > information is exposed to userspace via sysfs, along with a version number
> > allowing userspace to determine if a topology change has occurred while it
> > was reading from sysfs.
> > The interface between kfd and kgd also allows the kfd to request buffer
> > management services from kgd, and allows kgd to route interrupt requests to
> > kfd code since the interrupt block is shared between regular
> > graphics/compute and HSA compute subsystems in the GPU.
> > 
> > The kfd code works with an open source usermode library ("libhsakmt") which
> > is in the final stages of IP review and should be published in a separate
> > repo over the next few days.
> > The code operates in one of three modes, selectable via the sched_policy
> > module parameter :
> > 
> > - sched_policy=0 uses a hardware scheduler running in the MEC block within
> > CP, and allows oversubscription (more queues than HW slots)
> > - sched_policy=1 also uses HW scheduling but does not allow
> > oversubscription, so create_queue requests fail when we run out of HW slots
> > - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> > queues to HW slots by programming registers
> > 
> > The "no HW scheduling" option is for debug & new hardware bringup only, so
> > has less test coverage than the other options. Default in the current code
> > is "HW scheduling without oversubscription" since that is where we have the
> > most test coverage but we expect to change the default to "HW scheduling
> > with oversubscription" after further testing. This effectively removes the
> > HW limit on the number of work queues available to applications.
> > 
> > Programs running on the GPU are associated with an address space through the
> > VMID field, which is translated to a unique PASID at access time via a set
> > of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> > are partitioned (under control of the radeon kgd) between current
> > gfx/compute and HSA compute, with each getting 8 in the current code. The
> > VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> > and by driver code if HW scheduling is not being used.
> > The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> > earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> > dedicated for this purpose, and pages within the doorbell aperture are
> > mapped to userspace (each page mapped to only one user address space).
> > Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> > userspace code to safely manage work queues (rings) without requiring a
> > kernel call for every ring update.
> > First step for an application process is to open the kfd device. Calls to
> > open create a kfd "process" structure only for the first thread of the
> > process. Subsequent open calls are checked to see if they are from processes
> > using the same mm_struct and, if so, don't do anything. The kfd per-process
> > data lives as long as the mm_struct exists. Each mm_struct is associated
> > with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> > accessible to the GPU.
> > Next step is for the application to collect topology information via sysfs.
> > This gives userspace enough information to be able to identify specific
> > nodes (processors) in subsequent queue management calls. Application
> > processes can create queues on multiple processors, and processors support
> > queues from multiple processes.
> > At this point the application can create work queues in userspace memory and
> > pass them through the usermode library to kfd to have them mapped onto HW
> > queue slots so that commands written to the queues can be executed by the
> > GPU. Queue operations specify a processor node, and so the bulk of this code
> > is device-specific.
> > Written by John Bridgman <John.Bridgman@amd.com>
> > 
> > 
> > Alexey Skidanov (1):
> >   amdkfd: Implement the Get Process Aperture IOCTL
> > 
> > Andrew Lewycky (3):
> >   amdkfd: Add basic modules to amdkfd
> >   amdkfd: Add interrupt handling module
> >   amdkfd: Implement the Set Memory Policy IOCTL
> > 
> > Ben Goz (8):
> >   amdkfd: Add queue module
> >   amdkfd: Add mqd_manager module
> >   amdkfd: Add kernel queue module
> >   amdkfd: Add module parameter of scheduling policy
> >   amdkfd: Add packet manager module
> >   amdkfd: Add process queue manager module
> >   amdkfd: Add device queue manager module
> >   amdkfd: Implement the create/destroy/update queue IOCTLs
> > 
> > Evgeny Pinchuk (3):
> >   amdkfd: Add topology module to amdkfd
> >   amdkfd: Implement the Get Clock Counters IOCTL
> >   amdkfd: Implement the PMC Acquire/Release IOCTLs
> > 
> > Oded Gabbay (10):
> >   mm: Add kfd_process pointer to mm_struct
> >   drm/radeon: reduce number of free VMIDs and pipes in KV
> >   drm/radeon/cik: Don't touch int of pipes 1-7
> >   drm/radeon: Report doorbell configuration to amdkfd
> >   drm/radeon: adding synchronization for GRBM GFX
> >   drm/radeon: Add radeon <--> amdkfd interface
> >   Update MAINTAINERS and CREDITS files with amdkfd info
> >   amdkfd: Add IOCTL set definitions of amdkfd
> >   amdkfd: Add amdkfd skeleton driver
> >   amdkfd: Add binding/unbinding calls to amd_iommu driver
> > 
> >  CREDITS                                            |    7 +
> >  MAINTAINERS                                        |   10 +
> >  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >  include/linux/mm_types.h                           |   14 +
> >  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >  43 files changed, 9226 insertions(+), 95 deletions(-)
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >  create mode 100644 include/uapi/linux/kfd_ioctl.h
> > 
> > -- 
> > 1.9.1
> > 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21  3:03     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21  3:03 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Christian König,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > Forgot to cc mailing list on cover letter. Sorry.
> > 
> > As a continuation to the existing discussion, here is a v2 patch series
> > restructured with a cleaner history and no totally-different-early-versions
> > of the code.
> > 
> > Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> > are modifications to radeon driver and 18 of them include only amdkfd code.
> > There is no code going away or even modified between patches, only added.
> > 
> > The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> > is an AMD-only driver at this point. Having said that, we do foresee a
> > generic hsa framework being implemented in the future and in that case, we
> > will adjust amdkfd to work within that framework.
> > 
> > As the amdkfd driver should support multiple AMD gfx drivers, we want to
> > keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > contained in its own folder. The amdkfd folder was put under the radeon
> > folder because the only AMD gfx driver in the Linux kernel at this point
> > is the radeon driver. Having said that, we will probably need to move it
> > (maybe to be directly under drm) after we integrate with additional AMD gfx
> > drivers.
> > 
> > For people who like to review using git, the v2 patch set is located at:
> > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > 
> > Written by Oded Gabbayh <oded.gabbay@amd.com>
> 
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
> 
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
> 
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
> 
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
> 
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).

So i skimmed over the iommu v2 specification and while the iommu v2 claims
to obey the user/supervisor flags of the cpu page table, it does not seems
that this is a property set in the iommu against a pasid (ie is a given
pasid is allow supervisor access or not). It seems that the supervisor is
part of pcie tlp request which i assume is control by the gpu. So how is
this bit set ? How can we make sure that there is no way to abuse it ?

> 
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring. I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
> 
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.
> 
> 
> I might have more thoughts once i am done with all the patches.
> 
> Cheers,
> Jerome
> 
> > 
> > Original Cover Letter:
> > 
> > This patch set implements a Heterogeneous System Architecture (HSA) driver
> > for radeon-family GPUs.
> > HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> > system resources more effectively via HW features including shared pageable
> > memory, userspace-accessible work queues, and platform-level atomics. In
> > addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> > Islands family of GPUs also performs HW-level validation of commands passed
> > in through the queues (aka rings).
> > 
> > The code in this patch set is intended to serve both as a sample driver for
> > other HSA-compatible hardware devices and as a production driver for
> > radeon-family processors. The code is architected to support multiple CPUs
> > each with connected GPUs, although the current implementation focuses on a
> > single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> > graphics driver (kgd).
> > AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> > functionality between HSA compute and regular gfx/compute (memory,
> > interrupts, registers), while other functionality has been added
> > specifically for HSA compute  (hw scheduler for virtualized compute rings).
> > All shared hardware is owned by the radeon graphics driver, and an interface
> > between kfd and kgd allows the kfd to make use of those shared resources,
> > while HSA-specific functionality is managed directly by kfd by submitting
> > packets into an HSA-specific command queue (the "HIQ").
> > 
> > During kfd module initialization a char device node (/dev/kfd) is created
> > (surviving until module exit), with ioctls for queue creation & management,
> > and data structures are initialized for managing HSA device topology.
> > The rest of the initialization is driven by calls from the radeon kgd at the
> > following points :
> > 
> > - radeon_init (kfd_init)
> > - radeon_exit (kfd_fini)
> > - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> > - radeon_driver_unload_kms (kfd_device_fini)
> > 
> > During the probe and init processing per-device data structures are
> > established which connect to the associated graphics kernel driver. This
> > information is exposed to userspace via sysfs, along with a version number
> > allowing userspace to determine if a topology change has occurred while it
> > was reading from sysfs.
> > The interface between kfd and kgd also allows the kfd to request buffer
> > management services from kgd, and allows kgd to route interrupt requests to
> > kfd code since the interrupt block is shared between regular
> > graphics/compute and HSA compute subsystems in the GPU.
> > 
> > The kfd code works with an open source usermode library ("libhsakmt") which
> > is in the final stages of IP review and should be published in a separate
> > repo over the next few days.
> > The code operates in one of three modes, selectable via the sched_policy
> > module parameter :
> > 
> > - sched_policy=0 uses a hardware scheduler running in the MEC block within
> > CP, and allows oversubscription (more queues than HW slots)
> > - sched_policy=1 also uses HW scheduling but does not allow
> > oversubscription, so create_queue requests fail when we run out of HW slots
> > - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> > queues to HW slots by programming registers
> > 
> > The "no HW scheduling" option is for debug & new hardware bringup only, so
> > has less test coverage than the other options. Default in the current code
> > is "HW scheduling without oversubscription" since that is where we have the
> > most test coverage but we expect to change the default to "HW scheduling
> > with oversubscription" after further testing. This effectively removes the
> > HW limit on the number of work queues available to applications.
> > 
> > Programs running on the GPU are associated with an address space through the
> > VMID field, which is translated to a unique PASID at access time via a set
> > of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> > are partitioned (under control of the radeon kgd) between current
> > gfx/compute and HSA compute, with each getting 8 in the current code. The
> > VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> > and by driver code if HW scheduling is not being used.
> > The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> > earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> > dedicated for this purpose, and pages within the doorbell aperture are
> > mapped to userspace (each page mapped to only one user address space).
> > Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> > userspace code to safely manage work queues (rings) without requiring a
> > kernel call for every ring update.
> > First step for an application process is to open the kfd device. Calls to
> > open create a kfd "process" structure only for the first thread of the
> > process. Subsequent open calls are checked to see if they are from processes
> > using the same mm_struct and, if so, don't do anything. The kfd per-process
> > data lives as long as the mm_struct exists. Each mm_struct is associated
> > with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> > accessible to the GPU.
> > Next step is for the application to collect topology information via sysfs.
> > This gives userspace enough information to be able to identify specific
> > nodes (processors) in subsequent queue management calls. Application
> > processes can create queues on multiple processors, and processors support
> > queues from multiple processes.
> > At this point the application can create work queues in userspace memory and
> > pass them through the usermode library to kfd to have them mapped onto HW
> > queue slots so that commands written to the queues can be executed by the
> > GPU. Queue operations specify a processor node, and so the bulk of this code
> > is device-specific.
> > Written by John Bridgman <John.Bridgman@amd.com>
> > 
> > 
> > Alexey Skidanov (1):
> >   amdkfd: Implement the Get Process Aperture IOCTL
> > 
> > Andrew Lewycky (3):
> >   amdkfd: Add basic modules to amdkfd
> >   amdkfd: Add interrupt handling module
> >   amdkfd: Implement the Set Memory Policy IOCTL
> > 
> > Ben Goz (8):
> >   amdkfd: Add queue module
> >   amdkfd: Add mqd_manager module
> >   amdkfd: Add kernel queue module
> >   amdkfd: Add module parameter of scheduling policy
> >   amdkfd: Add packet manager module
> >   amdkfd: Add process queue manager module
> >   amdkfd: Add device queue manager module
> >   amdkfd: Implement the create/destroy/update queue IOCTLs
> > 
> > Evgeny Pinchuk (3):
> >   amdkfd: Add topology module to amdkfd
> >   amdkfd: Implement the Get Clock Counters IOCTL
> >   amdkfd: Implement the PMC Acquire/Release IOCTLs
> > 
> > Oded Gabbay (10):
> >   mm: Add kfd_process pointer to mm_struct
> >   drm/radeon: reduce number of free VMIDs and pipes in KV
> >   drm/radeon/cik: Don't touch int of pipes 1-7
> >   drm/radeon: Report doorbell configuration to amdkfd
> >   drm/radeon: adding synchronization for GRBM GFX
> >   drm/radeon: Add radeon <--> amdkfd interface
> >   Update MAINTAINERS and CREDITS files with amdkfd info
> >   amdkfd: Add IOCTL set definitions of amdkfd
> >   amdkfd: Add amdkfd skeleton driver
> >   amdkfd: Add binding/unbinding calls to amd_iommu driver
> > 
> >  CREDITS                                            |    7 +
> >  MAINTAINERS                                        |   10 +
> >  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >  include/linux/mm_types.h                           |   14 +
> >  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >  43 files changed, 9226 insertions(+), 95 deletions(-)
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >  create mode 100644 include/uapi/linux/kfd_ioctl.h
> > 
> > -- 
> > 1.9.1
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21  3:03     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21  3:03 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Christian König,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > Forgot to cc mailing list on cover letter. Sorry.
> > 
> > As a continuation to the existing discussion, here is a v2 patch series
> > restructured with a cleaner history and no totally-different-early-versions
> > of the code.
> > 
> > Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> > are modifications to radeon driver and 18 of them include only amdkfd code.
> > There is no code going away or even modified between patches, only added.
> > 
> > The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> > is an AMD-only driver at this point. Having said that, we do foresee a
> > generic hsa framework being implemented in the future and in that case, we
> > will adjust amdkfd to work within that framework.
> > 
> > As the amdkfd driver should support multiple AMD gfx drivers, we want to
> > keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > contained in its own folder. The amdkfd folder was put under the radeon
> > folder because the only AMD gfx driver in the Linux kernel at this point
> > is the radeon driver. Having said that, we will probably need to move it
> > (maybe to be directly under drm) after we integrate with additional AMD gfx
> > drivers.
> > 
> > For people who like to review using git, the v2 patch set is located at:
> > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > 
> > Written by Oded Gabbayh <oded.gabbay@amd.com>
> 
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
> 
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
> 
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
> 
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
> 
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).

So i skimmed over the iommu v2 specification and while the iommu v2 claims
to obey the user/supervisor flags of the cpu page table, it does not seems
that this is a property set in the iommu against a pasid (ie is a given
pasid is allow supervisor access or not). It seems that the supervisor is
part of pcie tlp request which i assume is control by the gpu. So how is
this bit set ? How can we make sure that there is no way to abuse it ?

> 
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring. I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
> 
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.
> 
> 
> I might have more thoughts once i am done with all the patches.
> 
> Cheers,
> Jérôme
> 
> > 
> > Original Cover Letter:
> > 
> > This patch set implements a Heterogeneous System Architecture (HSA) driver
> > for radeon-family GPUs.
> > HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> > system resources more effectively via HW features including shared pageable
> > memory, userspace-accessible work queues, and platform-level atomics. In
> > addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> > Islands family of GPUs also performs HW-level validation of commands passed
> > in through the queues (aka rings).
> > 
> > The code in this patch set is intended to serve both as a sample driver for
> > other HSA-compatible hardware devices and as a production driver for
> > radeon-family processors. The code is architected to support multiple CPUs
> > each with connected GPUs, although the current implementation focuses on a
> > single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> > graphics driver (kgd).
> > AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> > functionality between HSA compute and regular gfx/compute (memory,
> > interrupts, registers), while other functionality has been added
> > specifically for HSA compute  (hw scheduler for virtualized compute rings).
> > All shared hardware is owned by the radeon graphics driver, and an interface
> > between kfd and kgd allows the kfd to make use of those shared resources,
> > while HSA-specific functionality is managed directly by kfd by submitting
> > packets into an HSA-specific command queue (the "HIQ").
> > 
> > During kfd module initialization a char device node (/dev/kfd) is created
> > (surviving until module exit), with ioctls for queue creation & management,
> > and data structures are initialized for managing HSA device topology.
> > The rest of the initialization is driven by calls from the radeon kgd at the
> > following points :
> > 
> > - radeon_init (kfd_init)
> > - radeon_exit (kfd_fini)
> > - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> > - radeon_driver_unload_kms (kfd_device_fini)
> > 
> > During the probe and init processing per-device data structures are
> > established which connect to the associated graphics kernel driver. This
> > information is exposed to userspace via sysfs, along with a version number
> > allowing userspace to determine if a topology change has occurred while it
> > was reading from sysfs.
> > The interface between kfd and kgd also allows the kfd to request buffer
> > management services from kgd, and allows kgd to route interrupt requests to
> > kfd code since the interrupt block is shared between regular
> > graphics/compute and HSA compute subsystems in the GPU.
> > 
> > The kfd code works with an open source usermode library ("libhsakmt") which
> > is in the final stages of IP review and should be published in a separate
> > repo over the next few days.
> > The code operates in one of three modes, selectable via the sched_policy
> > module parameter :
> > 
> > - sched_policy=0 uses a hardware scheduler running in the MEC block within
> > CP, and allows oversubscription (more queues than HW slots)
> > - sched_policy=1 also uses HW scheduling but does not allow
> > oversubscription, so create_queue requests fail when we run out of HW slots
> > - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> > queues to HW slots by programming registers
> > 
> > The "no HW scheduling" option is for debug & new hardware bringup only, so
> > has less test coverage than the other options. Default in the current code
> > is "HW scheduling without oversubscription" since that is where we have the
> > most test coverage but we expect to change the default to "HW scheduling
> > with oversubscription" after further testing. This effectively removes the
> > HW limit on the number of work queues available to applications.
> > 
> > Programs running on the GPU are associated with an address space through the
> > VMID field, which is translated to a unique PASID at access time via a set
> > of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> > are partitioned (under control of the radeon kgd) between current
> > gfx/compute and HSA compute, with each getting 8 in the current code. The
> > VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> > and by driver code if HW scheduling is not being used.
> > The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> > earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> > dedicated for this purpose, and pages within the doorbell aperture are
> > mapped to userspace (each page mapped to only one user address space).
> > Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> > userspace code to safely manage work queues (rings) without requiring a
> > kernel call for every ring update.
> > First step for an application process is to open the kfd device. Calls to
> > open create a kfd "process" structure only for the first thread of the
> > process. Subsequent open calls are checked to see if they are from processes
> > using the same mm_struct and, if so, don't do anything. The kfd per-process
> > data lives as long as the mm_struct exists. Each mm_struct is associated
> > with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> > accessible to the GPU.
> > Next step is for the application to collect topology information via sysfs.
> > This gives userspace enough information to be able to identify specific
> > nodes (processors) in subsequent queue management calls. Application
> > processes can create queues on multiple processors, and processors support
> > queues from multiple processes.
> > At this point the application can create work queues in userspace memory and
> > pass them through the usermode library to kfd to have them mapped onto HW
> > queue slots so that commands written to the queues can be executed by the
> > GPU. Queue operations specify a processor node, and so the bulk of this code
> > is device-specific.
> > Written by John Bridgman <John.Bridgman@amd.com>
> > 
> > 
> > Alexey Skidanov (1):
> >   amdkfd: Implement the Get Process Aperture IOCTL
> > 
> > Andrew Lewycky (3):
> >   amdkfd: Add basic modules to amdkfd
> >   amdkfd: Add interrupt handling module
> >   amdkfd: Implement the Set Memory Policy IOCTL
> > 
> > Ben Goz (8):
> >   amdkfd: Add queue module
> >   amdkfd: Add mqd_manager module
> >   amdkfd: Add kernel queue module
> >   amdkfd: Add module parameter of scheduling policy
> >   amdkfd: Add packet manager module
> >   amdkfd: Add process queue manager module
> >   amdkfd: Add device queue manager module
> >   amdkfd: Implement the create/destroy/update queue IOCTLs
> > 
> > Evgeny Pinchuk (3):
> >   amdkfd: Add topology module to amdkfd
> >   amdkfd: Implement the Get Clock Counters IOCTL
> >   amdkfd: Implement the PMC Acquire/Release IOCTLs
> > 
> > Oded Gabbay (10):
> >   mm: Add kfd_process pointer to mm_struct
> >   drm/radeon: reduce number of free VMIDs and pipes in KV
> >   drm/radeon/cik: Don't touch int of pipes 1-7
> >   drm/radeon: Report doorbell configuration to amdkfd
> >   drm/radeon: adding synchronization for GRBM GFX
> >   drm/radeon: Add radeon <--> amdkfd interface
> >   Update MAINTAINERS and CREDITS files with amdkfd info
> >   amdkfd: Add IOCTL set definitions of amdkfd
> >   amdkfd: Add amdkfd skeleton driver
> >   amdkfd: Add binding/unbinding calls to amd_iommu driver
> > 
> >  CREDITS                                            |    7 +
> >  MAINTAINERS                                        |   10 +
> >  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >  include/linux/mm_types.h                           |   14 +
> >  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >  43 files changed, 9226 insertions(+), 95 deletions(-)
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >  create mode 100644 include/uapi/linux/kfd_ioctl.h
> > 
> > -- 
> > 1.9.1
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-20 17:46   ` Jerome Glisse
@ 2014-07-21  7:01     ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21  7:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky,
	Christian König, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > Forgot to cc mailing list on cover letter. Sorry.
> > 
> > As a continuation to the existing discussion, here is a v2 patch series
> > restructured with a cleaner history and no totally-different-early-versions
> > of the code.
> > 
> > Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> > are modifications to radeon driver and 18 of them include only amdkfd code.
> > There is no code going away or even modified between patches, only added.
> > 
> > The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> > is an AMD-only driver at this point. Having said that, we do foresee a
> > generic hsa framework being implemented in the future and in that case, we
> > will adjust amdkfd to work within that framework.
> > 
> > As the amdkfd driver should support multiple AMD gfx drivers, we want to
> > keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > contained in its own folder. The amdkfd folder was put under the radeon
> > folder because the only AMD gfx driver in the Linux kernel at this point
> > is the radeon driver. Having said that, we will probably need to move it
> > (maybe to be directly under drm) after we integrate with additional AMD gfx
> > drivers.
> > 
> > For people who like to review using git, the v2 patch set is located at:
> > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > 
> > Written by Oded Gabbayh <oded.gabbay@amd.com>
> 
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
> 
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
> 
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
> 
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
> 
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).
> 
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring. I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
> 
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.

Quick aside and mostly off the record: In i915 we plan to have the first
implementation exactly as Jerome suggests here:
- New flag at context creationg for svm/seamless-gpgpu contexts.
- New ioctl in i915 for submitting stuff to the hw (through doorbell or
  whatever else we want to do). The ring in the ctx would be under the
  kernel's control.

Of course there's lots of GEM stuff we don't need at all for such
contexts, but there's still lots of shared code. Imo creating a 2nd driver
has too much interface surface and so is a maintainence hell.

And the ioctl submission gives us flexibility in case the hw doesn't quite
live up to promise (e.g. scheduling, cmd parsing, ...).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21  7:01     ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21  7:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky,
	Christian König, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > Forgot to cc mailing list on cover letter. Sorry.
> > 
> > As a continuation to the existing discussion, here is a v2 patch series
> > restructured with a cleaner history and no totally-different-early-versions
> > of the code.
> > 
> > Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> > are modifications to radeon driver and 18 of them include only amdkfd code.
> > There is no code going away or even modified between patches, only added.
> > 
> > The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> > drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> > is an AMD-only driver at this point. Having said that, we do foresee a
> > generic hsa framework being implemented in the future and in that case, we
> > will adjust amdkfd to work within that framework.
> > 
> > As the amdkfd driver should support multiple AMD gfx drivers, we want to
> > keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > contained in its own folder. The amdkfd folder was put under the radeon
> > folder because the only AMD gfx driver in the Linux kernel at this point
> > is the radeon driver. Having said that, we will probably need to move it
> > (maybe to be directly under drm) after we integrate with additional AMD gfx
> > drivers.
> > 
> > For people who like to review using git, the v2 patch set is located at:
> > http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > 
> > Written by Oded Gabbayh <oded.gabbay@amd.com>
> 
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
> 
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
> 
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
> 
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
> 
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).
> 
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring. I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
> 
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.

Quick aside and mostly off the record: In i915 we plan to have the first
implementation exactly as Jerome suggests here:
- New flag at context creationg for svm/seamless-gpgpu contexts.
- New ioctl in i915 for submitting stuff to the hw (through doorbell or
  whatever else we want to do). The ring in the ctx would be under the
  kernel's control.

Of course there's lots of GEM stuff we don't need at all for such
contexts, but there's still lots of shared code. Imo creating a 2nd driver
has too much interface surface and so is a maintainence hell.

And the ioctl submission gives us flexibility in case the hw doesn't quite
live up to promise (e.g. scheduling, cmd parsing, ...).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21  7:01     ` Daniel Vetter
@ 2014-07-21  9:34       ` Christian König
  -1 siblings, 0 replies; 148+ messages in thread
From: Christian König @ 2014-07-21  9:34 UTC (permalink / raw)
  To: Jerome Glisse, Oded Gabbay, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

Am 21.07.2014 09:01, schrieb Daniel Vetter:
> On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote:
>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>> Forgot to cc mailing list on cover letter. Sorry.
>>>
>>> As a continuation to the existing discussion, here is a v2 patch series
>>> restructured with a cleaner history and no totally-different-early-versions
>>> of the code.
>>>
>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>> There is no code going away or even modified between patches, only added.
>>>
>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>> generic hsa framework being implemented in the future and in that case, we
>>> will adjust amdkfd to work within that framework.
>>>
>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>> contained in its own folder. The amdkfd folder was put under the radeon
>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>> is the radeon driver. Having said that, we will probably need to move it
>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>> drivers.
>>>
>>> For people who like to review using git, the v2 patch set is located at:
>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>
>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>> So quick comments before i finish going over all patches. There is many
>> things that need more documentation espacialy as of right now there is
>> no userspace i can go look at.
>>
>> There few show stopper, biggest one is gpu memory pinning this is a big
>> no, that would need serious arguments for any hope of convincing me on
>> that side.
>>
>> It might be better to add a drivers/gpu/drm/amd directory and add common
>> stuff there.
>>
>> Given that this is not intended to be final HSA api AFAICT then i would
>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>> This would avoid crazy communication btw radeon and kfd.
>>
>> The whole aperture business needs some serious explanation. Especialy as
>> you want to use userspace address there is nothing to prevent userspace
>> program from allocating things at address you reserve for lds, scratch,
>> ... only sane way would be to move those lds, scratch inside the virtual
>> address reserved for kernel (see kernel memory map).
>>
>> The whole business of locking performance counter for exclusive per process
>> access is a big NO. Which leads me to the questionable usefullness of user
>> space command ring. I only see issues with that. First and foremost i would
>> need to see solid figures that kernel ioctl or syscall has a higher an
>> overhead that is measurable in any meaning full way against a simple
>> function call. I know the userspace command ring is a big marketing features
>> that please ignorant userspace programmer. But really this only brings issues
>> and for absolutely not upside afaict.
>>
>> So i would rather see a very simple ioctl that write the doorbell and might
>> do more than that in case of ring/queue overcommit where it would first have
>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>> implementation of things like performance counter that could be acquire by
>> kernel for duration of a job submitted by userspace. While still not optimal
>> this would be better that userspace locking.
> Quick aside and mostly off the record: In i915 we plan to have the first
> implementation exactly as Jerome suggests here:
> - New flag at context creationg for svm/seamless-gpgpu contexts.
> - New ioctl in i915 for submitting stuff to the hw (through doorbell or
>    whatever else we want to do). The ring in the ctx would be under the
>    kernel's control.

And looking at the existing Radeon code, that's exactly what we are 
already doing with the compute queues on CIK as well. We just use the 
existing command submission interface, cause when you use an IOCTL 
anyway it's not beneficial any more to do all the complex scheduling and 
other stuff directly on the hardware.

What's mostly missing in the existing module is proper support for 
accessing the CPU address space through IOMMUv2.

Christian.

> Of course there's lots of GEM stuff we don't need at all for such
> contexts, but there's still lots of shared code. Imo creating a 2nd driver
> has too much interface surface and so is a maintainence hell.
>
> And the ioctl submission gives us flexibility in case the hw doesn't quite
> live up to promise (e.g. scheduling, cmd parsing, ...).
> -Daniel


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21  9:34       ` Christian König
  0 siblings, 0 replies; 148+ messages in thread
From: Christian König @ 2014-07-21  9:34 UTC (permalink / raw)
  To: Jerome Glisse, Oded Gabbay, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

Am 21.07.2014 09:01, schrieb Daniel Vetter:
> On Sun, Jul 20, 2014 at 01:46:53PM -0400, Jerome Glisse wrote:
>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>> Forgot to cc mailing list on cover letter. Sorry.
>>>
>>> As a continuation to the existing discussion, here is a v2 patch series
>>> restructured with a cleaner history and no totally-different-early-versions
>>> of the code.
>>>
>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>> There is no code going away or even modified between patches, only added.
>>>
>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>> generic hsa framework being implemented in the future and in that case, we
>>> will adjust amdkfd to work within that framework.
>>>
>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>> contained in its own folder. The amdkfd folder was put under the radeon
>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>> is the radeon driver. Having said that, we will probably need to move it
>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>> drivers.
>>>
>>> For people who like to review using git, the v2 patch set is located at:
>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>
>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>> So quick comments before i finish going over all patches. There is many
>> things that need more documentation espacialy as of right now there is
>> no userspace i can go look at.
>>
>> There few show stopper, biggest one is gpu memory pinning this is a big
>> no, that would need serious arguments for any hope of convincing me on
>> that side.
>>
>> It might be better to add a drivers/gpu/drm/amd directory and add common
>> stuff there.
>>
>> Given that this is not intended to be final HSA api AFAICT then i would
>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>> This would avoid crazy communication btw radeon and kfd.
>>
>> The whole aperture business needs some serious explanation. Especialy as
>> you want to use userspace address there is nothing to prevent userspace
>> program from allocating things at address you reserve for lds, scratch,
>> ... only sane way would be to move those lds, scratch inside the virtual
>> address reserved for kernel (see kernel memory map).
>>
>> The whole business of locking performance counter for exclusive per process
>> access is a big NO. Which leads me to the questionable usefullness of user
>> space command ring. I only see issues with that. First and foremost i would
>> need to see solid figures that kernel ioctl or syscall has a higher an
>> overhead that is measurable in any meaning full way against a simple
>> function call. I know the userspace command ring is a big marketing features
>> that please ignorant userspace programmer. But really this only brings issues
>> and for absolutely not upside afaict.
>>
>> So i would rather see a very simple ioctl that write the doorbell and might
>> do more than that in case of ring/queue overcommit where it would first have
>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>> implementation of things like performance counter that could be acquire by
>> kernel for duration of a job submitted by userspace. While still not optimal
>> this would be better that userspace locking.
> Quick aside and mostly off the record: In i915 we plan to have the first
> implementation exactly as Jerome suggests here:
> - New flag at context creationg for svm/seamless-gpgpu contexts.
> - New ioctl in i915 for submitting stuff to the hw (through doorbell or
>    whatever else we want to do). The ring in the ctx would be under the
>    kernel's control.

And looking at the existing Radeon code, that's exactly what we are 
already doing with the compute queues on CIK as well. We just use the 
existing command submission interface, cause when you use an IOCTL 
anyway it's not beneficial any more to do all the complex scheduling and 
other stuff directly on the hardware.

What's mostly missing in the existing module is proper support for 
accessing the CPU address space through IOMMUv2.

Christian.

> Of course there's lots of GEM stuff we don't need at all for such
> contexts, but there's still lots of shared code. Imo creating a 2nd driver
> has too much interface surface and so is a maintainence hell.
>
> And the ioctl submission gives us flexibility in case the hw doesn't quite
> live up to promise (e.g. scheduling, cmd parsing, ...).
> -Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-20 17:46   ` Jerome Glisse
  (?)
@ 2014-07-21 12:36     ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 12:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Christian König,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 20/07/14 20:46, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>> Forgot to cc mailing list on cover letter. Sorry.
>>
>> As a continuation to the existing discussion, here is a v2 patch series
>> restructured with a cleaner history and no totally-different-early-versions
>> of the code.
>>
>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>> are modifications to radeon driver and 18 of them include only amdkfd code.
>> There is no code going away or even modified between patches, only added.
>>
>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>> is an AMD-only driver at this point. Having said that, we do foresee a
>> generic hsa framework being implemented in the future and in that case, we
>> will adjust amdkfd to work within that framework.
>>
>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>> contained in its own folder. The amdkfd folder was put under the radeon
>> folder because the only AMD gfx driver in the Linux kernel at this point
>> is the radeon driver. Having said that, we will probably need to move it
>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>> drivers.
>>
>> For people who like to review using git, the v2 patch set is located at:
>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>
>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
So quick comments on some of your questions but first of all, thanks for the 
time you dedicated to review the code.
>
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
We only do gpu memory pinning for kernel objects. There are no userspace objects 
that are pinned on the gpu memory in our driver. If that is the case, is it 
still a show stopper ?

The kernel objects are:
- pipelines (4 per device)
- mqd per hiq (only 1 per device)
- mqd per userspace queue. On KV, we support up to 1K queues per process, for a 
total of 512K queues. Each mqd is 151 bytes, but the allocation is done in 256 
alignment. So total *possible* memory is 128MB
- kernel queue (only 1 per device)
- fence address for kernel queue
- runlists for the CP (1 or 2 per device)

>
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
>
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
>
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).
>
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring.
That's like saying: "Which leads me to the questionable usefulness of HSA". I 
find it analogous to a situation where a network maintainer nacking a driver for 
a network card, which is slower than a different network card. Doesn't seem 
reasonable this situation is would happen. He would still put both the drivers 
in the kernel because people want to use the H/W and its features. So, I don't 
think this is a valid reason to NACK the driver.

> I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
Really ? You think that doing a context switch to kernel space, with all its 
overhead, is _not_ more expansive than just calling a function in userspace 
which only puts a buffer on a ring and writes a doorbell ?
>
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.
>
>
> I might have more thoughts once i am done with all the patches.
>
> Cheers,
> Jérôme
>
>>
>> Original Cover Letter:
>>
>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>> for radeon-family GPUs.
>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>> system resources more effectively via HW features including shared pageable
>> memory, userspace-accessible work queues, and platform-level atomics. In
>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>> Islands family of GPUs also performs HW-level validation of commands passed
>> in through the queues (aka rings).
>>
>> The code in this patch set is intended to serve both as a sample driver for
>> other HSA-compatible hardware devices and as a production driver for
>> radeon-family processors. The code is architected to support multiple CPUs
>> each with connected GPUs, although the current implementation focuses on a
>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>> graphics driver (kgd).
>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>> functionality between HSA compute and regular gfx/compute (memory,
>> interrupts, registers), while other functionality has been added
>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>> All shared hardware is owned by the radeon graphics driver, and an interface
>> between kfd and kgd allows the kfd to make use of those shared resources,
>> while HSA-specific functionality is managed directly by kfd by submitting
>> packets into an HSA-specific command queue (the "HIQ").
>>
>> During kfd module initialization a char device node (/dev/kfd) is created
>> (surviving until module exit), with ioctls for queue creation & management,
>> and data structures are initialized for managing HSA device topology.
>> The rest of the initialization is driven by calls from the radeon kgd at the
>> following points :
>>
>> - radeon_init (kfd_init)
>> - radeon_exit (kfd_fini)
>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>> - radeon_driver_unload_kms (kfd_device_fini)
>>
>> During the probe and init processing per-device data structures are
>> established which connect to the associated graphics kernel driver. This
>> information is exposed to userspace via sysfs, along with a version number
>> allowing userspace to determine if a topology change has occurred while it
>> was reading from sysfs.
>> The interface between kfd and kgd also allows the kfd to request buffer
>> management services from kgd, and allows kgd to route interrupt requests to
>> kfd code since the interrupt block is shared between regular
>> graphics/compute and HSA compute subsystems in the GPU.
>>
>> The kfd code works with an open source usermode library ("libhsakmt") which
>> is in the final stages of IP review and should be published in a separate
>> repo over the next few days.
>> The code operates in one of three modes, selectable via the sched_policy
>> module parameter :
>>
>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>> CP, and allows oversubscription (more queues than HW slots)
>> - sched_policy=1 also uses HW scheduling but does not allow
>> oversubscription, so create_queue requests fail when we run out of HW slots
>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>> queues to HW slots by programming registers
>>
>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>> has less test coverage than the other options. Default in the current code
>> is "HW scheduling without oversubscription" since that is where we have the
>> most test coverage but we expect to change the default to "HW scheduling
>> with oversubscription" after further testing. This effectively removes the
>> HW limit on the number of work queues available to applications.
>>
>> Programs running on the GPU are associated with an address space through the
>> VMID field, which is translated to a unique PASID at access time via a set
>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>> are partitioned (under control of the radeon kgd) between current
>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>> and by driver code if HW scheduling is not being used.
>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>> dedicated for this purpose, and pages within the doorbell aperture are
>> mapped to userspace (each page mapped to only one user address space).
>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>> userspace code to safely manage work queues (rings) without requiring a
>> kernel call for every ring update.
>> First step for an application process is to open the kfd device. Calls to
>> open create a kfd "process" structure only for the first thread of the
>> process. Subsequent open calls are checked to see if they are from processes
>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>> data lives as long as the mm_struct exists. Each mm_struct is associated
>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>> accessible to the GPU.
>> Next step is for the application to collect topology information via sysfs.
>> This gives userspace enough information to be able to identify specific
>> nodes (processors) in subsequent queue management calls. Application
>> processes can create queues on multiple processors, and processors support
>> queues from multiple processes.
>> At this point the application can create work queues in userspace memory and
>> pass them through the usermode library to kfd to have them mapped onto HW
>> queue slots so that commands written to the queues can be executed by the
>> GPU. Queue operations specify a processor node, and so the bulk of this code
>> is device-specific.
>> Written by John Bridgman <John.Bridgman@amd.com>
>>
>>
>> Alexey Skidanov (1):
>>    amdkfd: Implement the Get Process Aperture IOCTL
>>
>> Andrew Lewycky (3):
>>    amdkfd: Add basic modules to amdkfd
>>    amdkfd: Add interrupt handling module
>>    amdkfd: Implement the Set Memory Policy IOCTL
>>
>> Ben Goz (8):
>>    amdkfd: Add queue module
>>    amdkfd: Add mqd_manager module
>>    amdkfd: Add kernel queue module
>>    amdkfd: Add module parameter of scheduling policy
>>    amdkfd: Add packet manager module
>>    amdkfd: Add process queue manager module
>>    amdkfd: Add device queue manager module
>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>
>> Evgeny Pinchuk (3):
>>    amdkfd: Add topology module to amdkfd
>>    amdkfd: Implement the Get Clock Counters IOCTL
>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>
>> Oded Gabbay (10):
>>    mm: Add kfd_process pointer to mm_struct
>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>    drm/radeon: Report doorbell configuration to amdkfd
>>    drm/radeon: adding synchronization for GRBM GFX
>>    drm/radeon: Add radeon <--> amdkfd interface
>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>    amdkfd: Add IOCTL set definitions of amdkfd
>>    amdkfd: Add amdkfd skeleton driver
>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>
>>   CREDITS                                            |    7 +
>>   MAINTAINERS                                        |   10 +
>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>   include/linux/mm_types.h                           |   14 +
>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>
>> --
>> 1.9.1
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 12:36     ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 12:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Christian König,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 20/07/14 20:46, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>> Forgot to cc mailing list on cover letter. Sorry.
>>
>> As a continuation to the existing discussion, here is a v2 patch series
>> restructured with a cleaner history and no totally-different-early-versions
>> of the code.
>>
>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>> are modifications to radeon driver and 18 of them include only amdkfd code.
>> There is no code going away or even modified between patches, only added.
>>
>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>> is an AMD-only driver at this point. Having said that, we do foresee a
>> generic hsa framework being implemented in the future and in that case, we
>> will adjust amdkfd to work within that framework.
>>
>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>> contained in its own folder. The amdkfd folder was put under the radeon
>> folder because the only AMD gfx driver in the Linux kernel at this point
>> is the radeon driver. Having said that, we will probably need to move it
>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>> drivers.
>>
>> For people who like to review using git, the v2 patch set is located at:
>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>
>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
So quick comments on some of your questions but first of all, thanks for the 
time you dedicated to review the code.
>
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
We only do gpu memory pinning for kernel objects. There are no userspace objects 
that are pinned on the gpu memory in our driver. If that is the case, is it 
still a show stopper ?

The kernel objects are:
- pipelines (4 per device)
- mqd per hiq (only 1 per device)
- mqd per userspace queue. On KV, we support up to 1K queues per process, for a 
total of 512K queues. Each mqd is 151 bytes, but the allocation is done in 256 
alignment. So total *possible* memory is 128MB
- kernel queue (only 1 per device)
- fence address for kernel queue
- runlists for the CP (1 or 2 per device)

>
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
>
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
>
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).
>
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring.
That's like saying: "Which leads me to the questionable usefulness of HSA". I 
find it analogous to a situation where a network maintainer nacking a driver for 
a network card, which is slower than a different network card. Doesn't seem 
reasonable this situation is would happen. He would still put both the drivers 
in the kernel because people want to use the H/W and its features. So, I don't 
think this is a valid reason to NACK the driver.

> I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
Really ? You think that doing a context switch to kernel space, with all its 
overhead, is _not_ more expansive than just calling a function in userspace 
which only puts a buffer on a ring and writes a doorbell ?
>
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.
>
>
> I might have more thoughts once i am done with all the patches.
>
> Cheers,
> Jérôme
>
>>
>> Original Cover Letter:
>>
>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>> for radeon-family GPUs.
>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>> system resources more effectively via HW features including shared pageable
>> memory, userspace-accessible work queues, and platform-level atomics. In
>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>> Islands family of GPUs also performs HW-level validation of commands passed
>> in through the queues (aka rings).
>>
>> The code in this patch set is intended to serve both as a sample driver for
>> other HSA-compatible hardware devices and as a production driver for
>> radeon-family processors. The code is architected to support multiple CPUs
>> each with connected GPUs, although the current implementation focuses on a
>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>> graphics driver (kgd).
>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>> functionality between HSA compute and regular gfx/compute (memory,
>> interrupts, registers), while other functionality has been added
>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>> All shared hardware is owned by the radeon graphics driver, and an interface
>> between kfd and kgd allows the kfd to make use of those shared resources,
>> while HSA-specific functionality is managed directly by kfd by submitting
>> packets into an HSA-specific command queue (the "HIQ").
>>
>> During kfd module initialization a char device node (/dev/kfd) is created
>> (surviving until module exit), with ioctls for queue creation & management,
>> and data structures are initialized for managing HSA device topology.
>> The rest of the initialization is driven by calls from the radeon kgd at the
>> following points :
>>
>> - radeon_init (kfd_init)
>> - radeon_exit (kfd_fini)
>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>> - radeon_driver_unload_kms (kfd_device_fini)
>>
>> During the probe and init processing per-device data structures are
>> established which connect to the associated graphics kernel driver. This
>> information is exposed to userspace via sysfs, along with a version number
>> allowing userspace to determine if a topology change has occurred while it
>> was reading from sysfs.
>> The interface between kfd and kgd also allows the kfd to request buffer
>> management services from kgd, and allows kgd to route interrupt requests to
>> kfd code since the interrupt block is shared between regular
>> graphics/compute and HSA compute subsystems in the GPU.
>>
>> The kfd code works with an open source usermode library ("libhsakmt") which
>> is in the final stages of IP review and should be published in a separate
>> repo over the next few days.
>> The code operates in one of three modes, selectable via the sched_policy
>> module parameter :
>>
>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>> CP, and allows oversubscription (more queues than HW slots)
>> - sched_policy=1 also uses HW scheduling but does not allow
>> oversubscription, so create_queue requests fail when we run out of HW slots
>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>> queues to HW slots by programming registers
>>
>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>> has less test coverage than the other options. Default in the current code
>> is "HW scheduling without oversubscription" since that is where we have the
>> most test coverage but we expect to change the default to "HW scheduling
>> with oversubscription" after further testing. This effectively removes the
>> HW limit on the number of work queues available to applications.
>>
>> Programs running on the GPU are associated with an address space through the
>> VMID field, which is translated to a unique PASID at access time via a set
>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>> are partitioned (under control of the radeon kgd) between current
>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>> and by driver code if HW scheduling is not being used.
>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>> dedicated for this purpose, and pages within the doorbell aperture are
>> mapped to userspace (each page mapped to only one user address space).
>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>> userspace code to safely manage work queues (rings) without requiring a
>> kernel call for every ring update.
>> First step for an application process is to open the kfd device. Calls to
>> open create a kfd "process" structure only for the first thread of the
>> process. Subsequent open calls are checked to see if they are from processes
>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>> data lives as long as the mm_struct exists. Each mm_struct is associated
>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>> accessible to the GPU.
>> Next step is for the application to collect topology information via sysfs.
>> This gives userspace enough information to be able to identify specific
>> nodes (processors) in subsequent queue management calls. Application
>> processes can create queues on multiple processors, and processors support
>> queues from multiple processes.
>> At this point the application can create work queues in userspace memory and
>> pass them through the usermode library to kfd to have them mapped onto HW
>> queue slots so that commands written to the queues can be executed by the
>> GPU. Queue operations specify a processor node, and so the bulk of this code
>> is device-specific.
>> Written by John Bridgman <John.Bridgman@amd.com>
>>
>>
>> Alexey Skidanov (1):
>>    amdkfd: Implement the Get Process Aperture IOCTL
>>
>> Andrew Lewycky (3):
>>    amdkfd: Add basic modules to amdkfd
>>    amdkfd: Add interrupt handling module
>>    amdkfd: Implement the Set Memory Policy IOCTL
>>
>> Ben Goz (8):
>>    amdkfd: Add queue module
>>    amdkfd: Add mqd_manager module
>>    amdkfd: Add kernel queue module
>>    amdkfd: Add module parameter of scheduling policy
>>    amdkfd: Add packet manager module
>>    amdkfd: Add process queue manager module
>>    amdkfd: Add device queue manager module
>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>
>> Evgeny Pinchuk (3):
>>    amdkfd: Add topology module to amdkfd
>>    amdkfd: Implement the Get Clock Counters IOCTL
>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>
>> Oded Gabbay (10):
>>    mm: Add kfd_process pointer to mm_struct
>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>    drm/radeon: Report doorbell configuration to amdkfd
>>    drm/radeon: adding synchronization for GRBM GFX
>>    drm/radeon: Add radeon <--> amdkfd interface
>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>    amdkfd: Add IOCTL set definitions of amdkfd
>>    amdkfd: Add amdkfd skeleton driver
>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>
>>   CREDITS                                            |    7 +
>>   MAINTAINERS                                        |   10 +
>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>   include/linux/mm_types.h                           |   14 +
>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>
>> --
>> 1.9.1
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 12:36     ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 12:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 20/07/14 20:46, Jerome Glisse wrote:
> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>> Forgot to cc mailing list on cover letter. Sorry.
>>
>> As a continuation to the existing discussion, here is a v2 patch series
>> restructured with a cleaner history and no totally-different-early-versions
>> of the code.
>>
>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>> are modifications to radeon driver and 18 of them include only amdkfd code.
>> There is no code going away or even modified between patches, only added.
>>
>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>> is an AMD-only driver at this point. Having said that, we do foresee a
>> generic hsa framework being implemented in the future and in that case, we
>> will adjust amdkfd to work within that framework.
>>
>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>> contained in its own folder. The amdkfd folder was put under the radeon
>> folder because the only AMD gfx driver in the Linux kernel at this point
>> is the radeon driver. Having said that, we will probably need to move it
>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>> drivers.
>>
>> For people who like to review using git, the v2 patch set is located at:
>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>
>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>
> So quick comments before i finish going over all patches. There is many
> things that need more documentation espacialy as of right now there is
> no userspace i can go look at.
So quick comments on some of your questions but first of all, thanks for the 
time you dedicated to review the code.
>
> There few show stopper, biggest one is gpu memory pinning this is a big
> no, that would need serious arguments for any hope of convincing me on
> that side.
We only do gpu memory pinning for kernel objects. There are no userspace objects 
that are pinned on the gpu memory in our driver. If that is the case, is it 
still a show stopper ?

The kernel objects are:
- pipelines (4 per device)
- mqd per hiq (only 1 per device)
- mqd per userspace queue. On KV, we support up to 1K queues per process, for a 
total of 512K queues. Each mqd is 151 bytes, but the allocation is done in 256 
alignment. So total *possible* memory is 128MB
- kernel queue (only 1 per device)
- fence address for kernel queue
- runlists for the CP (1 or 2 per device)

>
> It might be better to add a drivers/gpu/drm/amd directory and add common
> stuff there.
>
> Given that this is not intended to be final HSA api AFAICT then i would
> say this far better to avoid the whole kfd module and add ioctl to radeon.
> This would avoid crazy communication btw radeon and kfd.
>
> The whole aperture business needs some serious explanation. Especialy as
> you want to use userspace address there is nothing to prevent userspace
> program from allocating things at address you reserve for lds, scratch,
> ... only sane way would be to move those lds, scratch inside the virtual
> address reserved for kernel (see kernel memory map).
>
> The whole business of locking performance counter for exclusive per process
> access is a big NO. Which leads me to the questionable usefullness of user
> space command ring.
That's like saying: "Which leads me to the questionable usefulness of HSA". I 
find it analogous to a situation where a network maintainer nacking a driver for 
a network card, which is slower than a different network card. Doesn't seem 
reasonable this situation is would happen. He would still put both the drivers 
in the kernel because people want to use the H/W and its features. So, I don't 
think this is a valid reason to NACK the driver.

> I only see issues with that. First and foremost i would
> need to see solid figures that kernel ioctl or syscall has a higher an
> overhead that is measurable in any meaning full way against a simple
> function call. I know the userspace command ring is a big marketing features
> that please ignorant userspace programmer. But really this only brings issues
> and for absolutely not upside afaict.
Really ? You think that doing a context switch to kernel space, with all its 
overhead, is _not_ more expansive than just calling a function in userspace 
which only puts a buffer on a ring and writes a doorbell ?
>
> So i would rather see a very simple ioctl that write the doorbell and might
> do more than that in case of ring/queue overcommit where it would first have
> to wait for a free ring/queue to schedule stuff. This would also allow sane
> implementation of things like performance counter that could be acquire by
> kernel for duration of a job submitted by userspace. While still not optimal
> this would be better that userspace locking.
>
>
> I might have more thoughts once i am done with all the patches.
>
> Cheers,
> Jérôme
>
>>
>> Original Cover Letter:
>>
>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>> for radeon-family GPUs.
>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>> system resources more effectively via HW features including shared pageable
>> memory, userspace-accessible work queues, and platform-level atomics. In
>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>> Islands family of GPUs also performs HW-level validation of commands passed
>> in through the queues (aka rings).
>>
>> The code in this patch set is intended to serve both as a sample driver for
>> other HSA-compatible hardware devices and as a production driver for
>> radeon-family processors. The code is architected to support multiple CPUs
>> each with connected GPUs, although the current implementation focuses on a
>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>> graphics driver (kgd).
>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>> functionality between HSA compute and regular gfx/compute (memory,
>> interrupts, registers), while other functionality has been added
>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>> All shared hardware is owned by the radeon graphics driver, and an interface
>> between kfd and kgd allows the kfd to make use of those shared resources,
>> while HSA-specific functionality is managed directly by kfd by submitting
>> packets into an HSA-specific command queue (the "HIQ").
>>
>> During kfd module initialization a char device node (/dev/kfd) is created
>> (surviving until module exit), with ioctls for queue creation & management,
>> and data structures are initialized for managing HSA device topology.
>> The rest of the initialization is driven by calls from the radeon kgd at the
>> following points :
>>
>> - radeon_init (kfd_init)
>> - radeon_exit (kfd_fini)
>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>> - radeon_driver_unload_kms (kfd_device_fini)
>>
>> During the probe and init processing per-device data structures are
>> established which connect to the associated graphics kernel driver. This
>> information is exposed to userspace via sysfs, along with a version number
>> allowing userspace to determine if a topology change has occurred while it
>> was reading from sysfs.
>> The interface between kfd and kgd also allows the kfd to request buffer
>> management services from kgd, and allows kgd to route interrupt requests to
>> kfd code since the interrupt block is shared between regular
>> graphics/compute and HSA compute subsystems in the GPU.
>>
>> The kfd code works with an open source usermode library ("libhsakmt") which
>> is in the final stages of IP review and should be published in a separate
>> repo over the next few days.
>> The code operates in one of three modes, selectable via the sched_policy
>> module parameter :
>>
>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>> CP, and allows oversubscription (more queues than HW slots)
>> - sched_policy=1 also uses HW scheduling but does not allow
>> oversubscription, so create_queue requests fail when we run out of HW slots
>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>> queues to HW slots by programming registers
>>
>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>> has less test coverage than the other options. Default in the current code
>> is "HW scheduling without oversubscription" since that is where we have the
>> most test coverage but we expect to change the default to "HW scheduling
>> with oversubscription" after further testing. This effectively removes the
>> HW limit on the number of work queues available to applications.
>>
>> Programs running on the GPU are associated with an address space through the
>> VMID field, which is translated to a unique PASID at access time via a set
>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>> are partitioned (under control of the radeon kgd) between current
>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>> and by driver code if HW scheduling is not being used.
>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>> dedicated for this purpose, and pages within the doorbell aperture are
>> mapped to userspace (each page mapped to only one user address space).
>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>> userspace code to safely manage work queues (rings) without requiring a
>> kernel call for every ring update.
>> First step for an application process is to open the kfd device. Calls to
>> open create a kfd "process" structure only for the first thread of the
>> process. Subsequent open calls are checked to see if they are from processes
>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>> data lives as long as the mm_struct exists. Each mm_struct is associated
>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>> accessible to the GPU.
>> Next step is for the application to collect topology information via sysfs.
>> This gives userspace enough information to be able to identify specific
>> nodes (processors) in subsequent queue management calls. Application
>> processes can create queues on multiple processors, and processors support
>> queues from multiple processes.
>> At this point the application can create work queues in userspace memory and
>> pass them through the usermode library to kfd to have them mapped onto HW
>> queue slots so that commands written to the queues can be executed by the
>> GPU. Queue operations specify a processor node, and so the bulk of this code
>> is device-specific.
>> Written by John Bridgman <John.Bridgman@amd.com>
>>
>>
>> Alexey Skidanov (1):
>>    amdkfd: Implement the Get Process Aperture IOCTL
>>
>> Andrew Lewycky (3):
>>    amdkfd: Add basic modules to amdkfd
>>    amdkfd: Add interrupt handling module
>>    amdkfd: Implement the Set Memory Policy IOCTL
>>
>> Ben Goz (8):
>>    amdkfd: Add queue module
>>    amdkfd: Add mqd_manager module
>>    amdkfd: Add kernel queue module
>>    amdkfd: Add module parameter of scheduling policy
>>    amdkfd: Add packet manager module
>>    amdkfd: Add process queue manager module
>>    amdkfd: Add device queue manager module
>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>
>> Evgeny Pinchuk (3):
>>    amdkfd: Add topology module to amdkfd
>>    amdkfd: Implement the Get Clock Counters IOCTL
>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>
>> Oded Gabbay (10):
>>    mm: Add kfd_process pointer to mm_struct
>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>    drm/radeon: Report doorbell configuration to amdkfd
>>    drm/radeon: adding synchronization for GRBM GFX
>>    drm/radeon: Add radeon <--> amdkfd interface
>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>    amdkfd: Add IOCTL set definitions of amdkfd
>>    amdkfd: Add amdkfd skeleton driver
>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>
>>   CREDITS                                            |    7 +
>>   MAINTAINERS                                        |   10 +
>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 ++++++++++++++++++++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>   include/linux/mm_types.h                           |   14 +
>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 12:36     ` Oded Gabbay
  (?)
@ 2014-07-21 13:39       ` Christian König
  -1 siblings, 0 replies; 148+ messages in thread
From: Christian König @ 2014-07-21 13:39 UTC (permalink / raw)
  To: Oded Gabbay, Jerome Glisse
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

Am 21.07.2014 14:36, schrieb Oded Gabbay:
> On 20/07/14 20:46, Jerome Glisse wrote:
>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>> Forgot to cc mailing list on cover letter. Sorry.
>>>
>>> As a continuation to the existing discussion, here is a v2 patch series
>>> restructured with a cleaner history and no 
>>> totally-different-early-versions
>>> of the code.
>>>
>>> Instead of 83 patches, there are now a total of 25 patches, where 5 
>>> of them
>>> are modifications to radeon driver and 18 of them include only 
>>> amdkfd code.
>>> There is no code going away or even modified between patches, only 
>>> added.
>>>
>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside 
>>> under
>>> drm/radeon/amdkfd. This move was done to emphasize the fact that 
>>> this driver
>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>> generic hsa framework being implemented in the future and in that 
>>> case, we
>>> will adjust amdkfd to work within that framework.
>>>
>>> As the amdkfd driver should support multiple AMD gfx drivers, we 
>>> want to
>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>> contained in its own folder. The amdkfd folder was put under the radeon
>>> folder because the only AMD gfx driver in the Linux kernel at this 
>>> point
>>> is the radeon driver. Having said that, we will probably need to 
>>> move it
>>> (maybe to be directly under drm) after we integrate with additional 
>>> AMD gfx
>>> drivers.
>>>
>>> For people who like to review using git, the v2 patch set is located 
>>> at:
>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>
>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>
>> So quick comments before i finish going over all patches. There is many
>> things that need more documentation espacialy as of right now there is
>> no userspace i can go look at.
> So quick comments on some of your questions but first of all, thanks 
> for the time you dedicated to review the code.
>>
>> There few show stopper, biggest one is gpu memory pinning this is a big
>> no, that would need serious arguments for any hope of convincing me on
>> that side.
> We only do gpu memory pinning for kernel objects. There are no 
> userspace objects that are pinned on the gpu memory in our driver. If 
> that is the case, is it still a show stopper ?
>
> The kernel objects are:
> - pipelines (4 per device)
> - mqd per hiq (only 1 per device)
> - mqd per userspace queue. On KV, we support up to 1K queues per 
> process, for a total of 512K queues. Each mqd is 151 bytes, but the 
> allocation is done in 256 alignment. So total *possible* memory is 128MB
> - kernel queue (only 1 per device)
> - fence address for kernel queue
> - runlists for the CP (1 or 2 per device)

The main questions here are if it's avoid able to pin down the memory 
and if the memory is pinned down at driver load, by request from 
userspace or by anything else.

As far as I can see only the "mqd per userspace queue" might be a bit 
questionable, everything else sounds reasonable.

Christian.

>>
>> It might be better to add a drivers/gpu/drm/amd directory and add common
>> stuff there.
>>
>> Given that this is not intended to be final HSA api AFAICT then i would
>> say this far better to avoid the whole kfd module and add ioctl to 
>> radeon.
>> This would avoid crazy communication btw radeon and kfd.
>>
>> The whole aperture business needs some serious explanation. Especialy as
>> you want to use userspace address there is nothing to prevent userspace
>> program from allocating things at address you reserve for lds, scratch,
>> ... only sane way would be to move those lds, scratch inside the virtual
>> address reserved for kernel (see kernel memory map).
>>
>> The whole business of locking performance counter for exclusive per 
>> process
>> access is a big NO. Which leads me to the questionable usefullness of 
>> user
>> space command ring.
> That's like saying: "Which leads me to the questionable usefulness of 
> HSA". I find it analogous to a situation where a network maintainer 
> nacking a driver for a network card, which is slower than a different 
> network card. Doesn't seem reasonable this situation is would happen. 
> He would still put both the drivers in the kernel because people want 
> to use the H/W and its features. So, I don't think this is a valid 
> reason to NACK the driver.
>
>> I only see issues with that. First and foremost i would
>> need to see solid figures that kernel ioctl or syscall has a higher an
>> overhead that is measurable in any meaning full way against a simple
>> function call. I know the userspace command ring is a big marketing 
>> features
>> that please ignorant userspace programmer. But really this only 
>> brings issues
>> and for absolutely not upside afaict.
> Really ? You think that doing a context switch to kernel space, with 
> all its overhead, is _not_ more expansive than just calling a function 
> in userspace which only puts a buffer on a ring and writes a doorbell ?
>>
>> So i would rather see a very simple ioctl that write the doorbell and 
>> might
>> do more than that in case of ring/queue overcommit where it would 
>> first have
>> to wait for a free ring/queue to schedule stuff. This would also 
>> allow sane
>> implementation of things like performance counter that could be 
>> acquire by
>> kernel for duration of a job submitted by userspace. While still not 
>> optimal
>> this would be better that userspace locking.
>>
>>
>> I might have more thoughts once i am done with all the patches.
>>
>> Cheers,
>> Jérôme
>>
>>>
>>> Original Cover Letter:
>>>
>>> This patch set implements a Heterogeneous System Architecture (HSA) 
>>> driver
>>> for radeon-family GPUs.
>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>> system resources more effectively via HW features including shared 
>>> pageable
>>> memory, userspace-accessible work queues, and platform-level 
>>> atomics. In
>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, 
>>> the Sea
>>> Islands family of GPUs also performs HW-level validation of commands 
>>> passed
>>> in through the queues (aka rings).
>>>
>>> The code in this patch set is intended to serve both as a sample 
>>> driver for
>>> other HSA-compatible hardware devices and as a production driver for
>>> radeon-family processors. The code is architected to support 
>>> multiple CPUs
>>> each with connected GPUs, although the current implementation 
>>> focuses on a
>>> single Kaveri/Berlin APU, and works alongside the existing radeon 
>>> kernel
>>> graphics driver (kgd).
>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some 
>>> hardware
>>> functionality between HSA compute and regular gfx/compute (memory,
>>> interrupts, registers), while other functionality has been added
>>> specifically for HSA compute  (hw scheduler for virtualized compute 
>>> rings).
>>> All shared hardware is owned by the radeon graphics driver, and an 
>>> interface
>>> between kfd and kgd allows the kfd to make use of those shared 
>>> resources,
>>> while HSA-specific functionality is managed directly by kfd by 
>>> submitting
>>> packets into an HSA-specific command queue (the "HIQ").
>>>
>>> During kfd module initialization a char device node (/dev/kfd) is 
>>> created
>>> (surviving until module exit), with ioctls for queue creation & 
>>> management,
>>> and data structures are initialized for managing HSA device topology.
>>> The rest of the initialization is driven by calls from the radeon 
>>> kgd at the
>>> following points :
>>>
>>> - radeon_init (kfd_init)
>>> - radeon_exit (kfd_fini)
>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>
>>> During the probe and init processing per-device data structures are
>>> established which connect to the associated graphics kernel driver. 
>>> This
>>> information is exposed to userspace via sysfs, along with a version 
>>> number
>>> allowing userspace to determine if a topology change has occurred 
>>> while it
>>> was reading from sysfs.
>>> The interface between kfd and kgd also allows the kfd to request buffer
>>> management services from kgd, and allows kgd to route interrupt 
>>> requests to
>>> kfd code since the interrupt block is shared between regular
>>> graphics/compute and HSA compute subsystems in the GPU.
>>>
>>> The kfd code works with an open source usermode library 
>>> ("libhsakmt") which
>>> is in the final stages of IP review and should be published in a 
>>> separate
>>> repo over the next few days.
>>> The code operates in one of three modes, selectable via the 
>>> sched_policy
>>> module parameter :
>>>
>>> - sched_policy=0 uses a hardware scheduler running in the MEC block 
>>> within
>>> CP, and allows oversubscription (more queues than HW slots)
>>> - sched_policy=1 also uses HW scheduling but does not allow
>>> oversubscription, so create_queue requests fail when we run out of 
>>> HW slots
>>> - sched_policy=2 does not use HW scheduling, so the driver manually 
>>> assigns
>>> queues to HW slots by programming registers
>>>
>>> The "no HW scheduling" option is for debug & new hardware bringup 
>>> only, so
>>> has less test coverage than the other options. Default in the 
>>> current code
>>> is "HW scheduling without oversubscription" since that is where we 
>>> have the
>>> most test coverage but we expect to change the default to "HW 
>>> scheduling
>>> with oversubscription" after further testing. This effectively 
>>> removes the
>>> HW limit on the number of work queues available to applications.
>>>
>>> Programs running on the GPU are associated with an address space 
>>> through the
>>> VMID field, which is translated to a unique PASID at access time via 
>>> a set
>>> of 16 VMID-to-PASID mapping registers. The available VMIDs 
>>> (currently 16)
>>> are partitioned (under control of the radeon kgd) between current
>>> gfx/compute and HSA compute, with each getting 8 in the current 
>>> code. The
>>> VMID-to-PASID mapping registers are updated by the HW scheduler when 
>>> used,
>>> and by driver code if HW scheduling is not being used.
>>> The Sea Islands compute queues use a new "doorbell" mechanism 
>>> instead of the
>>> earlier kernel-managed write pointer registers. Doorbells use a 
>>> separate BAR
>>> dedicated for this purpose, and pages within the doorbell aperture are
>>> mapped to userspace (each page mapped to only one user address space).
>>> Writes to the doorbell aperture are intercepted by GPU hardware, 
>>> allowing
>>> userspace code to safely manage work queues (rings) without requiring a
>>> kernel call for every ring update.
>>> First step for an application process is to open the kfd device. 
>>> Calls to
>>> open create a kfd "process" structure only for the first thread of the
>>> process. Subsequent open calls are checked to see if they are from 
>>> processes
>>> using the same mm_struct and, if so, don't do anything. The kfd 
>>> per-process
>>> data lives as long as the mm_struct exists. Each mm_struct is 
>>> associated
>>> with a unique PASID, allowing the IOMMUv2 to make userspace process 
>>> memory
>>> accessible to the GPU.
>>> Next step is for the application to collect topology information via 
>>> sysfs.
>>> This gives userspace enough information to be able to identify specific
>>> nodes (processors) in subsequent queue management calls. Application
>>> processes can create queues on multiple processors, and processors 
>>> support
>>> queues from multiple processes.
>>> At this point the application can create work queues in userspace 
>>> memory and
>>> pass them through the usermode library to kfd to have them mapped 
>>> onto HW
>>> queue slots so that commands written to the queues can be executed 
>>> by the
>>> GPU. Queue operations specify a processor node, and so the bulk of 
>>> this code
>>> is device-specific.
>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>
>>>
>>> Alexey Skidanov (1):
>>>    amdkfd: Implement the Get Process Aperture IOCTL
>>>
>>> Andrew Lewycky (3):
>>>    amdkfd: Add basic modules to amdkfd
>>>    amdkfd: Add interrupt handling module
>>>    amdkfd: Implement the Set Memory Policy IOCTL
>>>
>>> Ben Goz (8):
>>>    amdkfd: Add queue module
>>>    amdkfd: Add mqd_manager module
>>>    amdkfd: Add kernel queue module
>>>    amdkfd: Add module parameter of scheduling policy
>>>    amdkfd: Add packet manager module
>>>    amdkfd: Add process queue manager module
>>>    amdkfd: Add device queue manager module
>>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>>
>>> Evgeny Pinchuk (3):
>>>    amdkfd: Add topology module to amdkfd
>>>    amdkfd: Implement the Get Clock Counters IOCTL
>>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>
>>> Oded Gabbay (10):
>>>    mm: Add kfd_process pointer to mm_struct
>>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>>    drm/radeon: Report doorbell configuration to amdkfd
>>>    drm/radeon: adding synchronization for GRBM GFX
>>>    drm/radeon: Add radeon <--> amdkfd interface
>>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>>    amdkfd: Add IOCTL set definitions of amdkfd
>>>    amdkfd: Add amdkfd skeleton driver
>>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>
>>>   CREDITS                                            |    7 +
>>>   MAINTAINERS                                        |   10 +
>>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 
>>> ++++++++++++++++
>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 
>>> ++++++++++++++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>   include/linux/mm_types.h                           |   14 +
>>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>
>>> -- 
>>> 1.9.1
>>>
>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 13:39       ` Christian König
  0 siblings, 0 replies; 148+ messages in thread
From: Christian König @ 2014-07-21 13:39 UTC (permalink / raw)
  To: Oded Gabbay, Jerome Glisse
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

Am 21.07.2014 14:36, schrieb Oded Gabbay:
> On 20/07/14 20:46, Jerome Glisse wrote:
>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>> Forgot to cc mailing list on cover letter. Sorry.
>>>
>>> As a continuation to the existing discussion, here is a v2 patch series
>>> restructured with a cleaner history and no 
>>> totally-different-early-versions
>>> of the code.
>>>
>>> Instead of 83 patches, there are now a total of 25 patches, where 5 
>>> of them
>>> are modifications to radeon driver and 18 of them include only 
>>> amdkfd code.
>>> There is no code going away or even modified between patches, only 
>>> added.
>>>
>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside 
>>> under
>>> drm/radeon/amdkfd. This move was done to emphasize the fact that 
>>> this driver
>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>> generic hsa framework being implemented in the future and in that 
>>> case, we
>>> will adjust amdkfd to work within that framework.
>>>
>>> As the amdkfd driver should support multiple AMD gfx drivers, we 
>>> want to
>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>> contained in its own folder. The amdkfd folder was put under the radeon
>>> folder because the only AMD gfx driver in the Linux kernel at this 
>>> point
>>> is the radeon driver. Having said that, we will probably need to 
>>> move it
>>> (maybe to be directly under drm) after we integrate with additional 
>>> AMD gfx
>>> drivers.
>>>
>>> For people who like to review using git, the v2 patch set is located 
>>> at:
>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>
>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>
>> So quick comments before i finish going over all patches. There is many
>> things that need more documentation espacialy as of right now there is
>> no userspace i can go look at.
> So quick comments on some of your questions but first of all, thanks 
> for the time you dedicated to review the code.
>>
>> There few show stopper, biggest one is gpu memory pinning this is a big
>> no, that would need serious arguments for any hope of convincing me on
>> that side.
> We only do gpu memory pinning for kernel objects. There are no 
> userspace objects that are pinned on the gpu memory in our driver. If 
> that is the case, is it still a show stopper ?
>
> The kernel objects are:
> - pipelines (4 per device)
> - mqd per hiq (only 1 per device)
> - mqd per userspace queue. On KV, we support up to 1K queues per 
> process, for a total of 512K queues. Each mqd is 151 bytes, but the 
> allocation is done in 256 alignment. So total *possible* memory is 128MB
> - kernel queue (only 1 per device)
> - fence address for kernel queue
> - runlists for the CP (1 or 2 per device)

The main questions here are if it's avoid able to pin down the memory 
and if the memory is pinned down at driver load, by request from 
userspace or by anything else.

As far as I can see only the "mqd per userspace queue" might be a bit 
questionable, everything else sounds reasonable.

Christian.

>>
>> It might be better to add a drivers/gpu/drm/amd directory and add common
>> stuff there.
>>
>> Given that this is not intended to be final HSA api AFAICT then i would
>> say this far better to avoid the whole kfd module and add ioctl to 
>> radeon.
>> This would avoid crazy communication btw radeon and kfd.
>>
>> The whole aperture business needs some serious explanation. Especialy as
>> you want to use userspace address there is nothing to prevent userspace
>> program from allocating things at address you reserve for lds, scratch,
>> ... only sane way would be to move those lds, scratch inside the virtual
>> address reserved for kernel (see kernel memory map).
>>
>> The whole business of locking performance counter for exclusive per 
>> process
>> access is a big NO. Which leads me to the questionable usefullness of 
>> user
>> space command ring.
> That's like saying: "Which leads me to the questionable usefulness of 
> HSA". I find it analogous to a situation where a network maintainer 
> nacking a driver for a network card, which is slower than a different 
> network card. Doesn't seem reasonable this situation is would happen. 
> He would still put both the drivers in the kernel because people want 
> to use the H/W and its features. So, I don't think this is a valid 
> reason to NACK the driver.
>
>> I only see issues with that. First and foremost i would
>> need to see solid figures that kernel ioctl or syscall has a higher an
>> overhead that is measurable in any meaning full way against a simple
>> function call. I know the userspace command ring is a big marketing 
>> features
>> that please ignorant userspace programmer. But really this only 
>> brings issues
>> and for absolutely not upside afaict.
> Really ? You think that doing a context switch to kernel space, with 
> all its overhead, is _not_ more expansive than just calling a function 
> in userspace which only puts a buffer on a ring and writes a doorbell ?
>>
>> So i would rather see a very simple ioctl that write the doorbell and 
>> might
>> do more than that in case of ring/queue overcommit where it would 
>> first have
>> to wait for a free ring/queue to schedule stuff. This would also 
>> allow sane
>> implementation of things like performance counter that could be 
>> acquire by
>> kernel for duration of a job submitted by userspace. While still not 
>> optimal
>> this would be better that userspace locking.
>>
>>
>> I might have more thoughts once i am done with all the patches.
>>
>> Cheers,
>> Jerome
>>
>>>
>>> Original Cover Letter:
>>>
>>> This patch set implements a Heterogeneous System Architecture (HSA) 
>>> driver
>>> for radeon-family GPUs.
>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>> system resources more effectively via HW features including shared 
>>> pageable
>>> memory, userspace-accessible work queues, and platform-level 
>>> atomics. In
>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, 
>>> the Sea
>>> Islands family of GPUs also performs HW-level validation of commands 
>>> passed
>>> in through the queues (aka rings).
>>>
>>> The code in this patch set is intended to serve both as a sample 
>>> driver for
>>> other HSA-compatible hardware devices and as a production driver for
>>> radeon-family processors. The code is architected to support 
>>> multiple CPUs
>>> each with connected GPUs, although the current implementation 
>>> focuses on a
>>> single Kaveri/Berlin APU, and works alongside the existing radeon 
>>> kernel
>>> graphics driver (kgd).
>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some 
>>> hardware
>>> functionality between HSA compute and regular gfx/compute (memory,
>>> interrupts, registers), while other functionality has been added
>>> specifically for HSA compute  (hw scheduler for virtualized compute 
>>> rings).
>>> All shared hardware is owned by the radeon graphics driver, and an 
>>> interface
>>> between kfd and kgd allows the kfd to make use of those shared 
>>> resources,
>>> while HSA-specific functionality is managed directly by kfd by 
>>> submitting
>>> packets into an HSA-specific command queue (the "HIQ").
>>>
>>> During kfd module initialization a char device node (/dev/kfd) is 
>>> created
>>> (surviving until module exit), with ioctls for queue creation & 
>>> management,
>>> and data structures are initialized for managing HSA device topology.
>>> The rest of the initialization is driven by calls from the radeon 
>>> kgd at the
>>> following points :
>>>
>>> - radeon_init (kfd_init)
>>> - radeon_exit (kfd_fini)
>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>
>>> During the probe and init processing per-device data structures are
>>> established which connect to the associated graphics kernel driver. 
>>> This
>>> information is exposed to userspace via sysfs, along with a version 
>>> number
>>> allowing userspace to determine if a topology change has occurred 
>>> while it
>>> was reading from sysfs.
>>> The interface between kfd and kgd also allows the kfd to request buffer
>>> management services from kgd, and allows kgd to route interrupt 
>>> requests to
>>> kfd code since the interrupt block is shared between regular
>>> graphics/compute and HSA compute subsystems in the GPU.
>>>
>>> The kfd code works with an open source usermode library 
>>> ("libhsakmt") which
>>> is in the final stages of IP review and should be published in a 
>>> separate
>>> repo over the next few days.
>>> The code operates in one of three modes, selectable via the 
>>> sched_policy
>>> module parameter :
>>>
>>> - sched_policy=0 uses a hardware scheduler running in the MEC block 
>>> within
>>> CP, and allows oversubscription (more queues than HW slots)
>>> - sched_policy=1 also uses HW scheduling but does not allow
>>> oversubscription, so create_queue requests fail when we run out of 
>>> HW slots
>>> - sched_policy=2 does not use HW scheduling, so the driver manually 
>>> assigns
>>> queues to HW slots by programming registers
>>>
>>> The "no HW scheduling" option is for debug & new hardware bringup 
>>> only, so
>>> has less test coverage than the other options. Default in the 
>>> current code
>>> is "HW scheduling without oversubscription" since that is where we 
>>> have the
>>> most test coverage but we expect to change the default to "HW 
>>> scheduling
>>> with oversubscription" after further testing. This effectively 
>>> removes the
>>> HW limit on the number of work queues available to applications.
>>>
>>> Programs running on the GPU are associated with an address space 
>>> through the
>>> VMID field, which is translated to a unique PASID at access time via 
>>> a set
>>> of 16 VMID-to-PASID mapping registers. The available VMIDs 
>>> (currently 16)
>>> are partitioned (under control of the radeon kgd) between current
>>> gfx/compute and HSA compute, with each getting 8 in the current 
>>> code. The
>>> VMID-to-PASID mapping registers are updated by the HW scheduler when 
>>> used,
>>> and by driver code if HW scheduling is not being used.
>>> The Sea Islands compute queues use a new "doorbell" mechanism 
>>> instead of the
>>> earlier kernel-managed write pointer registers. Doorbells use a 
>>> separate BAR
>>> dedicated for this purpose, and pages within the doorbell aperture are
>>> mapped to userspace (each page mapped to only one user address space).
>>> Writes to the doorbell aperture are intercepted by GPU hardware, 
>>> allowing
>>> userspace code to safely manage work queues (rings) without requiring a
>>> kernel call for every ring update.
>>> First step for an application process is to open the kfd device. 
>>> Calls to
>>> open create a kfd "process" structure only for the first thread of the
>>> process. Subsequent open calls are checked to see if they are from 
>>> processes
>>> using the same mm_struct and, if so, don't do anything. The kfd 
>>> per-process
>>> data lives as long as the mm_struct exists. Each mm_struct is 
>>> associated
>>> with a unique PASID, allowing the IOMMUv2 to make userspace process 
>>> memory
>>> accessible to the GPU.
>>> Next step is for the application to collect topology information via 
>>> sysfs.
>>> This gives userspace enough information to be able to identify specific
>>> nodes (processors) in subsequent queue management calls. Application
>>> processes can create queues on multiple processors, and processors 
>>> support
>>> queues from multiple processes.
>>> At this point the application can create work queues in userspace 
>>> memory and
>>> pass them through the usermode library to kfd to have them mapped 
>>> onto HW
>>> queue slots so that commands written to the queues can be executed 
>>> by the
>>> GPU. Queue operations specify a processor node, and so the bulk of 
>>> this code
>>> is device-specific.
>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>
>>>
>>> Alexey Skidanov (1):
>>>    amdkfd: Implement the Get Process Aperture IOCTL
>>>
>>> Andrew Lewycky (3):
>>>    amdkfd: Add basic modules to amdkfd
>>>    amdkfd: Add interrupt handling module
>>>    amdkfd: Implement the Set Memory Policy IOCTL
>>>
>>> Ben Goz (8):
>>>    amdkfd: Add queue module
>>>    amdkfd: Add mqd_manager module
>>>    amdkfd: Add kernel queue module
>>>    amdkfd: Add module parameter of scheduling policy
>>>    amdkfd: Add packet manager module
>>>    amdkfd: Add process queue manager module
>>>    amdkfd: Add device queue manager module
>>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>>
>>> Evgeny Pinchuk (3):
>>>    amdkfd: Add topology module to amdkfd
>>>    amdkfd: Implement the Get Clock Counters IOCTL
>>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>
>>> Oded Gabbay (10):
>>>    mm: Add kfd_process pointer to mm_struct
>>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>>    drm/radeon: Report doorbell configuration to amdkfd
>>>    drm/radeon: adding synchronization for GRBM GFX
>>>    drm/radeon: Add radeon <--> amdkfd interface
>>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>>    amdkfd: Add IOCTL set definitions of amdkfd
>>>    amdkfd: Add amdkfd skeleton driver
>>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>
>>>   CREDITS                                            |    7 +
>>>   MAINTAINERS                                        |   10 +
>>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 
>>> ++++++++++++++++
>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 
>>> ++++++++++++++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>   include/linux/mm_types.h                           |   14 +
>>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>
>>> -- 
>>> 1.9.1
>>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 13:39       ` Christian König
  0 siblings, 0 replies; 148+ messages in thread
From: Christian König @ 2014-07-21 13:39 UTC (permalink / raw)
  To: Oded Gabbay, Jerome Glisse
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

Am 21.07.2014 14:36, schrieb Oded Gabbay:
> On 20/07/14 20:46, Jerome Glisse wrote:
>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>> Forgot to cc mailing list on cover letter. Sorry.
>>>
>>> As a continuation to the existing discussion, here is a v2 patch series
>>> restructured with a cleaner history and no 
>>> totally-different-early-versions
>>> of the code.
>>>
>>> Instead of 83 patches, there are now a total of 25 patches, where 5 
>>> of them
>>> are modifications to radeon driver and 18 of them include only 
>>> amdkfd code.
>>> There is no code going away or even modified between patches, only 
>>> added.
>>>
>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside 
>>> under
>>> drm/radeon/amdkfd. This move was done to emphasize the fact that 
>>> this driver
>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>> generic hsa framework being implemented in the future and in that 
>>> case, we
>>> will adjust amdkfd to work within that framework.
>>>
>>> As the amdkfd driver should support multiple AMD gfx drivers, we 
>>> want to
>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>> contained in its own folder. The amdkfd folder was put under the radeon
>>> folder because the only AMD gfx driver in the Linux kernel at this 
>>> point
>>> is the radeon driver. Having said that, we will probably need to 
>>> move it
>>> (maybe to be directly under drm) after we integrate with additional 
>>> AMD gfx
>>> drivers.
>>>
>>> For people who like to review using git, the v2 patch set is located 
>>> at:
>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>
>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>
>> So quick comments before i finish going over all patches. There is many
>> things that need more documentation espacialy as of right now there is
>> no userspace i can go look at.
> So quick comments on some of your questions but first of all, thanks 
> for the time you dedicated to review the code.
>>
>> There few show stopper, biggest one is gpu memory pinning this is a big
>> no, that would need serious arguments for any hope of convincing me on
>> that side.
> We only do gpu memory pinning for kernel objects. There are no 
> userspace objects that are pinned on the gpu memory in our driver. If 
> that is the case, is it still a show stopper ?
>
> The kernel objects are:
> - pipelines (4 per device)
> - mqd per hiq (only 1 per device)
> - mqd per userspace queue. On KV, we support up to 1K queues per 
> process, for a total of 512K queues. Each mqd is 151 bytes, but the 
> allocation is done in 256 alignment. So total *possible* memory is 128MB
> - kernel queue (only 1 per device)
> - fence address for kernel queue
> - runlists for the CP (1 or 2 per device)

The main questions here are if it's avoid able to pin down the memory 
and if the memory is pinned down at driver load, by request from 
userspace or by anything else.

As far as I can see only the "mqd per userspace queue" might be a bit 
questionable, everything else sounds reasonable.

Christian.

>>
>> It might be better to add a drivers/gpu/drm/amd directory and add common
>> stuff there.
>>
>> Given that this is not intended to be final HSA api AFAICT then i would
>> say this far better to avoid the whole kfd module and add ioctl to 
>> radeon.
>> This would avoid crazy communication btw radeon and kfd.
>>
>> The whole aperture business needs some serious explanation. Especialy as
>> you want to use userspace address there is nothing to prevent userspace
>> program from allocating things at address you reserve for lds, scratch,
>> ... only sane way would be to move those lds, scratch inside the virtual
>> address reserved for kernel (see kernel memory map).
>>
>> The whole business of locking performance counter for exclusive per 
>> process
>> access is a big NO. Which leads me to the questionable usefullness of 
>> user
>> space command ring.
> That's like saying: "Which leads me to the questionable usefulness of 
> HSA". I find it analogous to a situation where a network maintainer 
> nacking a driver for a network card, which is slower than a different 
> network card. Doesn't seem reasonable this situation is would happen. 
> He would still put both the drivers in the kernel because people want 
> to use the H/W and its features. So, I don't think this is a valid 
> reason to NACK the driver.
>
>> I only see issues with that. First and foremost i would
>> need to see solid figures that kernel ioctl or syscall has a higher an
>> overhead that is measurable in any meaning full way against a simple
>> function call. I know the userspace command ring is a big marketing 
>> features
>> that please ignorant userspace programmer. But really this only 
>> brings issues
>> and for absolutely not upside afaict.
> Really ? You think that doing a context switch to kernel space, with 
> all its overhead, is _not_ more expansive than just calling a function 
> in userspace which only puts a buffer on a ring and writes a doorbell ?
>>
>> So i would rather see a very simple ioctl that write the doorbell and 
>> might
>> do more than that in case of ring/queue overcommit where it would 
>> first have
>> to wait for a free ring/queue to schedule stuff. This would also 
>> allow sane
>> implementation of things like performance counter that could be 
>> acquire by
>> kernel for duration of a job submitted by userspace. While still not 
>> optimal
>> this would be better that userspace locking.
>>
>>
>> I might have more thoughts once i am done with all the patches.
>>
>> Cheers,
>> Jérôme
>>
>>>
>>> Original Cover Letter:
>>>
>>> This patch set implements a Heterogeneous System Architecture (HSA) 
>>> driver
>>> for radeon-family GPUs.
>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>> system resources more effectively via HW features including shared 
>>> pageable
>>> memory, userspace-accessible work queues, and platform-level 
>>> atomics. In
>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, 
>>> the Sea
>>> Islands family of GPUs also performs HW-level validation of commands 
>>> passed
>>> in through the queues (aka rings).
>>>
>>> The code in this patch set is intended to serve both as a sample 
>>> driver for
>>> other HSA-compatible hardware devices and as a production driver for
>>> radeon-family processors. The code is architected to support 
>>> multiple CPUs
>>> each with connected GPUs, although the current implementation 
>>> focuses on a
>>> single Kaveri/Berlin APU, and works alongside the existing radeon 
>>> kernel
>>> graphics driver (kgd).
>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some 
>>> hardware
>>> functionality between HSA compute and regular gfx/compute (memory,
>>> interrupts, registers), while other functionality has been added
>>> specifically for HSA compute  (hw scheduler for virtualized compute 
>>> rings).
>>> All shared hardware is owned by the radeon graphics driver, and an 
>>> interface
>>> between kfd and kgd allows the kfd to make use of those shared 
>>> resources,
>>> while HSA-specific functionality is managed directly by kfd by 
>>> submitting
>>> packets into an HSA-specific command queue (the "HIQ").
>>>
>>> During kfd module initialization a char device node (/dev/kfd) is 
>>> created
>>> (surviving until module exit), with ioctls for queue creation & 
>>> management,
>>> and data structures are initialized for managing HSA device topology.
>>> The rest of the initialization is driven by calls from the radeon 
>>> kgd at the
>>> following points :
>>>
>>> - radeon_init (kfd_init)
>>> - radeon_exit (kfd_fini)
>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>
>>> During the probe and init processing per-device data structures are
>>> established which connect to the associated graphics kernel driver. 
>>> This
>>> information is exposed to userspace via sysfs, along with a version 
>>> number
>>> allowing userspace to determine if a topology change has occurred 
>>> while it
>>> was reading from sysfs.
>>> The interface between kfd and kgd also allows the kfd to request buffer
>>> management services from kgd, and allows kgd to route interrupt 
>>> requests to
>>> kfd code since the interrupt block is shared between regular
>>> graphics/compute and HSA compute subsystems in the GPU.
>>>
>>> The kfd code works with an open source usermode library 
>>> ("libhsakmt") which
>>> is in the final stages of IP review and should be published in a 
>>> separate
>>> repo over the next few days.
>>> The code operates in one of three modes, selectable via the 
>>> sched_policy
>>> module parameter :
>>>
>>> - sched_policy=0 uses a hardware scheduler running in the MEC block 
>>> within
>>> CP, and allows oversubscription (more queues than HW slots)
>>> - sched_policy=1 also uses HW scheduling but does not allow
>>> oversubscription, so create_queue requests fail when we run out of 
>>> HW slots
>>> - sched_policy=2 does not use HW scheduling, so the driver manually 
>>> assigns
>>> queues to HW slots by programming registers
>>>
>>> The "no HW scheduling" option is for debug & new hardware bringup 
>>> only, so
>>> has less test coverage than the other options. Default in the 
>>> current code
>>> is "HW scheduling without oversubscription" since that is where we 
>>> have the
>>> most test coverage but we expect to change the default to "HW 
>>> scheduling
>>> with oversubscription" after further testing. This effectively 
>>> removes the
>>> HW limit on the number of work queues available to applications.
>>>
>>> Programs running on the GPU are associated with an address space 
>>> through the
>>> VMID field, which is translated to a unique PASID at access time via 
>>> a set
>>> of 16 VMID-to-PASID mapping registers. The available VMIDs 
>>> (currently 16)
>>> are partitioned (under control of the radeon kgd) between current
>>> gfx/compute and HSA compute, with each getting 8 in the current 
>>> code. The
>>> VMID-to-PASID mapping registers are updated by the HW scheduler when 
>>> used,
>>> and by driver code if HW scheduling is not being used.
>>> The Sea Islands compute queues use a new "doorbell" mechanism 
>>> instead of the
>>> earlier kernel-managed write pointer registers. Doorbells use a 
>>> separate BAR
>>> dedicated for this purpose, and pages within the doorbell aperture are
>>> mapped to userspace (each page mapped to only one user address space).
>>> Writes to the doorbell aperture are intercepted by GPU hardware, 
>>> allowing
>>> userspace code to safely manage work queues (rings) without requiring a
>>> kernel call for every ring update.
>>> First step for an application process is to open the kfd device. 
>>> Calls to
>>> open create a kfd "process" structure only for the first thread of the
>>> process. Subsequent open calls are checked to see if they are from 
>>> processes
>>> using the same mm_struct and, if so, don't do anything. The kfd 
>>> per-process
>>> data lives as long as the mm_struct exists. Each mm_struct is 
>>> associated
>>> with a unique PASID, allowing the IOMMUv2 to make userspace process 
>>> memory
>>> accessible to the GPU.
>>> Next step is for the application to collect topology information via 
>>> sysfs.
>>> This gives userspace enough information to be able to identify specific
>>> nodes (processors) in subsequent queue management calls. Application
>>> processes can create queues on multiple processors, and processors 
>>> support
>>> queues from multiple processes.
>>> At this point the application can create work queues in userspace 
>>> memory and
>>> pass them through the usermode library to kfd to have them mapped 
>>> onto HW
>>> queue slots so that commands written to the queues can be executed 
>>> by the
>>> GPU. Queue operations specify a processor node, and so the bulk of 
>>> this code
>>> is device-specific.
>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>
>>>
>>> Alexey Skidanov (1):
>>>    amdkfd: Implement the Get Process Aperture IOCTL
>>>
>>> Andrew Lewycky (3):
>>>    amdkfd: Add basic modules to amdkfd
>>>    amdkfd: Add interrupt handling module
>>>    amdkfd: Implement the Set Memory Policy IOCTL
>>>
>>> Ben Goz (8):
>>>    amdkfd: Add queue module
>>>    amdkfd: Add mqd_manager module
>>>    amdkfd: Add kernel queue module
>>>    amdkfd: Add module parameter of scheduling policy
>>>    amdkfd: Add packet manager module
>>>    amdkfd: Add process queue manager module
>>>    amdkfd: Add device queue manager module
>>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>>
>>> Evgeny Pinchuk (3):
>>>    amdkfd: Add topology module to amdkfd
>>>    amdkfd: Implement the Get Clock Counters IOCTL
>>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>
>>> Oded Gabbay (10):
>>>    mm: Add kfd_process pointer to mm_struct
>>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>>    drm/radeon: Report doorbell configuration to amdkfd
>>>    drm/radeon: adding synchronization for GRBM GFX
>>>    drm/radeon: Add radeon <--> amdkfd interface
>>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>>    amdkfd: Add IOCTL set definitions of amdkfd
>>>    amdkfd: Add amdkfd skeleton driver
>>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>
>>>   CREDITS                                            |    7 +
>>>   MAINTAINERS                                        |   10 +
>>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 
>>> ++++++++++++++++
>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207 
>>> ++++++++++++++++++++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>   include/linux/mm_types.h                           |   14 +
>>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>   create mode 100644 
>>> drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>
>>> -- 
>>> 1.9.1
>>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 13:39       ` Christian König
  (?)
@ 2014-07-21 14:12         ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 14:12 UTC (permalink / raw)
  To: Christian König, Jerome Glisse
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

On 21/07/14 16:39, Christian König wrote:
> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> On 20/07/14 20:46, Jerome Glisse wrote:
>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>
>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>> restructured with a cleaner history and no totally-different-early-versions
>>>> of the code.
>>>>
>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>> There is no code going away or even modified between patches, only added.
>>>>
>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>> generic hsa framework being implemented in the future and in that case, we
>>>> will adjust amdkfd to work within that framework.
>>>>
>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>> is the radeon driver. Having said that, we will probably need to move it
>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>> drivers.
>>>>
>>>> For people who like to review using git, the v2 patch set is located at:
>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>
>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>
>>> So quick comments before i finish going over all patches. There is many
>>> things that need more documentation espacialy as of right now there is
>>> no userspace i can go look at.
>> So quick comments on some of your questions but first of all, thanks for the
>> time you dedicated to review the code.
>>>
>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>> no, that would need serious arguments for any hope of convincing me on
>>> that side.
>> We only do gpu memory pinning for kernel objects. There are no userspace
>> objects that are pinned on the gpu memory in our driver. If that is the case,
>> is it still a show stopper ?
>>
>> The kernel objects are:
>> - pipelines (4 per device)
>> - mqd per hiq (only 1 per device)
>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>> 256 alignment. So total *possible* memory is 128MB
>> - kernel queue (only 1 per device)
>> - fence address for kernel queue
>> - runlists for the CP (1 or 2 per device)
>
> The main questions here are if it's avoid able to pin down the memory and if the
> memory is pinned down at driver load, by request from userspace or by anything
> else.
>
> As far as I can see only the "mqd per userspace queue" might be a bit
> questionable, everything else sounds reasonable.
>
> Christian.

Most of the pin downs are done on device initialization.
The "mqd per userspace" is done per userspace queue creation. However, as I 
said, it has an upper limit of 128MB on KV, and considering the 2G local memory, 
I think it is OK.
The runlists are also done on userspace queue creation/deletion, but we only 
have 1 or 2 runlists per device, so it is not that bad.

	Oded
>
>>>
>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>> stuff there.
>>>
>>> Given that this is not intended to be final HSA api AFAICT then i would
>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>> This would avoid crazy communication btw radeon and kfd.
>>>
>>> The whole aperture business needs some serious explanation. Especialy as
>>> you want to use userspace address there is nothing to prevent userspace
>>> program from allocating things at address you reserve for lds, scratch,
>>> ... only sane way would be to move those lds, scratch inside the virtual
>>> address reserved for kernel (see kernel memory map).
>>>
>>> The whole business of locking performance counter for exclusive per process
>>> access is a big NO. Which leads me to the questionable usefullness of user
>>> space command ring.
>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>> find it analogous to a situation where a network maintainer nacking a driver
>> for a network card, which is slower than a different network card. Doesn't
>> seem reasonable this situation is would happen. He would still put both the
>> drivers in the kernel because people want to use the H/W and its features. So,
>> I don't think this is a valid reason to NACK the driver.
>>
>>> I only see issues with that. First and foremost i would
>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>> overhead that is measurable in any meaning full way against a simple
>>> function call. I know the userspace command ring is a big marketing features
>>> that please ignorant userspace programmer. But really this only brings issues
>>> and for absolutely not upside afaict.
>> Really ? You think that doing a context switch to kernel space, with all its
>> overhead, is _not_ more expansive than just calling a function in userspace
>> which only puts a buffer on a ring and writes a doorbell ?
>>>
>>> So i would rather see a very simple ioctl that write the doorbell and might
>>> do more than that in case of ring/queue overcommit where it would first have
>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>> implementation of things like performance counter that could be acquire by
>>> kernel for duration of a job submitted by userspace. While still not optimal
>>> this would be better that userspace locking.
>>>
>>>
>>> I might have more thoughts once i am done with all the patches.
>>>
>>> Cheers,
>>> Jérôme
>>>
>>>>
>>>> Original Cover Letter:
>>>>
>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>> for radeon-family GPUs.
>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>> system resources more effectively via HW features including shared pageable
>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>> in through the queues (aka rings).
>>>>
>>>> The code in this patch set is intended to serve both as a sample driver for
>>>> other HSA-compatible hardware devices and as a production driver for
>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>> each with connected GPUs, although the current implementation focuses on a
>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>> graphics driver (kgd).
>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>> interrupts, registers), while other functionality has been added
>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>
>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>> and data structures are initialized for managing HSA device topology.
>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>> following points :
>>>>
>>>> - radeon_init (kfd_init)
>>>> - radeon_exit (kfd_fini)
>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>
>>>> During the probe and init processing per-device data structures are
>>>> established which connect to the associated graphics kernel driver. This
>>>> information is exposed to userspace via sysfs, along with a version number
>>>> allowing userspace to determine if a topology change has occurred while it
>>>> was reading from sysfs.
>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>> kfd code since the interrupt block is shared between regular
>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>
>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>> is in the final stages of IP review and should be published in a separate
>>>> repo over the next few days.
>>>> The code operates in one of three modes, selectable via the sched_policy
>>>> module parameter :
>>>>
>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>> CP, and allows oversubscription (more queues than HW slots)
>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>> queues to HW slots by programming registers
>>>>
>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>> has less test coverage than the other options. Default in the current code
>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>> most test coverage but we expect to change the default to "HW scheduling
>>>> with oversubscription" after further testing. This effectively removes the
>>>> HW limit on the number of work queues available to applications.
>>>>
>>>> Programs running on the GPU are associated with an address space through the
>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>> are partitioned (under control of the radeon kgd) between current
>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>> and by driver code if HW scheduling is not being used.
>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>> mapped to userspace (each page mapped to only one user address space).
>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>> userspace code to safely manage work queues (rings) without requiring a
>>>> kernel call for every ring update.
>>>> First step for an application process is to open the kfd device. Calls to
>>>> open create a kfd "process" structure only for the first thread of the
>>>> process. Subsequent open calls are checked to see if they are from processes
>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>> accessible to the GPU.
>>>> Next step is for the application to collect topology information via sysfs.
>>>> This gives userspace enough information to be able to identify specific
>>>> nodes (processors) in subsequent queue management calls. Application
>>>> processes can create queues on multiple processors, and processors support
>>>> queues from multiple processes.
>>>> At this point the application can create work queues in userspace memory and
>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>> queue slots so that commands written to the queues can be executed by the
>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>> is device-specific.
>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>
>>>>
>>>> Alexey Skidanov (1):
>>>>    amdkfd: Implement the Get Process Aperture IOCTL
>>>>
>>>> Andrew Lewycky (3):
>>>>    amdkfd: Add basic modules to amdkfd
>>>>    amdkfd: Add interrupt handling module
>>>>    amdkfd: Implement the Set Memory Policy IOCTL
>>>>
>>>> Ben Goz (8):
>>>>    amdkfd: Add queue module
>>>>    amdkfd: Add mqd_manager module
>>>>    amdkfd: Add kernel queue module
>>>>    amdkfd: Add module parameter of scheduling policy
>>>>    amdkfd: Add packet manager module
>>>>    amdkfd: Add process queue manager module
>>>>    amdkfd: Add device queue manager module
>>>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>
>>>> Evgeny Pinchuk (3):
>>>>    amdkfd: Add topology module to amdkfd
>>>>    amdkfd: Implement the Get Clock Counters IOCTL
>>>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>
>>>> Oded Gabbay (10):
>>>>    mm: Add kfd_process pointer to mm_struct
>>>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>>>    drm/radeon: Report doorbell configuration to amdkfd
>>>>    drm/radeon: adding synchronization for GRBM GFX
>>>>    drm/radeon: Add radeon <--> amdkfd interface
>>>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>>>    amdkfd: Add IOCTL set definitions of amdkfd
>>>>    amdkfd: Add amdkfd skeleton driver
>>>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>
>>>>   CREDITS                                            |    7 +
>>>>   MAINTAINERS                                        |   10 +
>>>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>> ++++++++++++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>   include/linux/mm_types.h                           |   14 +
>>>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>
>>>> --
>>>> 1.9.1
>>>>
>>
>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 14:12         ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 14:12 UTC (permalink / raw)
  To: Christian König, Jerome Glisse
  Cc: David Airlie, Alex Deucher, Andrew Morton, John Bridgman,
	Joerg Roedel, Andrew Lewycky, Michel Dänzer, Ben Goz,
	Alexey Skidanov, Evgeny Pinchuk, linux-kernel, dri-devel,
	linux-mm

On 21/07/14 16:39, Christian König wrote:
> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> On 20/07/14 20:46, Jerome Glisse wrote:
>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>
>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>> restructured with a cleaner history and no totally-different-early-versions
>>>> of the code.
>>>>
>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>> There is no code going away or even modified between patches, only added.
>>>>
>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>> generic hsa framework being implemented in the future and in that case, we
>>>> will adjust amdkfd to work within that framework.
>>>>
>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>> is the radeon driver. Having said that, we will probably need to move it
>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>> drivers.
>>>>
>>>> For people who like to review using git, the v2 patch set is located at:
>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>
>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>
>>> So quick comments before i finish going over all patches. There is many
>>> things that need more documentation espacialy as of right now there is
>>> no userspace i can go look at.
>> So quick comments on some of your questions but first of all, thanks for the
>> time you dedicated to review the code.
>>>
>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>> no, that would need serious arguments for any hope of convincing me on
>>> that side.
>> We only do gpu memory pinning for kernel objects. There are no userspace
>> objects that are pinned on the gpu memory in our driver. If that is the case,
>> is it still a show stopper ?
>>
>> The kernel objects are:
>> - pipelines (4 per device)
>> - mqd per hiq (only 1 per device)
>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>> 256 alignment. So total *possible* memory is 128MB
>> - kernel queue (only 1 per device)
>> - fence address for kernel queue
>> - runlists for the CP (1 or 2 per device)
>
> The main questions here are if it's avoid able to pin down the memory and if the
> memory is pinned down at driver load, by request from userspace or by anything
> else.
>
> As far as I can see only the "mqd per userspace queue" might be a bit
> questionable, everything else sounds reasonable.
>
> Christian.

Most of the pin downs are done on device initialization.
The "mqd per userspace" is done per userspace queue creation. However, as I 
said, it has an upper limit of 128MB on KV, and considering the 2G local memory, 
I think it is OK.
The runlists are also done on userspace queue creation/deletion, but we only 
have 1 or 2 runlists per device, so it is not that bad.

	Oded
>
>>>
>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>> stuff there.
>>>
>>> Given that this is not intended to be final HSA api AFAICT then i would
>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>> This would avoid crazy communication btw radeon and kfd.
>>>
>>> The whole aperture business needs some serious explanation. Especialy as
>>> you want to use userspace address there is nothing to prevent userspace
>>> program from allocating things at address you reserve for lds, scratch,
>>> ... only sane way would be to move those lds, scratch inside the virtual
>>> address reserved for kernel (see kernel memory map).
>>>
>>> The whole business of locking performance counter for exclusive per process
>>> access is a big NO. Which leads me to the questionable usefullness of user
>>> space command ring.
>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>> find it analogous to a situation where a network maintainer nacking a driver
>> for a network card, which is slower than a different network card. Doesn't
>> seem reasonable this situation is would happen. He would still put both the
>> drivers in the kernel because people want to use the H/W and its features. So,
>> I don't think this is a valid reason to NACK the driver.
>>
>>> I only see issues with that. First and foremost i would
>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>> overhead that is measurable in any meaning full way against a simple
>>> function call. I know the userspace command ring is a big marketing features
>>> that please ignorant userspace programmer. But really this only brings issues
>>> and for absolutely not upside afaict.
>> Really ? You think that doing a context switch to kernel space, with all its
>> overhead, is _not_ more expansive than just calling a function in userspace
>> which only puts a buffer on a ring and writes a doorbell ?
>>>
>>> So i would rather see a very simple ioctl that write the doorbell and might
>>> do more than that in case of ring/queue overcommit where it would first have
>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>> implementation of things like performance counter that could be acquire by
>>> kernel for duration of a job submitted by userspace. While still not optimal
>>> this would be better that userspace locking.
>>>
>>>
>>> I might have more thoughts once i am done with all the patches.
>>>
>>> Cheers,
>>> Jérôme
>>>
>>>>
>>>> Original Cover Letter:
>>>>
>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>> for radeon-family GPUs.
>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>> system resources more effectively via HW features including shared pageable
>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>> in through the queues (aka rings).
>>>>
>>>> The code in this patch set is intended to serve both as a sample driver for
>>>> other HSA-compatible hardware devices and as a production driver for
>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>> each with connected GPUs, although the current implementation focuses on a
>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>> graphics driver (kgd).
>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>> interrupts, registers), while other functionality has been added
>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>
>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>> and data structures are initialized for managing HSA device topology.
>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>> following points :
>>>>
>>>> - radeon_init (kfd_init)
>>>> - radeon_exit (kfd_fini)
>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>
>>>> During the probe and init processing per-device data structures are
>>>> established which connect to the associated graphics kernel driver. This
>>>> information is exposed to userspace via sysfs, along with a version number
>>>> allowing userspace to determine if a topology change has occurred while it
>>>> was reading from sysfs.
>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>> kfd code since the interrupt block is shared between regular
>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>
>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>> is in the final stages of IP review and should be published in a separate
>>>> repo over the next few days.
>>>> The code operates in one of three modes, selectable via the sched_policy
>>>> module parameter :
>>>>
>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>> CP, and allows oversubscription (more queues than HW slots)
>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>> queues to HW slots by programming registers
>>>>
>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>> has less test coverage than the other options. Default in the current code
>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>> most test coverage but we expect to change the default to "HW scheduling
>>>> with oversubscription" after further testing. This effectively removes the
>>>> HW limit on the number of work queues available to applications.
>>>>
>>>> Programs running on the GPU are associated with an address space through the
>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>> are partitioned (under control of the radeon kgd) between current
>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>> and by driver code if HW scheduling is not being used.
>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>> mapped to userspace (each page mapped to only one user address space).
>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>> userspace code to safely manage work queues (rings) without requiring a
>>>> kernel call for every ring update.
>>>> First step for an application process is to open the kfd device. Calls to
>>>> open create a kfd "process" structure only for the first thread of the
>>>> process. Subsequent open calls are checked to see if they are from processes
>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>> accessible to the GPU.
>>>> Next step is for the application to collect topology information via sysfs.
>>>> This gives userspace enough information to be able to identify specific
>>>> nodes (processors) in subsequent queue management calls. Application
>>>> processes can create queues on multiple processors, and processors support
>>>> queues from multiple processes.
>>>> At this point the application can create work queues in userspace memory and
>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>> queue slots so that commands written to the queues can be executed by the
>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>> is device-specific.
>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>
>>>>
>>>> Alexey Skidanov (1):
>>>>    amdkfd: Implement the Get Process Aperture IOCTL
>>>>
>>>> Andrew Lewycky (3):
>>>>    amdkfd: Add basic modules to amdkfd
>>>>    amdkfd: Add interrupt handling module
>>>>    amdkfd: Implement the Set Memory Policy IOCTL
>>>>
>>>> Ben Goz (8):
>>>>    amdkfd: Add queue module
>>>>    amdkfd: Add mqd_manager module
>>>>    amdkfd: Add kernel queue module
>>>>    amdkfd: Add module parameter of scheduling policy
>>>>    amdkfd: Add packet manager module
>>>>    amdkfd: Add process queue manager module
>>>>    amdkfd: Add device queue manager module
>>>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>
>>>> Evgeny Pinchuk (3):
>>>>    amdkfd: Add topology module to amdkfd
>>>>    amdkfd: Implement the Get Clock Counters IOCTL
>>>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>
>>>> Oded Gabbay (10):
>>>>    mm: Add kfd_process pointer to mm_struct
>>>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>>>    drm/radeon: Report doorbell configuration to amdkfd
>>>>    drm/radeon: adding synchronization for GRBM GFX
>>>>    drm/radeon: Add radeon <--> amdkfd interface
>>>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>>>    amdkfd: Add IOCTL set definitions of amdkfd
>>>>    amdkfd: Add amdkfd skeleton driver
>>>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>
>>>>   CREDITS                                            |    7 +
>>>>   MAINTAINERS                                        |   10 +
>>>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>> ++++++++++++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>   include/linux/mm_types.h                           |   14 +
>>>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>
>>>> --
>>>> 1.9.1
>>>>
>>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 14:12         ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 14:12 UTC (permalink / raw)
  To: Christian König, Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 16:39, Christian König wrote:
> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> On 20/07/14 20:46, Jerome Glisse wrote:
>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>
>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>> restructured with a cleaner history and no totally-different-early-versions
>>>> of the code.
>>>>
>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>> There is no code going away or even modified between patches, only added.
>>>>
>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>> generic hsa framework being implemented in the future and in that case, we
>>>> will adjust amdkfd to work within that framework.
>>>>
>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>> is the radeon driver. Having said that, we will probably need to move it
>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>> drivers.
>>>>
>>>> For people who like to review using git, the v2 patch set is located at:
>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>
>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>
>>> So quick comments before i finish going over all patches. There is many
>>> things that need more documentation espacialy as of right now there is
>>> no userspace i can go look at.
>> So quick comments on some of your questions but first of all, thanks for the
>> time you dedicated to review the code.
>>>
>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>> no, that would need serious arguments for any hope of convincing me on
>>> that side.
>> We only do gpu memory pinning for kernel objects. There are no userspace
>> objects that are pinned on the gpu memory in our driver. If that is the case,
>> is it still a show stopper ?
>>
>> The kernel objects are:
>> - pipelines (4 per device)
>> - mqd per hiq (only 1 per device)
>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>> 256 alignment. So total *possible* memory is 128MB
>> - kernel queue (only 1 per device)
>> - fence address for kernel queue
>> - runlists for the CP (1 or 2 per device)
>
> The main questions here are if it's avoid able to pin down the memory and if the
> memory is pinned down at driver load, by request from userspace or by anything
> else.
>
> As far as I can see only the "mqd per userspace queue" might be a bit
> questionable, everything else sounds reasonable.
>
> Christian.

Most of the pin downs are done on device initialization.
The "mqd per userspace" is done per userspace queue creation. However, as I 
said, it has an upper limit of 128MB on KV, and considering the 2G local memory, 
I think it is OK.
The runlists are also done on userspace queue creation/deletion, but we only 
have 1 or 2 runlists per device, so it is not that bad.

	Oded
>
>>>
>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>> stuff there.
>>>
>>> Given that this is not intended to be final HSA api AFAICT then i would
>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>> This would avoid crazy communication btw radeon and kfd.
>>>
>>> The whole aperture business needs some serious explanation. Especialy as
>>> you want to use userspace address there is nothing to prevent userspace
>>> program from allocating things at address you reserve for lds, scratch,
>>> ... only sane way would be to move those lds, scratch inside the virtual
>>> address reserved for kernel (see kernel memory map).
>>>
>>> The whole business of locking performance counter for exclusive per process
>>> access is a big NO. Which leads me to the questionable usefullness of user
>>> space command ring.
>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>> find it analogous to a situation where a network maintainer nacking a driver
>> for a network card, which is slower than a different network card. Doesn't
>> seem reasonable this situation is would happen. He would still put both the
>> drivers in the kernel because people want to use the H/W and its features. So,
>> I don't think this is a valid reason to NACK the driver.
>>
>>> I only see issues with that. First and foremost i would
>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>> overhead that is measurable in any meaning full way against a simple
>>> function call. I know the userspace command ring is a big marketing features
>>> that please ignorant userspace programmer. But really this only brings issues
>>> and for absolutely not upside afaict.
>> Really ? You think that doing a context switch to kernel space, with all its
>> overhead, is _not_ more expansive than just calling a function in userspace
>> which only puts a buffer on a ring and writes a doorbell ?
>>>
>>> So i would rather see a very simple ioctl that write the doorbell and might
>>> do more than that in case of ring/queue overcommit where it would first have
>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>> implementation of things like performance counter that could be acquire by
>>> kernel for duration of a job submitted by userspace. While still not optimal
>>> this would be better that userspace locking.
>>>
>>>
>>> I might have more thoughts once i am done with all the patches.
>>>
>>> Cheers,
>>> Jérôme
>>>
>>>>
>>>> Original Cover Letter:
>>>>
>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>> for radeon-family GPUs.
>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>> system resources more effectively via HW features including shared pageable
>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>> in through the queues (aka rings).
>>>>
>>>> The code in this patch set is intended to serve both as a sample driver for
>>>> other HSA-compatible hardware devices and as a production driver for
>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>> each with connected GPUs, although the current implementation focuses on a
>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>> graphics driver (kgd).
>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>> interrupts, registers), while other functionality has been added
>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>
>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>> and data structures are initialized for managing HSA device topology.
>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>> following points :
>>>>
>>>> - radeon_init (kfd_init)
>>>> - radeon_exit (kfd_fini)
>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>
>>>> During the probe and init processing per-device data structures are
>>>> established which connect to the associated graphics kernel driver. This
>>>> information is exposed to userspace via sysfs, along with a version number
>>>> allowing userspace to determine if a topology change has occurred while it
>>>> was reading from sysfs.
>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>> kfd code since the interrupt block is shared between regular
>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>
>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>> is in the final stages of IP review and should be published in a separate
>>>> repo over the next few days.
>>>> The code operates in one of three modes, selectable via the sched_policy
>>>> module parameter :
>>>>
>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>> CP, and allows oversubscription (more queues than HW slots)
>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>> queues to HW slots by programming registers
>>>>
>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>> has less test coverage than the other options. Default in the current code
>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>> most test coverage but we expect to change the default to "HW scheduling
>>>> with oversubscription" after further testing. This effectively removes the
>>>> HW limit on the number of work queues available to applications.
>>>>
>>>> Programs running on the GPU are associated with an address space through the
>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>> are partitioned (under control of the radeon kgd) between current
>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>> and by driver code if HW scheduling is not being used.
>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>> mapped to userspace (each page mapped to only one user address space).
>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>> userspace code to safely manage work queues (rings) without requiring a
>>>> kernel call for every ring update.
>>>> First step for an application process is to open the kfd device. Calls to
>>>> open create a kfd "process" structure only for the first thread of the
>>>> process. Subsequent open calls are checked to see if they are from processes
>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>> accessible to the GPU.
>>>> Next step is for the application to collect topology information via sysfs.
>>>> This gives userspace enough information to be able to identify specific
>>>> nodes (processors) in subsequent queue management calls. Application
>>>> processes can create queues on multiple processors, and processors support
>>>> queues from multiple processes.
>>>> At this point the application can create work queues in userspace memory and
>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>> queue slots so that commands written to the queues can be executed by the
>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>> is device-specific.
>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>
>>>>
>>>> Alexey Skidanov (1):
>>>>    amdkfd: Implement the Get Process Aperture IOCTL
>>>>
>>>> Andrew Lewycky (3):
>>>>    amdkfd: Add basic modules to amdkfd
>>>>    amdkfd: Add interrupt handling module
>>>>    amdkfd: Implement the Set Memory Policy IOCTL
>>>>
>>>> Ben Goz (8):
>>>>    amdkfd: Add queue module
>>>>    amdkfd: Add mqd_manager module
>>>>    amdkfd: Add kernel queue module
>>>>    amdkfd: Add module parameter of scheduling policy
>>>>    amdkfd: Add packet manager module
>>>>    amdkfd: Add process queue manager module
>>>>    amdkfd: Add device queue manager module
>>>>    amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>
>>>> Evgeny Pinchuk (3):
>>>>    amdkfd: Add topology module to amdkfd
>>>>    amdkfd: Implement the Get Clock Counters IOCTL
>>>>    amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>
>>>> Oded Gabbay (10):
>>>>    mm: Add kfd_process pointer to mm_struct
>>>>    drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>    drm/radeon/cik: Don't touch int of pipes 1-7
>>>>    drm/radeon: Report doorbell configuration to amdkfd
>>>>    drm/radeon: adding synchronization for GRBM GFX
>>>>    drm/radeon: Add radeon <--> amdkfd interface
>>>>    Update MAINTAINERS and CREDITS files with amdkfd info
>>>>    amdkfd: Add IOCTL set definitions of amdkfd
>>>>    amdkfd: Add amdkfd skeleton driver
>>>>    amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>
>>>>   CREDITS                                            |    7 +
>>>>   MAINTAINERS                                        |   10 +
>>>>   drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>   drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>   drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>   .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>   .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>> ++++++++++++++++++++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>   drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>   drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>   drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>   drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>   drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>   drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>   drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>   drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>   drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>   drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>   include/linux/mm_types.h                           |   14 +
>>>>   include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>   43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>   create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>   create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>   create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>
>>>> --
>>>> 1.9.1
>>>>
>>
>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 13:39       ` Christian König
  (?)
@ 2014-07-21 15:25         ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 15:25 UTC (permalink / raw)
  To: Christian König
  Cc: Oded Gabbay, Jerome Glisse, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >On 20/07/14 20:46, Jerome Glisse wrote:
> >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>Forgot to cc mailing list on cover letter. Sorry.
> >>>
> >>>As a continuation to the existing discussion, here is a v2 patch series
> >>>restructured with a cleaner history and no
> >>>totally-different-early-versions
> >>>of the code.
> >>>
> >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> >>>them
> >>>are modifications to radeon driver and 18 of them include only amdkfd
> >>>code.
> >>>There is no code going away or even modified between patches, only
> >>>added.
> >>>
> >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> >>>under
> >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> >>>driver
> >>>is an AMD-only driver at this point. Having said that, we do foresee a
> >>>generic hsa framework being implemented in the future and in that
> >>>case, we
> >>>will adjust amdkfd to work within that framework.
> >>>
> >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> >>>to
> >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>contained in its own folder. The amdkfd folder was put under the radeon
> >>>folder because the only AMD gfx driver in the Linux kernel at this
> >>>point
> >>>is the radeon driver. Having said that, we will probably need to move
> >>>it
> >>>(maybe to be directly under drm) after we integrate with additional
> >>>AMD gfx
> >>>drivers.
> >>>
> >>>For people who like to review using git, the v2 patch set is located
> >>>at:
> >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>
> >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>
> >>So quick comments before i finish going over all patches. There is many
> >>things that need more documentation espacialy as of right now there is
> >>no userspace i can go look at.
> >So quick comments on some of your questions but first of all, thanks for
> >the time you dedicated to review the code.
> >>
> >>There few show stopper, biggest one is gpu memory pinning this is a big
> >>no, that would need serious arguments for any hope of convincing me on
> >>that side.
> >We only do gpu memory pinning for kernel objects. There are no userspace
> >objects that are pinned on the gpu memory in our driver. If that is the
> >case, is it still a show stopper ?
> >
> >The kernel objects are:
> >- pipelines (4 per device)
> >- mqd per hiq (only 1 per device)
> >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> >done in 256 alignment. So total *possible* memory is 128MB
> >- kernel queue (only 1 per device)
> >- fence address for kernel queue
> >- runlists for the CP (1 or 2 per device)
> 
> The main questions here are if it's avoid able to pin down the memory and if
> the memory is pinned down at driver load, by request from userspace or by
> anything else.
> 
> As far as I can see only the "mqd per userspace queue" might be a bit
> questionable, everything else sounds reasonable.

Aside, i915 perspective again (i.e. how we solved this): When scheduling
away from contexts we unpin them and put them into the lru. And in the
shrinker we have a last-ditch callback to switch to a default context
(since you can't ever have no context once you've started) which means we
can evict any context object if it's getting in the way.

We must do that since the contexts have to be in global gtt, which is
shared for scanouts. So fragmenting that badly with lots of context
objects and other stuff is a no-go, since that means we'll start to fail
pageflips.

I don't know whether ttm has a ready-made concept for such
opportunistically pinned stuff. I guess you could wire up the "switch to
dflt context" action to the evict/move function if ttm wants to get rid of
the currently used hw context.

Oh and: This is another reason for letting the kernel schedule contexts,
since you can't do this defrag trick if the gpu does all the scheduling
itself.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 15:25         ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 15:25 UTC (permalink / raw)
  To: Christian König
  Cc: Oded Gabbay, Jerome Glisse, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian Konig wrote:
> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >On 20/07/14 20:46, Jerome Glisse wrote:
> >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>Forgot to cc mailing list on cover letter. Sorry.
> >>>
> >>>As a continuation to the existing discussion, here is a v2 patch series
> >>>restructured with a cleaner history and no
> >>>totally-different-early-versions
> >>>of the code.
> >>>
> >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> >>>them
> >>>are modifications to radeon driver and 18 of them include only amdkfd
> >>>code.
> >>>There is no code going away or even modified between patches, only
> >>>added.
> >>>
> >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> >>>under
> >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> >>>driver
> >>>is an AMD-only driver at this point. Having said that, we do foresee a
> >>>generic hsa framework being implemented in the future and in that
> >>>case, we
> >>>will adjust amdkfd to work within that framework.
> >>>
> >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> >>>to
> >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>contained in its own folder. The amdkfd folder was put under the radeon
> >>>folder because the only AMD gfx driver in the Linux kernel at this
> >>>point
> >>>is the radeon driver. Having said that, we will probably need to move
> >>>it
> >>>(maybe to be directly under drm) after we integrate with additional
> >>>AMD gfx
> >>>drivers.
> >>>
> >>>For people who like to review using git, the v2 patch set is located
> >>>at:
> >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>
> >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>
> >>So quick comments before i finish going over all patches. There is many
> >>things that need more documentation espacialy as of right now there is
> >>no userspace i can go look at.
> >So quick comments on some of your questions but first of all, thanks for
> >the time you dedicated to review the code.
> >>
> >>There few show stopper, biggest one is gpu memory pinning this is a big
> >>no, that would need serious arguments for any hope of convincing me on
> >>that side.
> >We only do gpu memory pinning for kernel objects. There are no userspace
> >objects that are pinned on the gpu memory in our driver. If that is the
> >case, is it still a show stopper ?
> >
> >The kernel objects are:
> >- pipelines (4 per device)
> >- mqd per hiq (only 1 per device)
> >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> >done in 256 alignment. So total *possible* memory is 128MB
> >- kernel queue (only 1 per device)
> >- fence address for kernel queue
> >- runlists for the CP (1 or 2 per device)
> 
> The main questions here are if it's avoid able to pin down the memory and if
> the memory is pinned down at driver load, by request from userspace or by
> anything else.
> 
> As far as I can see only the "mqd per userspace queue" might be a bit
> questionable, everything else sounds reasonable.

Aside, i915 perspective again (i.e. how we solved this): When scheduling
away from contexts we unpin them and put them into the lru. And in the
shrinker we have a last-ditch callback to switch to a default context
(since you can't ever have no context once you've started) which means we
can evict any context object if it's getting in the way.

We must do that since the contexts have to be in global gtt, which is
shared for scanouts. So fragmenting that badly with lots of context
objects and other stuff is a no-go, since that means we'll start to fail
pageflips.

I don't know whether ttm has a ready-made concept for such
opportunistically pinned stuff. I guess you could wire up the "switch to
dflt context" action to the evict/move function if ttm wants to get rid of
the currently used hw context.

Oh and: This is another reason for letting the kernel schedule contexts,
since you can't do this defrag trick if the gpu does all the scheduling
itself.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 15:25         ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 15:25 UTC (permalink / raw)
  To: Christian König
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, Evgeny Pinchuk,
	linux-mm, Alexey Skidanov, dri-devel, Andrew Morton

On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >On 20/07/14 20:46, Jerome Glisse wrote:
> >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>Forgot to cc mailing list on cover letter. Sorry.
> >>>
> >>>As a continuation to the existing discussion, here is a v2 patch series
> >>>restructured with a cleaner history and no
> >>>totally-different-early-versions
> >>>of the code.
> >>>
> >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> >>>them
> >>>are modifications to radeon driver and 18 of them include only amdkfd
> >>>code.
> >>>There is no code going away or even modified between patches, only
> >>>added.
> >>>
> >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> >>>under
> >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> >>>driver
> >>>is an AMD-only driver at this point. Having said that, we do foresee a
> >>>generic hsa framework being implemented in the future and in that
> >>>case, we
> >>>will adjust amdkfd to work within that framework.
> >>>
> >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> >>>to
> >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>contained in its own folder. The amdkfd folder was put under the radeon
> >>>folder because the only AMD gfx driver in the Linux kernel at this
> >>>point
> >>>is the radeon driver. Having said that, we will probably need to move
> >>>it
> >>>(maybe to be directly under drm) after we integrate with additional
> >>>AMD gfx
> >>>drivers.
> >>>
> >>>For people who like to review using git, the v2 patch set is located
> >>>at:
> >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>
> >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>
> >>So quick comments before i finish going over all patches. There is many
> >>things that need more documentation espacialy as of right now there is
> >>no userspace i can go look at.
> >So quick comments on some of your questions but first of all, thanks for
> >the time you dedicated to review the code.
> >>
> >>There few show stopper, biggest one is gpu memory pinning this is a big
> >>no, that would need serious arguments for any hope of convincing me on
> >>that side.
> >We only do gpu memory pinning for kernel objects. There are no userspace
> >objects that are pinned on the gpu memory in our driver. If that is the
> >case, is it still a show stopper ?
> >
> >The kernel objects are:
> >- pipelines (4 per device)
> >- mqd per hiq (only 1 per device)
> >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> >done in 256 alignment. So total *possible* memory is 128MB
> >- kernel queue (only 1 per device)
> >- fence address for kernel queue
> >- runlists for the CP (1 or 2 per device)
> 
> The main questions here are if it's avoid able to pin down the memory and if
> the memory is pinned down at driver load, by request from userspace or by
> anything else.
> 
> As far as I can see only the "mqd per userspace queue" might be a bit
> questionable, everything else sounds reasonable.

Aside, i915 perspective again (i.e. how we solved this): When scheduling
away from contexts we unpin them and put them into the lru. And in the
shrinker we have a last-ditch callback to switch to a default context
(since you can't ever have no context once you've started) which means we
can evict any context object if it's getting in the way.

We must do that since the contexts have to be in global gtt, which is
shared for scanouts. So fragmenting that badly with lots of context
objects and other stuff is a no-go, since that means we'll start to fail
pageflips.

I don't know whether ttm has a ready-made concept for such
opportunistically pinned stuff. I guess you could wire up the "switch to
dflt context" action to the evict/move function if ttm wants to get rid of
the currently used hw context.

Oh and: This is another reason for letting the kernel schedule contexts,
since you can't do this defrag trick if the gpu does all the scheduling
itself.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 14:12         ` Oded Gabbay
  (?)
@ 2014-07-21 15:54           ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 15:54 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> On 21/07/14 16:39, Christian König wrote:
> >Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>On 20/07/14 20:46, Jerome Glisse wrote:
> >>>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>Forgot to cc mailing list on cover letter. Sorry.
> >>>>
> >>>>As a continuation to the existing discussion, here is a v2 patch series
> >>>>restructured with a cleaner history and no totally-different-early-versions
> >>>>of the code.
> >>>>
> >>>>Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>There is no code going away or even modified between patches, only added.
> >>>>
> >>>>The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>generic hsa framework being implemented in the future and in that case, we
> >>>>will adjust amdkfd to work within that framework.
> >>>>
> >>>>As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>contained in its own folder. The amdkfd folder was put under the radeon
> >>>>folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>is the radeon driver. Having said that, we will probably need to move it
> >>>>(maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>drivers.
> >>>>
> >>>>For people who like to review using git, the v2 patch set is located at:
> >>>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>
> >>>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>
> >>>So quick comments before i finish going over all patches. There is many
> >>>things that need more documentation espacialy as of right now there is
> >>>no userspace i can go look at.
> >>So quick comments on some of your questions but first of all, thanks for the
> >>time you dedicated to review the code.
> >>>
> >>>There few show stopper, biggest one is gpu memory pinning this is a big
> >>>no, that would need serious arguments for any hope of convincing me on
> >>>that side.
> >>We only do gpu memory pinning for kernel objects. There are no userspace
> >>objects that are pinned on the gpu memory in our driver. If that is the case,
> >>is it still a show stopper ?
> >>
> >>The kernel objects are:
> >>- pipelines (4 per device)
> >>- mqd per hiq (only 1 per device)
> >>- mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>256 alignment. So total *possible* memory is 128MB
> >>- kernel queue (only 1 per device)
> >>- fence address for kernel queue
> >>- runlists for the CP (1 or 2 per device)
> >
> >The main questions here are if it's avoid able to pin down the memory and if the
> >memory is pinned down at driver load, by request from userspace or by anything
> >else.
> >
> >As far as I can see only the "mqd per userspace queue" might be a bit
> >questionable, everything else sounds reasonable.
> >
> >Christian.
> 
> Most of the pin downs are done on device initialization.
> The "mqd per userspace" is done per userspace queue creation. However, as I
> said, it has an upper limit of 128MB on KV, and considering the 2G local
> memory, I think it is OK.
> The runlists are also done on userspace queue creation/deletion, but we only
> have 1 or 2 runlists per device, so it is not that bad.

2G local memory ? You can not assume anything on userside configuration some
one might build an hsa computer with 512M and still expect a functioning
desktop.

I need to go look into what all this mqd is for, what it does and what it is
about. But pinning is really bad and this is an issue with userspace command
scheduling an issue that obviously AMD fails to take into account in design
phase.

> 	Oded
> >
> >>>
> >>>It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>stuff there.
> >>>
> >>>Given that this is not intended to be final HSA api AFAICT then i would
> >>>say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>This would avoid crazy communication btw radeon and kfd.
> >>>
> >>>The whole aperture business needs some serious explanation. Especialy as
> >>>you want to use userspace address there is nothing to prevent userspace
> >>>program from allocating things at address you reserve for lds, scratch,
> >>>... only sane way would be to move those lds, scratch inside the virtual
> >>>address reserved for kernel (see kernel memory map).
> >>>
> >>>The whole business of locking performance counter for exclusive per process
> >>>access is a big NO. Which leads me to the questionable usefullness of user
> >>>space command ring.
> >>That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>find it analogous to a situation where a network maintainer nacking a driver
> >>for a network card, which is slower than a different network card. Doesn't
> >>seem reasonable this situation is would happen. He would still put both the
> >>drivers in the kernel because people want to use the H/W and its features. So,
> >>I don't think this is a valid reason to NACK the driver.

Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
i see no objection. In other word, i am not NACKING whole patchset i am NACKING
the performance ioctl.

Again this is another argument for round trip to the kernel. As inside kernel you
could properly do exclusive gpu counter access accross single user cmd buffer
execution.

> >>
> >>>I only see issues with that. First and foremost i would
> >>>need to see solid figures that kernel ioctl or syscall has a higher an
> >>>overhead that is measurable in any meaning full way against a simple
> >>>function call. I know the userspace command ring is a big marketing features
> >>>that please ignorant userspace programmer. But really this only brings issues
> >>>and for absolutely not upside afaict.
> >>Really ? You think that doing a context switch to kernel space, with all its
> >>overhead, is _not_ more expansive than just calling a function in userspace
> >>which only puts a buffer on a ring and writes a doorbell ?

I am saying the overhead is not that big and it probably will not matter in most
usecase. For instance i did wrote the most useless kernel module that add two
number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
ioctl is 13 times slower.

Now if there is enough data that shows that a significant percentage of jobs
submited to the GPU will take less that 0.35microsecond then yes userspace
scheduling does make sense. But so far all we have is handwaving with no data
to support any facts.


Now if we want to schedule from userspace than you will need to do something
about the pinning, something that gives control to kernel so that kernel can
unpin when it wants and move object when it wants no matter what userspace is
doing.

> >>>
> >>>So i would rather see a very simple ioctl that write the doorbell and might
> >>>do more than that in case of ring/queue overcommit where it would first have
> >>>to wait for a free ring/queue to schedule stuff. This would also allow sane
> >>>implementation of things like performance counter that could be acquire by
> >>>kernel for duration of a job submitted by userspace. While still not optimal
> >>>this would be better that userspace locking.
> >>>
> >>>
> >>>I might have more thoughts once i am done with all the patches.
> >>>
> >>>Cheers,
> >>>Jérôme
> >>>
> >>>>
> >>>>Original Cover Letter:
> >>>>
> >>>>This patch set implements a Heterogeneous System Architecture (HSA) driver
> >>>>for radeon-family GPUs.
> >>>>HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> >>>>system resources more effectively via HW features including shared pageable
> >>>>memory, userspace-accessible work queues, and platform-level atomics. In
> >>>>addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> >>>>Islands family of GPUs also performs HW-level validation of commands passed
> >>>>in through the queues (aka rings).
> >>>>
> >>>>The code in this patch set is intended to serve both as a sample driver for
> >>>>other HSA-compatible hardware devices and as a production driver for
> >>>>radeon-family processors. The code is architected to support multiple CPUs
> >>>>each with connected GPUs, although the current implementation focuses on a
> >>>>single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> >>>>graphics driver (kgd).
> >>>>AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> >>>>functionality between HSA compute and regular gfx/compute (memory,
> >>>>interrupts, registers), while other functionality has been added
> >>>>specifically for HSA compute  (hw scheduler for virtualized compute rings).
> >>>>All shared hardware is owned by the radeon graphics driver, and an interface
> >>>>between kfd and kgd allows the kfd to make use of those shared resources,
> >>>>while HSA-specific functionality is managed directly by kfd by submitting
> >>>>packets into an HSA-specific command queue (the "HIQ").
> >>>>
> >>>>During kfd module initialization a char device node (/dev/kfd) is created
> >>>>(surviving until module exit), with ioctls for queue creation & management,
> >>>>and data structures are initialized for managing HSA device topology.
> >>>>The rest of the initialization is driven by calls from the radeon kgd at the
> >>>>following points :
> >>>>
> >>>>- radeon_init (kfd_init)
> >>>>- radeon_exit (kfd_fini)
> >>>>- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>>- radeon_driver_unload_kms (kfd_device_fini)
> >>>>
> >>>>During the probe and init processing per-device data structures are
> >>>>established which connect to the associated graphics kernel driver. This
> >>>>information is exposed to userspace via sysfs, along with a version number
> >>>>allowing userspace to determine if a topology change has occurred while it
> >>>>was reading from sysfs.
> >>>>The interface between kfd and kgd also allows the kfd to request buffer
> >>>>management services from kgd, and allows kgd to route interrupt requests to
> >>>>kfd code since the interrupt block is shared between regular
> >>>>graphics/compute and HSA compute subsystems in the GPU.
> >>>>
> >>>>The kfd code works with an open source usermode library ("libhsakmt") which
> >>>>is in the final stages of IP review and should be published in a separate
> >>>>repo over the next few days.
> >>>>The code operates in one of three modes, selectable via the sched_policy
> >>>>module parameter :
> >>>>
> >>>>- sched_policy=0 uses a hardware scheduler running in the MEC block within
> >>>>CP, and allows oversubscription (more queues than HW slots)
> >>>>- sched_policy=1 also uses HW scheduling but does not allow
> >>>>oversubscription, so create_queue requests fail when we run out of HW slots
> >>>>- sched_policy=2 does not use HW scheduling, so the driver manually assigns
> >>>>queues to HW slots by programming registers
> >>>>
> >>>>The "no HW scheduling" option is for debug & new hardware bringup only, so
> >>>>has less test coverage than the other options. Default in the current code
> >>>>is "HW scheduling without oversubscription" since that is where we have the
> >>>>most test coverage but we expect to change the default to "HW scheduling
> >>>>with oversubscription" after further testing. This effectively removes the
> >>>>HW limit on the number of work queues available to applications.
> >>>>
> >>>>Programs running on the GPU are associated with an address space through the
> >>>>VMID field, which is translated to a unique PASID at access time via a set
> >>>>of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> >>>>are partitioned (under control of the radeon kgd) between current
> >>>>gfx/compute and HSA compute, with each getting 8 in the current code. The
> >>>>VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> >>>>and by driver code if HW scheduling is not being used.
> >>>>The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> >>>>earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> >>>>dedicated for this purpose, and pages within the doorbell aperture are
> >>>>mapped to userspace (each page mapped to only one user address space).
> >>>>Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> >>>>userspace code to safely manage work queues (rings) without requiring a
> >>>>kernel call for every ring update.
> >>>>First step for an application process is to open the kfd device. Calls to
> >>>>open create a kfd "process" structure only for the first thread of the
> >>>>process. Subsequent open calls are checked to see if they are from processes
> >>>>using the same mm_struct and, if so, don't do anything. The kfd per-process
> >>>>data lives as long as the mm_struct exists. Each mm_struct is associated
> >>>>with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> >>>>accessible to the GPU.
> >>>>Next step is for the application to collect topology information via sysfs.
> >>>>This gives userspace enough information to be able to identify specific
> >>>>nodes (processors) in subsequent queue management calls. Application
> >>>>processes can create queues on multiple processors, and processors support
> >>>>queues from multiple processes.
> >>>>At this point the application can create work queues in userspace memory and
> >>>>pass them through the usermode library to kfd to have them mapped onto HW
> >>>>queue slots so that commands written to the queues can be executed by the
> >>>>GPU. Queue operations specify a processor node, and so the bulk of this code
> >>>>is device-specific.
> >>>>Written by John Bridgman <John.Bridgman@amd.com>
> >>>>
> >>>>
> >>>>Alexey Skidanov (1):
> >>>>   amdkfd: Implement the Get Process Aperture IOCTL
> >>>>
> >>>>Andrew Lewycky (3):
> >>>>   amdkfd: Add basic modules to amdkfd
> >>>>   amdkfd: Add interrupt handling module
> >>>>   amdkfd: Implement the Set Memory Policy IOCTL
> >>>>
> >>>>Ben Goz (8):
> >>>>   amdkfd: Add queue module
> >>>>   amdkfd: Add mqd_manager module
> >>>>   amdkfd: Add kernel queue module
> >>>>   amdkfd: Add module parameter of scheduling policy
> >>>>   amdkfd: Add packet manager module
> >>>>   amdkfd: Add process queue manager module
> >>>>   amdkfd: Add device queue manager module
> >>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
> >>>>
> >>>>Evgeny Pinchuk (3):
> >>>>   amdkfd: Add topology module to amdkfd
> >>>>   amdkfd: Implement the Get Clock Counters IOCTL
> >>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> >>>>
> >>>>Oded Gabbay (10):
> >>>>   mm: Add kfd_process pointer to mm_struct
> >>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
> >>>>   drm/radeon/cik: Don't touch int of pipes 1-7
> >>>>   drm/radeon: Report doorbell configuration to amdkfd
> >>>>   drm/radeon: adding synchronization for GRBM GFX
> >>>>   drm/radeon: Add radeon <--> amdkfd interface
> >>>>   Update MAINTAINERS and CREDITS files with amdkfd info
> >>>>   amdkfd: Add IOCTL set definitions of amdkfd
> >>>>   amdkfd: Add amdkfd skeleton driver
> >>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> >>>>
> >>>>  CREDITS                                            |    7 +
> >>>>  MAINTAINERS                                        |   10 +
> >>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
> >>>>++++++++++++++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >>>>  include/linux/mm_types.h                           |   14 +
> >>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> >>>>
> >>>>--
> >>>>1.9.1
> >>>>
> >>
> >
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 15:54           ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 15:54 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> On 21/07/14 16:39, Christian Konig wrote:
> >Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>On 20/07/14 20:46, Jerome Glisse wrote:
> >>>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>Forgot to cc mailing list on cover letter. Sorry.
> >>>>
> >>>>As a continuation to the existing discussion, here is a v2 patch series
> >>>>restructured with a cleaner history and no totally-different-early-versions
> >>>>of the code.
> >>>>
> >>>>Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>There is no code going away or even modified between patches, only added.
> >>>>
> >>>>The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>generic hsa framework being implemented in the future and in that case, we
> >>>>will adjust amdkfd to work within that framework.
> >>>>
> >>>>As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>contained in its own folder. The amdkfd folder was put under the radeon
> >>>>folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>is the radeon driver. Having said that, we will probably need to move it
> >>>>(maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>drivers.
> >>>>
> >>>>For people who like to review using git, the v2 patch set is located at:
> >>>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>
> >>>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>
> >>>So quick comments before i finish going over all patches. There is many
> >>>things that need more documentation espacialy as of right now there is
> >>>no userspace i can go look at.
> >>So quick comments on some of your questions but first of all, thanks for the
> >>time you dedicated to review the code.
> >>>
> >>>There few show stopper, biggest one is gpu memory pinning this is a big
> >>>no, that would need serious arguments for any hope of convincing me on
> >>>that side.
> >>We only do gpu memory pinning for kernel objects. There are no userspace
> >>objects that are pinned on the gpu memory in our driver. If that is the case,
> >>is it still a show stopper ?
> >>
> >>The kernel objects are:
> >>- pipelines (4 per device)
> >>- mqd per hiq (only 1 per device)
> >>- mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>256 alignment. So total *possible* memory is 128MB
> >>- kernel queue (only 1 per device)
> >>- fence address for kernel queue
> >>- runlists for the CP (1 or 2 per device)
> >
> >The main questions here are if it's avoid able to pin down the memory and if the
> >memory is pinned down at driver load, by request from userspace or by anything
> >else.
> >
> >As far as I can see only the "mqd per userspace queue" might be a bit
> >questionable, everything else sounds reasonable.
> >
> >Christian.
> 
> Most of the pin downs are done on device initialization.
> The "mqd per userspace" is done per userspace queue creation. However, as I
> said, it has an upper limit of 128MB on KV, and considering the 2G local
> memory, I think it is OK.
> The runlists are also done on userspace queue creation/deletion, but we only
> have 1 or 2 runlists per device, so it is not that bad.

2G local memory ? You can not assume anything on userside configuration some
one might build an hsa computer with 512M and still expect a functioning
desktop.

I need to go look into what all this mqd is for, what it does and what it is
about. But pinning is really bad and this is an issue with userspace command
scheduling an issue that obviously AMD fails to take into account in design
phase.

> 	Oded
> >
> >>>
> >>>It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>stuff there.
> >>>
> >>>Given that this is not intended to be final HSA api AFAICT then i would
> >>>say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>This would avoid crazy communication btw radeon and kfd.
> >>>
> >>>The whole aperture business needs some serious explanation. Especialy as
> >>>you want to use userspace address there is nothing to prevent userspace
> >>>program from allocating things at address you reserve for lds, scratch,
> >>>... only sane way would be to move those lds, scratch inside the virtual
> >>>address reserved for kernel (see kernel memory map).
> >>>
> >>>The whole business of locking performance counter for exclusive per process
> >>>access is a big NO. Which leads me to the questionable usefullness of user
> >>>space command ring.
> >>That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>find it analogous to a situation where a network maintainer nacking a driver
> >>for a network card, which is slower than a different network card. Doesn't
> >>seem reasonable this situation is would happen. He would still put both the
> >>drivers in the kernel because people want to use the H/W and its features. So,
> >>I don't think this is a valid reason to NACK the driver.

Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
i see no objection. In other word, i am not NACKING whole patchset i am NACKING
the performance ioctl.

Again this is another argument for round trip to the kernel. As inside kernel you
could properly do exclusive gpu counter access accross single user cmd buffer
execution.

> >>
> >>>I only see issues with that. First and foremost i would
> >>>need to see solid figures that kernel ioctl or syscall has a higher an
> >>>overhead that is measurable in any meaning full way against a simple
> >>>function call. I know the userspace command ring is a big marketing features
> >>>that please ignorant userspace programmer. But really this only brings issues
> >>>and for absolutely not upside afaict.
> >>Really ? You think that doing a context switch to kernel space, with all its
> >>overhead, is _not_ more expansive than just calling a function in userspace
> >>which only puts a buffer on a ring and writes a doorbell ?

I am saying the overhead is not that big and it probably will not matter in most
usecase. For instance i did wrote the most useless kernel module that add two
number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
ioctl is 13 times slower.

Now if there is enough data that shows that a significant percentage of jobs
submited to the GPU will take less that 0.35microsecond then yes userspace
scheduling does make sense. But so far all we have is handwaving with no data
to support any facts.


Now if we want to schedule from userspace than you will need to do something
about the pinning, something that gives control to kernel so that kernel can
unpin when it wants and move object when it wants no matter what userspace is
doing.

> >>>
> >>>So i would rather see a very simple ioctl that write the doorbell and might
> >>>do more than that in case of ring/queue overcommit where it would first have
> >>>to wait for a free ring/queue to schedule stuff. This would also allow sane
> >>>implementation of things like performance counter that could be acquire by
> >>>kernel for duration of a job submitted by userspace. While still not optimal
> >>>this would be better that userspace locking.
> >>>
> >>>
> >>>I might have more thoughts once i am done with all the patches.
> >>>
> >>>Cheers,
> >>>Jerome
> >>>
> >>>>
> >>>>Original Cover Letter:
> >>>>
> >>>>This patch set implements a Heterogeneous System Architecture (HSA) driver
> >>>>for radeon-family GPUs.
> >>>>HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> >>>>system resources more effectively via HW features including shared pageable
> >>>>memory, userspace-accessible work queues, and platform-level atomics. In
> >>>>addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> >>>>Islands family of GPUs also performs HW-level validation of commands passed
> >>>>in through the queues (aka rings).
> >>>>
> >>>>The code in this patch set is intended to serve both as a sample driver for
> >>>>other HSA-compatible hardware devices and as a production driver for
> >>>>radeon-family processors. The code is architected to support multiple CPUs
> >>>>each with connected GPUs, although the current implementation focuses on a
> >>>>single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> >>>>graphics driver (kgd).
> >>>>AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> >>>>functionality between HSA compute and regular gfx/compute (memory,
> >>>>interrupts, registers), while other functionality has been added
> >>>>specifically for HSA compute  (hw scheduler for virtualized compute rings).
> >>>>All shared hardware is owned by the radeon graphics driver, and an interface
> >>>>between kfd and kgd allows the kfd to make use of those shared resources,
> >>>>while HSA-specific functionality is managed directly by kfd by submitting
> >>>>packets into an HSA-specific command queue (the "HIQ").
> >>>>
> >>>>During kfd module initialization a char device node (/dev/kfd) is created
> >>>>(surviving until module exit), with ioctls for queue creation & management,
> >>>>and data structures are initialized for managing HSA device topology.
> >>>>The rest of the initialization is driven by calls from the radeon kgd at the
> >>>>following points :
> >>>>
> >>>>- radeon_init (kfd_init)
> >>>>- radeon_exit (kfd_fini)
> >>>>- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>>- radeon_driver_unload_kms (kfd_device_fini)
> >>>>
> >>>>During the probe and init processing per-device data structures are
> >>>>established which connect to the associated graphics kernel driver. This
> >>>>information is exposed to userspace via sysfs, along with a version number
> >>>>allowing userspace to determine if a topology change has occurred while it
> >>>>was reading from sysfs.
> >>>>The interface between kfd and kgd also allows the kfd to request buffer
> >>>>management services from kgd, and allows kgd to route interrupt requests to
> >>>>kfd code since the interrupt block is shared between regular
> >>>>graphics/compute and HSA compute subsystems in the GPU.
> >>>>
> >>>>The kfd code works with an open source usermode library ("libhsakmt") which
> >>>>is in the final stages of IP review and should be published in a separate
> >>>>repo over the next few days.
> >>>>The code operates in one of three modes, selectable via the sched_policy
> >>>>module parameter :
> >>>>
> >>>>- sched_policy=0 uses a hardware scheduler running in the MEC block within
> >>>>CP, and allows oversubscription (more queues than HW slots)
> >>>>- sched_policy=1 also uses HW scheduling but does not allow
> >>>>oversubscription, so create_queue requests fail when we run out of HW slots
> >>>>- sched_policy=2 does not use HW scheduling, so the driver manually assigns
> >>>>queues to HW slots by programming registers
> >>>>
> >>>>The "no HW scheduling" option is for debug & new hardware bringup only, so
> >>>>has less test coverage than the other options. Default in the current code
> >>>>is "HW scheduling without oversubscription" since that is where we have the
> >>>>most test coverage but we expect to change the default to "HW scheduling
> >>>>with oversubscription" after further testing. This effectively removes the
> >>>>HW limit on the number of work queues available to applications.
> >>>>
> >>>>Programs running on the GPU are associated with an address space through the
> >>>>VMID field, which is translated to a unique PASID at access time via a set
> >>>>of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> >>>>are partitioned (under control of the radeon kgd) between current
> >>>>gfx/compute and HSA compute, with each getting 8 in the current code. The
> >>>>VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> >>>>and by driver code if HW scheduling is not being used.
> >>>>The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> >>>>earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> >>>>dedicated for this purpose, and pages within the doorbell aperture are
> >>>>mapped to userspace (each page mapped to only one user address space).
> >>>>Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> >>>>userspace code to safely manage work queues (rings) without requiring a
> >>>>kernel call for every ring update.
> >>>>First step for an application process is to open the kfd device. Calls to
> >>>>open create a kfd "process" structure only for the first thread of the
> >>>>process. Subsequent open calls are checked to see if they are from processes
> >>>>using the same mm_struct and, if so, don't do anything. The kfd per-process
> >>>>data lives as long as the mm_struct exists. Each mm_struct is associated
> >>>>with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> >>>>accessible to the GPU.
> >>>>Next step is for the application to collect topology information via sysfs.
> >>>>This gives userspace enough information to be able to identify specific
> >>>>nodes (processors) in subsequent queue management calls. Application
> >>>>processes can create queues on multiple processors, and processors support
> >>>>queues from multiple processes.
> >>>>At this point the application can create work queues in userspace memory and
> >>>>pass them through the usermode library to kfd to have them mapped onto HW
> >>>>queue slots so that commands written to the queues can be executed by the
> >>>>GPU. Queue operations specify a processor node, and so the bulk of this code
> >>>>is device-specific.
> >>>>Written by John Bridgman <John.Bridgman@amd.com>
> >>>>
> >>>>
> >>>>Alexey Skidanov (1):
> >>>>   amdkfd: Implement the Get Process Aperture IOCTL
> >>>>
> >>>>Andrew Lewycky (3):
> >>>>   amdkfd: Add basic modules to amdkfd
> >>>>   amdkfd: Add interrupt handling module
> >>>>   amdkfd: Implement the Set Memory Policy IOCTL
> >>>>
> >>>>Ben Goz (8):
> >>>>   amdkfd: Add queue module
> >>>>   amdkfd: Add mqd_manager module
> >>>>   amdkfd: Add kernel queue module
> >>>>   amdkfd: Add module parameter of scheduling policy
> >>>>   amdkfd: Add packet manager module
> >>>>   amdkfd: Add process queue manager module
> >>>>   amdkfd: Add device queue manager module
> >>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
> >>>>
> >>>>Evgeny Pinchuk (3):
> >>>>   amdkfd: Add topology module to amdkfd
> >>>>   amdkfd: Implement the Get Clock Counters IOCTL
> >>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> >>>>
> >>>>Oded Gabbay (10):
> >>>>   mm: Add kfd_process pointer to mm_struct
> >>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
> >>>>   drm/radeon/cik: Don't touch int of pipes 1-7
> >>>>   drm/radeon: Report doorbell configuration to amdkfd
> >>>>   drm/radeon: adding synchronization for GRBM GFX
> >>>>   drm/radeon: Add radeon <--> amdkfd interface
> >>>>   Update MAINTAINERS and CREDITS files with amdkfd info
> >>>>   amdkfd: Add IOCTL set definitions of amdkfd
> >>>>   amdkfd: Add amdkfd skeleton driver
> >>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> >>>>
> >>>>  CREDITS                                            |    7 +
> >>>>  MAINTAINERS                                        |   10 +
> >>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
> >>>>++++++++++++++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >>>>  include/linux/mm_types.h                           |   14 +
> >>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> >>>>
> >>>>--
> >>>>1.9.1
> >>>>
> >>
> >
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 15:54           ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 15:54 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> On 21/07/14 16:39, Christian König wrote:
> >Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>On 20/07/14 20:46, Jerome Glisse wrote:
> >>>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>Forgot to cc mailing list on cover letter. Sorry.
> >>>>
> >>>>As a continuation to the existing discussion, here is a v2 patch series
> >>>>restructured with a cleaner history and no totally-different-early-versions
> >>>>of the code.
> >>>>
> >>>>Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>There is no code going away or even modified between patches, only added.
> >>>>
> >>>>The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>generic hsa framework being implemented in the future and in that case, we
> >>>>will adjust amdkfd to work within that framework.
> >>>>
> >>>>As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>contained in its own folder. The amdkfd folder was put under the radeon
> >>>>folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>is the radeon driver. Having said that, we will probably need to move it
> >>>>(maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>drivers.
> >>>>
> >>>>For people who like to review using git, the v2 patch set is located at:
> >>>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>
> >>>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>
> >>>So quick comments before i finish going over all patches. There is many
> >>>things that need more documentation espacialy as of right now there is
> >>>no userspace i can go look at.
> >>So quick comments on some of your questions but first of all, thanks for the
> >>time you dedicated to review the code.
> >>>
> >>>There few show stopper, biggest one is gpu memory pinning this is a big
> >>>no, that would need serious arguments for any hope of convincing me on
> >>>that side.
> >>We only do gpu memory pinning for kernel objects. There are no userspace
> >>objects that are pinned on the gpu memory in our driver. If that is the case,
> >>is it still a show stopper ?
> >>
> >>The kernel objects are:
> >>- pipelines (4 per device)
> >>- mqd per hiq (only 1 per device)
> >>- mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>256 alignment. So total *possible* memory is 128MB
> >>- kernel queue (only 1 per device)
> >>- fence address for kernel queue
> >>- runlists for the CP (1 or 2 per device)
> >
> >The main questions here are if it's avoid able to pin down the memory and if the
> >memory is pinned down at driver load, by request from userspace or by anything
> >else.
> >
> >As far as I can see only the "mqd per userspace queue" might be a bit
> >questionable, everything else sounds reasonable.
> >
> >Christian.
> 
> Most of the pin downs are done on device initialization.
> The "mqd per userspace" is done per userspace queue creation. However, as I
> said, it has an upper limit of 128MB on KV, and considering the 2G local
> memory, I think it is OK.
> The runlists are also done on userspace queue creation/deletion, but we only
> have 1 or 2 runlists per device, so it is not that bad.

2G local memory ? You can not assume anything on userside configuration some
one might build an hsa computer with 512M and still expect a functioning
desktop.

I need to go look into what all this mqd is for, what it does and what it is
about. But pinning is really bad and this is an issue with userspace command
scheduling an issue that obviously AMD fails to take into account in design
phase.

> 	Oded
> >
> >>>
> >>>It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>stuff there.
> >>>
> >>>Given that this is not intended to be final HSA api AFAICT then i would
> >>>say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>This would avoid crazy communication btw radeon and kfd.
> >>>
> >>>The whole aperture business needs some serious explanation. Especialy as
> >>>you want to use userspace address there is nothing to prevent userspace
> >>>program from allocating things at address you reserve for lds, scratch,
> >>>... only sane way would be to move those lds, scratch inside the virtual
> >>>address reserved for kernel (see kernel memory map).
> >>>
> >>>The whole business of locking performance counter for exclusive per process
> >>>access is a big NO. Which leads me to the questionable usefullness of user
> >>>space command ring.
> >>That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>find it analogous to a situation where a network maintainer nacking a driver
> >>for a network card, which is slower than a different network card. Doesn't
> >>seem reasonable this situation is would happen. He would still put both the
> >>drivers in the kernel because people want to use the H/W and its features. So,
> >>I don't think this is a valid reason to NACK the driver.

Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
i see no objection. In other word, i am not NACKING whole patchset i am NACKING
the performance ioctl.

Again this is another argument for round trip to the kernel. As inside kernel you
could properly do exclusive gpu counter access accross single user cmd buffer
execution.

> >>
> >>>I only see issues with that. First and foremost i would
> >>>need to see solid figures that kernel ioctl or syscall has a higher an
> >>>overhead that is measurable in any meaning full way against a simple
> >>>function call. I know the userspace command ring is a big marketing features
> >>>that please ignorant userspace programmer. But really this only brings issues
> >>>and for absolutely not upside afaict.
> >>Really ? You think that doing a context switch to kernel space, with all its
> >>overhead, is _not_ more expansive than just calling a function in userspace
> >>which only puts a buffer on a ring and writes a doorbell ?

I am saying the overhead is not that big and it probably will not matter in most
usecase. For instance i did wrote the most useless kernel module that add two
number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
ioctl is 13 times slower.

Now if there is enough data that shows that a significant percentage of jobs
submited to the GPU will take less that 0.35microsecond then yes userspace
scheduling does make sense. But so far all we have is handwaving with no data
to support any facts.


Now if we want to schedule from userspace than you will need to do something
about the pinning, something that gives control to kernel so that kernel can
unpin when it wants and move object when it wants no matter what userspace is
doing.

> >>>
> >>>So i would rather see a very simple ioctl that write the doorbell and might
> >>>do more than that in case of ring/queue overcommit where it would first have
> >>>to wait for a free ring/queue to schedule stuff. This would also allow sane
> >>>implementation of things like performance counter that could be acquire by
> >>>kernel for duration of a job submitted by userspace. While still not optimal
> >>>this would be better that userspace locking.
> >>>
> >>>
> >>>I might have more thoughts once i am done with all the patches.
> >>>
> >>>Cheers,
> >>>Jérôme
> >>>
> >>>>
> >>>>Original Cover Letter:
> >>>>
> >>>>This patch set implements a Heterogeneous System Architecture (HSA) driver
> >>>>for radeon-family GPUs.
> >>>>HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> >>>>system resources more effectively via HW features including shared pageable
> >>>>memory, userspace-accessible work queues, and platform-level atomics. In
> >>>>addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> >>>>Islands family of GPUs also performs HW-level validation of commands passed
> >>>>in through the queues (aka rings).
> >>>>
> >>>>The code in this patch set is intended to serve both as a sample driver for
> >>>>other HSA-compatible hardware devices and as a production driver for
> >>>>radeon-family processors. The code is architected to support multiple CPUs
> >>>>each with connected GPUs, although the current implementation focuses on a
> >>>>single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> >>>>graphics driver (kgd).
> >>>>AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> >>>>functionality between HSA compute and regular gfx/compute (memory,
> >>>>interrupts, registers), while other functionality has been added
> >>>>specifically for HSA compute  (hw scheduler for virtualized compute rings).
> >>>>All shared hardware is owned by the radeon graphics driver, and an interface
> >>>>between kfd and kgd allows the kfd to make use of those shared resources,
> >>>>while HSA-specific functionality is managed directly by kfd by submitting
> >>>>packets into an HSA-specific command queue (the "HIQ").
> >>>>
> >>>>During kfd module initialization a char device node (/dev/kfd) is created
> >>>>(surviving until module exit), with ioctls for queue creation & management,
> >>>>and data structures are initialized for managing HSA device topology.
> >>>>The rest of the initialization is driven by calls from the radeon kgd at the
> >>>>following points :
> >>>>
> >>>>- radeon_init (kfd_init)
> >>>>- radeon_exit (kfd_fini)
> >>>>- radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>>- radeon_driver_unload_kms (kfd_device_fini)
> >>>>
> >>>>During the probe and init processing per-device data structures are
> >>>>established which connect to the associated graphics kernel driver. This
> >>>>information is exposed to userspace via sysfs, along with a version number
> >>>>allowing userspace to determine if a topology change has occurred while it
> >>>>was reading from sysfs.
> >>>>The interface between kfd and kgd also allows the kfd to request buffer
> >>>>management services from kgd, and allows kgd to route interrupt requests to
> >>>>kfd code since the interrupt block is shared between regular
> >>>>graphics/compute and HSA compute subsystems in the GPU.
> >>>>
> >>>>The kfd code works with an open source usermode library ("libhsakmt") which
> >>>>is in the final stages of IP review and should be published in a separate
> >>>>repo over the next few days.
> >>>>The code operates in one of three modes, selectable via the sched_policy
> >>>>module parameter :
> >>>>
> >>>>- sched_policy=0 uses a hardware scheduler running in the MEC block within
> >>>>CP, and allows oversubscription (more queues than HW slots)
> >>>>- sched_policy=1 also uses HW scheduling but does not allow
> >>>>oversubscription, so create_queue requests fail when we run out of HW slots
> >>>>- sched_policy=2 does not use HW scheduling, so the driver manually assigns
> >>>>queues to HW slots by programming registers
> >>>>
> >>>>The "no HW scheduling" option is for debug & new hardware bringup only, so
> >>>>has less test coverage than the other options. Default in the current code
> >>>>is "HW scheduling without oversubscription" since that is where we have the
> >>>>most test coverage but we expect to change the default to "HW scheduling
> >>>>with oversubscription" after further testing. This effectively removes the
> >>>>HW limit on the number of work queues available to applications.
> >>>>
> >>>>Programs running on the GPU are associated with an address space through the
> >>>>VMID field, which is translated to a unique PASID at access time via a set
> >>>>of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> >>>>are partitioned (under control of the radeon kgd) between current
> >>>>gfx/compute and HSA compute, with each getting 8 in the current code. The
> >>>>VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> >>>>and by driver code if HW scheduling is not being used.
> >>>>The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> >>>>earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> >>>>dedicated for this purpose, and pages within the doorbell aperture are
> >>>>mapped to userspace (each page mapped to only one user address space).
> >>>>Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> >>>>userspace code to safely manage work queues (rings) without requiring a
> >>>>kernel call for every ring update.
> >>>>First step for an application process is to open the kfd device. Calls to
> >>>>open create a kfd "process" structure only for the first thread of the
> >>>>process. Subsequent open calls are checked to see if they are from processes
> >>>>using the same mm_struct and, if so, don't do anything. The kfd per-process
> >>>>data lives as long as the mm_struct exists. Each mm_struct is associated
> >>>>with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> >>>>accessible to the GPU.
> >>>>Next step is for the application to collect topology information via sysfs.
> >>>>This gives userspace enough information to be able to identify specific
> >>>>nodes (processors) in subsequent queue management calls. Application
> >>>>processes can create queues on multiple processors, and processors support
> >>>>queues from multiple processes.
> >>>>At this point the application can create work queues in userspace memory and
> >>>>pass them through the usermode library to kfd to have them mapped onto HW
> >>>>queue slots so that commands written to the queues can be executed by the
> >>>>GPU. Queue operations specify a processor node, and so the bulk of this code
> >>>>is device-specific.
> >>>>Written by John Bridgman <John.Bridgman@amd.com>
> >>>>
> >>>>
> >>>>Alexey Skidanov (1):
> >>>>   amdkfd: Implement the Get Process Aperture IOCTL
> >>>>
> >>>>Andrew Lewycky (3):
> >>>>   amdkfd: Add basic modules to amdkfd
> >>>>   amdkfd: Add interrupt handling module
> >>>>   amdkfd: Implement the Set Memory Policy IOCTL
> >>>>
> >>>>Ben Goz (8):
> >>>>   amdkfd: Add queue module
> >>>>   amdkfd: Add mqd_manager module
> >>>>   amdkfd: Add kernel queue module
> >>>>   amdkfd: Add module parameter of scheduling policy
> >>>>   amdkfd: Add packet manager module
> >>>>   amdkfd: Add process queue manager module
> >>>>   amdkfd: Add device queue manager module
> >>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
> >>>>
> >>>>Evgeny Pinchuk (3):
> >>>>   amdkfd: Add topology module to amdkfd
> >>>>   amdkfd: Implement the Get Clock Counters IOCTL
> >>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> >>>>
> >>>>Oded Gabbay (10):
> >>>>   mm: Add kfd_process pointer to mm_struct
> >>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
> >>>>   drm/radeon/cik: Don't touch int of pipes 1-7
> >>>>   drm/radeon: Report doorbell configuration to amdkfd
> >>>>   drm/radeon: adding synchronization for GRBM GFX
> >>>>   drm/radeon: Add radeon <--> amdkfd interface
> >>>>   Update MAINTAINERS and CREDITS files with amdkfd info
> >>>>   amdkfd: Add IOCTL set definitions of amdkfd
> >>>>   amdkfd: Add amdkfd skeleton driver
> >>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> >>>>
> >>>>  CREDITS                                            |    7 +
> >>>>  MAINTAINERS                                        |   10 +
> >>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
> >>>>++++++++++++++++++++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >>>>  include/linux/mm_types.h                           |   14 +
> >>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> >>>>
> >>>>--
> >>>>1.9.1
> >>>>
> >>
> >
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 15:25         ` Daniel Vetter
  (?)
@ 2014-07-21 15:58           ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 15:58 UTC (permalink / raw)
  To: Christian König, Oded Gabbay, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
> > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > >On 20/07/14 20:46, Jerome Glisse wrote:
> > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > >>>Forgot to cc mailing list on cover letter. Sorry.
> > >>>
> > >>>As a continuation to the existing discussion, here is a v2 patch series
> > >>>restructured with a cleaner history and no
> > >>>totally-different-early-versions
> > >>>of the code.
> > >>>
> > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> > >>>them
> > >>>are modifications to radeon driver and 18 of them include only amdkfd
> > >>>code.
> > >>>There is no code going away or even modified between patches, only
> > >>>added.
> > >>>
> > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> > >>>under
> > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> > >>>driver
> > >>>is an AMD-only driver at this point. Having said that, we do foresee a
> > >>>generic hsa framework being implemented in the future and in that
> > >>>case, we
> > >>>will adjust amdkfd to work within that framework.
> > >>>
> > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> > >>>to
> > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > >>>contained in its own folder. The amdkfd folder was put under the radeon
> > >>>folder because the only AMD gfx driver in the Linux kernel at this
> > >>>point
> > >>>is the radeon driver. Having said that, we will probably need to move
> > >>>it
> > >>>(maybe to be directly under drm) after we integrate with additional
> > >>>AMD gfx
> > >>>drivers.
> > >>>
> > >>>For people who like to review using git, the v2 patch set is located
> > >>>at:
> > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > >>>
> > >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> > >>
> > >>So quick comments before i finish going over all patches. There is many
> > >>things that need more documentation espacialy as of right now there is
> > >>no userspace i can go look at.
> > >So quick comments on some of your questions but first of all, thanks for
> > >the time you dedicated to review the code.
> > >>
> > >>There few show stopper, biggest one is gpu memory pinning this is a big
> > >>no, that would need serious arguments for any hope of convincing me on
> > >>that side.
> > >We only do gpu memory pinning for kernel objects. There are no userspace
> > >objects that are pinned on the gpu memory in our driver. If that is the
> > >case, is it still a show stopper ?
> > >
> > >The kernel objects are:
> > >- pipelines (4 per device)
> > >- mqd per hiq (only 1 per device)
> > >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> > >done in 256 alignment. So total *possible* memory is 128MB
> > >- kernel queue (only 1 per device)
> > >- fence address for kernel queue
> > >- runlists for the CP (1 or 2 per device)
> > 
> > The main questions here are if it's avoid able to pin down the memory and if
> > the memory is pinned down at driver load, by request from userspace or by
> > anything else.
> > 
> > As far as I can see only the "mqd per userspace queue" might be a bit
> > questionable, everything else sounds reasonable.
> 
> Aside, i915 perspective again (i.e. how we solved this): When scheduling
> away from contexts we unpin them and put them into the lru. And in the
> shrinker we have a last-ditch callback to switch to a default context
> (since you can't ever have no context once you've started) which means we
> can evict any context object if it's getting in the way.

So Intel hardware report through some interrupt or some channel when it is
not using a context ? ie kernel side get notification when some user context
is done executing ?

The issue with radeon hardware AFAICT is that the hardware do not report any
thing about the userspace context running ie you do not get notification when
a context is not use. Well AFAICT. Maybe hardware do provide that.

Like the VMID is a limited resources so you have to dynamicly bind them so
maybe we can only allocate pinned buffer for each VMID and then when binding
a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.

Cheers,
Jérôme

> 
> We must do that since the contexts have to be in global gtt, which is
> shared for scanouts. So fragmenting that badly with lots of context
> objects and other stuff is a no-go, since that means we'll start to fail
> pageflips.
> 
> I don't know whether ttm has a ready-made concept for such
> opportunistically pinned stuff. I guess you could wire up the "switch to
> dflt context" action to the evict/move function if ttm wants to get rid of
> the currently used hw context.
> 
> Oh and: This is another reason for letting the kernel schedule contexts,
> since you can't do this defrag trick if the gpu does all the scheduling
> itself.
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 15:58           ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 15:58 UTC (permalink / raw)
  To: Christian König, Oded Gabbay, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian Konig wrote:
> > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > >On 20/07/14 20:46, Jerome Glisse wrote:
> > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > >>>Forgot to cc mailing list on cover letter. Sorry.
> > >>>
> > >>>As a continuation to the existing discussion, here is a v2 patch series
> > >>>restructured with a cleaner history and no
> > >>>totally-different-early-versions
> > >>>of the code.
> > >>>
> > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> > >>>them
> > >>>are modifications to radeon driver and 18 of them include only amdkfd
> > >>>code.
> > >>>There is no code going away or even modified between patches, only
> > >>>added.
> > >>>
> > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> > >>>under
> > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> > >>>driver
> > >>>is an AMD-only driver at this point. Having said that, we do foresee a
> > >>>generic hsa framework being implemented in the future and in that
> > >>>case, we
> > >>>will adjust amdkfd to work within that framework.
> > >>>
> > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> > >>>to
> > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > >>>contained in its own folder. The amdkfd folder was put under the radeon
> > >>>folder because the only AMD gfx driver in the Linux kernel at this
> > >>>point
> > >>>is the radeon driver. Having said that, we will probably need to move
> > >>>it
> > >>>(maybe to be directly under drm) after we integrate with additional
> > >>>AMD gfx
> > >>>drivers.
> > >>>
> > >>>For people who like to review using git, the v2 patch set is located
> > >>>at:
> > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > >>>
> > >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> > >>
> > >>So quick comments before i finish going over all patches. There is many
> > >>things that need more documentation espacialy as of right now there is
> > >>no userspace i can go look at.
> > >So quick comments on some of your questions but first of all, thanks for
> > >the time you dedicated to review the code.
> > >>
> > >>There few show stopper, biggest one is gpu memory pinning this is a big
> > >>no, that would need serious arguments for any hope of convincing me on
> > >>that side.
> > >We only do gpu memory pinning for kernel objects. There are no userspace
> > >objects that are pinned on the gpu memory in our driver. If that is the
> > >case, is it still a show stopper ?
> > >
> > >The kernel objects are:
> > >- pipelines (4 per device)
> > >- mqd per hiq (only 1 per device)
> > >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> > >done in 256 alignment. So total *possible* memory is 128MB
> > >- kernel queue (only 1 per device)
> > >- fence address for kernel queue
> > >- runlists for the CP (1 or 2 per device)
> > 
> > The main questions here are if it's avoid able to pin down the memory and if
> > the memory is pinned down at driver load, by request from userspace or by
> > anything else.
> > 
> > As far as I can see only the "mqd per userspace queue" might be a bit
> > questionable, everything else sounds reasonable.
> 
> Aside, i915 perspective again (i.e. how we solved this): When scheduling
> away from contexts we unpin them and put them into the lru. And in the
> shrinker we have a last-ditch callback to switch to a default context
> (since you can't ever have no context once you've started) which means we
> can evict any context object if it's getting in the way.

So Intel hardware report through some interrupt or some channel when it is
not using a context ? ie kernel side get notification when some user context
is done executing ?

The issue with radeon hardware AFAICT is that the hardware do not report any
thing about the userspace context running ie you do not get notification when
a context is not use. Well AFAICT. Maybe hardware do provide that.

Like the VMID is a limited resources so you have to dynamicly bind them so
maybe we can only allocate pinned buffer for each VMID and then when binding
a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.

Cheers,
Jerome

> 
> We must do that since the contexts have to be in global gtt, which is
> shared for scanouts. So fragmenting that badly with lots of context
> objects and other stuff is a no-go, since that means we'll start to fail
> pageflips.
> 
> I don't know whether ttm has a ready-made concept for such
> opportunistically pinned stuff. I guess you could wire up the "switch to
> dflt context" action to the evict/move function if ttm wants to get rid of
> the currently used hw context.
> 
> Oh and: This is another reason for letting the kernel schedule contexts,
> since you can't do this defrag trick if the gpu does all the scheduling
> itself.
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 15:58           ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 15:58 UTC (permalink / raw)
  To: Christian König, Oded Gabbay, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
> > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > >On 20/07/14 20:46, Jerome Glisse wrote:
> > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > >>>Forgot to cc mailing list on cover letter. Sorry.
> > >>>
> > >>>As a continuation to the existing discussion, here is a v2 patch series
> > >>>restructured with a cleaner history and no
> > >>>totally-different-early-versions
> > >>>of the code.
> > >>>
> > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> > >>>them
> > >>>are modifications to radeon driver and 18 of them include only amdkfd
> > >>>code.
> > >>>There is no code going away or even modified between patches, only
> > >>>added.
> > >>>
> > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> > >>>under
> > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> > >>>driver
> > >>>is an AMD-only driver at this point. Having said that, we do foresee a
> > >>>generic hsa framework being implemented in the future and in that
> > >>>case, we
> > >>>will adjust amdkfd to work within that framework.
> > >>>
> > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> > >>>to
> > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > >>>contained in its own folder. The amdkfd folder was put under the radeon
> > >>>folder because the only AMD gfx driver in the Linux kernel at this
> > >>>point
> > >>>is the radeon driver. Having said that, we will probably need to move
> > >>>it
> > >>>(maybe to be directly under drm) after we integrate with additional
> > >>>AMD gfx
> > >>>drivers.
> > >>>
> > >>>For people who like to review using git, the v2 patch set is located
> > >>>at:
> > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > >>>
> > >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> > >>
> > >>So quick comments before i finish going over all patches. There is many
> > >>things that need more documentation espacialy as of right now there is
> > >>no userspace i can go look at.
> > >So quick comments on some of your questions but first of all, thanks for
> > >the time you dedicated to review the code.
> > >>
> > >>There few show stopper, biggest one is gpu memory pinning this is a big
> > >>no, that would need serious arguments for any hope of convincing me on
> > >>that side.
> > >We only do gpu memory pinning for kernel objects. There are no userspace
> > >objects that are pinned on the gpu memory in our driver. If that is the
> > >case, is it still a show stopper ?
> > >
> > >The kernel objects are:
> > >- pipelines (4 per device)
> > >- mqd per hiq (only 1 per device)
> > >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> > >done in 256 alignment. So total *possible* memory is 128MB
> > >- kernel queue (only 1 per device)
> > >- fence address for kernel queue
> > >- runlists for the CP (1 or 2 per device)
> > 
> > The main questions here are if it's avoid able to pin down the memory and if
> > the memory is pinned down at driver load, by request from userspace or by
> > anything else.
> > 
> > As far as I can see only the "mqd per userspace queue" might be a bit
> > questionable, everything else sounds reasonable.
> 
> Aside, i915 perspective again (i.e. how we solved this): When scheduling
> away from contexts we unpin them and put them into the lru. And in the
> shrinker we have a last-ditch callback to switch to a default context
> (since you can't ever have no context once you've started) which means we
> can evict any context object if it's getting in the way.

So Intel hardware report through some interrupt or some channel when it is
not using a context ? ie kernel side get notification when some user context
is done executing ?

The issue with radeon hardware AFAICT is that the hardware do not report any
thing about the userspace context running ie you do not get notification when
a context is not use. Well AFAICT. Maybe hardware do provide that.

Like the VMID is a limited resources so you have to dynamicly bind them so
maybe we can only allocate pinned buffer for each VMID and then when binding
a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.

Cheers,
Jérôme

> 
> We must do that since the contexts have to be in global gtt, which is
> shared for scanouts. So fragmenting that badly with lots of context
> objects and other stuff is a no-go, since that means we'll start to fail
> pageflips.
> 
> I don't know whether ttm has a ready-made concept for such
> opportunistically pinned stuff. I guess you could wire up the "switch to
> dflt context" action to the evict/move function if ttm wants to get rid of
> the currently used hw context.
> 
> Oh and: This is another reason for letting the kernel schedule contexts,
> since you can't do this defrag trick if the gpu does all the scheduling
> itself.
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> +41 (0) 79 365 57 48 - http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 15:58           ` Jerome Glisse
  (?)
@ 2014-07-21 17:05             ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 17:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Oded Gabbay, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
> > > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > > >On 20/07/14 20:46, Jerome Glisse wrote:
> > > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > > >>>Forgot to cc mailing list on cover letter. Sorry.
> > > >>>
> > > >>>As a continuation to the existing discussion, here is a v2 patch series
> > > >>>restructured with a cleaner history and no
> > > >>>totally-different-early-versions
> > > >>>of the code.
> > > >>>
> > > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> > > >>>them
> > > >>>are modifications to radeon driver and 18 of them include only amdkfd
> > > >>>code.
> > > >>>There is no code going away or even modified between patches, only
> > > >>>added.
> > > >>>
> > > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> > > >>>under
> > > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> > > >>>driver
> > > >>>is an AMD-only driver at this point. Having said that, we do foresee a
> > > >>>generic hsa framework being implemented in the future and in that
> > > >>>case, we
> > > >>>will adjust amdkfd to work within that framework.
> > > >>>
> > > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> > > >>>to
> > > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > > >>>contained in its own folder. The amdkfd folder was put under the radeon
> > > >>>folder because the only AMD gfx driver in the Linux kernel at this
> > > >>>point
> > > >>>is the radeon driver. Having said that, we will probably need to move
> > > >>>it
> > > >>>(maybe to be directly under drm) after we integrate with additional
> > > >>>AMD gfx
> > > >>>drivers.
> > > >>>
> > > >>>For people who like to review using git, the v2 patch set is located
> > > >>>at:
> > > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > > >>>
> > > >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> > > >>
> > > >>So quick comments before i finish going over all patches. There is many
> > > >>things that need more documentation espacialy as of right now there is
> > > >>no userspace i can go look at.
> > > >So quick comments on some of your questions but first of all, thanks for
> > > >the time you dedicated to review the code.
> > > >>
> > > >>There few show stopper, biggest one is gpu memory pinning this is a big
> > > >>no, that would need serious arguments for any hope of convincing me on
> > > >>that side.
> > > >We only do gpu memory pinning for kernel objects. There are no userspace
> > > >objects that are pinned on the gpu memory in our driver. If that is the
> > > >case, is it still a show stopper ?
> > > >
> > > >The kernel objects are:
> > > >- pipelines (4 per device)
> > > >- mqd per hiq (only 1 per device)
> > > >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> > > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> > > >done in 256 alignment. So total *possible* memory is 128MB
> > > >- kernel queue (only 1 per device)
> > > >- fence address for kernel queue
> > > >- runlists for the CP (1 or 2 per device)
> > > 
> > > The main questions here are if it's avoid able to pin down the memory and if
> > > the memory is pinned down at driver load, by request from userspace or by
> > > anything else.
> > > 
> > > As far as I can see only the "mqd per userspace queue" might be a bit
> > > questionable, everything else sounds reasonable.
> > 
> > Aside, i915 perspective again (i.e. how we solved this): When scheduling
> > away from contexts we unpin them and put them into the lru. And in the
> > shrinker we have a last-ditch callback to switch to a default context
> > (since you can't ever have no context once you've started) which means we
> > can evict any context object if it's getting in the way.
> 
> So Intel hardware report through some interrupt or some channel when it is
> not using a context ? ie kernel side get notification when some user context
> is done executing ?

Yes, as long as we do the scheduling with the cpu we get interrupts for
context switches. The mechanic is already published in the execlist
patches currently floating around. We get a special context switch
interrupt.

But we have this unpin logic already on the current code where we switch
contexts through in-line cs commands from the kernel. There we obviously
use the normal batch completion events.

> The issue with radeon hardware AFAICT is that the hardware do not report any
> thing about the userspace context running ie you do not get notification when
> a context is not use. Well AFAICT. Maybe hardware do provide that.

I'm not sure whether we can do the same trick with the hw scheduler. But
then unpinning hw contexts will drain the pipeline anyway, so I guess we
can just stop feeding the hw scheduler until it runs dry. And then unpin
and evict.

> Like the VMID is a limited resources so you have to dynamicly bind them so
> maybe we can only allocate pinned buffer for each VMID and then when binding
> a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.

Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
do this already. We _do_ already have fun with ctx id assigments though
since we move them around (and the hw id is the ggtt address afaik). So we
need to remap them already. Not sure on the details for pasid mapping,
iirc it's a separate field somewhere in the context struct. Jesse knows
the details.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 17:05             ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 17:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Oded Gabbay, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian Konig wrote:
> > > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > > >On 20/07/14 20:46, Jerome Glisse wrote:
> > > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > > >>>Forgot to cc mailing list on cover letter. Sorry.
> > > >>>
> > > >>>As a continuation to the existing discussion, here is a v2 patch series
> > > >>>restructured with a cleaner history and no
> > > >>>totally-different-early-versions
> > > >>>of the code.
> > > >>>
> > > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> > > >>>them
> > > >>>are modifications to radeon driver and 18 of them include only amdkfd
> > > >>>code.
> > > >>>There is no code going away or even modified between patches, only
> > > >>>added.
> > > >>>
> > > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> > > >>>under
> > > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> > > >>>driver
> > > >>>is an AMD-only driver at this point. Having said that, we do foresee a
> > > >>>generic hsa framework being implemented in the future and in that
> > > >>>case, we
> > > >>>will adjust amdkfd to work within that framework.
> > > >>>
> > > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> > > >>>to
> > > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > > >>>contained in its own folder. The amdkfd folder was put under the radeon
> > > >>>folder because the only AMD gfx driver in the Linux kernel at this
> > > >>>point
> > > >>>is the radeon driver. Having said that, we will probably need to move
> > > >>>it
> > > >>>(maybe to be directly under drm) after we integrate with additional
> > > >>>AMD gfx
> > > >>>drivers.
> > > >>>
> > > >>>For people who like to review using git, the v2 patch set is located
> > > >>>at:
> > > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > > >>>
> > > >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> > > >>
> > > >>So quick comments before i finish going over all patches. There is many
> > > >>things that need more documentation espacialy as of right now there is
> > > >>no userspace i can go look at.
> > > >So quick comments on some of your questions but first of all, thanks for
> > > >the time you dedicated to review the code.
> > > >>
> > > >>There few show stopper, biggest one is gpu memory pinning this is a big
> > > >>no, that would need serious arguments for any hope of convincing me on
> > > >>that side.
> > > >We only do gpu memory pinning for kernel objects. There are no userspace
> > > >objects that are pinned on the gpu memory in our driver. If that is the
> > > >case, is it still a show stopper ?
> > > >
> > > >The kernel objects are:
> > > >- pipelines (4 per device)
> > > >- mqd per hiq (only 1 per device)
> > > >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> > > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> > > >done in 256 alignment. So total *possible* memory is 128MB
> > > >- kernel queue (only 1 per device)
> > > >- fence address for kernel queue
> > > >- runlists for the CP (1 or 2 per device)
> > > 
> > > The main questions here are if it's avoid able to pin down the memory and if
> > > the memory is pinned down at driver load, by request from userspace or by
> > > anything else.
> > > 
> > > As far as I can see only the "mqd per userspace queue" might be a bit
> > > questionable, everything else sounds reasonable.
> > 
> > Aside, i915 perspective again (i.e. how we solved this): When scheduling
> > away from contexts we unpin them and put them into the lru. And in the
> > shrinker we have a last-ditch callback to switch to a default context
> > (since you can't ever have no context once you've started) which means we
> > can evict any context object if it's getting in the way.
> 
> So Intel hardware report through some interrupt or some channel when it is
> not using a context ? ie kernel side get notification when some user context
> is done executing ?

Yes, as long as we do the scheduling with the cpu we get interrupts for
context switches. The mechanic is already published in the execlist
patches currently floating around. We get a special context switch
interrupt.

But we have this unpin logic already on the current code where we switch
contexts through in-line cs commands from the kernel. There we obviously
use the normal batch completion events.

> The issue with radeon hardware AFAICT is that the hardware do not report any
> thing about the userspace context running ie you do not get notification when
> a context is not use. Well AFAICT. Maybe hardware do provide that.

I'm not sure whether we can do the same trick with the hw scheduler. But
then unpinning hw contexts will drain the pipeline anyway, so I guess we
can just stop feeding the hw scheduler until it runs dry. And then unpin
and evict.

> Like the VMID is a limited resources so you have to dynamicly bind them so
> maybe we can only allocate pinned buffer for each VMID and then when binding
> a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.

Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
do this already. We _do_ already have fun with ctx id assigments though
since we move them around (and the hw id is the ggtt address afaik). So we
need to remap them already. Not sure on the details for pasid mapping,
iirc it's a separate field somewhere in the context struct. Jesse knows
the details.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 17:05             ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 17:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, linux-mm,
	Evgeny Pinchuk, Alexey Skidanov, dri-devel, Andrew Morton

On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
> > > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > > >On 20/07/14 20:46, Jerome Glisse wrote:
> > > >>On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> > > >>>Forgot to cc mailing list on cover letter. Sorry.
> > > >>>
> > > >>>As a continuation to the existing discussion, here is a v2 patch series
> > > >>>restructured with a cleaner history and no
> > > >>>totally-different-early-versions
> > > >>>of the code.
> > > >>>
> > > >>>Instead of 83 patches, there are now a total of 25 patches, where 5 of
> > > >>>them
> > > >>>are modifications to radeon driver and 18 of them include only amdkfd
> > > >>>code.
> > > >>>There is no code going away or even modified between patches, only
> > > >>>added.
> > > >>>
> > > >>>The driver was renamed from radeon_kfd to amdkfd and moved to reside
> > > >>>under
> > > >>>drm/radeon/amdkfd. This move was done to emphasize the fact that this
> > > >>>driver
> > > >>>is an AMD-only driver at this point. Having said that, we do foresee a
> > > >>>generic hsa framework being implemented in the future and in that
> > > >>>case, we
> > > >>>will adjust amdkfd to work within that framework.
> > > >>>
> > > >>>As the amdkfd driver should support multiple AMD gfx drivers, we want
> > > >>>to
> > > >>>keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> > > >>>contained in its own folder. The amdkfd folder was put under the radeon
> > > >>>folder because the only AMD gfx driver in the Linux kernel at this
> > > >>>point
> > > >>>is the radeon driver. Having said that, we will probably need to move
> > > >>>it
> > > >>>(maybe to be directly under drm) after we integrate with additional
> > > >>>AMD gfx
> > > >>>drivers.
> > > >>>
> > > >>>For people who like to review using git, the v2 patch set is located
> > > >>>at:
> > > >>>http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> > > >>>
> > > >>>Written by Oded Gabbayh <oded.gabbay@amd.com>
> > > >>
> > > >>So quick comments before i finish going over all patches. There is many
> > > >>things that need more documentation espacialy as of right now there is
> > > >>no userspace i can go look at.
> > > >So quick comments on some of your questions but first of all, thanks for
> > > >the time you dedicated to review the code.
> > > >>
> > > >>There few show stopper, biggest one is gpu memory pinning this is a big
> > > >>no, that would need serious arguments for any hope of convincing me on
> > > >>that side.
> > > >We only do gpu memory pinning for kernel objects. There are no userspace
> > > >objects that are pinned on the gpu memory in our driver. If that is the
> > > >case, is it still a show stopper ?
> > > >
> > > >The kernel objects are:
> > > >- pipelines (4 per device)
> > > >- mqd per hiq (only 1 per device)
> > > >- mqd per userspace queue. On KV, we support up to 1K queues per process,
> > > >for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
> > > >done in 256 alignment. So total *possible* memory is 128MB
> > > >- kernel queue (only 1 per device)
> > > >- fence address for kernel queue
> > > >- runlists for the CP (1 or 2 per device)
> > > 
> > > The main questions here are if it's avoid able to pin down the memory and if
> > > the memory is pinned down at driver load, by request from userspace or by
> > > anything else.
> > > 
> > > As far as I can see only the "mqd per userspace queue" might be a bit
> > > questionable, everything else sounds reasonable.
> > 
> > Aside, i915 perspective again (i.e. how we solved this): When scheduling
> > away from contexts we unpin them and put them into the lru. And in the
> > shrinker we have a last-ditch callback to switch to a default context
> > (since you can't ever have no context once you've started) which means we
> > can evict any context object if it's getting in the way.
> 
> So Intel hardware report through some interrupt or some channel when it is
> not using a context ? ie kernel side get notification when some user context
> is done executing ?

Yes, as long as we do the scheduling with the cpu we get interrupts for
context switches. The mechanic is already published in the execlist
patches currently floating around. We get a special context switch
interrupt.

But we have this unpin logic already on the current code where we switch
contexts through in-line cs commands from the kernel. There we obviously
use the normal batch completion events.

> The issue with radeon hardware AFAICT is that the hardware do not report any
> thing about the userspace context running ie you do not get notification when
> a context is not use. Well AFAICT. Maybe hardware do provide that.

I'm not sure whether we can do the same trick with the hw scheduler. But
then unpinning hw contexts will drain the pipeline anyway, so I guess we
can just stop feeding the hw scheduler until it runs dry. And then unpin
and evict.

> Like the VMID is a limited resources so you have to dynamicly bind them so
> maybe we can only allocate pinned buffer for each VMID and then when binding
> a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.

Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
do this already. We _do_ already have fun with ctx id assigments though
since we move them around (and the hw id is the ggtt address afaik). So we
need to remap them already. Not sure on the details for pasid mapping,
iirc it's a separate field somewhere in the context struct. Jesse knows
the details.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 17:05             ` Daniel Vetter
  (?)
@ 2014-07-21 17:28               ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 17:28 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 21/07/14 20:05, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>
>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>> restructured with a cleaner history and no
>>>>>>> totally-different-early-versions
>>>>>>> of the code.
>>>>>>>
>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of
>>>>>>> them
>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd
>>>>>>> code.
>>>>>>> There is no code going away or even modified between patches, only
>>>>>>> added.
>>>>>>>
>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside
>>>>>>> under
>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this
>>>>>>> driver
>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>> generic hsa framework being implemented in the future and in that
>>>>>>> case, we
>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>
>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want
>>>>>>> to
>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this
>>>>>>> point
>>>>>>> is the radeon driver. Having said that, we will probably need to move
>>>>>>> it
>>>>>>> (maybe to be directly under drm) after we integrate with additional
>>>>>>> AMD gfx
>>>>>>> drivers.
>>>>>>>
>>>>>>> For people who like to review using git, the v2 patch set is located
>>>>>>> at:
>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>
>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>
>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>> things that need more documentation espacialy as of right now there is
>>>>>> no userspace i can go look at.
>>>>> So quick comments on some of your questions but first of all, thanks for
>>>>> the time you dedicated to review the code.
>>>>>>
>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>> that side.
>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>> objects that are pinned on the gpu memory in our driver. If that is the
>>>>> case, is it still a show stopper ?
>>>>>
>>>>> The kernel objects are:
>>>>> - pipelines (4 per device)
>>>>> - mqd per hiq (only 1 per device)
>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process,
>>>>> for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
>>>>> done in 256 alignment. So total *possible* memory is 128MB
>>>>> - kernel queue (only 1 per device)
>>>>> - fence address for kernel queue
>>>>> - runlists for the CP (1 or 2 per device)
>>>>
>>>> The main questions here are if it's avoid able to pin down the memory and if
>>>> the memory is pinned down at driver load, by request from userspace or by
>>>> anything else.
>>>>
>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>> questionable, everything else sounds reasonable.
>>>
>>> Aside, i915 perspective again (i.e. how we solved this): When scheduling
>>> away from contexts we unpin them and put them into the lru. And in the
>>> shrinker we have a last-ditch callback to switch to a default context
>>> (since you can't ever have no context once you've started) which means we
>>> can evict any context object if it's getting in the way.
>>
>> So Intel hardware report through some interrupt or some channel when it is
>> not using a context ? ie kernel side get notification when some user context
>> is done executing ?
> 
> Yes, as long as we do the scheduling with the cpu we get interrupts for
> context switches. The mechanic is already published in the execlist
> patches currently floating around. We get a special context switch
> interrupt.
> 
> But we have this unpin logic already on the current code where we switch
> contexts through in-line cs commands from the kernel. There we obviously
> use the normal batch completion events.
> 
>> The issue with radeon hardware AFAICT is that the hardware do not report any
>> thing about the userspace context running ie you do not get notification when
>> a context is not use. Well AFAICT. Maybe hardware do provide that.
> 
> I'm not sure whether we can do the same trick with the hw scheduler. But
> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> can just stop feeding the hw scheduler until it runs dry. And then unpin
> and evict.
So, I'm afraid but we can't do this for AMD Kaveri because:

a. The hw scheduler doesn't inform us which queues it is going to
execute next. We feed it a runlist of queues, which can be very large
(we have a test that runs 1000 queues on the same runlist, but we can
put a lot more). All the MQDs of those queues must be pinned in memory
as long as the runlist is in effect. The runlist is in effect until
either a queue is deleted or a queue is added (or something more extreme
happens, like the process terminates).

b. The hw scheduler takes care of VMID to PASID mapping. We don't
program the ATC registers manually, the internal CP does that
dynamically, so we basically have over-subscription of processes as
well. Therefore, we can't ping MQDs based on VMID binding.

I don't see AMD moving back to SW scheduling, as it doesn't scale well
with the number of processes and queues and our next gen APU will have a
lot more queues than what we have on KV

	Oded
> 
>> Like the VMID is a limited resources so you have to dynamicly bind them so
>> maybe we can only allocate pinned buffer for each VMID and then when binding
>> a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.
> 
> Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
> do this already. We _do_ already have fun with ctx id assigments though
> since we move them around (and the hw id is the ggtt address afaik). So we
> need to remap them already. Not sure on the details for pasid mapping,
> iirc it's a separate field somewhere in the context struct. Jesse knows
> the details.
> -Daniel
> 


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 17:28               ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 17:28 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 21/07/14 20:05, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>
>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>> restructured with a cleaner history and no
>>>>>>> totally-different-early-versions
>>>>>>> of the code.
>>>>>>>
>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of
>>>>>>> them
>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd
>>>>>>> code.
>>>>>>> There is no code going away or even modified between patches, only
>>>>>>> added.
>>>>>>>
>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside
>>>>>>> under
>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this
>>>>>>> driver
>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>> generic hsa framework being implemented in the future and in that
>>>>>>> case, we
>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>
>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want
>>>>>>> to
>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this
>>>>>>> point
>>>>>>> is the radeon driver. Having said that, we will probably need to move
>>>>>>> it
>>>>>>> (maybe to be directly under drm) after we integrate with additional
>>>>>>> AMD gfx
>>>>>>> drivers.
>>>>>>>
>>>>>>> For people who like to review using git, the v2 patch set is located
>>>>>>> at:
>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>
>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>
>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>> things that need more documentation espacialy as of right now there is
>>>>>> no userspace i can go look at.
>>>>> So quick comments on some of your questions but first of all, thanks for
>>>>> the time you dedicated to review the code.
>>>>>>
>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>> that side.
>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>> objects that are pinned on the gpu memory in our driver. If that is the
>>>>> case, is it still a show stopper ?
>>>>>
>>>>> The kernel objects are:
>>>>> - pipelines (4 per device)
>>>>> - mqd per hiq (only 1 per device)
>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process,
>>>>> for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
>>>>> done in 256 alignment. So total *possible* memory is 128MB
>>>>> - kernel queue (only 1 per device)
>>>>> - fence address for kernel queue
>>>>> - runlists for the CP (1 or 2 per device)
>>>>
>>>> The main questions here are if it's avoid able to pin down the memory and if
>>>> the memory is pinned down at driver load, by request from userspace or by
>>>> anything else.
>>>>
>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>> questionable, everything else sounds reasonable.
>>>
>>> Aside, i915 perspective again (i.e. how we solved this): When scheduling
>>> away from contexts we unpin them and put them into the lru. And in the
>>> shrinker we have a last-ditch callback to switch to a default context
>>> (since you can't ever have no context once you've started) which means we
>>> can evict any context object if it's getting in the way.
>>
>> So Intel hardware report through some interrupt or some channel when it is
>> not using a context ? ie kernel side get notification when some user context
>> is done executing ?
> 
> Yes, as long as we do the scheduling with the cpu we get interrupts for
> context switches. The mechanic is already published in the execlist
> patches currently floating around. We get a special context switch
> interrupt.
> 
> But we have this unpin logic already on the current code where we switch
> contexts through in-line cs commands from the kernel. There we obviously
> use the normal batch completion events.
> 
>> The issue with radeon hardware AFAICT is that the hardware do not report any
>> thing about the userspace context running ie you do not get notification when
>> a context is not use. Well AFAICT. Maybe hardware do provide that.
> 
> I'm not sure whether we can do the same trick with the hw scheduler. But
> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> can just stop feeding the hw scheduler until it runs dry. And then unpin
> and evict.
So, I'm afraid but we can't do this for AMD Kaveri because:

a. The hw scheduler doesn't inform us which queues it is going to
execute next. We feed it a runlist of queues, which can be very large
(we have a test that runs 1000 queues on the same runlist, but we can
put a lot more). All the MQDs of those queues must be pinned in memory
as long as the runlist is in effect. The runlist is in effect until
either a queue is deleted or a queue is added (or something more extreme
happens, like the process terminates).

b. The hw scheduler takes care of VMID to PASID mapping. We don't
program the ATC registers manually, the internal CP does that
dynamically, so we basically have over-subscription of processes as
well. Therefore, we can't ping MQDs based on VMID binding.

I don't see AMD moving back to SW scheduling, as it doesn't scale well
with the number of processes and queues and our next gen APU will have a
lot more queues than what we have on KV

	Oded
> 
>> Like the VMID is a limited resources so you have to dynamicly bind them so
>> maybe we can only allocate pinned buffer for each VMID and then when binding
>> a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.
> 
> Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
> do this already. We _do_ already have fun with ctx id assigments though
> since we move them around (and the hw id is the ggtt address afaik). So we
> need to remap them already. Not sure on the details for pasid mapping,
> iirc it's a separate field somewhere in the context struct. Jesse knows
> the details.
> -Daniel
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 17:28               ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 17:28 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 21/07/14 20:05, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian König wrote:
>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>
>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>> restructured with a cleaner history and no
>>>>>>> totally-different-early-versions
>>>>>>> of the code.
>>>>>>>
>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of
>>>>>>> them
>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd
>>>>>>> code.
>>>>>>> There is no code going away or even modified between patches, only
>>>>>>> added.
>>>>>>>
>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside
>>>>>>> under
>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this
>>>>>>> driver
>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>> generic hsa framework being implemented in the future and in that
>>>>>>> case, we
>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>
>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want
>>>>>>> to
>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this
>>>>>>> point
>>>>>>> is the radeon driver. Having said that, we will probably need to move
>>>>>>> it
>>>>>>> (maybe to be directly under drm) after we integrate with additional
>>>>>>> AMD gfx
>>>>>>> drivers.
>>>>>>>
>>>>>>> For people who like to review using git, the v2 patch set is located
>>>>>>> at:
>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>
>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>
>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>> things that need more documentation espacialy as of right now there is
>>>>>> no userspace i can go look at.
>>>>> So quick comments on some of your questions but first of all, thanks for
>>>>> the time you dedicated to review the code.
>>>>>>
>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>> that side.
>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>> objects that are pinned on the gpu memory in our driver. If that is the
>>>>> case, is it still a show stopper ?
>>>>>
>>>>> The kernel objects are:
>>>>> - pipelines (4 per device)
>>>>> - mqd per hiq (only 1 per device)
>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process,
>>>>> for a total of 512K queues. Each mqd is 151 bytes, but the allocation is
>>>>> done in 256 alignment. So total *possible* memory is 128MB
>>>>> - kernel queue (only 1 per device)
>>>>> - fence address for kernel queue
>>>>> - runlists for the CP (1 or 2 per device)
>>>>
>>>> The main questions here are if it's avoid able to pin down the memory and if
>>>> the memory is pinned down at driver load, by request from userspace or by
>>>> anything else.
>>>>
>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>> questionable, everything else sounds reasonable.
>>>
>>> Aside, i915 perspective again (i.e. how we solved this): When scheduling
>>> away from contexts we unpin them and put them into the lru. And in the
>>> shrinker we have a last-ditch callback to switch to a default context
>>> (since you can't ever have no context once you've started) which means we
>>> can evict any context object if it's getting in the way.
>>
>> So Intel hardware report through some interrupt or some channel when it is
>> not using a context ? ie kernel side get notification when some user context
>> is done executing ?
> 
> Yes, as long as we do the scheduling with the cpu we get interrupts for
> context switches. The mechanic is already published in the execlist
> patches currently floating around. We get a special context switch
> interrupt.
> 
> But we have this unpin logic already on the current code where we switch
> contexts through in-line cs commands from the kernel. There we obviously
> use the normal batch completion events.
> 
>> The issue with radeon hardware AFAICT is that the hardware do not report any
>> thing about the userspace context running ie you do not get notification when
>> a context is not use. Well AFAICT. Maybe hardware do provide that.
> 
> I'm not sure whether we can do the same trick with the hw scheduler. But
> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> can just stop feeding the hw scheduler until it runs dry. And then unpin
> and evict.
So, I'm afraid but we can't do this for AMD Kaveri because:

a. The hw scheduler doesn't inform us which queues it is going to
execute next. We feed it a runlist of queues, which can be very large
(we have a test that runs 1000 queues on the same runlist, but we can
put a lot more). All the MQDs of those queues must be pinned in memory
as long as the runlist is in effect. The runlist is in effect until
either a queue is deleted or a queue is added (or something more extreme
happens, like the process terminates).

b. The hw scheduler takes care of VMID to PASID mapping. We don't
program the ATC registers manually, the internal CP does that
dynamically, so we basically have over-subscription of processes as
well. Therefore, we can't ping MQDs based on VMID binding.

I don't see AMD moving back to SW scheduling, as it doesn't scale well
with the number of processes and queues and our next gen APU will have a
lot more queues than what we have on KV

	Oded
> 
>> Like the VMID is a limited resources so you have to dynamicly bind them so
>> maybe we can only allocate pinned buffer for each VMID and then when binding
>> a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.
> 
> Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
> do this already. We _do_ already have fun with ctx id assigments though
> since we move them around (and the hw id is the ggtt address afaik). So we
> need to remap them already. Not sure on the details for pasid mapping,
> iirc it's a separate field somewhere in the context struct. Jesse knows
> the details.
> -Daniel
> 

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 15:54           ` Jerome Glisse
  (?)
@ 2014-07-21 17:42             ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 17:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On 21/07/14 18:54, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>> On 21/07/14 16:39, Christian König wrote:
>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>
>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>> of the code.
>>>>>>
>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>
>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>> will adjust amdkfd to work within that framework.
>>>>>>
>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>> drivers.
>>>>>>
>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>
>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>
>>>>> So quick comments before i finish going over all patches. There is many
>>>>> things that need more documentation espacialy as of right now there is
>>>>> no userspace i can go look at.
>>>> So quick comments on some of your questions but first of all, thanks for the
>>>> time you dedicated to review the code.
>>>>>
>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>> that side.
>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>> is it still a show stopper ?
>>>>
>>>> The kernel objects are:
>>>> - pipelines (4 per device)
>>>> - mqd per hiq (only 1 per device)
>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>> 256 alignment. So total *possible* memory is 128MB
>>>> - kernel queue (only 1 per device)
>>>> - fence address for kernel queue
>>>> - runlists for the CP (1 or 2 per device)
>>>
>>> The main questions here are if it's avoid able to pin down the memory and if the
>>> memory is pinned down at driver load, by request from userspace or by anything
>>> else.
>>>
>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>> questionable, everything else sounds reasonable.
>>>
>>> Christian.
>>
>> Most of the pin downs are done on device initialization.
>> The "mqd per userspace" is done per userspace queue creation. However, as I
>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>> memory, I think it is OK.
>> The runlists are also done on userspace queue creation/deletion, but we only
>> have 1 or 2 runlists per device, so it is not that bad.
> 
> 2G local memory ? You can not assume anything on userside configuration some
> one might build an hsa computer with 512M and still expect a functioning
> desktop.
First of all, I'm only considering Kaveri computer, not "hsa" computer.
Second, I would imagine we can build some protection around it, like
checking total local memory and limit number of queues based on some
percentage of that total local memory. So, if someone will have only
512M, he will be able to open less queues.


> 
> I need to go look into what all this mqd is for, what it does and what it is
> about. But pinning is really bad and this is an issue with userspace command
> scheduling an issue that obviously AMD fails to take into account in design
> phase.
Maybe, but that is the H/W design non-the-less. We can't very well
change the H/W.
	Oded
> 
>> 	Oded
>>>
>>>>>
>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>> stuff there.
>>>>>
>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>
>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>> address reserved for kernel (see kernel memory map).
>>>>>
>>>>> The whole business of locking performance counter for exclusive per process
>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>> space command ring.
>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>> for a network card, which is slower than a different network card. Doesn't
>>>> seem reasonable this situation is would happen. He would still put both the
>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>> I don't think this is a valid reason to NACK the driver.
> 
> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> the performance ioctl.
> 
> Again this is another argument for round trip to the kernel. As inside kernel you
> could properly do exclusive gpu counter access accross single user cmd buffer
> execution.
> 
>>>>
>>>>> I only see issues with that. First and foremost i would
>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>> overhead that is measurable in any meaning full way against a simple
>>>>> function call. I know the userspace command ring is a big marketing features
>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>> and for absolutely not upside afaict.
>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>> which only puts a buffer on a ring and writes a doorbell ?
> 
> I am saying the overhead is not that big and it probably will not matter in most
> usecase. For instance i did wrote the most useless kernel module that add two
> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> ioctl is 13 times slower.
> 
> Now if there is enough data that shows that a significant percentage of jobs
> submited to the GPU will take less that 0.35microsecond then yes userspace
> scheduling does make sense. But so far all we have is handwaving with no data
> to support any facts.
> 
> 
> Now if we want to schedule from userspace than you will need to do something
> about the pinning, something that gives control to kernel so that kernel can
> unpin when it wants and move object when it wants no matter what userspace is
> doing.
> 
>>>>>
>>>>> So i would rather see a very simple ioctl that write the doorbell and might
>>>>> do more than that in case of ring/queue overcommit where it would first have
>>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>>>> implementation of things like performance counter that could be acquire by
>>>>> kernel for duration of a job submitted by userspace. While still not optimal
>>>>> this would be better that userspace locking.
>>>>>
>>>>>
>>>>> I might have more thoughts once i am done with all the patches.
>>>>>
>>>>> Cheers,
>>>>> Jérôme
>>>>>
>>>>>>
>>>>>> Original Cover Letter:
>>>>>>
>>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>>>> for radeon-family GPUs.
>>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>>>> system resources more effectively via HW features including shared pageable
>>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>>>> in through the queues (aka rings).
>>>>>>
>>>>>> The code in this patch set is intended to serve both as a sample driver for
>>>>>> other HSA-compatible hardware devices and as a production driver for
>>>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>>>> each with connected GPUs, although the current implementation focuses on a
>>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>>>> graphics driver (kgd).
>>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>>>> interrupts, registers), while other functionality has been added
>>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>>>
>>>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>>>> and data structures are initialized for managing HSA device topology.
>>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>>>> following points :
>>>>>>
>>>>>> - radeon_init (kfd_init)
>>>>>> - radeon_exit (kfd_fini)
>>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>>>
>>>>>> During the probe and init processing per-device data structures are
>>>>>> established which connect to the associated graphics kernel driver. This
>>>>>> information is exposed to userspace via sysfs, along with a version number
>>>>>> allowing userspace to determine if a topology change has occurred while it
>>>>>> was reading from sysfs.
>>>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>>>> kfd code since the interrupt block is shared between regular
>>>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>>>
>>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>>>> is in the final stages of IP review and should be published in a separate
>>>>>> repo over the next few days.
>>>>>> The code operates in one of three modes, selectable via the sched_policy
>>>>>> module parameter :
>>>>>>
>>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>>>> CP, and allows oversubscription (more queues than HW slots)
>>>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>>>> queues to HW slots by programming registers
>>>>>>
>>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>>>> has less test coverage than the other options. Default in the current code
>>>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>>>> most test coverage but we expect to change the default to "HW scheduling
>>>>>> with oversubscription" after further testing. This effectively removes the
>>>>>> HW limit on the number of work queues available to applications.
>>>>>>
>>>>>> Programs running on the GPU are associated with an address space through the
>>>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>>>> are partitioned (under control of the radeon kgd) between current
>>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>>>> and by driver code if HW scheduling is not being used.
>>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>>>> mapped to userspace (each page mapped to only one user address space).
>>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>>>> userspace code to safely manage work queues (rings) without requiring a
>>>>>> kernel call for every ring update.
>>>>>> First step for an application process is to open the kfd device. Calls to
>>>>>> open create a kfd "process" structure only for the first thread of the
>>>>>> process. Subsequent open calls are checked to see if they are from processes
>>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>>>> accessible to the GPU.
>>>>>> Next step is for the application to collect topology information via sysfs.
>>>>>> This gives userspace enough information to be able to identify specific
>>>>>> nodes (processors) in subsequent queue management calls. Application
>>>>>> processes can create queues on multiple processors, and processors support
>>>>>> queues from multiple processes.
>>>>>> At this point the application can create work queues in userspace memory and
>>>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>>>> queue slots so that commands written to the queues can be executed by the
>>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>>>> is device-specific.
>>>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>>>
>>>>>>
>>>>>> Alexey Skidanov (1):
>>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
>>>>>>
>>>>>> Andrew Lewycky (3):
>>>>>>   amdkfd: Add basic modules to amdkfd
>>>>>>   amdkfd: Add interrupt handling module
>>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
>>>>>>
>>>>>> Ben Goz (8):
>>>>>>   amdkfd: Add queue module
>>>>>>   amdkfd: Add mqd_manager module
>>>>>>   amdkfd: Add kernel queue module
>>>>>>   amdkfd: Add module parameter of scheduling policy
>>>>>>   amdkfd: Add packet manager module
>>>>>>   amdkfd: Add process queue manager module
>>>>>>   amdkfd: Add device queue manager module
>>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>>>
>>>>>> Evgeny Pinchuk (3):
>>>>>>   amdkfd: Add topology module to amdkfd
>>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
>>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>>>
>>>>>> Oded Gabbay (10):
>>>>>>   mm: Add kfd_process pointer to mm_struct
>>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
>>>>>>   drm/radeon: Report doorbell configuration to amdkfd
>>>>>>   drm/radeon: adding synchronization for GRBM GFX
>>>>>>   drm/radeon: Add radeon <--> amdkfd interface
>>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
>>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
>>>>>>   amdkfd: Add amdkfd skeleton driver
>>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>>>
>>>>>>  CREDITS                                            |    7 +
>>>>>>  MAINTAINERS                                        |   10 +
>>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>>>> ++++++++++++++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>>>  include/linux/mm_types.h                           |   14 +
>>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>>>
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 17:42             ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 17:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On 21/07/14 18:54, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>> On 21/07/14 16:39, Christian König wrote:
>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>
>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>> of the code.
>>>>>>
>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>
>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>> will adjust amdkfd to work within that framework.
>>>>>>
>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>> drivers.
>>>>>>
>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>
>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>
>>>>> So quick comments before i finish going over all patches. There is many
>>>>> things that need more documentation espacialy as of right now there is
>>>>> no userspace i can go look at.
>>>> So quick comments on some of your questions but first of all, thanks for the
>>>> time you dedicated to review the code.
>>>>>
>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>> that side.
>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>> is it still a show stopper ?
>>>>
>>>> The kernel objects are:
>>>> - pipelines (4 per device)
>>>> - mqd per hiq (only 1 per device)
>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>> 256 alignment. So total *possible* memory is 128MB
>>>> - kernel queue (only 1 per device)
>>>> - fence address for kernel queue
>>>> - runlists for the CP (1 or 2 per device)
>>>
>>> The main questions here are if it's avoid able to pin down the memory and if the
>>> memory is pinned down at driver load, by request from userspace or by anything
>>> else.
>>>
>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>> questionable, everything else sounds reasonable.
>>>
>>> Christian.
>>
>> Most of the pin downs are done on device initialization.
>> The "mqd per userspace" is done per userspace queue creation. However, as I
>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>> memory, I think it is OK.
>> The runlists are also done on userspace queue creation/deletion, but we only
>> have 1 or 2 runlists per device, so it is not that bad.
> 
> 2G local memory ? You can not assume anything on userside configuration some
> one might build an hsa computer with 512M and still expect a functioning
> desktop.
First of all, I'm only considering Kaveri computer, not "hsa" computer.
Second, I would imagine we can build some protection around it, like
checking total local memory and limit number of queues based on some
percentage of that total local memory. So, if someone will have only
512M, he will be able to open less queues.


> 
> I need to go look into what all this mqd is for, what it does and what it is
> about. But pinning is really bad and this is an issue with userspace command
> scheduling an issue that obviously AMD fails to take into account in design
> phase.
Maybe, but that is the H/W design non-the-less. We can't very well
change the H/W.
	Oded
> 
>> 	Oded
>>>
>>>>>
>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>> stuff there.
>>>>>
>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>
>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>> address reserved for kernel (see kernel memory map).
>>>>>
>>>>> The whole business of locking performance counter for exclusive per process
>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>> space command ring.
>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>> for a network card, which is slower than a different network card. Doesn't
>>>> seem reasonable this situation is would happen. He would still put both the
>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>> I don't think this is a valid reason to NACK the driver.
> 
> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> the performance ioctl.
> 
> Again this is another argument for round trip to the kernel. As inside kernel you
> could properly do exclusive gpu counter access accross single user cmd buffer
> execution.
> 
>>>>
>>>>> I only see issues with that. First and foremost i would
>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>> overhead that is measurable in any meaning full way against a simple
>>>>> function call. I know the userspace command ring is a big marketing features
>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>> and for absolutely not upside afaict.
>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>> which only puts a buffer on a ring and writes a doorbell ?
> 
> I am saying the overhead is not that big and it probably will not matter in most
> usecase. For instance i did wrote the most useless kernel module that add two
> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> ioctl is 13 times slower.
> 
> Now if there is enough data that shows that a significant percentage of jobs
> submited to the GPU will take less that 0.35microsecond then yes userspace
> scheduling does make sense. But so far all we have is handwaving with no data
> to support any facts.
> 
> 
> Now if we want to schedule from userspace than you will need to do something
> about the pinning, something that gives control to kernel so that kernel can
> unpin when it wants and move object when it wants no matter what userspace is
> doing.
> 
>>>>>
>>>>> So i would rather see a very simple ioctl that write the doorbell and might
>>>>> do more than that in case of ring/queue overcommit where it would first have
>>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>>>> implementation of things like performance counter that could be acquire by
>>>>> kernel for duration of a job submitted by userspace. While still not optimal
>>>>> this would be better that userspace locking.
>>>>>
>>>>>
>>>>> I might have more thoughts once i am done with all the patches.
>>>>>
>>>>> Cheers,
>>>>> Jérôme
>>>>>
>>>>>>
>>>>>> Original Cover Letter:
>>>>>>
>>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>>>> for radeon-family GPUs.
>>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>>>> system resources more effectively via HW features including shared pageable
>>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>>>> in through the queues (aka rings).
>>>>>>
>>>>>> The code in this patch set is intended to serve both as a sample driver for
>>>>>> other HSA-compatible hardware devices and as a production driver for
>>>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>>>> each with connected GPUs, although the current implementation focuses on a
>>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>>>> graphics driver (kgd).
>>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>>>> interrupts, registers), while other functionality has been added
>>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>>>
>>>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>>>> and data structures are initialized for managing HSA device topology.
>>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>>>> following points :
>>>>>>
>>>>>> - radeon_init (kfd_init)
>>>>>> - radeon_exit (kfd_fini)
>>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>>>
>>>>>> During the probe and init processing per-device data structures are
>>>>>> established which connect to the associated graphics kernel driver. This
>>>>>> information is exposed to userspace via sysfs, along with a version number
>>>>>> allowing userspace to determine if a topology change has occurred while it
>>>>>> was reading from sysfs.
>>>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>>>> kfd code since the interrupt block is shared between regular
>>>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>>>
>>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>>>> is in the final stages of IP review and should be published in a separate
>>>>>> repo over the next few days.
>>>>>> The code operates in one of three modes, selectable via the sched_policy
>>>>>> module parameter :
>>>>>>
>>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>>>> CP, and allows oversubscription (more queues than HW slots)
>>>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>>>> queues to HW slots by programming registers
>>>>>>
>>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>>>> has less test coverage than the other options. Default in the current code
>>>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>>>> most test coverage but we expect to change the default to "HW scheduling
>>>>>> with oversubscription" after further testing. This effectively removes the
>>>>>> HW limit on the number of work queues available to applications.
>>>>>>
>>>>>> Programs running on the GPU are associated with an address space through the
>>>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>>>> are partitioned (under control of the radeon kgd) between current
>>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>>>> and by driver code if HW scheduling is not being used.
>>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>>>> mapped to userspace (each page mapped to only one user address space).
>>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>>>> userspace code to safely manage work queues (rings) without requiring a
>>>>>> kernel call for every ring update.
>>>>>> First step for an application process is to open the kfd device. Calls to
>>>>>> open create a kfd "process" structure only for the first thread of the
>>>>>> process. Subsequent open calls are checked to see if they are from processes
>>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>>>> accessible to the GPU.
>>>>>> Next step is for the application to collect topology information via sysfs.
>>>>>> This gives userspace enough information to be able to identify specific
>>>>>> nodes (processors) in subsequent queue management calls. Application
>>>>>> processes can create queues on multiple processors, and processors support
>>>>>> queues from multiple processes.
>>>>>> At this point the application can create work queues in userspace memory and
>>>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>>>> queue slots so that commands written to the queues can be executed by the
>>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>>>> is device-specific.
>>>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>>>
>>>>>>
>>>>>> Alexey Skidanov (1):
>>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
>>>>>>
>>>>>> Andrew Lewycky (3):
>>>>>>   amdkfd: Add basic modules to amdkfd
>>>>>>   amdkfd: Add interrupt handling module
>>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
>>>>>>
>>>>>> Ben Goz (8):
>>>>>>   amdkfd: Add queue module
>>>>>>   amdkfd: Add mqd_manager module
>>>>>>   amdkfd: Add kernel queue module
>>>>>>   amdkfd: Add module parameter of scheduling policy
>>>>>>   amdkfd: Add packet manager module
>>>>>>   amdkfd: Add process queue manager module
>>>>>>   amdkfd: Add device queue manager module
>>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>>>
>>>>>> Evgeny Pinchuk (3):
>>>>>>   amdkfd: Add topology module to amdkfd
>>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
>>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>>>
>>>>>> Oded Gabbay (10):
>>>>>>   mm: Add kfd_process pointer to mm_struct
>>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
>>>>>>   drm/radeon: Report doorbell configuration to amdkfd
>>>>>>   drm/radeon: adding synchronization for GRBM GFX
>>>>>>   drm/radeon: Add radeon <--> amdkfd interface
>>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
>>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
>>>>>>   amdkfd: Add amdkfd skeleton driver
>>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>>>
>>>>>>  CREDITS                                            |    7 +
>>>>>>  MAINTAINERS                                        |   10 +
>>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>>>> ++++++++++++++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>>>  include/linux/mm_types.h                           |   14 +
>>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>>>
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>>
>>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 17:42             ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 17:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 18:54, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>> On 21/07/14 16:39, Christian König wrote:
>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>
>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>> of the code.
>>>>>>
>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>
>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>> will adjust amdkfd to work within that framework.
>>>>>>
>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>> drivers.
>>>>>>
>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>
>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>
>>>>> So quick comments before i finish going over all patches. There is many
>>>>> things that need more documentation espacialy as of right now there is
>>>>> no userspace i can go look at.
>>>> So quick comments on some of your questions but first of all, thanks for the
>>>> time you dedicated to review the code.
>>>>>
>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>> that side.
>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>> is it still a show stopper ?
>>>>
>>>> The kernel objects are:
>>>> - pipelines (4 per device)
>>>> - mqd per hiq (only 1 per device)
>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>> 256 alignment. So total *possible* memory is 128MB
>>>> - kernel queue (only 1 per device)
>>>> - fence address for kernel queue
>>>> - runlists for the CP (1 or 2 per device)
>>>
>>> The main questions here are if it's avoid able to pin down the memory and if the
>>> memory is pinned down at driver load, by request from userspace or by anything
>>> else.
>>>
>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>> questionable, everything else sounds reasonable.
>>>
>>> Christian.
>>
>> Most of the pin downs are done on device initialization.
>> The "mqd per userspace" is done per userspace queue creation. However, as I
>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>> memory, I think it is OK.
>> The runlists are also done on userspace queue creation/deletion, but we only
>> have 1 or 2 runlists per device, so it is not that bad.
> 
> 2G local memory ? You can not assume anything on userside configuration some
> one might build an hsa computer with 512M and still expect a functioning
> desktop.
First of all, I'm only considering Kaveri computer, not "hsa" computer.
Second, I would imagine we can build some protection around it, like
checking total local memory and limit number of queues based on some
percentage of that total local memory. So, if someone will have only
512M, he will be able to open less queues.


> 
> I need to go look into what all this mqd is for, what it does and what it is
> about. But pinning is really bad and this is an issue with userspace command
> scheduling an issue that obviously AMD fails to take into account in design
> phase.
Maybe, but that is the H/W design non-the-less. We can't very well
change the H/W.
	Oded
> 
>> 	Oded
>>>
>>>>>
>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>> stuff there.
>>>>>
>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>
>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>> address reserved for kernel (see kernel memory map).
>>>>>
>>>>> The whole business of locking performance counter for exclusive per process
>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>> space command ring.
>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>> for a network card, which is slower than a different network card. Doesn't
>>>> seem reasonable this situation is would happen. He would still put both the
>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>> I don't think this is a valid reason to NACK the driver.
> 
> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> the performance ioctl.
> 
> Again this is another argument for round trip to the kernel. As inside kernel you
> could properly do exclusive gpu counter access accross single user cmd buffer
> execution.
> 
>>>>
>>>>> I only see issues with that. First and foremost i would
>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>> overhead that is measurable in any meaning full way against a simple
>>>>> function call. I know the userspace command ring is a big marketing features
>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>> and for absolutely not upside afaict.
>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>> which only puts a buffer on a ring and writes a doorbell ?
> 
> I am saying the overhead is not that big and it probably will not matter in most
> usecase. For instance i did wrote the most useless kernel module that add two
> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> ioctl is 13 times slower.
> 
> Now if there is enough data that shows that a significant percentage of jobs
> submited to the GPU will take less that 0.35microsecond then yes userspace
> scheduling does make sense. But so far all we have is handwaving with no data
> to support any facts.
> 
> 
> Now if we want to schedule from userspace than you will need to do something
> about the pinning, something that gives control to kernel so that kernel can
> unpin when it wants and move object when it wants no matter what userspace is
> doing.
> 
>>>>>
>>>>> So i would rather see a very simple ioctl that write the doorbell and might
>>>>> do more than that in case of ring/queue overcommit where it would first have
>>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>>>> implementation of things like performance counter that could be acquire by
>>>>> kernel for duration of a job submitted by userspace. While still not optimal
>>>>> this would be better that userspace locking.
>>>>>
>>>>>
>>>>> I might have more thoughts once i am done with all the patches.
>>>>>
>>>>> Cheers,
>>>>> Jérôme
>>>>>
>>>>>>
>>>>>> Original Cover Letter:
>>>>>>
>>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>>>> for radeon-family GPUs.
>>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>>>> system resources more effectively via HW features including shared pageable
>>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>>>> in through the queues (aka rings).
>>>>>>
>>>>>> The code in this patch set is intended to serve both as a sample driver for
>>>>>> other HSA-compatible hardware devices and as a production driver for
>>>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>>>> each with connected GPUs, although the current implementation focuses on a
>>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>>>> graphics driver (kgd).
>>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>>>> interrupts, registers), while other functionality has been added
>>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>>>
>>>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>>>> and data structures are initialized for managing HSA device topology.
>>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>>>> following points :
>>>>>>
>>>>>> - radeon_init (kfd_init)
>>>>>> - radeon_exit (kfd_fini)
>>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>>>
>>>>>> During the probe and init processing per-device data structures are
>>>>>> established which connect to the associated graphics kernel driver. This
>>>>>> information is exposed to userspace via sysfs, along with a version number
>>>>>> allowing userspace to determine if a topology change has occurred while it
>>>>>> was reading from sysfs.
>>>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>>>> kfd code since the interrupt block is shared between regular
>>>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>>>
>>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>>>> is in the final stages of IP review and should be published in a separate
>>>>>> repo over the next few days.
>>>>>> The code operates in one of three modes, selectable via the sched_policy
>>>>>> module parameter :
>>>>>>
>>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>>>> CP, and allows oversubscription (more queues than HW slots)
>>>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>>>> queues to HW slots by programming registers
>>>>>>
>>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>>>> has less test coverage than the other options. Default in the current code
>>>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>>>> most test coverage but we expect to change the default to "HW scheduling
>>>>>> with oversubscription" after further testing. This effectively removes the
>>>>>> HW limit on the number of work queues available to applications.
>>>>>>
>>>>>> Programs running on the GPU are associated with an address space through the
>>>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>>>> are partitioned (under control of the radeon kgd) between current
>>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>>>> and by driver code if HW scheduling is not being used.
>>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>>>> mapped to userspace (each page mapped to only one user address space).
>>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>>>> userspace code to safely manage work queues (rings) without requiring a
>>>>>> kernel call for every ring update.
>>>>>> First step for an application process is to open the kfd device. Calls to
>>>>>> open create a kfd "process" structure only for the first thread of the
>>>>>> process. Subsequent open calls are checked to see if they are from processes
>>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>>>> accessible to the GPU.
>>>>>> Next step is for the application to collect topology information via sysfs.
>>>>>> This gives userspace enough information to be able to identify specific
>>>>>> nodes (processors) in subsequent queue management calls. Application
>>>>>> processes can create queues on multiple processors, and processors support
>>>>>> queues from multiple processes.
>>>>>> At this point the application can create work queues in userspace memory and
>>>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>>>> queue slots so that commands written to the queues can be executed by the
>>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>>>> is device-specific.
>>>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>>>
>>>>>>
>>>>>> Alexey Skidanov (1):
>>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
>>>>>>
>>>>>> Andrew Lewycky (3):
>>>>>>   amdkfd: Add basic modules to amdkfd
>>>>>>   amdkfd: Add interrupt handling module
>>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
>>>>>>
>>>>>> Ben Goz (8):
>>>>>>   amdkfd: Add queue module
>>>>>>   amdkfd: Add mqd_manager module
>>>>>>   amdkfd: Add kernel queue module
>>>>>>   amdkfd: Add module parameter of scheduling policy
>>>>>>   amdkfd: Add packet manager module
>>>>>>   amdkfd: Add process queue manager module
>>>>>>   amdkfd: Add device queue manager module
>>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>>>
>>>>>> Evgeny Pinchuk (3):
>>>>>>   amdkfd: Add topology module to amdkfd
>>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
>>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>>>
>>>>>> Oded Gabbay (10):
>>>>>>   mm: Add kfd_process pointer to mm_struct
>>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
>>>>>>   drm/radeon: Report doorbell configuration to amdkfd
>>>>>>   drm/radeon: adding synchronization for GRBM GFX
>>>>>>   drm/radeon: Add radeon <--> amdkfd interface
>>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
>>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
>>>>>>   amdkfd: Add amdkfd skeleton driver
>>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>>>
>>>>>>  CREDITS                                            |    7 +
>>>>>>  MAINTAINERS                                        |   10 +
>>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>>>> ++++++++++++++++++++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>>>  include/linux/mm_types.h                           |   14 +
>>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>>>
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>>
>>>
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 17:42             ` Oded Gabbay
  (?)
@ 2014-07-21 18:14               ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 18:14 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> On 21/07/14 18:54, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 16:39, Christian König wrote:
> >>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>
> >>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>> of the code.
> >>>>>>
> >>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>
> >>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>> will adjust amdkfd to work within that framework.
> >>>>>>
> >>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>> drivers.
> >>>>>>
> >>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>
> >>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>
> >>>>> So quick comments before i finish going over all patches. There is many
> >>>>> things that need more documentation espacialy as of right now there is
> >>>>> no userspace i can go look at.
> >>>> So quick comments on some of your questions but first of all, thanks for the
> >>>> time you dedicated to review the code.
> >>>>>
> >>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>> that side.
> >>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>> is it still a show stopper ?
> >>>>
> >>>> The kernel objects are:
> >>>> - pipelines (4 per device)
> >>>> - mqd per hiq (only 1 per device)
> >>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>> 256 alignment. So total *possible* memory is 128MB
> >>>> - kernel queue (only 1 per device)
> >>>> - fence address for kernel queue
> >>>> - runlists for the CP (1 or 2 per device)
> >>>
> >>> The main questions here are if it's avoid able to pin down the memory and if the
> >>> memory is pinned down at driver load, by request from userspace or by anything
> >>> else.
> >>>
> >>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>> questionable, everything else sounds reasonable.
> >>>
> >>> Christian.
> >>
> >> Most of the pin downs are done on device initialization.
> >> The "mqd per userspace" is done per userspace queue creation. However, as I
> >> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >> memory, I think it is OK.
> >> The runlists are also done on userspace queue creation/deletion, but we only
> >> have 1 or 2 runlists per device, so it is not that bad.
> > 
> > 2G local memory ? You can not assume anything on userside configuration some
> > one might build an hsa computer with 512M and still expect a functioning
> > desktop.
> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> Second, I would imagine we can build some protection around it, like
> checking total local memory and limit number of queues based on some
> percentage of that total local memory. So, if someone will have only
> 512M, he will be able to open less queues.
> 
> 
> > 
> > I need to go look into what all this mqd is for, what it does and what it is
> > about. But pinning is really bad and this is an issue with userspace command
> > scheduling an issue that obviously AMD fails to take into account in design
> > phase.
> Maybe, but that is the H/W design non-the-less. We can't very well
> change the H/W.

You can not change the hardware but it is not an excuse to allow bad design to
sneak in software to work around that. So i would rather penalize bad hardware
design and have command submission in the kernel, until AMD fix its hardware to
allow proper scheduling by the kernel and proper control by the kernel. Because
really where we want to go is having GPU closer to a CPU in term of scheduling
capacity and once we get there we want the kernel to always be able to take over
and do whatever it wants behind process back.

> >>>
> >>>>>
> >>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>> stuff there.
> >>>>>
> >>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>
> >>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>> address reserved for kernel (see kernel memory map).
> >>>>>
> >>>>> The whole business of locking performance counter for exclusive per process
> >>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>> space command ring.
> >>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>> for a network card, which is slower than a different network card. Doesn't
> >>>> seem reasonable this situation is would happen. He would still put both the
> >>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>> I don't think this is a valid reason to NACK the driver.
> > 
> > Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> > i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> > the performance ioctl.
> > 
> > Again this is another argument for round trip to the kernel. As inside kernel you
> > could properly do exclusive gpu counter access accross single user cmd buffer
> > execution.
> > 
> >>>>
> >>>>> I only see issues with that. First and foremost i would
> >>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>> overhead that is measurable in any meaning full way against a simple
> >>>>> function call. I know the userspace command ring is a big marketing features
> >>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>> and for absolutely not upside afaict.
> >>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>> which only puts a buffer on a ring and writes a doorbell ?
> > 
> > I am saying the overhead is not that big and it probably will not matter in most
> > usecase. For instance i did wrote the most useless kernel module that add two
> > number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> > it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> > ioctl is 13 times slower.
> > 
> > Now if there is enough data that shows that a significant percentage of jobs
> > submited to the GPU will take less that 0.35microsecond then yes userspace
> > scheduling does make sense. But so far all we have is handwaving with no data
> > to support any facts.
> > 
> > 
> > Now if we want to schedule from userspace than you will need to do something
> > about the pinning, something that gives control to kernel so that kernel can
> > unpin when it wants and move object when it wants no matter what userspace is
> > doing.
> > 
> >>>>>
> >>>>> So i would rather see a very simple ioctl that write the doorbell and might
> >>>>> do more than that in case of ring/queue overcommit where it would first have
> >>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
> >>>>> implementation of things like performance counter that could be acquire by
> >>>>> kernel for duration of a job submitted by userspace. While still not optimal
> >>>>> this would be better that userspace locking.
> >>>>>
> >>>>>
> >>>>> I might have more thoughts once i am done with all the patches.
> >>>>>
> >>>>> Cheers,
> >>>>> Jérôme
> >>>>>
> >>>>>>
> >>>>>> Original Cover Letter:
> >>>>>>
> >>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
> >>>>>> for radeon-family GPUs.
> >>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> >>>>>> system resources more effectively via HW features including shared pageable
> >>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
> >>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> >>>>>> Islands family of GPUs also performs HW-level validation of commands passed
> >>>>>> in through the queues (aka rings).
> >>>>>>
> >>>>>> The code in this patch set is intended to serve both as a sample driver for
> >>>>>> other HSA-compatible hardware devices and as a production driver for
> >>>>>> radeon-family processors. The code is architected to support multiple CPUs
> >>>>>> each with connected GPUs, although the current implementation focuses on a
> >>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> >>>>>> graphics driver (kgd).
> >>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> >>>>>> functionality between HSA compute and regular gfx/compute (memory,
> >>>>>> interrupts, registers), while other functionality has been added
> >>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
> >>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
> >>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
> >>>>>> while HSA-specific functionality is managed directly by kfd by submitting
> >>>>>> packets into an HSA-specific command queue (the "HIQ").
> >>>>>>
> >>>>>> During kfd module initialization a char device node (/dev/kfd) is created
> >>>>>> (surviving until module exit), with ioctls for queue creation & management,
> >>>>>> and data structures are initialized for managing HSA device topology.
> >>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
> >>>>>> following points :
> >>>>>>
> >>>>>> - radeon_init (kfd_init)
> >>>>>> - radeon_exit (kfd_fini)
> >>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>>>> - radeon_driver_unload_kms (kfd_device_fini)
> >>>>>>
> >>>>>> During the probe and init processing per-device data structures are
> >>>>>> established which connect to the associated graphics kernel driver. This
> >>>>>> information is exposed to userspace via sysfs, along with a version number
> >>>>>> allowing userspace to determine if a topology change has occurred while it
> >>>>>> was reading from sysfs.
> >>>>>> The interface between kfd and kgd also allows the kfd to request buffer
> >>>>>> management services from kgd, and allows kgd to route interrupt requests to
> >>>>>> kfd code since the interrupt block is shared between regular
> >>>>>> graphics/compute and HSA compute subsystems in the GPU.
> >>>>>>
> >>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
> >>>>>> is in the final stages of IP review and should be published in a separate
> >>>>>> repo over the next few days.
> >>>>>> The code operates in one of three modes, selectable via the sched_policy
> >>>>>> module parameter :
> >>>>>>
> >>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
> >>>>>> CP, and allows oversubscription (more queues than HW slots)
> >>>>>> - sched_policy=1 also uses HW scheduling but does not allow
> >>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
> >>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> >>>>>> queues to HW slots by programming registers
> >>>>>>
> >>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
> >>>>>> has less test coverage than the other options. Default in the current code
> >>>>>> is "HW scheduling without oversubscription" since that is where we have the
> >>>>>> most test coverage but we expect to change the default to "HW scheduling
> >>>>>> with oversubscription" after further testing. This effectively removes the
> >>>>>> HW limit on the number of work queues available to applications.
> >>>>>>
> >>>>>> Programs running on the GPU are associated with an address space through the
> >>>>>> VMID field, which is translated to a unique PASID at access time via a set
> >>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> >>>>>> are partitioned (under control of the radeon kgd) between current
> >>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
> >>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> >>>>>> and by driver code if HW scheduling is not being used.
> >>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> >>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> >>>>>> dedicated for this purpose, and pages within the doorbell aperture are
> >>>>>> mapped to userspace (each page mapped to only one user address space).
> >>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> >>>>>> userspace code to safely manage work queues (rings) without requiring a
> >>>>>> kernel call for every ring update.
> >>>>>> First step for an application process is to open the kfd device. Calls to
> >>>>>> open create a kfd "process" structure only for the first thread of the
> >>>>>> process. Subsequent open calls are checked to see if they are from processes
> >>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
> >>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
> >>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> >>>>>> accessible to the GPU.
> >>>>>> Next step is for the application to collect topology information via sysfs.
> >>>>>> This gives userspace enough information to be able to identify specific
> >>>>>> nodes (processors) in subsequent queue management calls. Application
> >>>>>> processes can create queues on multiple processors, and processors support
> >>>>>> queues from multiple processes.
> >>>>>> At this point the application can create work queues in userspace memory and
> >>>>>> pass them through the usermode library to kfd to have them mapped onto HW
> >>>>>> queue slots so that commands written to the queues can be executed by the
> >>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
> >>>>>> is device-specific.
> >>>>>> Written by John Bridgman <John.Bridgman@amd.com>
> >>>>>>
> >>>>>>
> >>>>>> Alexey Skidanov (1):
> >>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
> >>>>>>
> >>>>>> Andrew Lewycky (3):
> >>>>>>   amdkfd: Add basic modules to amdkfd
> >>>>>>   amdkfd: Add interrupt handling module
> >>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
> >>>>>>
> >>>>>> Ben Goz (8):
> >>>>>>   amdkfd: Add queue module
> >>>>>>   amdkfd: Add mqd_manager module
> >>>>>>   amdkfd: Add kernel queue module
> >>>>>>   amdkfd: Add module parameter of scheduling policy
> >>>>>>   amdkfd: Add packet manager module
> >>>>>>   amdkfd: Add process queue manager module
> >>>>>>   amdkfd: Add device queue manager module
> >>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
> >>>>>>
> >>>>>> Evgeny Pinchuk (3):
> >>>>>>   amdkfd: Add topology module to amdkfd
> >>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
> >>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> >>>>>>
> >>>>>> Oded Gabbay (10):
> >>>>>>   mm: Add kfd_process pointer to mm_struct
> >>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
> >>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
> >>>>>>   drm/radeon: Report doorbell configuration to amdkfd
> >>>>>>   drm/radeon: adding synchronization for GRBM GFX
> >>>>>>   drm/radeon: Add radeon <--> amdkfd interface
> >>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
> >>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
> >>>>>>   amdkfd: Add amdkfd skeleton driver
> >>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> >>>>>>
> >>>>>>  CREDITS                                            |    7 +
> >>>>>>  MAINTAINERS                                        |   10 +
> >>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
> >>>>>> ++++++++++++++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >>>>>>  include/linux/mm_types.h                           |   14 +
> >>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> >>>>>>
> >>>>>> --
> >>>>>> 1.9.1
> >>>>>>
> >>>>
> >>>
> >>
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:14               ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 18:14 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> On 21/07/14 18:54, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 16:39, Christian Konig wrote:
> >>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>
> >>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>> of the code.
> >>>>>>
> >>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>
> >>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>> will adjust amdkfd to work within that framework.
> >>>>>>
> >>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>> drivers.
> >>>>>>
> >>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>
> >>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>
> >>>>> So quick comments before i finish going over all patches. There is many
> >>>>> things that need more documentation espacialy as of right now there is
> >>>>> no userspace i can go look at.
> >>>> So quick comments on some of your questions but first of all, thanks for the
> >>>> time you dedicated to review the code.
> >>>>>
> >>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>> that side.
> >>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>> is it still a show stopper ?
> >>>>
> >>>> The kernel objects are:
> >>>> - pipelines (4 per device)
> >>>> - mqd per hiq (only 1 per device)
> >>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>> 256 alignment. So total *possible* memory is 128MB
> >>>> - kernel queue (only 1 per device)
> >>>> - fence address for kernel queue
> >>>> - runlists for the CP (1 or 2 per device)
> >>>
> >>> The main questions here are if it's avoid able to pin down the memory and if the
> >>> memory is pinned down at driver load, by request from userspace or by anything
> >>> else.
> >>>
> >>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>> questionable, everything else sounds reasonable.
> >>>
> >>> Christian.
> >>
> >> Most of the pin downs are done on device initialization.
> >> The "mqd per userspace" is done per userspace queue creation. However, as I
> >> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >> memory, I think it is OK.
> >> The runlists are also done on userspace queue creation/deletion, but we only
> >> have 1 or 2 runlists per device, so it is not that bad.
> > 
> > 2G local memory ? You can not assume anything on userside configuration some
> > one might build an hsa computer with 512M and still expect a functioning
> > desktop.
> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> Second, I would imagine we can build some protection around it, like
> checking total local memory and limit number of queues based on some
> percentage of that total local memory. So, if someone will have only
> 512M, he will be able to open less queues.
> 
> 
> > 
> > I need to go look into what all this mqd is for, what it does and what it is
> > about. But pinning is really bad and this is an issue with userspace command
> > scheduling an issue that obviously AMD fails to take into account in design
> > phase.
> Maybe, but that is the H/W design non-the-less. We can't very well
> change the H/W.

You can not change the hardware but it is not an excuse to allow bad design to
sneak in software to work around that. So i would rather penalize bad hardware
design and have command submission in the kernel, until AMD fix its hardware to
allow proper scheduling by the kernel and proper control by the kernel. Because
really where we want to go is having GPU closer to a CPU in term of scheduling
capacity and once we get there we want the kernel to always be able to take over
and do whatever it wants behind process back.

> >>>
> >>>>>
> >>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>> stuff there.
> >>>>>
> >>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>
> >>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>> address reserved for kernel (see kernel memory map).
> >>>>>
> >>>>> The whole business of locking performance counter for exclusive per process
> >>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>> space command ring.
> >>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>> for a network card, which is slower than a different network card. Doesn't
> >>>> seem reasonable this situation is would happen. He would still put both the
> >>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>> I don't think this is a valid reason to NACK the driver.
> > 
> > Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> > i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> > the performance ioctl.
> > 
> > Again this is another argument for round trip to the kernel. As inside kernel you
> > could properly do exclusive gpu counter access accross single user cmd buffer
> > execution.
> > 
> >>>>
> >>>>> I only see issues with that. First and foremost i would
> >>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>> overhead that is measurable in any meaning full way against a simple
> >>>>> function call. I know the userspace command ring is a big marketing features
> >>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>> and for absolutely not upside afaict.
> >>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>> which only puts a buffer on a ring and writes a doorbell ?
> > 
> > I am saying the overhead is not that big and it probably will not matter in most
> > usecase. For instance i did wrote the most useless kernel module that add two
> > number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> > it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> > ioctl is 13 times slower.
> > 
> > Now if there is enough data that shows that a significant percentage of jobs
> > submited to the GPU will take less that 0.35microsecond then yes userspace
> > scheduling does make sense. But so far all we have is handwaving with no data
> > to support any facts.
> > 
> > 
> > Now if we want to schedule from userspace than you will need to do something
> > about the pinning, something that gives control to kernel so that kernel can
> > unpin when it wants and move object when it wants no matter what userspace is
> > doing.
> > 
> >>>>>
> >>>>> So i would rather see a very simple ioctl that write the doorbell and might
> >>>>> do more than that in case of ring/queue overcommit where it would first have
> >>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
> >>>>> implementation of things like performance counter that could be acquire by
> >>>>> kernel for duration of a job submitted by userspace. While still not optimal
> >>>>> this would be better that userspace locking.
> >>>>>
> >>>>>
> >>>>> I might have more thoughts once i am done with all the patches.
> >>>>>
> >>>>> Cheers,
> >>>>> Jerome
> >>>>>
> >>>>>>
> >>>>>> Original Cover Letter:
> >>>>>>
> >>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
> >>>>>> for radeon-family GPUs.
> >>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> >>>>>> system resources more effectively via HW features including shared pageable
> >>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
> >>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> >>>>>> Islands family of GPUs also performs HW-level validation of commands passed
> >>>>>> in through the queues (aka rings).
> >>>>>>
> >>>>>> The code in this patch set is intended to serve both as a sample driver for
> >>>>>> other HSA-compatible hardware devices and as a production driver for
> >>>>>> radeon-family processors. The code is architected to support multiple CPUs
> >>>>>> each with connected GPUs, although the current implementation focuses on a
> >>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> >>>>>> graphics driver (kgd).
> >>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> >>>>>> functionality between HSA compute and regular gfx/compute (memory,
> >>>>>> interrupts, registers), while other functionality has been added
> >>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
> >>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
> >>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
> >>>>>> while HSA-specific functionality is managed directly by kfd by submitting
> >>>>>> packets into an HSA-specific command queue (the "HIQ").
> >>>>>>
> >>>>>> During kfd module initialization a char device node (/dev/kfd) is created
> >>>>>> (surviving until module exit), with ioctls for queue creation & management,
> >>>>>> and data structures are initialized for managing HSA device topology.
> >>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
> >>>>>> following points :
> >>>>>>
> >>>>>> - radeon_init (kfd_init)
> >>>>>> - radeon_exit (kfd_fini)
> >>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>>>> - radeon_driver_unload_kms (kfd_device_fini)
> >>>>>>
> >>>>>> During the probe and init processing per-device data structures are
> >>>>>> established which connect to the associated graphics kernel driver. This
> >>>>>> information is exposed to userspace via sysfs, along with a version number
> >>>>>> allowing userspace to determine if a topology change has occurred while it
> >>>>>> was reading from sysfs.
> >>>>>> The interface between kfd and kgd also allows the kfd to request buffer
> >>>>>> management services from kgd, and allows kgd to route interrupt requests to
> >>>>>> kfd code since the interrupt block is shared between regular
> >>>>>> graphics/compute and HSA compute subsystems in the GPU.
> >>>>>>
> >>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
> >>>>>> is in the final stages of IP review and should be published in a separate
> >>>>>> repo over the next few days.
> >>>>>> The code operates in one of three modes, selectable via the sched_policy
> >>>>>> module parameter :
> >>>>>>
> >>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
> >>>>>> CP, and allows oversubscription (more queues than HW slots)
> >>>>>> - sched_policy=1 also uses HW scheduling but does not allow
> >>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
> >>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> >>>>>> queues to HW slots by programming registers
> >>>>>>
> >>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
> >>>>>> has less test coverage than the other options. Default in the current code
> >>>>>> is "HW scheduling without oversubscription" since that is where we have the
> >>>>>> most test coverage but we expect to change the default to "HW scheduling
> >>>>>> with oversubscription" after further testing. This effectively removes the
> >>>>>> HW limit on the number of work queues available to applications.
> >>>>>>
> >>>>>> Programs running on the GPU are associated with an address space through the
> >>>>>> VMID field, which is translated to a unique PASID at access time via a set
> >>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> >>>>>> are partitioned (under control of the radeon kgd) between current
> >>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
> >>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> >>>>>> and by driver code if HW scheduling is not being used.
> >>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> >>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> >>>>>> dedicated for this purpose, and pages within the doorbell aperture are
> >>>>>> mapped to userspace (each page mapped to only one user address space).
> >>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> >>>>>> userspace code to safely manage work queues (rings) without requiring a
> >>>>>> kernel call for every ring update.
> >>>>>> First step for an application process is to open the kfd device. Calls to
> >>>>>> open create a kfd "process" structure only for the first thread of the
> >>>>>> process. Subsequent open calls are checked to see if they are from processes
> >>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
> >>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
> >>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> >>>>>> accessible to the GPU.
> >>>>>> Next step is for the application to collect topology information via sysfs.
> >>>>>> This gives userspace enough information to be able to identify specific
> >>>>>> nodes (processors) in subsequent queue management calls. Application
> >>>>>> processes can create queues on multiple processors, and processors support
> >>>>>> queues from multiple processes.
> >>>>>> At this point the application can create work queues in userspace memory and
> >>>>>> pass them through the usermode library to kfd to have them mapped onto HW
> >>>>>> queue slots so that commands written to the queues can be executed by the
> >>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
> >>>>>> is device-specific.
> >>>>>> Written by John Bridgman <John.Bridgman@amd.com>
> >>>>>>
> >>>>>>
> >>>>>> Alexey Skidanov (1):
> >>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
> >>>>>>
> >>>>>> Andrew Lewycky (3):
> >>>>>>   amdkfd: Add basic modules to amdkfd
> >>>>>>   amdkfd: Add interrupt handling module
> >>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
> >>>>>>
> >>>>>> Ben Goz (8):
> >>>>>>   amdkfd: Add queue module
> >>>>>>   amdkfd: Add mqd_manager module
> >>>>>>   amdkfd: Add kernel queue module
> >>>>>>   amdkfd: Add module parameter of scheduling policy
> >>>>>>   amdkfd: Add packet manager module
> >>>>>>   amdkfd: Add process queue manager module
> >>>>>>   amdkfd: Add device queue manager module
> >>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
> >>>>>>
> >>>>>> Evgeny Pinchuk (3):
> >>>>>>   amdkfd: Add topology module to amdkfd
> >>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
> >>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> >>>>>>
> >>>>>> Oded Gabbay (10):
> >>>>>>   mm: Add kfd_process pointer to mm_struct
> >>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
> >>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
> >>>>>>   drm/radeon: Report doorbell configuration to amdkfd
> >>>>>>   drm/radeon: adding synchronization for GRBM GFX
> >>>>>>   drm/radeon: Add radeon <--> amdkfd interface
> >>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
> >>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
> >>>>>>   amdkfd: Add amdkfd skeleton driver
> >>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> >>>>>>
> >>>>>>  CREDITS                                            |    7 +
> >>>>>>  MAINTAINERS                                        |   10 +
> >>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
> >>>>>> ++++++++++++++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >>>>>>  include/linux/mm_types.h                           |   14 +
> >>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> >>>>>>
> >>>>>> --
> >>>>>> 1.9.1
> >>>>>>
> >>>>
> >>>
> >>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:14               ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 18:14 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> On 21/07/14 18:54, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 16:39, Christian König wrote:
> >>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>
> >>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>> of the code.
> >>>>>>
> >>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>
> >>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>> will adjust amdkfd to work within that framework.
> >>>>>>
> >>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>> drivers.
> >>>>>>
> >>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>
> >>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>
> >>>>> So quick comments before i finish going over all patches. There is many
> >>>>> things that need more documentation espacialy as of right now there is
> >>>>> no userspace i can go look at.
> >>>> So quick comments on some of your questions but first of all, thanks for the
> >>>> time you dedicated to review the code.
> >>>>>
> >>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>> that side.
> >>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>> is it still a show stopper ?
> >>>>
> >>>> The kernel objects are:
> >>>> - pipelines (4 per device)
> >>>> - mqd per hiq (only 1 per device)
> >>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>> 256 alignment. So total *possible* memory is 128MB
> >>>> - kernel queue (only 1 per device)
> >>>> - fence address for kernel queue
> >>>> - runlists for the CP (1 or 2 per device)
> >>>
> >>> The main questions here are if it's avoid able to pin down the memory and if the
> >>> memory is pinned down at driver load, by request from userspace or by anything
> >>> else.
> >>>
> >>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>> questionable, everything else sounds reasonable.
> >>>
> >>> Christian.
> >>
> >> Most of the pin downs are done on device initialization.
> >> The "mqd per userspace" is done per userspace queue creation. However, as I
> >> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >> memory, I think it is OK.
> >> The runlists are also done on userspace queue creation/deletion, but we only
> >> have 1 or 2 runlists per device, so it is not that bad.
> > 
> > 2G local memory ? You can not assume anything on userside configuration some
> > one might build an hsa computer with 512M and still expect a functioning
> > desktop.
> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> Second, I would imagine we can build some protection around it, like
> checking total local memory and limit number of queues based on some
> percentage of that total local memory. So, if someone will have only
> 512M, he will be able to open less queues.
> 
> 
> > 
> > I need to go look into what all this mqd is for, what it does and what it is
> > about. But pinning is really bad and this is an issue with userspace command
> > scheduling an issue that obviously AMD fails to take into account in design
> > phase.
> Maybe, but that is the H/W design non-the-less. We can't very well
> change the H/W.

You can not change the hardware but it is not an excuse to allow bad design to
sneak in software to work around that. So i would rather penalize bad hardware
design and have command submission in the kernel, until AMD fix its hardware to
allow proper scheduling by the kernel and proper control by the kernel. Because
really where we want to go is having GPU closer to a CPU in term of scheduling
capacity and once we get there we want the kernel to always be able to take over
and do whatever it wants behind process back.

> >>>
> >>>>>
> >>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>> stuff there.
> >>>>>
> >>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>
> >>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>> address reserved for kernel (see kernel memory map).
> >>>>>
> >>>>> The whole business of locking performance counter for exclusive per process
> >>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>> space command ring.
> >>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>> for a network card, which is slower than a different network card. Doesn't
> >>>> seem reasonable this situation is would happen. He would still put both the
> >>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>> I don't think this is a valid reason to NACK the driver.
> > 
> > Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> > i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> > the performance ioctl.
> > 
> > Again this is another argument for round trip to the kernel. As inside kernel you
> > could properly do exclusive gpu counter access accross single user cmd buffer
> > execution.
> > 
> >>>>
> >>>>> I only see issues with that. First and foremost i would
> >>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>> overhead that is measurable in any meaning full way against a simple
> >>>>> function call. I know the userspace command ring is a big marketing features
> >>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>> and for absolutely not upside afaict.
> >>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>> which only puts a buffer on a ring and writes a doorbell ?
> > 
> > I am saying the overhead is not that big and it probably will not matter in most
> > usecase. For instance i did wrote the most useless kernel module that add two
> > number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> > it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> > ioctl is 13 times slower.
> > 
> > Now if there is enough data that shows that a significant percentage of jobs
> > submited to the GPU will take less that 0.35microsecond then yes userspace
> > scheduling does make sense. But so far all we have is handwaving with no data
> > to support any facts.
> > 
> > 
> > Now if we want to schedule from userspace than you will need to do something
> > about the pinning, something that gives control to kernel so that kernel can
> > unpin when it wants and move object when it wants no matter what userspace is
> > doing.
> > 
> >>>>>
> >>>>> So i would rather see a very simple ioctl that write the doorbell and might
> >>>>> do more than that in case of ring/queue overcommit where it would first have
> >>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
> >>>>> implementation of things like performance counter that could be acquire by
> >>>>> kernel for duration of a job submitted by userspace. While still not optimal
> >>>>> this would be better that userspace locking.
> >>>>>
> >>>>>
> >>>>> I might have more thoughts once i am done with all the patches.
> >>>>>
> >>>>> Cheers,
> >>>>> Jérôme
> >>>>>
> >>>>>>
> >>>>>> Original Cover Letter:
> >>>>>>
> >>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
> >>>>>> for radeon-family GPUs.
> >>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
> >>>>>> system resources more effectively via HW features including shared pageable
> >>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
> >>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
> >>>>>> Islands family of GPUs also performs HW-level validation of commands passed
> >>>>>> in through the queues (aka rings).
> >>>>>>
> >>>>>> The code in this patch set is intended to serve both as a sample driver for
> >>>>>> other HSA-compatible hardware devices and as a production driver for
> >>>>>> radeon-family processors. The code is architected to support multiple CPUs
> >>>>>> each with connected GPUs, although the current implementation focuses on a
> >>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
> >>>>>> graphics driver (kgd).
> >>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
> >>>>>> functionality between HSA compute and regular gfx/compute (memory,
> >>>>>> interrupts, registers), while other functionality has been added
> >>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
> >>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
> >>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
> >>>>>> while HSA-specific functionality is managed directly by kfd by submitting
> >>>>>> packets into an HSA-specific command queue (the "HIQ").
> >>>>>>
> >>>>>> During kfd module initialization a char device node (/dev/kfd) is created
> >>>>>> (surviving until module exit), with ioctls for queue creation & management,
> >>>>>> and data structures are initialized for managing HSA device topology.
> >>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
> >>>>>> following points :
> >>>>>>
> >>>>>> - radeon_init (kfd_init)
> >>>>>> - radeon_exit (kfd_fini)
> >>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>>>> - radeon_driver_unload_kms (kfd_device_fini)
> >>>>>>
> >>>>>> During the probe and init processing per-device data structures are
> >>>>>> established which connect to the associated graphics kernel driver. This
> >>>>>> information is exposed to userspace via sysfs, along with a version number
> >>>>>> allowing userspace to determine if a topology change has occurred while it
> >>>>>> was reading from sysfs.
> >>>>>> The interface between kfd and kgd also allows the kfd to request buffer
> >>>>>> management services from kgd, and allows kgd to route interrupt requests to
> >>>>>> kfd code since the interrupt block is shared between regular
> >>>>>> graphics/compute and HSA compute subsystems in the GPU.
> >>>>>>
> >>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
> >>>>>> is in the final stages of IP review and should be published in a separate
> >>>>>> repo over the next few days.
> >>>>>> The code operates in one of three modes, selectable via the sched_policy
> >>>>>> module parameter :
> >>>>>>
> >>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
> >>>>>> CP, and allows oversubscription (more queues than HW slots)
> >>>>>> - sched_policy=1 also uses HW scheduling but does not allow
> >>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
> >>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
> >>>>>> queues to HW slots by programming registers
> >>>>>>
> >>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
> >>>>>> has less test coverage than the other options. Default in the current code
> >>>>>> is "HW scheduling without oversubscription" since that is where we have the
> >>>>>> most test coverage but we expect to change the default to "HW scheduling
> >>>>>> with oversubscription" after further testing. This effectively removes the
> >>>>>> HW limit on the number of work queues available to applications.
> >>>>>>
> >>>>>> Programs running on the GPU are associated with an address space through the
> >>>>>> VMID field, which is translated to a unique PASID at access time via a set
> >>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
> >>>>>> are partitioned (under control of the radeon kgd) between current
> >>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
> >>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
> >>>>>> and by driver code if HW scheduling is not being used.
> >>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
> >>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
> >>>>>> dedicated for this purpose, and pages within the doorbell aperture are
> >>>>>> mapped to userspace (each page mapped to only one user address space).
> >>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
> >>>>>> userspace code to safely manage work queues (rings) without requiring a
> >>>>>> kernel call for every ring update.
> >>>>>> First step for an application process is to open the kfd device. Calls to
> >>>>>> open create a kfd "process" structure only for the first thread of the
> >>>>>> process. Subsequent open calls are checked to see if they are from processes
> >>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
> >>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
> >>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
> >>>>>> accessible to the GPU.
> >>>>>> Next step is for the application to collect topology information via sysfs.
> >>>>>> This gives userspace enough information to be able to identify specific
> >>>>>> nodes (processors) in subsequent queue management calls. Application
> >>>>>> processes can create queues on multiple processors, and processors support
> >>>>>> queues from multiple processes.
> >>>>>> At this point the application can create work queues in userspace memory and
> >>>>>> pass them through the usermode library to kfd to have them mapped onto HW
> >>>>>> queue slots so that commands written to the queues can be executed by the
> >>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
> >>>>>> is device-specific.
> >>>>>> Written by John Bridgman <John.Bridgman@amd.com>
> >>>>>>
> >>>>>>
> >>>>>> Alexey Skidanov (1):
> >>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
> >>>>>>
> >>>>>> Andrew Lewycky (3):
> >>>>>>   amdkfd: Add basic modules to amdkfd
> >>>>>>   amdkfd: Add interrupt handling module
> >>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
> >>>>>>
> >>>>>> Ben Goz (8):
> >>>>>>   amdkfd: Add queue module
> >>>>>>   amdkfd: Add mqd_manager module
> >>>>>>   amdkfd: Add kernel queue module
> >>>>>>   amdkfd: Add module parameter of scheduling policy
> >>>>>>   amdkfd: Add packet manager module
> >>>>>>   amdkfd: Add process queue manager module
> >>>>>>   amdkfd: Add device queue manager module
> >>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
> >>>>>>
> >>>>>> Evgeny Pinchuk (3):
> >>>>>>   amdkfd: Add topology module to amdkfd
> >>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
> >>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
> >>>>>>
> >>>>>> Oded Gabbay (10):
> >>>>>>   mm: Add kfd_process pointer to mm_struct
> >>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
> >>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
> >>>>>>   drm/radeon: Report doorbell configuration to amdkfd
> >>>>>>   drm/radeon: adding synchronization for GRBM GFX
> >>>>>>   drm/radeon: Add radeon <--> amdkfd interface
> >>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
> >>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
> >>>>>>   amdkfd: Add amdkfd skeleton driver
> >>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
> >>>>>>
> >>>>>>  CREDITS                                            |    7 +
> >>>>>>  MAINTAINERS                                        |   10 +
> >>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
> >>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
> >>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
> >>>>>> ++++++++++++++++++++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
> >>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
> >>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
> >>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
> >>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
> >>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
> >>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
> >>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
> >>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
> >>>>>>  include/linux/mm_types.h                           |   14 +
> >>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
> >>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
> >>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
> >>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
> >>>>>>
> >>>>>> --
> >>>>>> 1.9.1
> >>>>>>
> >>>>
> >>>
> >>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 17:28               ` Oded Gabbay
  (?)
@ 2014-07-21 18:22                 ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 18:22 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> I'm not sure whether we can do the same trick with the hw scheduler. But
>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>> and evict.
> So, I'm afraid but we can't do this for AMD Kaveri because:

Well as long as you can drain the hw scheduler queue (and you can do
that, worst case you have to unmap all the doorbells and other stuff
to intercept further submission from userspace) you can evict stuff.
And if we don't want compute to be a denial of service on the display
side of the driver we need this ability. Now if you go through an
ioctl instead of the doorbell (I agree with Jerome here, the doorbell
should be supported by benchmarks on linux) this gets a bit easier,
but it's not a requirement really.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:22                 ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 18:22 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> I'm not sure whether we can do the same trick with the hw scheduler. But
>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>> and evict.
> So, I'm afraid but we can't do this for AMD Kaveri because:

Well as long as you can drain the hw scheduler queue (and you can do
that, worst case you have to unmap all the doorbells and other stuff
to intercept further submission from userspace) you can evict stuff.
And if we don't want compute to be a denial of service on the display
side of the driver we need this ability. Now if you go through an
ioctl instead of the doorbell (I agree with Jerome here, the doorbell
should be supported by benchmarks on linux) this gets a bit easier,
but it's not a requirement really.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:22                 ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-21 18:22 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	Evgeny Pinchuk, linux-mm, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> I'm not sure whether we can do the same trick with the hw scheduler. But
>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>> and evict.
> So, I'm afraid but we can't do this for AMD Kaveri because:

Well as long as you can drain the hw scheduler queue (and you can do
that, worst case you have to unmap all the doorbells and other stuff
to intercept further submission from userspace) you can evict stuff.
And if we don't want compute to be a denial of service on the display
side of the driver we need this ability. Now if you go through an
ioctl instead of the doorbell (I agree with Jerome here, the doorbell
should be supported by benchmarks on linux) this gets a bit easier,
but it's not a requirement really.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 18:14               ` Jerome Glisse
  (?)
@ 2014-07-21 18:36                 ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 18:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On 21/07/14 21:14, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>> On 21/07/14 18:54, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 16:39, Christian König wrote:
>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>
>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>> of the code.
>>>>>>>>
>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>
>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>
>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>> drivers.
>>>>>>>>
>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>
>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>
>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>> no userspace i can go look at.
>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>> time you dedicated to review the code.
>>>>>>>
>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>> that side.
>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>> is it still a show stopper ?
>>>>>>
>>>>>> The kernel objects are:
>>>>>> - pipelines (4 per device)
>>>>>> - mqd per hiq (only 1 per device)
>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>> - kernel queue (only 1 per device)
>>>>>> - fence address for kernel queue
>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>
>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>> else.
>>>>>
>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>> questionable, everything else sounds reasonable.
>>>>>
>>>>> Christian.
>>>>
>>>> Most of the pin downs are done on device initialization.
>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>> memory, I think it is OK.
>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>
>>> 2G local memory ? You can not assume anything on userside configuration some
>>> one might build an hsa computer with 512M and still expect a functioning
>>> desktop.
>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>> Second, I would imagine we can build some protection around it, like
>> checking total local memory and limit number of queues based on some
>> percentage of that total local memory. So, if someone will have only
>> 512M, he will be able to open less queues.
>>
>>
>>>
>>> I need to go look into what all this mqd is for, what it does and what it is
>>> about. But pinning is really bad and this is an issue with userspace command
>>> scheduling an issue that obviously AMD fails to take into account in design
>>> phase.
>> Maybe, but that is the H/W design non-the-less. We can't very well
>> change the H/W.
> 
> You can not change the hardware but it is not an excuse to allow bad design to
> sneak in software to work around that. So i would rather penalize bad hardware
> design and have command submission in the kernel, until AMD fix its hardware to
> allow proper scheduling by the kernel and proper control by the kernel. 
I'm sorry but I do *not* think this is a bad design. S/W scheduling in
the kernel can not, IMO, scale well to 100K queues and 10K processes.

> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> capacity and once we get there we want the kernel to always be able to take over
> and do whatever it wants behind process back.
Who do you refer to when you say "we" ? AFAIK, the hw scheduling
direction is where AMD is now and where it is heading in the future.
That doesn't preclude the option to allow the kernel to take over and do
what he wants. I agree that in KV we have a problem where we can't do a
mid-wave preemption, so theoretically, a long running compute kernel can
make things messy, but in Carrizo, we will have this ability. Having
said that, it will only be through the CP H/W scheduling. So AMD is
_not_ going to abandon H/W scheduling. You can dislike it, but this is
the situation.
> 
>>>>>
>>>>>>>
>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>> stuff there.
>>>>>>>
>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>
>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>
>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>> space command ring.
>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>> I don't think this is a valid reason to NACK the driver.
>>>
>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>> the performance ioctl.
>>>
>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>> execution.
>>>
>>>>>>
>>>>>>> I only see issues with that. First and foremost i would
>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>> and for absolutely not upside afaict.
>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>
>>> I am saying the overhead is not that big and it probably will not matter in most
>>> usecase. For instance i did wrote the most useless kernel module that add two
>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>> ioctl is 13 times slower.
>>>
>>> Now if there is enough data that shows that a significant percentage of jobs
>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>> scheduling does make sense. But so far all we have is handwaving with no data
>>> to support any facts.
>>>
>>>
>>> Now if we want to schedule from userspace than you will need to do something
>>> about the pinning, something that gives control to kernel so that kernel can
>>> unpin when it wants and move object when it wants no matter what userspace is
>>> doing.
>>>
>>>>>>>
>>>>>>> So i would rather see a very simple ioctl that write the doorbell and might
>>>>>>> do more than that in case of ring/queue overcommit where it would first have
>>>>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>>>>>> implementation of things like performance counter that could be acquire by
>>>>>>> kernel for duration of a job submitted by userspace. While still not optimal
>>>>>>> this would be better that userspace locking.
>>>>>>>
>>>>>>>
>>>>>>> I might have more thoughts once i am done with all the patches.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jérôme
>>>>>>>
>>>>>>>>
>>>>>>>> Original Cover Letter:
>>>>>>>>
>>>>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>>>>>> for radeon-family GPUs.
>>>>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>>>>>> system resources more effectively via HW features including shared pageable
>>>>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>>>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>>>>>> in through the queues (aka rings).
>>>>>>>>
>>>>>>>> The code in this patch set is intended to serve both as a sample driver for
>>>>>>>> other HSA-compatible hardware devices and as a production driver for
>>>>>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>>>>>> each with connected GPUs, although the current implementation focuses on a
>>>>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>>>>>> graphics driver (kgd).
>>>>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>>>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>>>>>> interrupts, registers), while other functionality has been added
>>>>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>>>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>>>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>>>>>
>>>>>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>>>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>>>>>> and data structures are initialized for managing HSA device topology.
>>>>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>>>>>> following points :
>>>>>>>>
>>>>>>>> - radeon_init (kfd_init)
>>>>>>>> - radeon_exit (kfd_fini)
>>>>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>>>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>>>>>
>>>>>>>> During the probe and init processing per-device data structures are
>>>>>>>> established which connect to the associated graphics kernel driver. This
>>>>>>>> information is exposed to userspace via sysfs, along with a version number
>>>>>>>> allowing userspace to determine if a topology change has occurred while it
>>>>>>>> was reading from sysfs.
>>>>>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>>>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>>>>>> kfd code since the interrupt block is shared between regular
>>>>>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>>>>>
>>>>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>>>>>> is in the final stages of IP review and should be published in a separate
>>>>>>>> repo over the next few days.
>>>>>>>> The code operates in one of three modes, selectable via the sched_policy
>>>>>>>> module parameter :
>>>>>>>>
>>>>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>>>>>> CP, and allows oversubscription (more queues than HW slots)
>>>>>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>>>>>> queues to HW slots by programming registers
>>>>>>>>
>>>>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>>>>>> has less test coverage than the other options. Default in the current code
>>>>>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>>>>>> most test coverage but we expect to change the default to "HW scheduling
>>>>>>>> with oversubscription" after further testing. This effectively removes the
>>>>>>>> HW limit on the number of work queues available to applications.
>>>>>>>>
>>>>>>>> Programs running on the GPU are associated with an address space through the
>>>>>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>>>>>> are partitioned (under control of the radeon kgd) between current
>>>>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>>>>>> and by driver code if HW scheduling is not being used.
>>>>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>>>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>>>>>> mapped to userspace (each page mapped to only one user address space).
>>>>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>>>>>> userspace code to safely manage work queues (rings) without requiring a
>>>>>>>> kernel call for every ring update.
>>>>>>>> First step for an application process is to open the kfd device. Calls to
>>>>>>>> open create a kfd "process" structure only for the first thread of the
>>>>>>>> process. Subsequent open calls are checked to see if they are from processes
>>>>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>>>>>> accessible to the GPU.
>>>>>>>> Next step is for the application to collect topology information via sysfs.
>>>>>>>> This gives userspace enough information to be able to identify specific
>>>>>>>> nodes (processors) in subsequent queue management calls. Application
>>>>>>>> processes can create queues on multiple processors, and processors support
>>>>>>>> queues from multiple processes.
>>>>>>>> At this point the application can create work queues in userspace memory and
>>>>>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>>>>>> queue slots so that commands written to the queues can be executed by the
>>>>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>>>>>> is device-specific.
>>>>>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> Alexey Skidanov (1):
>>>>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
>>>>>>>>
>>>>>>>> Andrew Lewycky (3):
>>>>>>>>   amdkfd: Add basic modules to amdkfd
>>>>>>>>   amdkfd: Add interrupt handling module
>>>>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
>>>>>>>>
>>>>>>>> Ben Goz (8):
>>>>>>>>   amdkfd: Add queue module
>>>>>>>>   amdkfd: Add mqd_manager module
>>>>>>>>   amdkfd: Add kernel queue module
>>>>>>>>   amdkfd: Add module parameter of scheduling policy
>>>>>>>>   amdkfd: Add packet manager module
>>>>>>>>   amdkfd: Add process queue manager module
>>>>>>>>   amdkfd: Add device queue manager module
>>>>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>>>>>
>>>>>>>> Evgeny Pinchuk (3):
>>>>>>>>   amdkfd: Add topology module to amdkfd
>>>>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
>>>>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>>>>>
>>>>>>>> Oded Gabbay (10):
>>>>>>>>   mm: Add kfd_process pointer to mm_struct
>>>>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
>>>>>>>>   drm/radeon: Report doorbell configuration to amdkfd
>>>>>>>>   drm/radeon: adding synchronization for GRBM GFX
>>>>>>>>   drm/radeon: Add radeon <--> amdkfd interface
>>>>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
>>>>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
>>>>>>>>   amdkfd: Add amdkfd skeleton driver
>>>>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>>>>>
>>>>>>>>  CREDITS                                            |    7 +
>>>>>>>>  MAINTAINERS                                        |   10 +
>>>>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>>>>>> ++++++++++++++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>>>>>  include/linux/mm_types.h                           |   14 +
>>>>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>>>>>
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:36                 ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 18:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, Evgeny Pinchuk, linux-kernel,
	dri-devel, linux-mm

On 21/07/14 21:14, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>> On 21/07/14 18:54, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 16:39, Christian König wrote:
>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>
>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>> of the code.
>>>>>>>>
>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>
>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>
>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>> drivers.
>>>>>>>>
>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>
>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>
>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>> no userspace i can go look at.
>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>> time you dedicated to review the code.
>>>>>>>
>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>> that side.
>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>> is it still a show stopper ?
>>>>>>
>>>>>> The kernel objects are:
>>>>>> - pipelines (4 per device)
>>>>>> - mqd per hiq (only 1 per device)
>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>> - kernel queue (only 1 per device)
>>>>>> - fence address for kernel queue
>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>
>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>> else.
>>>>>
>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>> questionable, everything else sounds reasonable.
>>>>>
>>>>> Christian.
>>>>
>>>> Most of the pin downs are done on device initialization.
>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>> memory, I think it is OK.
>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>
>>> 2G local memory ? You can not assume anything on userside configuration some
>>> one might build an hsa computer with 512M and still expect a functioning
>>> desktop.
>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>> Second, I would imagine we can build some protection around it, like
>> checking total local memory and limit number of queues based on some
>> percentage of that total local memory. So, if someone will have only
>> 512M, he will be able to open less queues.
>>
>>
>>>
>>> I need to go look into what all this mqd is for, what it does and what it is
>>> about. But pinning is really bad and this is an issue with userspace command
>>> scheduling an issue that obviously AMD fails to take into account in design
>>> phase.
>> Maybe, but that is the H/W design non-the-less. We can't very well
>> change the H/W.
> 
> You can not change the hardware but it is not an excuse to allow bad design to
> sneak in software to work around that. So i would rather penalize bad hardware
> design and have command submission in the kernel, until AMD fix its hardware to
> allow proper scheduling by the kernel and proper control by the kernel. 
I'm sorry but I do *not* think this is a bad design. S/W scheduling in
the kernel can not, IMO, scale well to 100K queues and 10K processes.

> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> capacity and once we get there we want the kernel to always be able to take over
> and do whatever it wants behind process back.
Who do you refer to when you say "we" ? AFAIK, the hw scheduling
direction is where AMD is now and where it is heading in the future.
That doesn't preclude the option to allow the kernel to take over and do
what he wants. I agree that in KV we have a problem where we can't do a
mid-wave preemption, so theoretically, a long running compute kernel can
make things messy, but in Carrizo, we will have this ability. Having
said that, it will only be through the CP H/W scheduling. So AMD is
_not_ going to abandon H/W scheduling. You can dislike it, but this is
the situation.
> 
>>>>>
>>>>>>>
>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>> stuff there.
>>>>>>>
>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>
>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>
>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>> space command ring.
>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>> I don't think this is a valid reason to NACK the driver.
>>>
>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>> the performance ioctl.
>>>
>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>> execution.
>>>
>>>>>>
>>>>>>> I only see issues with that. First and foremost i would
>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>> and for absolutely not upside afaict.
>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>
>>> I am saying the overhead is not that big and it probably will not matter in most
>>> usecase. For instance i did wrote the most useless kernel module that add two
>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>> ioctl is 13 times slower.
>>>
>>> Now if there is enough data that shows that a significant percentage of jobs
>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>> scheduling does make sense. But so far all we have is handwaving with no data
>>> to support any facts.
>>>
>>>
>>> Now if we want to schedule from userspace than you will need to do something
>>> about the pinning, something that gives control to kernel so that kernel can
>>> unpin when it wants and move object when it wants no matter what userspace is
>>> doing.
>>>
>>>>>>>
>>>>>>> So i would rather see a very simple ioctl that write the doorbell and might
>>>>>>> do more than that in case of ring/queue overcommit where it would first have
>>>>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>>>>>> implementation of things like performance counter that could be acquire by
>>>>>>> kernel for duration of a job submitted by userspace. While still not optimal
>>>>>>> this would be better that userspace locking.
>>>>>>>
>>>>>>>
>>>>>>> I might have more thoughts once i am done with all the patches.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jérôme
>>>>>>>
>>>>>>>>
>>>>>>>> Original Cover Letter:
>>>>>>>>
>>>>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>>>>>> for radeon-family GPUs.
>>>>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>>>>>> system resources more effectively via HW features including shared pageable
>>>>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>>>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>>>>>> in through the queues (aka rings).
>>>>>>>>
>>>>>>>> The code in this patch set is intended to serve both as a sample driver for
>>>>>>>> other HSA-compatible hardware devices and as a production driver for
>>>>>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>>>>>> each with connected GPUs, although the current implementation focuses on a
>>>>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>>>>>> graphics driver (kgd).
>>>>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>>>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>>>>>> interrupts, registers), while other functionality has been added
>>>>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>>>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>>>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>>>>>
>>>>>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>>>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>>>>>> and data structures are initialized for managing HSA device topology.
>>>>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>>>>>> following points :
>>>>>>>>
>>>>>>>> - radeon_init (kfd_init)
>>>>>>>> - radeon_exit (kfd_fini)
>>>>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>>>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>>>>>
>>>>>>>> During the probe and init processing per-device data structures are
>>>>>>>> established which connect to the associated graphics kernel driver. This
>>>>>>>> information is exposed to userspace via sysfs, along with a version number
>>>>>>>> allowing userspace to determine if a topology change has occurred while it
>>>>>>>> was reading from sysfs.
>>>>>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>>>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>>>>>> kfd code since the interrupt block is shared between regular
>>>>>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>>>>>
>>>>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>>>>>> is in the final stages of IP review and should be published in a separate
>>>>>>>> repo over the next few days.
>>>>>>>> The code operates in one of three modes, selectable via the sched_policy
>>>>>>>> module parameter :
>>>>>>>>
>>>>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>>>>>> CP, and allows oversubscription (more queues than HW slots)
>>>>>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>>>>>> queues to HW slots by programming registers
>>>>>>>>
>>>>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>>>>>> has less test coverage than the other options. Default in the current code
>>>>>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>>>>>> most test coverage but we expect to change the default to "HW scheduling
>>>>>>>> with oversubscription" after further testing. This effectively removes the
>>>>>>>> HW limit on the number of work queues available to applications.
>>>>>>>>
>>>>>>>> Programs running on the GPU are associated with an address space through the
>>>>>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>>>>>> are partitioned (under control of the radeon kgd) between current
>>>>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>>>>>> and by driver code if HW scheduling is not being used.
>>>>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>>>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>>>>>> mapped to userspace (each page mapped to only one user address space).
>>>>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>>>>>> userspace code to safely manage work queues (rings) without requiring a
>>>>>>>> kernel call for every ring update.
>>>>>>>> First step for an application process is to open the kfd device. Calls to
>>>>>>>> open create a kfd "process" structure only for the first thread of the
>>>>>>>> process. Subsequent open calls are checked to see if they are from processes
>>>>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>>>>>> accessible to the GPU.
>>>>>>>> Next step is for the application to collect topology information via sysfs.
>>>>>>>> This gives userspace enough information to be able to identify specific
>>>>>>>> nodes (processors) in subsequent queue management calls. Application
>>>>>>>> processes can create queues on multiple processors, and processors support
>>>>>>>> queues from multiple processes.
>>>>>>>> At this point the application can create work queues in userspace memory and
>>>>>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>>>>>> queue slots so that commands written to the queues can be executed by the
>>>>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>>>>>> is device-specific.
>>>>>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> Alexey Skidanov (1):
>>>>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
>>>>>>>>
>>>>>>>> Andrew Lewycky (3):
>>>>>>>>   amdkfd: Add basic modules to amdkfd
>>>>>>>>   amdkfd: Add interrupt handling module
>>>>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
>>>>>>>>
>>>>>>>> Ben Goz (8):
>>>>>>>>   amdkfd: Add queue module
>>>>>>>>   amdkfd: Add mqd_manager module
>>>>>>>>   amdkfd: Add kernel queue module
>>>>>>>>   amdkfd: Add module parameter of scheduling policy
>>>>>>>>   amdkfd: Add packet manager module
>>>>>>>>   amdkfd: Add process queue manager module
>>>>>>>>   amdkfd: Add device queue manager module
>>>>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>>>>>
>>>>>>>> Evgeny Pinchuk (3):
>>>>>>>>   amdkfd: Add topology module to amdkfd
>>>>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
>>>>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>>>>>
>>>>>>>> Oded Gabbay (10):
>>>>>>>>   mm: Add kfd_process pointer to mm_struct
>>>>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
>>>>>>>>   drm/radeon: Report doorbell configuration to amdkfd
>>>>>>>>   drm/radeon: adding synchronization for GRBM GFX
>>>>>>>>   drm/radeon: Add radeon <--> amdkfd interface
>>>>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
>>>>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
>>>>>>>>   amdkfd: Add amdkfd skeleton driver
>>>>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>>>>>
>>>>>>>>  CREDITS                                            |    7 +
>>>>>>>>  MAINTAINERS                                        |   10 +
>>>>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>>>>>> ++++++++++++++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>>>>>  include/linux/mm_types.h                           |   14 +
>>>>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>>>>>
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:36                 ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 18:36 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 21:14, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>> On 21/07/14 18:54, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 16:39, Christian König wrote:
>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>
>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>> of the code.
>>>>>>>>
>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>
>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>
>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>> drivers.
>>>>>>>>
>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>
>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>
>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>> no userspace i can go look at.
>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>> time you dedicated to review the code.
>>>>>>>
>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>> that side.
>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>> is it still a show stopper ?
>>>>>>
>>>>>> The kernel objects are:
>>>>>> - pipelines (4 per device)
>>>>>> - mqd per hiq (only 1 per device)
>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>> - kernel queue (only 1 per device)
>>>>>> - fence address for kernel queue
>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>
>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>> else.
>>>>>
>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>> questionable, everything else sounds reasonable.
>>>>>
>>>>> Christian.
>>>>
>>>> Most of the pin downs are done on device initialization.
>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>> memory, I think it is OK.
>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>
>>> 2G local memory ? You can not assume anything on userside configuration some
>>> one might build an hsa computer with 512M and still expect a functioning
>>> desktop.
>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>> Second, I would imagine we can build some protection around it, like
>> checking total local memory and limit number of queues based on some
>> percentage of that total local memory. So, if someone will have only
>> 512M, he will be able to open less queues.
>>
>>
>>>
>>> I need to go look into what all this mqd is for, what it does and what it is
>>> about. But pinning is really bad and this is an issue with userspace command
>>> scheduling an issue that obviously AMD fails to take into account in design
>>> phase.
>> Maybe, but that is the H/W design non-the-less. We can't very well
>> change the H/W.
> 
> You can not change the hardware but it is not an excuse to allow bad design to
> sneak in software to work around that. So i would rather penalize bad hardware
> design and have command submission in the kernel, until AMD fix its hardware to
> allow proper scheduling by the kernel and proper control by the kernel. 
I'm sorry but I do *not* think this is a bad design. S/W scheduling in
the kernel can not, IMO, scale well to 100K queues and 10K processes.

> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> capacity and once we get there we want the kernel to always be able to take over
> and do whatever it wants behind process back.
Who do you refer to when you say "we" ? AFAIK, the hw scheduling
direction is where AMD is now and where it is heading in the future.
That doesn't preclude the option to allow the kernel to take over and do
what he wants. I agree that in KV we have a problem where we can't do a
mid-wave preemption, so theoretically, a long running compute kernel can
make things messy, but in Carrizo, we will have this ability. Having
said that, it will only be through the CP H/W scheduling. So AMD is
_not_ going to abandon H/W scheduling. You can dislike it, but this is
the situation.
> 
>>>>>
>>>>>>>
>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>> stuff there.
>>>>>>>
>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>
>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>
>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>> space command ring.
>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>> I don't think this is a valid reason to NACK the driver.
>>>
>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>> the performance ioctl.
>>>
>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>> execution.
>>>
>>>>>>
>>>>>>> I only see issues with that. First and foremost i would
>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>> and for absolutely not upside afaict.
>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>
>>> I am saying the overhead is not that big and it probably will not matter in most
>>> usecase. For instance i did wrote the most useless kernel module that add two
>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>> ioctl is 13 times slower.
>>>
>>> Now if there is enough data that shows that a significant percentage of jobs
>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>> scheduling does make sense. But so far all we have is handwaving with no data
>>> to support any facts.
>>>
>>>
>>> Now if we want to schedule from userspace than you will need to do something
>>> about the pinning, something that gives control to kernel so that kernel can
>>> unpin when it wants and move object when it wants no matter what userspace is
>>> doing.
>>>
>>>>>>>
>>>>>>> So i would rather see a very simple ioctl that write the doorbell and might
>>>>>>> do more than that in case of ring/queue overcommit where it would first have
>>>>>>> to wait for a free ring/queue to schedule stuff. This would also allow sane
>>>>>>> implementation of things like performance counter that could be acquire by
>>>>>>> kernel for duration of a job submitted by userspace. While still not optimal
>>>>>>> this would be better that userspace locking.
>>>>>>>
>>>>>>>
>>>>>>> I might have more thoughts once i am done with all the patches.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Jérôme
>>>>>>>
>>>>>>>>
>>>>>>>> Original Cover Letter:
>>>>>>>>
>>>>>>>> This patch set implements a Heterogeneous System Architecture (HSA) driver
>>>>>>>> for radeon-family GPUs.
>>>>>>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to share
>>>>>>>> system resources more effectively via HW features including shared pageable
>>>>>>>> memory, userspace-accessible work queues, and platform-level atomics. In
>>>>>>>> addition to the memory protection mechanisms in GPUVM and IOMMUv2, the Sea
>>>>>>>> Islands family of GPUs also performs HW-level validation of commands passed
>>>>>>>> in through the queues (aka rings).
>>>>>>>>
>>>>>>>> The code in this patch set is intended to serve both as a sample driver for
>>>>>>>> other HSA-compatible hardware devices and as a production driver for
>>>>>>>> radeon-family processors. The code is architected to support multiple CPUs
>>>>>>>> each with connected GPUs, although the current implementation focuses on a
>>>>>>>> single Kaveri/Berlin APU, and works alongside the existing radeon kernel
>>>>>>>> graphics driver (kgd).
>>>>>>>> AMD GPUs designed for use with HSA (Sea Islands and up) share some hardware
>>>>>>>> functionality between HSA compute and regular gfx/compute (memory,
>>>>>>>> interrupts, registers), while other functionality has been added
>>>>>>>> specifically for HSA compute  (hw scheduler for virtualized compute rings).
>>>>>>>> All shared hardware is owned by the radeon graphics driver, and an interface
>>>>>>>> between kfd and kgd allows the kfd to make use of those shared resources,
>>>>>>>> while HSA-specific functionality is managed directly by kfd by submitting
>>>>>>>> packets into an HSA-specific command queue (the "HIQ").
>>>>>>>>
>>>>>>>> During kfd module initialization a char device node (/dev/kfd) is created
>>>>>>>> (surviving until module exit), with ioctls for queue creation & management,
>>>>>>>> and data structures are initialized for managing HSA device topology.
>>>>>>>> The rest of the initialization is driven by calls from the radeon kgd at the
>>>>>>>> following points :
>>>>>>>>
>>>>>>>> - radeon_init (kfd_init)
>>>>>>>> - radeon_exit (kfd_fini)
>>>>>>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
>>>>>>>> - radeon_driver_unload_kms (kfd_device_fini)
>>>>>>>>
>>>>>>>> During the probe and init processing per-device data structures are
>>>>>>>> established which connect to the associated graphics kernel driver. This
>>>>>>>> information is exposed to userspace via sysfs, along with a version number
>>>>>>>> allowing userspace to determine if a topology change has occurred while it
>>>>>>>> was reading from sysfs.
>>>>>>>> The interface between kfd and kgd also allows the kfd to request buffer
>>>>>>>> management services from kgd, and allows kgd to route interrupt requests to
>>>>>>>> kfd code since the interrupt block is shared between regular
>>>>>>>> graphics/compute and HSA compute subsystems in the GPU.
>>>>>>>>
>>>>>>>> The kfd code works with an open source usermode library ("libhsakmt") which
>>>>>>>> is in the final stages of IP review and should be published in a separate
>>>>>>>> repo over the next few days.
>>>>>>>> The code operates in one of three modes, selectable via the sched_policy
>>>>>>>> module parameter :
>>>>>>>>
>>>>>>>> - sched_policy=0 uses a hardware scheduler running in the MEC block within
>>>>>>>> CP, and allows oversubscription (more queues than HW slots)
>>>>>>>> - sched_policy=1 also uses HW scheduling but does not allow
>>>>>>>> oversubscription, so create_queue requests fail when we run out of HW slots
>>>>>>>> - sched_policy=2 does not use HW scheduling, so the driver manually assigns
>>>>>>>> queues to HW slots by programming registers
>>>>>>>>
>>>>>>>> The "no HW scheduling" option is for debug & new hardware bringup only, so
>>>>>>>> has less test coverage than the other options. Default in the current code
>>>>>>>> is "HW scheduling without oversubscription" since that is where we have the
>>>>>>>> most test coverage but we expect to change the default to "HW scheduling
>>>>>>>> with oversubscription" after further testing. This effectively removes the
>>>>>>>> HW limit on the number of work queues available to applications.
>>>>>>>>
>>>>>>>> Programs running on the GPU are associated with an address space through the
>>>>>>>> VMID field, which is translated to a unique PASID at access time via a set
>>>>>>>> of 16 VMID-to-PASID mapping registers. The available VMIDs (currently 16)
>>>>>>>> are partitioned (under control of the radeon kgd) between current
>>>>>>>> gfx/compute and HSA compute, with each getting 8 in the current code. The
>>>>>>>> VMID-to-PASID mapping registers are updated by the HW scheduler when used,
>>>>>>>> and by driver code if HW scheduling is not being used.
>>>>>>>> The Sea Islands compute queues use a new "doorbell" mechanism instead of the
>>>>>>>> earlier kernel-managed write pointer registers. Doorbells use a separate BAR
>>>>>>>> dedicated for this purpose, and pages within the doorbell aperture are
>>>>>>>> mapped to userspace (each page mapped to only one user address space).
>>>>>>>> Writes to the doorbell aperture are intercepted by GPU hardware, allowing
>>>>>>>> userspace code to safely manage work queues (rings) without requiring a
>>>>>>>> kernel call for every ring update.
>>>>>>>> First step for an application process is to open the kfd device. Calls to
>>>>>>>> open create a kfd "process" structure only for the first thread of the
>>>>>>>> process. Subsequent open calls are checked to see if they are from processes
>>>>>>>> using the same mm_struct and, if so, don't do anything. The kfd per-process
>>>>>>>> data lives as long as the mm_struct exists. Each mm_struct is associated
>>>>>>>> with a unique PASID, allowing the IOMMUv2 to make userspace process memory
>>>>>>>> accessible to the GPU.
>>>>>>>> Next step is for the application to collect topology information via sysfs.
>>>>>>>> This gives userspace enough information to be able to identify specific
>>>>>>>> nodes (processors) in subsequent queue management calls. Application
>>>>>>>> processes can create queues on multiple processors, and processors support
>>>>>>>> queues from multiple processes.
>>>>>>>> At this point the application can create work queues in userspace memory and
>>>>>>>> pass them through the usermode library to kfd to have them mapped onto HW
>>>>>>>> queue slots so that commands written to the queues can be executed by the
>>>>>>>> GPU. Queue operations specify a processor node, and so the bulk of this code
>>>>>>>> is device-specific.
>>>>>>>> Written by John Bridgman <John.Bridgman@amd.com>
>>>>>>>>
>>>>>>>>
>>>>>>>> Alexey Skidanov (1):
>>>>>>>>   amdkfd: Implement the Get Process Aperture IOCTL
>>>>>>>>
>>>>>>>> Andrew Lewycky (3):
>>>>>>>>   amdkfd: Add basic modules to amdkfd
>>>>>>>>   amdkfd: Add interrupt handling module
>>>>>>>>   amdkfd: Implement the Set Memory Policy IOCTL
>>>>>>>>
>>>>>>>> Ben Goz (8):
>>>>>>>>   amdkfd: Add queue module
>>>>>>>>   amdkfd: Add mqd_manager module
>>>>>>>>   amdkfd: Add kernel queue module
>>>>>>>>   amdkfd: Add module parameter of scheduling policy
>>>>>>>>   amdkfd: Add packet manager module
>>>>>>>>   amdkfd: Add process queue manager module
>>>>>>>>   amdkfd: Add device queue manager module
>>>>>>>>   amdkfd: Implement the create/destroy/update queue IOCTLs
>>>>>>>>
>>>>>>>> Evgeny Pinchuk (3):
>>>>>>>>   amdkfd: Add topology module to amdkfd
>>>>>>>>   amdkfd: Implement the Get Clock Counters IOCTL
>>>>>>>>   amdkfd: Implement the PMC Acquire/Release IOCTLs
>>>>>>>>
>>>>>>>> Oded Gabbay (10):
>>>>>>>>   mm: Add kfd_process pointer to mm_struct
>>>>>>>>   drm/radeon: reduce number of free VMIDs and pipes in KV
>>>>>>>>   drm/radeon/cik: Don't touch int of pipes 1-7
>>>>>>>>   drm/radeon: Report doorbell configuration to amdkfd
>>>>>>>>   drm/radeon: adding synchronization for GRBM GFX
>>>>>>>>   drm/radeon: Add radeon <--> amdkfd interface
>>>>>>>>   Update MAINTAINERS and CREDITS files with amdkfd info
>>>>>>>>   amdkfd: Add IOCTL set definitions of amdkfd
>>>>>>>>   amdkfd: Add amdkfd skeleton driver
>>>>>>>>   amdkfd: Add binding/unbinding calls to amd_iommu driver
>>>>>>>>
>>>>>>>>  CREDITS                                            |    7 +
>>>>>>>>  MAINTAINERS                                        |   10 +
>>>>>>>>  drivers/gpu/drm/radeon/Kconfig                     |    2 +
>>>>>>>>  drivers/gpu/drm/radeon/Makefile                    |    3 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/Kconfig              |   10 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/Makefile             |   14 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_mqds.h           |  185 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/cik_regs.h           |  220 ++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c       |  123 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c        |  518 +++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_crat.h           |  294 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_device.c         |  254 ++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.c   |  985 ++++++++++++++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_device_queue_manager.h   |  101 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c       |  264 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c      |  161 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c   |  305 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h   |   66 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_module.c         |  131 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c    |  291 +++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h    |   54 +
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c |  488 ++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c          |   97 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h    |  682 +++++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h    |  107 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_priv.h           |  466 ++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_process.c        |  405 +++++++
>>>>>>>>  .../drm/radeon/amdkfd/kfd_process_queue_manager.c  |  343 ++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_queue.c          |  109 ++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.c       | 1207
>>>>>>>> ++++++++++++++++++++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_topology.h       |  168 +++
>>>>>>>>  drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c         |   96 ++
>>>>>>>>  drivers/gpu/drm/radeon/cik.c                       |  154 +--
>>>>>>>>  drivers/gpu/drm/radeon/cik_reg.h                   |   65 ++
>>>>>>>>  drivers/gpu/drm/radeon/cikd.h                      |   51 +-
>>>>>>>>  drivers/gpu/drm/radeon/radeon.h                    |    9 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_device.c             |   32 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_drv.c                |    5 +
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.c                |  566 +++++++++
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kfd.h                |  119 ++
>>>>>>>>  drivers/gpu/drm/radeon/radeon_kms.c                |    7 +
>>>>>>>>  include/linux/mm_types.h                           |   14 +
>>>>>>>>  include/uapi/linux/kfd_ioctl.h                     |  133 +++
>>>>>>>>  43 files changed, 9226 insertions(+), 95 deletions(-)
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Kconfig
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/Makefile
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_mqds.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/cik_regs.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_aperture.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_chardev.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_crat.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_device_queue_manager.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_doorbell.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_interrupt.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_kernel_queue.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_module.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_mqd_manager.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_packet_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pasid.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_headers.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_pm4_opcodes.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_priv.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_process_queue_manager.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_queue.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_topology.h
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/amdkfd/kfd_vidmem.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.c
>>>>>>>>  create mode 100644 drivers/gpu/drm/radeon/radeon_kfd.h
>>>>>>>>  create mode 100644 include/uapi/linux/kfd_ioctl.h
>>>>>>>>
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 18:22                 ` Daniel Vetter
  (?)
@ 2014-07-21 18:41                   ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 18:41 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 21/07/14 21:22, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> I'm not sure whether we can do the same trick with the hw scheduler. But
>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>>> and evict.
>> So, I'm afraid but we can't do this for AMD Kaveri because:
> 
> Well as long as you can drain the hw scheduler queue (and you can do
> that, worst case you have to unmap all the doorbells and other stuff
> to intercept further submission from userspace) you can evict stuff.

I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
Moreover, if I use the dequeue request register to preempt a queue
during a dispatch it may be that some waves (wave groups actually) of
the dispatch have not yet been created, and when I reactivate the mqd,
they should be created but are not. However, this works fine if you use
the HIQ. the CP ucode correctly saves and restores the state of an
outstanding dispatch. I don't think we have access to the state from
software at all, so it's not a bug, it is "as designed".

> And if we don't want compute to be a denial of service on the display
> side of the driver we need this ability. Now if you go through an
> ioctl instead of the doorbell (I agree with Jerome here, the doorbell
> should be supported by benchmarks on linux) this gets a bit easier,
> but it's not a requirement really.
> -Daniel
> 
On KV, we have the theoretical option of DOS on the display side as we
can't do a mid-wave preemption. On CZ, we won't have this problem.

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:41                   ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 18:41 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 21/07/14 21:22, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> I'm not sure whether we can do the same trick with the hw scheduler. But
>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>>> and evict.
>> So, I'm afraid but we can't do this for AMD Kaveri because:
> 
> Well as long as you can drain the hw scheduler queue (and you can do
> that, worst case you have to unmap all the doorbells and other stuff
> to intercept further submission from userspace) you can evict stuff.

I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
Moreover, if I use the dequeue request register to preempt a queue
during a dispatch it may be that some waves (wave groups actually) of
the dispatch have not yet been created, and when I reactivate the mqd,
they should be created but are not. However, this works fine if you use
the HIQ. the CP ucode correctly saves and restores the state of an
outstanding dispatch. I don't think we have access to the state from
software at all, so it's not a bug, it is "as designed".

> And if we don't want compute to be a denial of service on the display
> side of the driver we need this ability. Now if you go through an
> ioctl instead of the doorbell (I agree with Jerome here, the doorbell
> should be supported by benchmarks on linux) this gets a bit easier,
> but it's not a requirement really.
> -Daniel
> 
On KV, we have the theoretical option of DOS on the display side as we
can't do a mid-wave preemption. On CZ, we won't have this problem.

	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:41                   ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 18:41 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	Evgeny Pinchuk, linux-mm, Alexey Skidanov, Andrew Morton

On 21/07/14 21:22, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> I'm not sure whether we can do the same trick with the hw scheduler. But
>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>>> and evict.
>> So, I'm afraid but we can't do this for AMD Kaveri because:
> 
> Well as long as you can drain the hw scheduler queue (and you can do
> that, worst case you have to unmap all the doorbells and other stuff
> to intercept further submission from userspace) you can evict stuff.

I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
Moreover, if I use the dequeue request register to preempt a queue
during a dispatch it may be that some waves (wave groups actually) of
the dispatch have not yet been created, and when I reactivate the mqd,
they should be created but are not. However, this works fine if you use
the HIQ. the CP ucode correctly saves and restores the state of an
outstanding dispatch. I don't think we have access to the state from
software at all, so it's not a bug, it is "as designed".

> And if we don't want compute to be a denial of service on the display
> side of the driver we need this ability. Now if you go through an
> ioctl instead of the doorbell (I agree with Jerome here, the doorbell
> should be supported by benchmarks on linux) this gets a bit easier,
> but it's not a requirement really.
> -Daniel
> 
On KV, we have the theoretical option of DOS on the display side as we
can't do a mid-wave preemption. On CZ, we won't have this problem.

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 18:36                 ` Oded Gabbay
  (?)
@ 2014-07-21 18:59                   ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 18:59 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:14, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 18:54, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 16:39, Christian König wrote:
> >>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>
> >>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>> of the code.
> >>>>>>>>
> >>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>
> >>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>
> >>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>> drivers.
> >>>>>>>>
> >>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>
> >>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>
> >>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>> no userspace i can go look at.
> >>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>> time you dedicated to review the code.
> >>>>>>>
> >>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>> that side.
> >>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>> is it still a show stopper ?
> >>>>>>
> >>>>>> The kernel objects are:
> >>>>>> - pipelines (4 per device)
> >>>>>> - mqd per hiq (only 1 per device)
> >>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>> - kernel queue (only 1 per device)
> >>>>>> - fence address for kernel queue
> >>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>
> >>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>> else.
> >>>>>
> >>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>> questionable, everything else sounds reasonable.
> >>>>>
> >>>>> Christian.
> >>>>
> >>>> Most of the pin downs are done on device initialization.
> >>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>> memory, I think it is OK.
> >>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>
> >>> 2G local memory ? You can not assume anything on userside configuration some
> >>> one might build an hsa computer with 512M and still expect a functioning
> >>> desktop.
> >> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >> Second, I would imagine we can build some protection around it, like
> >> checking total local memory and limit number of queues based on some
> >> percentage of that total local memory. So, if someone will have only
> >> 512M, he will be able to open less queues.
> >>
> >>
> >>>
> >>> I need to go look into what all this mqd is for, what it does and what it is
> >>> about. But pinning is really bad and this is an issue with userspace command
> >>> scheduling an issue that obviously AMD fails to take into account in design
> >>> phase.
> >> Maybe, but that is the H/W design non-the-less. We can't very well
> >> change the H/W.
> > 
> > You can not change the hardware but it is not an excuse to allow bad design to
> > sneak in software to work around that. So i would rather penalize bad hardware
> > design and have command submission in the kernel, until AMD fix its hardware to
> > allow proper scheduling by the kernel and proper control by the kernel. 
> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> the kernel can not, IMO, scale well to 100K queues and 10K processes.

I am not advocating for having kernel decide down to the very last details. I am
advocating for kernel being able to preempt at any time and be able to decrease
or increase user queue priority so overall kernel is in charge of resources
management and it can handle rogue client in proper fashion.

> 
> > Because really where we want to go is having GPU closer to a CPU in term of scheduling
> > capacity and once we get there we want the kernel to always be able to take over
> > and do whatever it wants behind process back.
> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> direction is where AMD is now and where it is heading in the future.
> That doesn't preclude the option to allow the kernel to take over and do
> what he wants. I agree that in KV we have a problem where we can't do a
> mid-wave preemption, so theoretically, a long running compute kernel can
> make things messy, but in Carrizo, we will have this ability. Having
> said that, it will only be through the CP H/W scheduling. So AMD is
> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> the situation.

We was for the overall Linux community but maybe i should not pretend to talk
for anyone interested in having a common standard.

My point is that current hardware do not have approriate hardware support for
preemption hence, current hardware should use ioctl to schedule job and AMD
should think a bit more on commiting to a design and handwaving any hardware
short coming as something that can be work around in the software. The pinning
thing is broken by design, only way to work around it is through kernel cmd
queue scheduling that's a fact.

Once hardware support proper preemption and allows to move around/evict buffer
use on behalf of userspace command queue then we can allow userspace scheduling
but until then my personnal opinion is that it should not be allowed and that
people will have to pay the ioctl price which i proved to be small, because
really if you 100K queue each with one job, i would not expect that all those
100K job will complete in less time than it takes to execute an ioctl ie by
even if you do not have the ioctl delay what ever you schedule will have to
wait on previously submited jobs.

> > 
> >>>>>
> >>>>>>>
> >>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>> stuff there.
> >>>>>>>
> >>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>
> >>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>
> >>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>> space command ring.
> >>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>> I don't think this is a valid reason to NACK the driver.
> >>>
> >>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>> the performance ioctl.
> >>>
> >>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>> execution.
> >>>
> >>>>>>
> >>>>>>> I only see issues with that. First and foremost i would
> >>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>> and for absolutely not upside afaict.
> >>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>
> >>> I am saying the overhead is not that big and it probably will not matter in most
> >>> usecase. For instance i did wrote the most useless kernel module that add two
> >>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>> ioctl is 13 times slower.
> >>>
> >>> Now if there is enough data that shows that a significant percentage of jobs
> >>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>> scheduling does make sense. But so far all we have is handwaving with no data
> >>> to support any facts.
> >>>
> >>>
> >>> Now if we want to schedule from userspace than you will need to do something
> >>> about the pinning, something that gives control to kernel so that kernel can
> >>> unpin when it wants and move object when it wants no matter what userspace is
> >>> doing.
> >>>
> >>>>>>>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:59                   ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 18:59 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:14, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 18:54, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 16:39, Christian Konig wrote:
> >>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>
> >>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>> of the code.
> >>>>>>>>
> >>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>
> >>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>
> >>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>> drivers.
> >>>>>>>>
> >>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>
> >>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>
> >>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>> no userspace i can go look at.
> >>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>> time you dedicated to review the code.
> >>>>>>>
> >>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>> that side.
> >>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>> is it still a show stopper ?
> >>>>>>
> >>>>>> The kernel objects are:
> >>>>>> - pipelines (4 per device)
> >>>>>> - mqd per hiq (only 1 per device)
> >>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>> - kernel queue (only 1 per device)
> >>>>>> - fence address for kernel queue
> >>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>
> >>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>> else.
> >>>>>
> >>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>> questionable, everything else sounds reasonable.
> >>>>>
> >>>>> Christian.
> >>>>
> >>>> Most of the pin downs are done on device initialization.
> >>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>> memory, I think it is OK.
> >>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>
> >>> 2G local memory ? You can not assume anything on userside configuration some
> >>> one might build an hsa computer with 512M and still expect a functioning
> >>> desktop.
> >> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >> Second, I would imagine we can build some protection around it, like
> >> checking total local memory and limit number of queues based on some
> >> percentage of that total local memory. So, if someone will have only
> >> 512M, he will be able to open less queues.
> >>
> >>
> >>>
> >>> I need to go look into what all this mqd is for, what it does and what it is
> >>> about. But pinning is really bad and this is an issue with userspace command
> >>> scheduling an issue that obviously AMD fails to take into account in design
> >>> phase.
> >> Maybe, but that is the H/W design non-the-less. We can't very well
> >> change the H/W.
> > 
> > You can not change the hardware but it is not an excuse to allow bad design to
> > sneak in software to work around that. So i would rather penalize bad hardware
> > design and have command submission in the kernel, until AMD fix its hardware to
> > allow proper scheduling by the kernel and proper control by the kernel. 
> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> the kernel can not, IMO, scale well to 100K queues and 10K processes.

I am not advocating for having kernel decide down to the very last details. I am
advocating for kernel being able to preempt at any time and be able to decrease
or increase user queue priority so overall kernel is in charge of resources
management and it can handle rogue client in proper fashion.

> 
> > Because really where we want to go is having GPU closer to a CPU in term of scheduling
> > capacity and once we get there we want the kernel to always be able to take over
> > and do whatever it wants behind process back.
> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> direction is where AMD is now and where it is heading in the future.
> That doesn't preclude the option to allow the kernel to take over and do
> what he wants. I agree that in KV we have a problem where we can't do a
> mid-wave preemption, so theoretically, a long running compute kernel can
> make things messy, but in Carrizo, we will have this ability. Having
> said that, it will only be through the CP H/W scheduling. So AMD is
> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> the situation.

We was for the overall Linux community but maybe i should not pretend to talk
for anyone interested in having a common standard.

My point is that current hardware do not have approriate hardware support for
preemption hence, current hardware should use ioctl to schedule job and AMD
should think a bit more on commiting to a design and handwaving any hardware
short coming as something that can be work around in the software. The pinning
thing is broken by design, only way to work around it is through kernel cmd
queue scheduling that's a fact.

Once hardware support proper preemption and allows to move around/evict buffer
use on behalf of userspace command queue then we can allow userspace scheduling
but until then my personnal opinion is that it should not be allowed and that
people will have to pay the ioctl price which i proved to be small, because
really if you 100K queue each with one job, i would not expect that all those
100K job will complete in less time than it takes to execute an ioctl ie by
even if you do not have the ioctl delay what ever you schedule will have to
wait on previously submited jobs.

> > 
> >>>>>
> >>>>>>>
> >>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>> stuff there.
> >>>>>>>
> >>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>
> >>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>
> >>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>> space command ring.
> >>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>> I don't think this is a valid reason to NACK the driver.
> >>>
> >>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>> the performance ioctl.
> >>>
> >>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>> execution.
> >>>
> >>>>>>
> >>>>>>> I only see issues with that. First and foremost i would
> >>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>> and for absolutely not upside afaict.
> >>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>
> >>> I am saying the overhead is not that big and it probably will not matter in most
> >>> usecase. For instance i did wrote the most useless kernel module that add two
> >>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>> ioctl is 13 times slower.
> >>>
> >>> Now if there is enough data that shows that a significant percentage of jobs
> >>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>> scheduling does make sense. But so far all we have is handwaving with no data
> >>> to support any facts.
> >>>
> >>>
> >>> Now if we want to schedule from userspace than you will need to do something
> >>> about the pinning, something that gives control to kernel so that kernel can
> >>> unpin when it wants and move object when it wants no matter what userspace is
> >>> doing.
> >>>
> >>>>>>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 18:59                   ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 18:59 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:14, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 18:54, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 16:39, Christian König wrote:
> >>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>
> >>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>> of the code.
> >>>>>>>>
> >>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>
> >>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>
> >>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>> drivers.
> >>>>>>>>
> >>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>
> >>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>
> >>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>> no userspace i can go look at.
> >>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>> time you dedicated to review the code.
> >>>>>>>
> >>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>> that side.
> >>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>> is it still a show stopper ?
> >>>>>>
> >>>>>> The kernel objects are:
> >>>>>> - pipelines (4 per device)
> >>>>>> - mqd per hiq (only 1 per device)
> >>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>> - kernel queue (only 1 per device)
> >>>>>> - fence address for kernel queue
> >>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>
> >>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>> else.
> >>>>>
> >>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>> questionable, everything else sounds reasonable.
> >>>>>
> >>>>> Christian.
> >>>>
> >>>> Most of the pin downs are done on device initialization.
> >>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>> memory, I think it is OK.
> >>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>
> >>> 2G local memory ? You can not assume anything on userside configuration some
> >>> one might build an hsa computer with 512M and still expect a functioning
> >>> desktop.
> >> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >> Second, I would imagine we can build some protection around it, like
> >> checking total local memory and limit number of queues based on some
> >> percentage of that total local memory. So, if someone will have only
> >> 512M, he will be able to open less queues.
> >>
> >>
> >>>
> >>> I need to go look into what all this mqd is for, what it does and what it is
> >>> about. But pinning is really bad and this is an issue with userspace command
> >>> scheduling an issue that obviously AMD fails to take into account in design
> >>> phase.
> >> Maybe, but that is the H/W design non-the-less. We can't very well
> >> change the H/W.
> > 
> > You can not change the hardware but it is not an excuse to allow bad design to
> > sneak in software to work around that. So i would rather penalize bad hardware
> > design and have command submission in the kernel, until AMD fix its hardware to
> > allow proper scheduling by the kernel and proper control by the kernel. 
> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> the kernel can not, IMO, scale well to 100K queues and 10K processes.

I am not advocating for having kernel decide down to the very last details. I am
advocating for kernel being able to preempt at any time and be able to decrease
or increase user queue priority so overall kernel is in charge of resources
management and it can handle rogue client in proper fashion.

> 
> > Because really where we want to go is having GPU closer to a CPU in term of scheduling
> > capacity and once we get there we want the kernel to always be able to take over
> > and do whatever it wants behind process back.
> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> direction is where AMD is now and where it is heading in the future.
> That doesn't preclude the option to allow the kernel to take over and do
> what he wants. I agree that in KV we have a problem where we can't do a
> mid-wave preemption, so theoretically, a long running compute kernel can
> make things messy, but in Carrizo, we will have this ability. Having
> said that, it will only be through the CP H/W scheduling. So AMD is
> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> the situation.

We was for the overall Linux community but maybe i should not pretend to talk
for anyone interested in having a common standard.

My point is that current hardware do not have approriate hardware support for
preemption hence, current hardware should use ioctl to schedule job and AMD
should think a bit more on commiting to a design and handwaving any hardware
short coming as something that can be work around in the software. The pinning
thing is broken by design, only way to work around it is through kernel cmd
queue scheduling that's a fact.

Once hardware support proper preemption and allows to move around/evict buffer
use on behalf of userspace command queue then we can allow userspace scheduling
but until then my personnal opinion is that it should not be allowed and that
people will have to pay the ioctl price which i proved to be small, because
really if you 100K queue each with one job, i would not expect that all those
100K job will complete in less time than it takes to execute an ioctl ie by
even if you do not have the ioctl delay what ever you schedule will have to
wait on previously submited jobs.

> > 
> >>>>>
> >>>>>>>
> >>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>> stuff there.
> >>>>>>>
> >>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>
> >>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>
> >>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>> space command ring.
> >>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>> I don't think this is a valid reason to NACK the driver.
> >>>
> >>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>> the performance ioctl.
> >>>
> >>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>> execution.
> >>>
> >>>>>>
> >>>>>>> I only see issues with that. First and foremost i would
> >>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>> and for absolutely not upside afaict.
> >>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>
> >>> I am saying the overhead is not that big and it probably will not matter in most
> >>> usecase. For instance i did wrote the most useless kernel module that add two
> >>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>> ioctl is 13 times slower.
> >>>
> >>> Now if there is enough data that shows that a significant percentage of jobs
> >>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>> scheduling does make sense. But so far all we have is handwaving with no data
> >>> to support any facts.
> >>>
> >>>
> >>> Now if we want to schedule from userspace than you will need to do something
> >>> about the pinning, something that gives control to kernel so that kernel can
> >>> unpin when it wants and move object when it wants no matter what userspace is
> >>> doing.
> >>>
> >>>>>>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 18:41                   ` Oded Gabbay
  (?)
@ 2014-07-21 19:03                     ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 19:03 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Daniel Vetter, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:22, Daniel Vetter wrote:
> > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> >>> and evict.
> >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > 
> > Well as long as you can drain the hw scheduler queue (and you can do
> > that, worst case you have to unmap all the doorbells and other stuff
> > to intercept further submission from userspace) you can evict stuff.
> 
> I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> Moreover, if I use the dequeue request register to preempt a queue
> during a dispatch it may be that some waves (wave groups actually) of
> the dispatch have not yet been created, and when I reactivate the mqd,
> they should be created but are not. However, this works fine if you use
> the HIQ. the CP ucode correctly saves and restores the state of an
> outstanding dispatch. I don't think we have access to the state from
> software at all, so it's not a bug, it is "as designed".
> 

I think here Daniel is suggesting to unmapp the doorbell page, and track
each write made by userspace to it and while unmapped wait for the gpu to
drain or use some kind of fence on a special queue. Once GPU is drain we
can move pinned buffer, then remap the doorbell and update it to the last
value written by userspace which will resume execution to the next job.

> > And if we don't want compute to be a denial of service on the display
> > side of the driver we need this ability. Now if you go through an
> > ioctl instead of the doorbell (I agree with Jerome here, the doorbell
> > should be supported by benchmarks on linux) this gets a bit easier,
> > but it's not a requirement really.
> > -Daniel
> > 
> On KV, we have the theoretical option of DOS on the display side as we
> can't do a mid-wave preemption. On CZ, we won't have this problem.
> 
> 	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 19:03                     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 19:03 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Daniel Vetter, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:22, Daniel Vetter wrote:
> > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> >>> and evict.
> >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > 
> > Well as long as you can drain the hw scheduler queue (and you can do
> > that, worst case you have to unmap all the doorbells and other stuff
> > to intercept further submission from userspace) you can evict stuff.
> 
> I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> Moreover, if I use the dequeue request register to preempt a queue
> during a dispatch it may be that some waves (wave groups actually) of
> the dispatch have not yet been created, and when I reactivate the mqd,
> they should be created but are not. However, this works fine if you use
> the HIQ. the CP ucode correctly saves and restores the state of an
> outstanding dispatch. I don't think we have access to the state from
> software at all, so it's not a bug, it is "as designed".
> 

I think here Daniel is suggesting to unmapp the doorbell page, and track
each write made by userspace to it and while unmapped wait for the gpu to
drain or use some kind of fence on a special queue. Once GPU is drain we
can move pinned buffer, then remap the doorbell and update it to the last
value written by userspace which will resume execution to the next job.

> > And if we don't want compute to be a denial of service on the display
> > side of the driver we need this ability. Now if you go through an
> > ioctl instead of the doorbell (I agree with Jerome here, the doorbell
> > should be supported by benchmarks on linux) this gets a bit easier,
> > but it's not a requirement really.
> > -Daniel
> > 
> On KV, we have the theoretical option of DOS on the display side as we
> can't do a mid-wave preemption. On CZ, we won't have this problem.
> 
> 	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 19:03                     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 19:03 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:22, Daniel Vetter wrote:
> > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> >>> and evict.
> >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > 
> > Well as long as you can drain the hw scheduler queue (and you can do
> > that, worst case you have to unmap all the doorbells and other stuff
> > to intercept further submission from userspace) you can evict stuff.
> 
> I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> Moreover, if I use the dequeue request register to preempt a queue
> during a dispatch it may be that some waves (wave groups actually) of
> the dispatch have not yet been created, and when I reactivate the mqd,
> they should be created but are not. However, this works fine if you use
> the HIQ. the CP ucode correctly saves and restores the state of an
> outstanding dispatch. I don't think we have access to the state from
> software at all, so it's not a bug, it is "as designed".
> 

I think here Daniel is suggesting to unmapp the doorbell page, and track
each write made by userspace to it and while unmapped wait for the gpu to
drain or use some kind of fence on a special queue. Once GPU is drain we
can move pinned buffer, then remap the doorbell and update it to the last
value written by userspace which will resume execution to the next job.

> > And if we don't want compute to be a denial of service on the display
> > side of the driver we need this ability. Now if you go through an
> > ioctl instead of the doorbell (I agree with Jerome here, the doorbell
> > should be supported by benchmarks on linux) this gets a bit easier,
> > but it's not a requirement really.
> > -Daniel
> > 
> On KV, we have the theoretical option of DOS on the display side as we
> can't do a mid-wave preemption. On CZ, we won't have this problem.
> 
> 	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 18:59                   ` Jerome Glisse
  (?)
@ 2014-07-21 19:23                     ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 19:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 21:59, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>> On 21/07/14 21:14, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>
>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>> of the code.
>>>>>>>>>>
>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>
>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>
>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>> drivers.
>>>>>>>>>>
>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>
>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>
>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>> no userspace i can go look at.
>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>> time you dedicated to review the code.
>>>>>>>>>
>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>> that side.
>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>> is it still a show stopper ?
>>>>>>>>
>>>>>>>> The kernel objects are:
>>>>>>>> - pipelines (4 per device)
>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>> - fence address for kernel queue
>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>
>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>> else.
>>>>>>>
>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>> Most of the pin downs are done on device initialization.
>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>> memory, I think it is OK.
>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>
>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>> desktop.
>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>> Second, I would imagine we can build some protection around it, like
>>>> checking total local memory and limit number of queues based on some
>>>> percentage of that total local memory. So, if someone will have only
>>>> 512M, he will be able to open less queues.
>>>>
>>>>
>>>>>
>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>> phase.
>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>> change the H/W.
>>>
>>> You can not change the hardware but it is not an excuse to allow bad design to
>>> sneak in software to work around that. So i would rather penalize bad hardware
>>> design and have command submission in the kernel, until AMD fix its hardware to
>>> allow proper scheduling by the kernel and proper control by the kernel. 
>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> 
> I am not advocating for having kernel decide down to the very last details. I am
> advocating for kernel being able to preempt at any time and be able to decrease
> or increase user queue priority so overall kernel is in charge of resources
> management and it can handle rogue client in proper fashion.
> 
>>
>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>> capacity and once we get there we want the kernel to always be able to take over
>>> and do whatever it wants behind process back.
>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>> direction is where AMD is now and where it is heading in the future.
>> That doesn't preclude the option to allow the kernel to take over and do
>> what he wants. I agree that in KV we have a problem where we can't do a
>> mid-wave preemption, so theoretically, a long running compute kernel can
>> make things messy, but in Carrizo, we will have this ability. Having
>> said that, it will only be through the CP H/W scheduling. So AMD is
>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>> the situation.
> 
> We was for the overall Linux community but maybe i should not pretend to talk
> for anyone interested in having a common standard.
> 
> My point is that current hardware do not have approriate hardware support for
> preemption hence, current hardware should use ioctl to schedule job and AMD
> should think a bit more on commiting to a design and handwaving any hardware
> short coming as something that can be work around in the software. The pinning
> thing is broken by design, only way to work around it is through kernel cmd
> queue scheduling that's a fact.

> 
> Once hardware support proper preemption and allows to move around/evict buffer
> use on behalf of userspace command queue then we can allow userspace scheduling
> but until then my personnal opinion is that it should not be allowed and that
> people will have to pay the ioctl price which i proved to be small, because
> really if you 100K queue each with one job, i would not expect that all those
> 100K job will complete in less time than it takes to execute an ioctl ie by
> even if you do not have the ioctl delay what ever you schedule will have to
> wait on previously submited jobs.

But Jerome, the core problem still remains in effect, even with your
suggestion. If an application, either via userspace queue or via ioctl,
submits a long-running kernel, than the CPU in general can't stop the
GPU from running it. And if that kernel does while(1); than that's it,
game's over, and no matter how you submitted the work. So I don't really
see the big advantage in your proposal. Only in CZ we can stop this wave
(by CP H/W scheduling only). What are you saying is basically I won't
allow people to use compute on Linux KV system because it _may_ get the
system stuck.

So even if I really wanted to, and I may agree with you theoretically on
that, I can't fulfill your desire to make the "kernel being able to
preempt at any time and be able to decrease or increase user queue
priority so overall kernel is in charge of resources management and it
can handle rogue client in proper fashion". Not in KV, and I guess not
in CZ as well.

	Oded

> 
>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>> stuff there.
>>>>>>>>>
>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>
>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>
>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>> space command ring.
>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>
>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>> the performance ioctl.
>>>>>
>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>> execution.
>>>>>
>>>>>>>>
>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>
>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>> ioctl is 13 times slower.
>>>>>
>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>> to support any facts.
>>>>>
>>>>>
>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>> doing.
>>>>>
>>>>>>>>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 19:23                     ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 19:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 21:59, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>> On 21/07/14 21:14, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>
>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>> of the code.
>>>>>>>>>>
>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>
>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>
>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>> drivers.
>>>>>>>>>>
>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>
>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>
>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>> no userspace i can go look at.
>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>> time you dedicated to review the code.
>>>>>>>>>
>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>> that side.
>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>> is it still a show stopper ?
>>>>>>>>
>>>>>>>> The kernel objects are:
>>>>>>>> - pipelines (4 per device)
>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>> - fence address for kernel queue
>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>
>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>> else.
>>>>>>>
>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>> Most of the pin downs are done on device initialization.
>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>> memory, I think it is OK.
>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>
>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>> desktop.
>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>> Second, I would imagine we can build some protection around it, like
>>>> checking total local memory and limit number of queues based on some
>>>> percentage of that total local memory. So, if someone will have only
>>>> 512M, he will be able to open less queues.
>>>>
>>>>
>>>>>
>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>> phase.
>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>> change the H/W.
>>>
>>> You can not change the hardware but it is not an excuse to allow bad design to
>>> sneak in software to work around that. So i would rather penalize bad hardware
>>> design and have command submission in the kernel, until AMD fix its hardware to
>>> allow proper scheduling by the kernel and proper control by the kernel. 
>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> 
> I am not advocating for having kernel decide down to the very last details. I am
> advocating for kernel being able to preempt at any time and be able to decrease
> or increase user queue priority so overall kernel is in charge of resources
> management and it can handle rogue client in proper fashion.
> 
>>
>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>> capacity and once we get there we want the kernel to always be able to take over
>>> and do whatever it wants behind process back.
>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>> direction is where AMD is now and where it is heading in the future.
>> That doesn't preclude the option to allow the kernel to take over and do
>> what he wants. I agree that in KV we have a problem where we can't do a
>> mid-wave preemption, so theoretically, a long running compute kernel can
>> make things messy, but in Carrizo, we will have this ability. Having
>> said that, it will only be through the CP H/W scheduling. So AMD is
>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>> the situation.
> 
> We was for the overall Linux community but maybe i should not pretend to talk
> for anyone interested in having a common standard.
> 
> My point is that current hardware do not have approriate hardware support for
> preemption hence, current hardware should use ioctl to schedule job and AMD
> should think a bit more on commiting to a design and handwaving any hardware
> short coming as something that can be work around in the software. The pinning
> thing is broken by design, only way to work around it is through kernel cmd
> queue scheduling that's a fact.

> 
> Once hardware support proper preemption and allows to move around/evict buffer
> use on behalf of userspace command queue then we can allow userspace scheduling
> but until then my personnal opinion is that it should not be allowed and that
> people will have to pay the ioctl price which i proved to be small, because
> really if you 100K queue each with one job, i would not expect that all those
> 100K job will complete in less time than it takes to execute an ioctl ie by
> even if you do not have the ioctl delay what ever you schedule will have to
> wait on previously submited jobs.

But Jerome, the core problem still remains in effect, even with your
suggestion. If an application, either via userspace queue or via ioctl,
submits a long-running kernel, than the CPU in general can't stop the
GPU from running it. And if that kernel does while(1); than that's it,
game's over, and no matter how you submitted the work. So I don't really
see the big advantage in your proposal. Only in CZ we can stop this wave
(by CP H/W scheduling only). What are you saying is basically I won't
allow people to use compute on Linux KV system because it _may_ get the
system stuck.

So even if I really wanted to, and I may agree with you theoretically on
that, I can't fulfill your desire to make the "kernel being able to
preempt at any time and be able to decrease or increase user queue
priority so overall kernel is in charge of resources management and it
can handle rogue client in proper fashion". Not in KV, and I guess not
in CZ as well.

	Oded

> 
>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>> stuff there.
>>>>>>>>>
>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>
>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>
>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>> space command ring.
>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>
>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>> the performance ioctl.
>>>>>
>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>> execution.
>>>>>
>>>>>>>>
>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>
>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>> ioctl is 13 times slower.
>>>>>
>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>> to support any facts.
>>>>>
>>>>>
>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>> doing.
>>>>>
>>>>>>>>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 19:23                     ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 19:23 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel,
	dri-devel, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 21:59, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>> On 21/07/14 21:14, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>
>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>> of the code.
>>>>>>>>>>
>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>
>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>
>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>> drivers.
>>>>>>>>>>
>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>
>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>
>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>> no userspace i can go look at.
>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>> time you dedicated to review the code.
>>>>>>>>>
>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>> that side.
>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>> is it still a show stopper ?
>>>>>>>>
>>>>>>>> The kernel objects are:
>>>>>>>> - pipelines (4 per device)
>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>> - fence address for kernel queue
>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>
>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>> else.
>>>>>>>
>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>
>>>>>>> Christian.
>>>>>>
>>>>>> Most of the pin downs are done on device initialization.
>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>> memory, I think it is OK.
>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>
>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>> desktop.
>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>> Second, I would imagine we can build some protection around it, like
>>>> checking total local memory and limit number of queues based on some
>>>> percentage of that total local memory. So, if someone will have only
>>>> 512M, he will be able to open less queues.
>>>>
>>>>
>>>>>
>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>> phase.
>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>> change the H/W.
>>>
>>> You can not change the hardware but it is not an excuse to allow bad design to
>>> sneak in software to work around that. So i would rather penalize bad hardware
>>> design and have command submission in the kernel, until AMD fix its hardware to
>>> allow proper scheduling by the kernel and proper control by the kernel. 
>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> 
> I am not advocating for having kernel decide down to the very last details. I am
> advocating for kernel being able to preempt at any time and be able to decrease
> or increase user queue priority so overall kernel is in charge of resources
> management and it can handle rogue client in proper fashion.
> 
>>
>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>> capacity and once we get there we want the kernel to always be able to take over
>>> and do whatever it wants behind process back.
>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>> direction is where AMD is now and where it is heading in the future.
>> That doesn't preclude the option to allow the kernel to take over and do
>> what he wants. I agree that in KV we have a problem where we can't do a
>> mid-wave preemption, so theoretically, a long running compute kernel can
>> make things messy, but in Carrizo, we will have this ability. Having
>> said that, it will only be through the CP H/W scheduling. So AMD is
>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>> the situation.
> 
> We was for the overall Linux community but maybe i should not pretend to talk
> for anyone interested in having a common standard.
> 
> My point is that current hardware do not have approriate hardware support for
> preemption hence, current hardware should use ioctl to schedule job and AMD
> should think a bit more on commiting to a design and handwaving any hardware
> short coming as something that can be work around in the software. The pinning
> thing is broken by design, only way to work around it is through kernel cmd
> queue scheduling that's a fact.

> 
> Once hardware support proper preemption and allows to move around/evict buffer
> use on behalf of userspace command queue then we can allow userspace scheduling
> but until then my personnal opinion is that it should not be allowed and that
> people will have to pay the ioctl price which i proved to be small, because
> really if you 100K queue each with one job, i would not expect that all those
> 100K job will complete in less time than it takes to execute an ioctl ie by
> even if you do not have the ioctl delay what ever you schedule will have to
> wait on previously submited jobs.

But Jerome, the core problem still remains in effect, even with your
suggestion. If an application, either via userspace queue or via ioctl,
submits a long-running kernel, than the CPU in general can't stop the
GPU from running it. And if that kernel does while(1); than that's it,
game's over, and no matter how you submitted the work. So I don't really
see the big advantage in your proposal. Only in CZ we can stop this wave
(by CP H/W scheduling only). What are you saying is basically I won't
allow people to use compute on Linux KV system because it _may_ get the
system stuck.

So even if I really wanted to, and I may agree with you theoretically on
that, I can't fulfill your desire to make the "kernel being able to
preempt at any time and be able to decrease or increase user queue
priority so overall kernel is in charge of resources management and it
can handle rogue client in proper fashion". Not in KV, and I guess not
in CZ as well.

	Oded

> 
>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>> stuff there.
>>>>>>>>>
>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>
>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>
>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>> space command ring.
>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>
>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>> the performance ioctl.
>>>>>
>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>> execution.
>>>>>
>>>>>>>>
>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>
>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>> ioctl is 13 times slower.
>>>>>
>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>> to support any facts.
>>>>>
>>>>>
>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>> doing.
>>>>>
>>>>>>>>>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 19:23                     ` Oded Gabbay
  (?)
@ 2014-07-21 19:28                       ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 19:28 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:59, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 21:14, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 18:54, Jerome Glisse wrote:
> >>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>>>> On 21/07/14 16:39, Christian König wrote:
> >>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>>>
> >>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>>>> of the code.
> >>>>>>>>>>
> >>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>>>
> >>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>>>
> >>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>>>> drivers.
> >>>>>>>>>>
> >>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>>>
> >>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>>>
> >>>>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>>>> no userspace i can go look at.
> >>>>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>>>> time you dedicated to review the code.
> >>>>>>>>>
> >>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>>>> that side.
> >>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>>>> is it still a show stopper ?
> >>>>>>>>
> >>>>>>>> The kernel objects are:
> >>>>>>>> - pipelines (4 per device)
> >>>>>>>> - mqd per hiq (only 1 per device)
> >>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>>>> - kernel queue (only 1 per device)
> >>>>>>>> - fence address for kernel queue
> >>>>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>>>
> >>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>>>> else.
> >>>>>>>
> >>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>>>> questionable, everything else sounds reasonable.
> >>>>>>>
> >>>>>>> Christian.
> >>>>>>
> >>>>>> Most of the pin downs are done on device initialization.
> >>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>>>> memory, I think it is OK.
> >>>>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>>>
> >>>>> 2G local memory ? You can not assume anything on userside configuration some
> >>>>> one might build an hsa computer with 512M and still expect a functioning
> >>>>> desktop.
> >>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >>>> Second, I would imagine we can build some protection around it, like
> >>>> checking total local memory and limit number of queues based on some
> >>>> percentage of that total local memory. So, if someone will have only
> >>>> 512M, he will be able to open less queues.
> >>>>
> >>>>
> >>>>>
> >>>>> I need to go look into what all this mqd is for, what it does and what it is
> >>>>> about. But pinning is really bad and this is an issue with userspace command
> >>>>> scheduling an issue that obviously AMD fails to take into account in design
> >>>>> phase.
> >>>> Maybe, but that is the H/W design non-the-less. We can't very well
> >>>> change the H/W.
> >>>
> >>> You can not change the hardware but it is not an excuse to allow bad design to
> >>> sneak in software to work around that. So i would rather penalize bad hardware
> >>> design and have command submission in the kernel, until AMD fix its hardware to
> >>> allow proper scheduling by the kernel and proper control by the kernel. 
> >> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> >> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> > 
> > I am not advocating for having kernel decide down to the very last details. I am
> > advocating for kernel being able to preempt at any time and be able to decrease
> > or increase user queue priority so overall kernel is in charge of resources
> > management and it can handle rogue client in proper fashion.
> > 
> >>
> >>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> >>> capacity and once we get there we want the kernel to always be able to take over
> >>> and do whatever it wants behind process back.
> >> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> >> direction is where AMD is now and where it is heading in the future.
> >> That doesn't preclude the option to allow the kernel to take over and do
> >> what he wants. I agree that in KV we have a problem where we can't do a
> >> mid-wave preemption, so theoretically, a long running compute kernel can
> >> make things messy, but in Carrizo, we will have this ability. Having
> >> said that, it will only be through the CP H/W scheduling. So AMD is
> >> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> >> the situation.
> > 
> > We was for the overall Linux community but maybe i should not pretend to talk
> > for anyone interested in having a common standard.
> > 
> > My point is that current hardware do not have approriate hardware support for
> > preemption hence, current hardware should use ioctl to schedule job and AMD
> > should think a bit more on commiting to a design and handwaving any hardware
> > short coming as something that can be work around in the software. The pinning
> > thing is broken by design, only way to work around it is through kernel cmd
> > queue scheduling that's a fact.
> 
> > 
> > Once hardware support proper preemption and allows to move around/evict buffer
> > use on behalf of userspace command queue then we can allow userspace scheduling
> > but until then my personnal opinion is that it should not be allowed and that
> > people will have to pay the ioctl price which i proved to be small, because
> > really if you 100K queue each with one job, i would not expect that all those
> > 100K job will complete in less time than it takes to execute an ioctl ie by
> > even if you do not have the ioctl delay what ever you schedule will have to
> > wait on previously submited jobs.
> 
> But Jerome, the core problem still remains in effect, even with your
> suggestion. If an application, either via userspace queue or via ioctl,
> submits a long-running kernel, than the CPU in general can't stop the
> GPU from running it. And if that kernel does while(1); than that's it,
> game's over, and no matter how you submitted the work. So I don't really
> see the big advantage in your proposal. Only in CZ we can stop this wave
> (by CP H/W scheduling only). What are you saying is basically I won't
> allow people to use compute on Linux KV system because it _may_ get the
> system stuck.
> 
> So even if I really wanted to, and I may agree with you theoretically on
> that, I can't fulfill your desire to make the "kernel being able to
> preempt at any time and be able to decrease or increase user queue
> priority so overall kernel is in charge of resources management and it
> can handle rogue client in proper fashion". Not in KV, and I guess not
> in CZ as well.
> 
> 	Oded

I do understand that but using kernel ioctl provide the same kind of control
as we have now ie we can bind/unbind buffer on per command buffer submission
basis, just like with current graphic or compute stuff.

Yes current graphic and compute stuff can launch a while and never return back
and yes currently we have nothing against that but we should and solution would
be simple just kill the gpu thread.

> 
> > 
> >>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>>>> stuff there.
> >>>>>>>>>
> >>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>>>
> >>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>>>
> >>>>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>>>> space command ring.
> >>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>>>> I don't think this is a valid reason to NACK the driver.
> >>>>>
> >>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>>>> the performance ioctl.
> >>>>>
> >>>>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>>>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>>>> execution.
> >>>>>
> >>>>>>>>
> >>>>>>>>> I only see issues with that. First and foremost i would
> >>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>>>> and for absolutely not upside afaict.
> >>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>>>
> >>>>> I am saying the overhead is not that big and it probably will not matter in most
> >>>>> usecase. For instance i did wrote the most useless kernel module that add two
> >>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>>>> ioctl is 13 times slower.
> >>>>>
> >>>>> Now if there is enough data that shows that a significant percentage of jobs
> >>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>>>> scheduling does make sense. But so far all we have is handwaving with no data
> >>>>> to support any facts.
> >>>>>
> >>>>>
> >>>>> Now if we want to schedule from userspace than you will need to do something
> >>>>> about the pinning, something that gives control to kernel so that kernel can
> >>>>> unpin when it wants and move object when it wants no matter what userspace is
> >>>>> doing.
> >>>>>
> >>>>>>>>>
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 19:28                       ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 19:28 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:59, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 21:14, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 18:54, Jerome Glisse wrote:
> >>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>>>> On 21/07/14 16:39, Christian Konig wrote:
> >>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>>>
> >>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>>>> of the code.
> >>>>>>>>>>
> >>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>>>
> >>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>>>
> >>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>>>> drivers.
> >>>>>>>>>>
> >>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>>>
> >>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>>>
> >>>>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>>>> no userspace i can go look at.
> >>>>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>>>> time you dedicated to review the code.
> >>>>>>>>>
> >>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>>>> that side.
> >>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>>>> is it still a show stopper ?
> >>>>>>>>
> >>>>>>>> The kernel objects are:
> >>>>>>>> - pipelines (4 per device)
> >>>>>>>> - mqd per hiq (only 1 per device)
> >>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>>>> - kernel queue (only 1 per device)
> >>>>>>>> - fence address for kernel queue
> >>>>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>>>
> >>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>>>> else.
> >>>>>>>
> >>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>>>> questionable, everything else sounds reasonable.
> >>>>>>>
> >>>>>>> Christian.
> >>>>>>
> >>>>>> Most of the pin downs are done on device initialization.
> >>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>>>> memory, I think it is OK.
> >>>>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>>>
> >>>>> 2G local memory ? You can not assume anything on userside configuration some
> >>>>> one might build an hsa computer with 512M and still expect a functioning
> >>>>> desktop.
> >>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >>>> Second, I would imagine we can build some protection around it, like
> >>>> checking total local memory and limit number of queues based on some
> >>>> percentage of that total local memory. So, if someone will have only
> >>>> 512M, he will be able to open less queues.
> >>>>
> >>>>
> >>>>>
> >>>>> I need to go look into what all this mqd is for, what it does and what it is
> >>>>> about. But pinning is really bad and this is an issue with userspace command
> >>>>> scheduling an issue that obviously AMD fails to take into account in design
> >>>>> phase.
> >>>> Maybe, but that is the H/W design non-the-less. We can't very well
> >>>> change the H/W.
> >>>
> >>> You can not change the hardware but it is not an excuse to allow bad design to
> >>> sneak in software to work around that. So i would rather penalize bad hardware
> >>> design and have command submission in the kernel, until AMD fix its hardware to
> >>> allow proper scheduling by the kernel and proper control by the kernel. 
> >> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> >> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> > 
> > I am not advocating for having kernel decide down to the very last details. I am
> > advocating for kernel being able to preempt at any time and be able to decrease
> > or increase user queue priority so overall kernel is in charge of resources
> > management and it can handle rogue client in proper fashion.
> > 
> >>
> >>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> >>> capacity and once we get there we want the kernel to always be able to take over
> >>> and do whatever it wants behind process back.
> >> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> >> direction is where AMD is now and where it is heading in the future.
> >> That doesn't preclude the option to allow the kernel to take over and do
> >> what he wants. I agree that in KV we have a problem where we can't do a
> >> mid-wave preemption, so theoretically, a long running compute kernel can
> >> make things messy, but in Carrizo, we will have this ability. Having
> >> said that, it will only be through the CP H/W scheduling. So AMD is
> >> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> >> the situation.
> > 
> > We was for the overall Linux community but maybe i should not pretend to talk
> > for anyone interested in having a common standard.
> > 
> > My point is that current hardware do not have approriate hardware support for
> > preemption hence, current hardware should use ioctl to schedule job and AMD
> > should think a bit more on commiting to a design and handwaving any hardware
> > short coming as something that can be work around in the software. The pinning
> > thing is broken by design, only way to work around it is through kernel cmd
> > queue scheduling that's a fact.
> 
> > 
> > Once hardware support proper preemption and allows to move around/evict buffer
> > use on behalf of userspace command queue then we can allow userspace scheduling
> > but until then my personnal opinion is that it should not be allowed and that
> > people will have to pay the ioctl price which i proved to be small, because
> > really if you 100K queue each with one job, i would not expect that all those
> > 100K job will complete in less time than it takes to execute an ioctl ie by
> > even if you do not have the ioctl delay what ever you schedule will have to
> > wait on previously submited jobs.
> 
> But Jerome, the core problem still remains in effect, even with your
> suggestion. If an application, either via userspace queue or via ioctl,
> submits a long-running kernel, than the CPU in general can't stop the
> GPU from running it. And if that kernel does while(1); than that's it,
> game's over, and no matter how you submitted the work. So I don't really
> see the big advantage in your proposal. Only in CZ we can stop this wave
> (by CP H/W scheduling only). What are you saying is basically I won't
> allow people to use compute on Linux KV system because it _may_ get the
> system stuck.
> 
> So even if I really wanted to, and I may agree with you theoretically on
> that, I can't fulfill your desire to make the "kernel being able to
> preempt at any time and be able to decrease or increase user queue
> priority so overall kernel is in charge of resources management and it
> can handle rogue client in proper fashion". Not in KV, and I guess not
> in CZ as well.
> 
> 	Oded

I do understand that but using kernel ioctl provide the same kind of control
as we have now ie we can bind/unbind buffer on per command buffer submission
basis, just like with current graphic or compute stuff.

Yes current graphic and compute stuff can launch a while and never return back
and yes currently we have nothing against that but we should and solution would
be simple just kill the gpu thread.

> 
> > 
> >>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>>>> stuff there.
> >>>>>>>>>
> >>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>>>
> >>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>>>
> >>>>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>>>> space command ring.
> >>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>>>> I don't think this is a valid reason to NACK the driver.
> >>>>>
> >>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>>>> the performance ioctl.
> >>>>>
> >>>>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>>>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>>>> execution.
> >>>>>
> >>>>>>>>
> >>>>>>>>> I only see issues with that. First and foremost i would
> >>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>>>> and for absolutely not upside afaict.
> >>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>>>
> >>>>> I am saying the overhead is not that big and it probably will not matter in most
> >>>>> usecase. For instance i did wrote the most useless kernel module that add two
> >>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>>>> ioctl is 13 times slower.
> >>>>>
> >>>>> Now if there is enough data that shows that a significant percentage of jobs
> >>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>>>> scheduling does make sense. But so far all we have is handwaving with no data
> >>>>> to support any facts.
> >>>>>
> >>>>>
> >>>>> Now if we want to schedule from userspace than you will need to do something
> >>>>> about the pinning, something that gives control to kernel so that kernel can
> >>>>> unpin when it wants and move object when it wants no matter what userspace is
> >>>>> doing.
> >>>>>
> >>>>>>>>>
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 19:28                       ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 19:28 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> On 21/07/14 21:59, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 21:14, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 18:54, Jerome Glisse wrote:
> >>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>>>> On 21/07/14 16:39, Christian König wrote:
> >>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>>>
> >>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>>>> of the code.
> >>>>>>>>>>
> >>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>>>
> >>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>>>
> >>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>>>> drivers.
> >>>>>>>>>>
> >>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>>>
> >>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>>>
> >>>>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>>>> no userspace i can go look at.
> >>>>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>>>> time you dedicated to review the code.
> >>>>>>>>>
> >>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>>>> that side.
> >>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>>>> is it still a show stopper ?
> >>>>>>>>
> >>>>>>>> The kernel objects are:
> >>>>>>>> - pipelines (4 per device)
> >>>>>>>> - mqd per hiq (only 1 per device)
> >>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>>>> - kernel queue (only 1 per device)
> >>>>>>>> - fence address for kernel queue
> >>>>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>>>
> >>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>>>> else.
> >>>>>>>
> >>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>>>> questionable, everything else sounds reasonable.
> >>>>>>>
> >>>>>>> Christian.
> >>>>>>
> >>>>>> Most of the pin downs are done on device initialization.
> >>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>>>> memory, I think it is OK.
> >>>>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>>>
> >>>>> 2G local memory ? You can not assume anything on userside configuration some
> >>>>> one might build an hsa computer with 512M and still expect a functioning
> >>>>> desktop.
> >>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >>>> Second, I would imagine we can build some protection around it, like
> >>>> checking total local memory and limit number of queues based on some
> >>>> percentage of that total local memory. So, if someone will have only
> >>>> 512M, he will be able to open less queues.
> >>>>
> >>>>
> >>>>>
> >>>>> I need to go look into what all this mqd is for, what it does and what it is
> >>>>> about. But pinning is really bad and this is an issue with userspace command
> >>>>> scheduling an issue that obviously AMD fails to take into account in design
> >>>>> phase.
> >>>> Maybe, but that is the H/W design non-the-less. We can't very well
> >>>> change the H/W.
> >>>
> >>> You can not change the hardware but it is not an excuse to allow bad design to
> >>> sneak in software to work around that. So i would rather penalize bad hardware
> >>> design and have command submission in the kernel, until AMD fix its hardware to
> >>> allow proper scheduling by the kernel and proper control by the kernel. 
> >> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> >> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> > 
> > I am not advocating for having kernel decide down to the very last details. I am
> > advocating for kernel being able to preempt at any time and be able to decrease
> > or increase user queue priority so overall kernel is in charge of resources
> > management and it can handle rogue client in proper fashion.
> > 
> >>
> >>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> >>> capacity and once we get there we want the kernel to always be able to take over
> >>> and do whatever it wants behind process back.
> >> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> >> direction is where AMD is now and where it is heading in the future.
> >> That doesn't preclude the option to allow the kernel to take over and do
> >> what he wants. I agree that in KV we have a problem where we can't do a
> >> mid-wave preemption, so theoretically, a long running compute kernel can
> >> make things messy, but in Carrizo, we will have this ability. Having
> >> said that, it will only be through the CP H/W scheduling. So AMD is
> >> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> >> the situation.
> > 
> > We was for the overall Linux community but maybe i should not pretend to talk
> > for anyone interested in having a common standard.
> > 
> > My point is that current hardware do not have approriate hardware support for
> > preemption hence, current hardware should use ioctl to schedule job and AMD
> > should think a bit more on commiting to a design and handwaving any hardware
> > short coming as something that can be work around in the software. The pinning
> > thing is broken by design, only way to work around it is through kernel cmd
> > queue scheduling that's a fact.
> 
> > 
> > Once hardware support proper preemption and allows to move around/evict buffer
> > use on behalf of userspace command queue then we can allow userspace scheduling
> > but until then my personnal opinion is that it should not be allowed and that
> > people will have to pay the ioctl price which i proved to be small, because
> > really if you 100K queue each with one job, i would not expect that all those
> > 100K job will complete in less time than it takes to execute an ioctl ie by
> > even if you do not have the ioctl delay what ever you schedule will have to
> > wait on previously submited jobs.
> 
> But Jerome, the core problem still remains in effect, even with your
> suggestion. If an application, either via userspace queue or via ioctl,
> submits a long-running kernel, than the CPU in general can't stop the
> GPU from running it. And if that kernel does while(1); than that's it,
> game's over, and no matter how you submitted the work. So I don't really
> see the big advantage in your proposal. Only in CZ we can stop this wave
> (by CP H/W scheduling only). What are you saying is basically I won't
> allow people to use compute on Linux KV system because it _may_ get the
> system stuck.
> 
> So even if I really wanted to, and I may agree with you theoretically on
> that, I can't fulfill your desire to make the "kernel being able to
> preempt at any time and be able to decrease or increase user queue
> priority so overall kernel is in charge of resources management and it
> can handle rogue client in proper fashion". Not in KV, and I guess not
> in CZ as well.
> 
> 	Oded

I do understand that but using kernel ioctl provide the same kind of control
as we have now ie we can bind/unbind buffer on per command buffer submission
basis, just like with current graphic or compute stuff.

Yes current graphic and compute stuff can launch a while and never return back
and yes currently we have nothing against that but we should and solution would
be simple just kill the gpu thread.

> 
> > 
> >>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>>>> stuff there.
> >>>>>>>>>
> >>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>>>
> >>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>>>
> >>>>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>>>> space command ring.
> >>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>>>> I don't think this is a valid reason to NACK the driver.
> >>>>>
> >>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>>>> the performance ioctl.
> >>>>>
> >>>>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>>>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>>>> execution.
> >>>>>
> >>>>>>>>
> >>>>>>>>> I only see issues with that. First and foremost i would
> >>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>>>> and for absolutely not upside afaict.
> >>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>>>
> >>>>> I am saying the overhead is not that big and it probably will not matter in most
> >>>>> usecase. For instance i did wrote the most useless kernel module that add two
> >>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>>>> ioctl is 13 times slower.
> >>>>>
> >>>>> Now if there is enough data that shows that a significant percentage of jobs
> >>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>>>> scheduling does make sense. But so far all we have is handwaving with no data
> >>>>> to support any facts.
> >>>>>
> >>>>>
> >>>>> Now if we want to schedule from userspace than you will need to do something
> >>>>> about the pinning, something that gives control to kernel so that kernel can
> >>>>> unpin when it wants and move object when it wants no matter what userspace is
> >>>>> doing.
> >>>>>
> >>>>>>>>>
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 19:28                       ` Jerome Glisse
  (?)
@ 2014-07-21 21:56                         ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 21:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 22:28, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>> On 21/07/14 21:59, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:14, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>>>
>>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>>>> of the code.
>>>>>>>>>>>>
>>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>>>
>>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>>>
>>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>>>> drivers.
>>>>>>>>>>>>
>>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>>>
>>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>>>
>>>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>>>> no userspace i can go look at.
>>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>>>> time you dedicated to review the code.
>>>>>>>>>>>
>>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>>>> that side.
>>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>>>> is it still a show stopper ?
>>>>>>>>>>
>>>>>>>>>> The kernel objects are:
>>>>>>>>>> - pipelines (4 per device)
>>>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>>>> - fence address for kernel queue
>>>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>>>
>>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>>>> else.
>>>>>>>>>
>>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Most of the pin downs are done on device initialization.
>>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>>>> memory, I think it is OK.
>>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>>>
>>>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>>>> desktop.
>>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>>>> Second, I would imagine we can build some protection around it, like
>>>>>> checking total local memory and limit number of queues based on some
>>>>>> percentage of that total local memory. So, if someone will have only
>>>>>> 512M, he will be able to open less queues.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>>>> phase.
>>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>>>> change the H/W.
>>>>>
>>>>> You can not change the hardware but it is not an excuse to allow bad design to
>>>>> sneak in software to work around that. So i would rather penalize bad hardware
>>>>> design and have command submission in the kernel, until AMD fix its hardware to
>>>>> allow proper scheduling by the kernel and proper control by the kernel. 
>>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
>>>
>>> I am not advocating for having kernel decide down to the very last details. I am
>>> advocating for kernel being able to preempt at any time and be able to decrease
>>> or increase user queue priority so overall kernel is in charge of resources
>>> management and it can handle rogue client in proper fashion.
>>>
>>>>
>>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>>>> capacity and once we get there we want the kernel to always be able to take over
>>>>> and do whatever it wants behind process back.
>>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>>>> direction is where AMD is now and where it is heading in the future.
>>>> That doesn't preclude the option to allow the kernel to take over and do
>>>> what he wants. I agree that in KV we have a problem where we can't do a
>>>> mid-wave preemption, so theoretically, a long running compute kernel can
>>>> make things messy, but in Carrizo, we will have this ability. Having
>>>> said that, it will only be through the CP H/W scheduling. So AMD is
>>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>>>> the situation.
>>>
>>> We was for the overall Linux community but maybe i should not pretend to talk
>>> for anyone interested in having a common standard.
>>>
>>> My point is that current hardware do not have approriate hardware support for
>>> preemption hence, current hardware should use ioctl to schedule job and AMD
>>> should think a bit more on commiting to a design and handwaving any hardware
>>> short coming as something that can be work around in the software. The pinning
>>> thing is broken by design, only way to work around it is through kernel cmd
>>> queue scheduling that's a fact.
>>
>>>
>>> Once hardware support proper preemption and allows to move around/evict buffer
>>> use on behalf of userspace command queue then we can allow userspace scheduling
>>> but until then my personnal opinion is that it should not be allowed and that
>>> people will have to pay the ioctl price which i proved to be small, because
>>> really if you 100K queue each with one job, i would not expect that all those
>>> 100K job will complete in less time than it takes to execute an ioctl ie by
>>> even if you do not have the ioctl delay what ever you schedule will have to
>>> wait on previously submited jobs.
>>
>> But Jerome, the core problem still remains in effect, even with your
>> suggestion. If an application, either via userspace queue or via ioctl,
>> submits a long-running kernel, than the CPU in general can't stop the
>> GPU from running it. And if that kernel does while(1); than that's it,
>> game's over, and no matter how you submitted the work. So I don't really
>> see the big advantage in your proposal. Only in CZ we can stop this wave
>> (by CP H/W scheduling only). What are you saying is basically I won't
>> allow people to use compute on Linux KV system because it _may_ get the
>> system stuck.
>>
>> So even if I really wanted to, and I may agree with you theoretically on
>> that, I can't fulfill your desire to make the "kernel being able to
>> preempt at any time and be able to decrease or increase user queue
>> priority so overall kernel is in charge of resources management and it
>> can handle rogue client in proper fashion". Not in KV, and I guess not
>> in CZ as well.
>>
>> 	Oded
> 
> I do understand that but using kernel ioctl provide the same kind of control
> as we have now ie we can bind/unbind buffer on per command buffer submission
> basis, just like with current graphic or compute stuff.
> 
> Yes current graphic and compute stuff can launch a while and never return back
> and yes currently we have nothing against that but we should and solution would
> be simple just kill the gpu thread.
> 
OK, so in that case, the kernel can simple unmap all the queues by
simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
userspace, they will not be mapped to the internal CP scheduler.
Does that satisfy the kernel control level you want ?

	Oded
>>
>>>
>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>>>> stuff there.
>>>>>>>>>>>
>>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>>>
>>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>>>
>>>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>>>> space command ring.
>>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>>>
>>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>>>> the performance ioctl.
>>>>>>>
>>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>>>> execution.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>>>
>>>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>>>> ioctl is 13 times slower.
>>>>>>>
>>>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>>>> to support any facts.
>>>>>>>
>>>>>>>
>>>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>>>> doing.
>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 21:56                         ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 21:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 22:28, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>> On 21/07/14 21:59, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:14, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>>>
>>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>>>> of the code.
>>>>>>>>>>>>
>>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>>>
>>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>>>
>>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>>>> drivers.
>>>>>>>>>>>>
>>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>>>
>>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>>>
>>>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>>>> no userspace i can go look at.
>>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>>>> time you dedicated to review the code.
>>>>>>>>>>>
>>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>>>> that side.
>>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>>>> is it still a show stopper ?
>>>>>>>>>>
>>>>>>>>>> The kernel objects are:
>>>>>>>>>> - pipelines (4 per device)
>>>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>>>> - fence address for kernel queue
>>>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>>>
>>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>>>> else.
>>>>>>>>>
>>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Most of the pin downs are done on device initialization.
>>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>>>> memory, I think it is OK.
>>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>>>
>>>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>>>> desktop.
>>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>>>> Second, I would imagine we can build some protection around it, like
>>>>>> checking total local memory and limit number of queues based on some
>>>>>> percentage of that total local memory. So, if someone will have only
>>>>>> 512M, he will be able to open less queues.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>>>> phase.
>>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>>>> change the H/W.
>>>>>
>>>>> You can not change the hardware but it is not an excuse to allow bad design to
>>>>> sneak in software to work around that. So i would rather penalize bad hardware
>>>>> design and have command submission in the kernel, until AMD fix its hardware to
>>>>> allow proper scheduling by the kernel and proper control by the kernel. 
>>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
>>>
>>> I am not advocating for having kernel decide down to the very last details. I am
>>> advocating for kernel being able to preempt at any time and be able to decrease
>>> or increase user queue priority so overall kernel is in charge of resources
>>> management and it can handle rogue client in proper fashion.
>>>
>>>>
>>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>>>> capacity and once we get there we want the kernel to always be able to take over
>>>>> and do whatever it wants behind process back.
>>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>>>> direction is where AMD is now and where it is heading in the future.
>>>> That doesn't preclude the option to allow the kernel to take over and do
>>>> what he wants. I agree that in KV we have a problem where we can't do a
>>>> mid-wave preemption, so theoretically, a long running compute kernel can
>>>> make things messy, but in Carrizo, we will have this ability. Having
>>>> said that, it will only be through the CP H/W scheduling. So AMD is
>>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>>>> the situation.
>>>
>>> We was for the overall Linux community but maybe i should not pretend to talk
>>> for anyone interested in having a common standard.
>>>
>>> My point is that current hardware do not have approriate hardware support for
>>> preemption hence, current hardware should use ioctl to schedule job and AMD
>>> should think a bit more on commiting to a design and handwaving any hardware
>>> short coming as something that can be work around in the software. The pinning
>>> thing is broken by design, only way to work around it is through kernel cmd
>>> queue scheduling that's a fact.
>>
>>>
>>> Once hardware support proper preemption and allows to move around/evict buffer
>>> use on behalf of userspace command queue then we can allow userspace scheduling
>>> but until then my personnal opinion is that it should not be allowed and that
>>> people will have to pay the ioctl price which i proved to be small, because
>>> really if you 100K queue each with one job, i would not expect that all those
>>> 100K job will complete in less time than it takes to execute an ioctl ie by
>>> even if you do not have the ioctl delay what ever you schedule will have to
>>> wait on previously submited jobs.
>>
>> But Jerome, the core problem still remains in effect, even with your
>> suggestion. If an application, either via userspace queue or via ioctl,
>> submits a long-running kernel, than the CPU in general can't stop the
>> GPU from running it. And if that kernel does while(1); than that's it,
>> game's over, and no matter how you submitted the work. So I don't really
>> see the big advantage in your proposal. Only in CZ we can stop this wave
>> (by CP H/W scheduling only). What are you saying is basically I won't
>> allow people to use compute on Linux KV system because it _may_ get the
>> system stuck.
>>
>> So even if I really wanted to, and I may agree with you theoretically on
>> that, I can't fulfill your desire to make the "kernel being able to
>> preempt at any time and be able to decrease or increase user queue
>> priority so overall kernel is in charge of resources management and it
>> can handle rogue client in proper fashion". Not in KV, and I guess not
>> in CZ as well.
>>
>> 	Oded
> 
> I do understand that but using kernel ioctl provide the same kind of control
> as we have now ie we can bind/unbind buffer on per command buffer submission
> basis, just like with current graphic or compute stuff.
> 
> Yes current graphic and compute stuff can launch a while and never return back
> and yes currently we have nothing against that but we should and solution would
> be simple just kill the gpu thread.
> 
OK, so in that case, the kernel can simple unmap all the queues by
simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
userspace, they will not be mapped to the internal CP scheduler.
Does that satisfy the kernel control level you want ?

	Oded
>>
>>>
>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>>>> stuff there.
>>>>>>>>>>>
>>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>>>
>>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>>>
>>>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>>>> space command ring.
>>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>>>
>>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>>>> the performance ioctl.
>>>>>>>
>>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>>>> execution.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>>>
>>>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>>>> ioctl is 13 times slower.
>>>>>>>
>>>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>>>> to support any facts.
>>>>>>>
>>>>>>>
>>>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>>>> doing.
>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 21:56                         ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-21 21:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel,
	dri-devel, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On 21/07/14 22:28, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>> On 21/07/14 21:59, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:14, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>>>
>>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>>>> of the code.
>>>>>>>>>>>>
>>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>>>
>>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>>>
>>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>>>> drivers.
>>>>>>>>>>>>
>>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>>>
>>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>>>
>>>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>>>> no userspace i can go look at.
>>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>>>> time you dedicated to review the code.
>>>>>>>>>>>
>>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>>>> that side.
>>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>>>> is it still a show stopper ?
>>>>>>>>>>
>>>>>>>>>> The kernel objects are:
>>>>>>>>>> - pipelines (4 per device)
>>>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>>>> - fence address for kernel queue
>>>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>>>
>>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>>>> else.
>>>>>>>>>
>>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>
>>>>>>>> Most of the pin downs are done on device initialization.
>>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>>>> memory, I think it is OK.
>>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>>>
>>>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>>>> desktop.
>>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>>>> Second, I would imagine we can build some protection around it, like
>>>>>> checking total local memory and limit number of queues based on some
>>>>>> percentage of that total local memory. So, if someone will have only
>>>>>> 512M, he will be able to open less queues.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>>>> phase.
>>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>>>> change the H/W.
>>>>>
>>>>> You can not change the hardware but it is not an excuse to allow bad design to
>>>>> sneak in software to work around that. So i would rather penalize bad hardware
>>>>> design and have command submission in the kernel, until AMD fix its hardware to
>>>>> allow proper scheduling by the kernel and proper control by the kernel. 
>>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
>>>
>>> I am not advocating for having kernel decide down to the very last details. I am
>>> advocating for kernel being able to preempt at any time and be able to decrease
>>> or increase user queue priority so overall kernel is in charge of resources
>>> management and it can handle rogue client in proper fashion.
>>>
>>>>
>>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>>>> capacity and once we get there we want the kernel to always be able to take over
>>>>> and do whatever it wants behind process back.
>>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>>>> direction is where AMD is now and where it is heading in the future.
>>>> That doesn't preclude the option to allow the kernel to take over and do
>>>> what he wants. I agree that in KV we have a problem where we can't do a
>>>> mid-wave preemption, so theoretically, a long running compute kernel can
>>>> make things messy, but in Carrizo, we will have this ability. Having
>>>> said that, it will only be through the CP H/W scheduling. So AMD is
>>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>>>> the situation.
>>>
>>> We was for the overall Linux community but maybe i should not pretend to talk
>>> for anyone interested in having a common standard.
>>>
>>> My point is that current hardware do not have approriate hardware support for
>>> preemption hence, current hardware should use ioctl to schedule job and AMD
>>> should think a bit more on commiting to a design and handwaving any hardware
>>> short coming as something that can be work around in the software. The pinning
>>> thing is broken by design, only way to work around it is through kernel cmd
>>> queue scheduling that's a fact.
>>
>>>
>>> Once hardware support proper preemption and allows to move around/evict buffer
>>> use on behalf of userspace command queue then we can allow userspace scheduling
>>> but until then my personnal opinion is that it should not be allowed and that
>>> people will have to pay the ioctl price which i proved to be small, because
>>> really if you 100K queue each with one job, i would not expect that all those
>>> 100K job will complete in less time than it takes to execute an ioctl ie by
>>> even if you do not have the ioctl delay what ever you schedule will have to
>>> wait on previously submited jobs.
>>
>> But Jerome, the core problem still remains in effect, even with your
>> suggestion. If an application, either via userspace queue or via ioctl,
>> submits a long-running kernel, than the CPU in general can't stop the
>> GPU from running it. And if that kernel does while(1); than that's it,
>> game's over, and no matter how you submitted the work. So I don't really
>> see the big advantage in your proposal. Only in CZ we can stop this wave
>> (by CP H/W scheduling only). What are you saying is basically I won't
>> allow people to use compute on Linux KV system because it _may_ get the
>> system stuck.
>>
>> So even if I really wanted to, and I may agree with you theoretically on
>> that, I can't fulfill your desire to make the "kernel being able to
>> preempt at any time and be able to decrease or increase user queue
>> priority so overall kernel is in charge of resources management and it
>> can handle rogue client in proper fashion". Not in KV, and I guess not
>> in CZ as well.
>>
>> 	Oded
> 
> I do understand that but using kernel ioctl provide the same kind of control
> as we have now ie we can bind/unbind buffer on per command buffer submission
> basis, just like with current graphic or compute stuff.
> 
> Yes current graphic and compute stuff can launch a while and never return back
> and yes currently we have nothing against that but we should and solution would
> be simple just kill the gpu thread.
> 
OK, so in that case, the kernel can simple unmap all the queues by
simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
userspace, they will not be mapped to the internal CP scheduler.
Does that satisfy the kernel control level you want ?

	Oded
>>
>>>
>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>>>> stuff there.
>>>>>>>>>>>
>>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>>>
>>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>>>
>>>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>>>> space command ring.
>>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>>>
>>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>>>> the performance ioctl.
>>>>>>>
>>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>>>> execution.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>>>
>>>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>>>> ioctl is 13 times slower.
>>>>>>>
>>>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>>>> to support any facts.
>>>>>>>
>>>>>>>
>>>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>>>> doing.
>>>>>>>
>>>>>>>>>>>
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 21:56                         ` Oded Gabbay
  (?)
@ 2014-07-21 23:05                           ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 23:05 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel,
	dri-devel, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
> On 21/07/14 22:28, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 21:59, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 21:14, Jerome Glisse wrote:
> >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
> >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>>>>>> On 21/07/14 16:39, Christian König wrote:
> >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>>>>>> of the code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>>>>>> drivers.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>>>>>
> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>>>>>
> >>>>>>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>>>>>> no userspace i can go look at.
> >>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>>>>>> time you dedicated to review the code.
> >>>>>>>>>>>
> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>>>>>> that side.
> >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>>>>>> is it still a show stopper ?
> >>>>>>>>>>
> >>>>>>>>>> The kernel objects are:
> >>>>>>>>>> - pipelines (4 per device)
> >>>>>>>>>> - mqd per hiq (only 1 per device)
> >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>>>>>> - kernel queue (only 1 per device)
> >>>>>>>>>> - fence address for kernel queue
> >>>>>>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>>>>>
> >>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>>>>>> else.
> >>>>>>>>>
> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>>>>>> questionable, everything else sounds reasonable.
> >>>>>>>>>
> >>>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>> Most of the pin downs are done on device initialization.
> >>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>>>>>> memory, I think it is OK.
> >>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>>>>>
> >>>>>>> 2G local memory ? You can not assume anything on userside configuration some
> >>>>>>> one might build an hsa computer with 512M and still expect a functioning
> >>>>>>> desktop.
> >>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >>>>>> Second, I would imagine we can build some protection around it, like
> >>>>>> checking total local memory and limit number of queues based on some
> >>>>>> percentage of that total local memory. So, if someone will have only
> >>>>>> 512M, he will be able to open less queues.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I need to go look into what all this mqd is for, what it does and what it is
> >>>>>>> about. But pinning is really bad and this is an issue with userspace command
> >>>>>>> scheduling an issue that obviously AMD fails to take into account in design
> >>>>>>> phase.
> >>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
> >>>>>> change the H/W.
> >>>>>
> >>>>> You can not change the hardware but it is not an excuse to allow bad design to
> >>>>> sneak in software to work around that. So i would rather penalize bad hardware
> >>>>> design and have command submission in the kernel, until AMD fix its hardware to
> >>>>> allow proper scheduling by the kernel and proper control by the kernel. 
> >>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> >>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> >>>
> >>> I am not advocating for having kernel decide down to the very last details. I am
> >>> advocating for kernel being able to preempt at any time and be able to decrease
> >>> or increase user queue priority so overall kernel is in charge of resources
> >>> management and it can handle rogue client in proper fashion.
> >>>
> >>>>
> >>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> >>>>> capacity and once we get there we want the kernel to always be able to take over
> >>>>> and do whatever it wants behind process back.
> >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> >>>> direction is where AMD is now and where it is heading in the future.
> >>>> That doesn't preclude the option to allow the kernel to take over and do
> >>>> what he wants. I agree that in KV we have a problem where we can't do a
> >>>> mid-wave preemption, so theoretically, a long running compute kernel can
> >>>> make things messy, but in Carrizo, we will have this ability. Having
> >>>> said that, it will only be through the CP H/W scheduling. So AMD is
> >>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> >>>> the situation.
> >>>
> >>> We was for the overall Linux community but maybe i should not pretend to talk
> >>> for anyone interested in having a common standard.
> >>>
> >>> My point is that current hardware do not have approriate hardware support for
> >>> preemption hence, current hardware should use ioctl to schedule job and AMD
> >>> should think a bit more on commiting to a design and handwaving any hardware
> >>> short coming as something that can be work around in the software. The pinning
> >>> thing is broken by design, only way to work around it is through kernel cmd
> >>> queue scheduling that's a fact.
> >>
> >>>
> >>> Once hardware support proper preemption and allows to move around/evict buffer
> >>> use on behalf of userspace command queue then we can allow userspace scheduling
> >>> but until then my personnal opinion is that it should not be allowed and that
> >>> people will have to pay the ioctl price which i proved to be small, because
> >>> really if you 100K queue each with one job, i would not expect that all those
> >>> 100K job will complete in less time than it takes to execute an ioctl ie by
> >>> even if you do not have the ioctl delay what ever you schedule will have to
> >>> wait on previously submited jobs.
> >>
> >> But Jerome, the core problem still remains in effect, even with your
> >> suggestion. If an application, either via userspace queue or via ioctl,
> >> submits a long-running kernel, than the CPU in general can't stop the
> >> GPU from running it. And if that kernel does while(1); than that's it,
> >> game's over, and no matter how you submitted the work. So I don't really
> >> see the big advantage in your proposal. Only in CZ we can stop this wave
> >> (by CP H/W scheduling only). What are you saying is basically I won't
> >> allow people to use compute on Linux KV system because it _may_ get the
> >> system stuck.
> >>
> >> So even if I really wanted to, and I may agree with you theoretically on
> >> that, I can't fulfill your desire to make the "kernel being able to
> >> preempt at any time and be able to decrease or increase user queue
> >> priority so overall kernel is in charge of resources management and it
> >> can handle rogue client in proper fashion". Not in KV, and I guess not
> >> in CZ as well.
> >>
> >> 	Oded
> > 
> > I do understand that but using kernel ioctl provide the same kind of control
> > as we have now ie we can bind/unbind buffer on per command buffer submission
> > basis, just like with current graphic or compute stuff.
> > 
> > Yes current graphic and compute stuff can launch a while and never return back
> > and yes currently we have nothing against that but we should and solution would
> > be simple just kill the gpu thread.
> > 
> OK, so in that case, the kernel can simple unmap all the queues by
> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
> userspace, they will not be mapped to the internal CP scheduler.
> Does that satisfy the kernel control level you want ?

This raises questions, what does happen to currently running thread when you
unmap queue ? Do they keep running until done ? If not than this means this
will break user application and those is not an acceptable solution.

Otherwise, infrastructre inside radeon would be needed to force this queue
unmap on bo_pin failure so gfx pinning can be retry.

Also how do you cope with doorbell exhaustion ? Do you just plan to error out ?
In which case this is another DDOS vector but only affecting the gpu.

And there is many other questions that need answer, like my kernel memory map
question because as of right now i assume that kfd allow any thread on the gpu
to access any kernel memory.

Otherthings are how ill formated packet are handled by the hardware ? I do not
see any mecanism to deal with SIGBUS or SIGFAULT.


Also it is a worrisome prospect of seeing resource management completely ignore
for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to
provide resource management if AMD fails to understand that, this is not looking
good on long term and i expect none of the HSA technology will get momentum and
i would certainly advocate against any use of it inside product i work on.

Cheers,
Jérôme

> 
> 	Oded
> >>
> >>>
> >>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>>>>>> stuff there.
> >>>>>>>>>>>
> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>>>>>
> >>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>>>>>
> >>>>>>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>>>>>> space command ring.
> >>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>>>>>> I don't think this is a valid reason to NACK the driver.
> >>>>>>>
> >>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>>>>>> the performance ioctl.
> >>>>>>>
> >>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>>>>>> execution.
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I only see issues with that. First and foremost i would
> >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>>>>>> and for absolutely not upside afaict.
> >>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>>>>>
> >>>>>>> I am saying the overhead is not that big and it probably will not matter in most
> >>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
> >>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>>>>>> ioctl is 13 times slower.
> >>>>>>>
> >>>>>>> Now if there is enough data that shows that a significant percentage of jobs
> >>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
> >>>>>>> to support any facts.
> >>>>>>>
> >>>>>>>
> >>>>>>> Now if we want to schedule from userspace than you will need to do something
> >>>>>>> about the pinning, something that gives control to kernel so that kernel can
> >>>>>>> unpin when it wants and move object when it wants no matter what userspace is
> >>>>>>> doing.
> >>>>>>>
> >>>>>>>>>>>
> >>>
> >>> --
> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>> the body to majordomo@kvack.org.  For more info on Linux MM,
> >>> see: http://www.linux-mm.org/ .
> >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >>>
> >>
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 23:05                           ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 23:05 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel,
	dri-devel, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
> On 21/07/14 22:28, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 21:59, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 21:14, Jerome Glisse wrote:
> >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
> >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>>>>>> On 21/07/14 16:39, Christian Konig wrote:
> >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>>>>>> of the code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>>>>>> drivers.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>>>>>
> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>>>>>
> >>>>>>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>>>>>> no userspace i can go look at.
> >>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>>>>>> time you dedicated to review the code.
> >>>>>>>>>>>
> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>>>>>> that side.
> >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>>>>>> is it still a show stopper ?
> >>>>>>>>>>
> >>>>>>>>>> The kernel objects are:
> >>>>>>>>>> - pipelines (4 per device)
> >>>>>>>>>> - mqd per hiq (only 1 per device)
> >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>>>>>> - kernel queue (only 1 per device)
> >>>>>>>>>> - fence address for kernel queue
> >>>>>>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>>>>>
> >>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>>>>>> else.
> >>>>>>>>>
> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>>>>>> questionable, everything else sounds reasonable.
> >>>>>>>>>
> >>>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>> Most of the pin downs are done on device initialization.
> >>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>>>>>> memory, I think it is OK.
> >>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>>>>>
> >>>>>>> 2G local memory ? You can not assume anything on userside configuration some
> >>>>>>> one might build an hsa computer with 512M and still expect a functioning
> >>>>>>> desktop.
> >>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >>>>>> Second, I would imagine we can build some protection around it, like
> >>>>>> checking total local memory and limit number of queues based on some
> >>>>>> percentage of that total local memory. So, if someone will have only
> >>>>>> 512M, he will be able to open less queues.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I need to go look into what all this mqd is for, what it does and what it is
> >>>>>>> about. But pinning is really bad and this is an issue with userspace command
> >>>>>>> scheduling an issue that obviously AMD fails to take into account in design
> >>>>>>> phase.
> >>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
> >>>>>> change the H/W.
> >>>>>
> >>>>> You can not change the hardware but it is not an excuse to allow bad design to
> >>>>> sneak in software to work around that. So i would rather penalize bad hardware
> >>>>> design and have command submission in the kernel, until AMD fix its hardware to
> >>>>> allow proper scheduling by the kernel and proper control by the kernel. 
> >>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> >>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> >>>
> >>> I am not advocating for having kernel decide down to the very last details. I am
> >>> advocating for kernel being able to preempt at any time and be able to decrease
> >>> or increase user queue priority so overall kernel is in charge of resources
> >>> management and it can handle rogue client in proper fashion.
> >>>
> >>>>
> >>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> >>>>> capacity and once we get there we want the kernel to always be able to take over
> >>>>> and do whatever it wants behind process back.
> >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> >>>> direction is where AMD is now and where it is heading in the future.
> >>>> That doesn't preclude the option to allow the kernel to take over and do
> >>>> what he wants. I agree that in KV we have a problem where we can't do a
> >>>> mid-wave preemption, so theoretically, a long running compute kernel can
> >>>> make things messy, but in Carrizo, we will have this ability. Having
> >>>> said that, it will only be through the CP H/W scheduling. So AMD is
> >>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> >>>> the situation.
> >>>
> >>> We was for the overall Linux community but maybe i should not pretend to talk
> >>> for anyone interested in having a common standard.
> >>>
> >>> My point is that current hardware do not have approriate hardware support for
> >>> preemption hence, current hardware should use ioctl to schedule job and AMD
> >>> should think a bit more on commiting to a design and handwaving any hardware
> >>> short coming as something that can be work around in the software. The pinning
> >>> thing is broken by design, only way to work around it is through kernel cmd
> >>> queue scheduling that's a fact.
> >>
> >>>
> >>> Once hardware support proper preemption and allows to move around/evict buffer
> >>> use on behalf of userspace command queue then we can allow userspace scheduling
> >>> but until then my personnal opinion is that it should not be allowed and that
> >>> people will have to pay the ioctl price which i proved to be small, because
> >>> really if you 100K queue each with one job, i would not expect that all those
> >>> 100K job will complete in less time than it takes to execute an ioctl ie by
> >>> even if you do not have the ioctl delay what ever you schedule will have to
> >>> wait on previously submited jobs.
> >>
> >> But Jerome, the core problem still remains in effect, even with your
> >> suggestion. If an application, either via userspace queue or via ioctl,
> >> submits a long-running kernel, than the CPU in general can't stop the
> >> GPU from running it. And if that kernel does while(1); than that's it,
> >> game's over, and no matter how you submitted the work. So I don't really
> >> see the big advantage in your proposal. Only in CZ we can stop this wave
> >> (by CP H/W scheduling only). What are you saying is basically I won't
> >> allow people to use compute on Linux KV system because it _may_ get the
> >> system stuck.
> >>
> >> So even if I really wanted to, and I may agree with you theoretically on
> >> that, I can't fulfill your desire to make the "kernel being able to
> >> preempt at any time and be able to decrease or increase user queue
> >> priority so overall kernel is in charge of resources management and it
> >> can handle rogue client in proper fashion". Not in KV, and I guess not
> >> in CZ as well.
> >>
> >> 	Oded
> > 
> > I do understand that but using kernel ioctl provide the same kind of control
> > as we have now ie we can bind/unbind buffer on per command buffer submission
> > basis, just like with current graphic or compute stuff.
> > 
> > Yes current graphic and compute stuff can launch a while and never return back
> > and yes currently we have nothing against that but we should and solution would
> > be simple just kill the gpu thread.
> > 
> OK, so in that case, the kernel can simple unmap all the queues by
> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
> userspace, they will not be mapped to the internal CP scheduler.
> Does that satisfy the kernel control level you want ?

This raises questions, what does happen to currently running thread when you
unmap queue ? Do they keep running until done ? If not than this means this
will break user application and those is not an acceptable solution.

Otherwise, infrastructre inside radeon would be needed to force this queue
unmap on bo_pin failure so gfx pinning can be retry.

Also how do you cope with doorbell exhaustion ? Do you just plan to error out ?
In which case this is another DDOS vector but only affecting the gpu.

And there is many other questions that need answer, like my kernel memory map
question because as of right now i assume that kfd allow any thread on the gpu
to access any kernel memory.

Otherthings are how ill formated packet are handled by the hardware ? I do not
see any mecanism to deal with SIGBUS or SIGFAULT.


Also it is a worrisome prospect of seeing resource management completely ignore
for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to
provide resource management if AMD fails to understand that, this is not looking
good on long term and i expect none of the HSA technology will get momentum and
i would certainly advocate against any use of it inside product i work on.

Cheers,
Jerome

> 
> 	Oded
> >>
> >>>
> >>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>>>>>> stuff there.
> >>>>>>>>>>>
> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>>>>>
> >>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>>>>>
> >>>>>>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>>>>>> space command ring.
> >>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>>>>>> I don't think this is a valid reason to NACK the driver.
> >>>>>>>
> >>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>>>>>> the performance ioctl.
> >>>>>>>
> >>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>>>>>> execution.
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I only see issues with that. First and foremost i would
> >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>>>>>> and for absolutely not upside afaict.
> >>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>>>>>
> >>>>>>> I am saying the overhead is not that big and it probably will not matter in most
> >>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
> >>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>>>>>> ioctl is 13 times slower.
> >>>>>>>
> >>>>>>> Now if there is enough data that shows that a significant percentage of jobs
> >>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
> >>>>>>> to support any facts.
> >>>>>>>
> >>>>>>>
> >>>>>>> Now if we want to schedule from userspace than you will need to do something
> >>>>>>> about the pinning, something that gives control to kernel so that kernel can
> >>>>>>> unpin when it wants and move object when it wants no matter what userspace is
> >>>>>>> doing.
> >>>>>>>
> >>>>>>>>>>>
> >>>
> >>> --
> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>> the body to majordomo@kvack.org.  For more info on Linux MM,
> >>> see: http://www.linux-mm.org/ .
> >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >>>
> >>
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 23:05                           ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 23:05 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel,
	dri-devel, Evgeny Pinchuk, Alexey Skidanov, Andrew Morton

On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
> On 21/07/14 22:28, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> >> On 21/07/14 21:59, Jerome Glisse wrote:
> >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
> >>>> On 21/07/14 21:14, Jerome Glisse wrote:
> >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
> >>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
> >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
> >>>>>>>> On 21/07/14 16:39, Christian König wrote:
> >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
> >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
> >>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
> >>>>>>>>>>>> of the code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
> >>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
> >>>>>>>>>>>> There is no code going away or even modified between patches, only added.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
> >>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
> >>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
> >>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
> >>>>>>>>>>>> will adjust amdkfd to work within that framework.
> >>>>>>>>>>>>
> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
> >>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
> >>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
> >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
> >>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
> >>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
> >>>>>>>>>>>> drivers.
> >>>>>>>>>>>>
> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
> >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
> >>>>>>>>>>>>
> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
> >>>>>>>>>>>
> >>>>>>>>>>> So quick comments before i finish going over all patches. There is many
> >>>>>>>>>>> things that need more documentation espacialy as of right now there is
> >>>>>>>>>>> no userspace i can go look at.
> >>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
> >>>>>>>>>> time you dedicated to review the code.
> >>>>>>>>>>>
> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
> >>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
> >>>>>>>>>>> that side.
> >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
> >>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
> >>>>>>>>>> is it still a show stopper ?
> >>>>>>>>>>
> >>>>>>>>>> The kernel objects are:
> >>>>>>>>>> - pipelines (4 per device)
> >>>>>>>>>> - mqd per hiq (only 1 per device)
> >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
> >>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
> >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
> >>>>>>>>>> - kernel queue (only 1 per device)
> >>>>>>>>>> - fence address for kernel queue
> >>>>>>>>>> - runlists for the CP (1 or 2 per device)
> >>>>>>>>>
> >>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
> >>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
> >>>>>>>>> else.
> >>>>>>>>>
> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
> >>>>>>>>> questionable, everything else sounds reasonable.
> >>>>>>>>>
> >>>>>>>>> Christian.
> >>>>>>>>
> >>>>>>>> Most of the pin downs are done on device initialization.
> >>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
> >>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
> >>>>>>>> memory, I think it is OK.
> >>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
> >>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
> >>>>>>>
> >>>>>>> 2G local memory ? You can not assume anything on userside configuration some
> >>>>>>> one might build an hsa computer with 512M and still expect a functioning
> >>>>>>> desktop.
> >>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
> >>>>>> Second, I would imagine we can build some protection around it, like
> >>>>>> checking total local memory and limit number of queues based on some
> >>>>>> percentage of that total local memory. So, if someone will have only
> >>>>>> 512M, he will be able to open less queues.
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> I need to go look into what all this mqd is for, what it does and what it is
> >>>>>>> about. But pinning is really bad and this is an issue with userspace command
> >>>>>>> scheduling an issue that obviously AMD fails to take into account in design
> >>>>>>> phase.
> >>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
> >>>>>> change the H/W.
> >>>>>
> >>>>> You can not change the hardware but it is not an excuse to allow bad design to
> >>>>> sneak in software to work around that. So i would rather penalize bad hardware
> >>>>> design and have command submission in the kernel, until AMD fix its hardware to
> >>>>> allow proper scheduling by the kernel and proper control by the kernel. 
> >>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
> >>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
> >>>
> >>> I am not advocating for having kernel decide down to the very last details. I am
> >>> advocating for kernel being able to preempt at any time and be able to decrease
> >>> or increase user queue priority so overall kernel is in charge of resources
> >>> management and it can handle rogue client in proper fashion.
> >>>
> >>>>
> >>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
> >>>>> capacity and once we get there we want the kernel to always be able to take over
> >>>>> and do whatever it wants behind process back.
> >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
> >>>> direction is where AMD is now and where it is heading in the future.
> >>>> That doesn't preclude the option to allow the kernel to take over and do
> >>>> what he wants. I agree that in KV we have a problem where we can't do a
> >>>> mid-wave preemption, so theoretically, a long running compute kernel can
> >>>> make things messy, but in Carrizo, we will have this ability. Having
> >>>> said that, it will only be through the CP H/W scheduling. So AMD is
> >>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
> >>>> the situation.
> >>>
> >>> We was for the overall Linux community but maybe i should not pretend to talk
> >>> for anyone interested in having a common standard.
> >>>
> >>> My point is that current hardware do not have approriate hardware support for
> >>> preemption hence, current hardware should use ioctl to schedule job and AMD
> >>> should think a bit more on commiting to a design and handwaving any hardware
> >>> short coming as something that can be work around in the software. The pinning
> >>> thing is broken by design, only way to work around it is through kernel cmd
> >>> queue scheduling that's a fact.
> >>
> >>>
> >>> Once hardware support proper preemption and allows to move around/evict buffer
> >>> use on behalf of userspace command queue then we can allow userspace scheduling
> >>> but until then my personnal opinion is that it should not be allowed and that
> >>> people will have to pay the ioctl price which i proved to be small, because
> >>> really if you 100K queue each with one job, i would not expect that all those
> >>> 100K job will complete in less time than it takes to execute an ioctl ie by
> >>> even if you do not have the ioctl delay what ever you schedule will have to
> >>> wait on previously submited jobs.
> >>
> >> But Jerome, the core problem still remains in effect, even with your
> >> suggestion. If an application, either via userspace queue or via ioctl,
> >> submits a long-running kernel, than the CPU in general can't stop the
> >> GPU from running it. And if that kernel does while(1); than that's it,
> >> game's over, and no matter how you submitted the work. So I don't really
> >> see the big advantage in your proposal. Only in CZ we can stop this wave
> >> (by CP H/W scheduling only). What are you saying is basically I won't
> >> allow people to use compute on Linux KV system because it _may_ get the
> >> system stuck.
> >>
> >> So even if I really wanted to, and I may agree with you theoretically on
> >> that, I can't fulfill your desire to make the "kernel being able to
> >> preempt at any time and be able to decrease or increase user queue
> >> priority so overall kernel is in charge of resources management and it
> >> can handle rogue client in proper fashion". Not in KV, and I guess not
> >> in CZ as well.
> >>
> >> 	Oded
> > 
> > I do understand that but using kernel ioctl provide the same kind of control
> > as we have now ie we can bind/unbind buffer on per command buffer submission
> > basis, just like with current graphic or compute stuff.
> > 
> > Yes current graphic and compute stuff can launch a while and never return back
> > and yes currently we have nothing against that but we should and solution would
> > be simple just kill the gpu thread.
> > 
> OK, so in that case, the kernel can simple unmap all the queues by
> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
> userspace, they will not be mapped to the internal CP scheduler.
> Does that satisfy the kernel control level you want ?

This raises questions, what does happen to currently running thread when you
unmap queue ? Do they keep running until done ? If not than this means this
will break user application and those is not an acceptable solution.

Otherwise, infrastructre inside radeon would be needed to force this queue
unmap on bo_pin failure so gfx pinning can be retry.

Also how do you cope with doorbell exhaustion ? Do you just plan to error out ?
In which case this is another DDOS vector but only affecting the gpu.

And there is many other questions that need answer, like my kernel memory map
question because as of right now i assume that kfd allow any thread on the gpu
to access any kernel memory.

Otherthings are how ill formated packet are handled by the hardware ? I do not
see any mecanism to deal with SIGBUS or SIGFAULT.


Also it is a worrisome prospect of seeing resource management completely ignore
for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to
provide resource management if AMD fails to understand that, this is not looking
good on long term and i expect none of the HSA technology will get momentum and
i would certainly advocate against any use of it inside product i work on.

Cheers,
Jérôme

> 
> 	Oded
> >>
> >>>
> >>>>>
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
> >>>>>>>>>>> stuff there.
> >>>>>>>>>>>
> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
> >>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
> >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
> >>>>>>>>>>>
> >>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
> >>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
> >>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
> >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
> >>>>>>>>>>> address reserved for kernel (see kernel memory map).
> >>>>>>>>>>>
> >>>>>>>>>>> The whole business of locking performance counter for exclusive per process
> >>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
> >>>>>>>>>>> space command ring.
> >>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
> >>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
> >>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
> >>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
> >>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
> >>>>>>>>>> I don't think this is a valid reason to NACK the driver.
> >>>>>>>
> >>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
> >>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
> >>>>>>> the performance ioctl.
> >>>>>>>
> >>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
> >>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
> >>>>>>> execution.
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I only see issues with that. First and foremost i would
> >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
> >>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
> >>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
> >>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
> >>>>>>>>>>> and for absolutely not upside afaict.
> >>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
> >>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
> >>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
> >>>>>>>
> >>>>>>> I am saying the overhead is not that big and it probably will not matter in most
> >>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
> >>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
> >>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
> >>>>>>> ioctl is 13 times slower.
> >>>>>>>
> >>>>>>> Now if there is enough data that shows that a significant percentage of jobs
> >>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
> >>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
> >>>>>>> to support any facts.
> >>>>>>>
> >>>>>>>
> >>>>>>> Now if we want to schedule from userspace than you will need to do something
> >>>>>>> about the pinning, something that gives control to kernel so that kernel can
> >>>>>>> unpin when it wants and move object when it wants no matter what userspace is
> >>>>>>> doing.
> >>>>>>>
> >>>>>>>>>>>
> >>>
> >>> --
> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>> the body to majordomo@kvack.org.  For more info on Linux MM,
> >>> see: http://www.linux-mm.org/ .
> >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >>>
> >>
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 23:05                           ` Jerome Glisse
@ 2014-07-21 23:29                             ` Bridgman, John
  -1 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-21 23:29 UTC (permalink / raw)
  To: Jerome Glisse, Gabbay, Oded
  Cc: Lewycky, Andrew, Pinchuk, Evgeny, Daenzer, Michel, linux-kernel,
	dri-devel, linux-mm, Skidanov, Alexey, Andrew Morton



>-----Original Message-----
>From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf
>Of Jerome Glisse
>Sent: Monday, July 21, 2014 7:06 PM
>To: Gabbay, Oded
>Cc: Lewycky, Andrew; Pinchuk, Evgeny; Daenzer, Michel; linux-
>kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; linux-mm;
>Skidanov, Alexey; Andrew Morton
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
>> On 21/07/14 22:28, Jerome Glisse wrote:
>> > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>> >> On 21/07/14 21:59, Jerome Glisse wrote:
>> >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>> >>>> On 21/07/14 21:14, Jerome Glisse wrote:
>> >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>> >>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>> >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>> >>>>>>>> On 21/07/14 16:39, Christian König wrote:
>> >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>> >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>> >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> As a continuation to the existing discussion, here is a
>> >>>>>>>>>>>> v2 patch series restructured with a cleaner history and
>> >>>>>>>>>>>> no totally-different-early-versions of the code.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25
>> >>>>>>>>>>>> patches, where 5 of them are modifications to radeon driver
>and 18 of them include only amdkfd code.
>> >>>>>>>>>>>> There is no code going away or even modified between
>patches, only added.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and
>> >>>>>>>>>>>> moved to reside under drm/radeon/amdkfd. This move was
>> >>>>>>>>>>>> done to emphasize the fact that this driver is an
>> >>>>>>>>>>>> AMD-only driver at this point. Having said that, we do
>> >>>>>>>>>>>> foresee a generic hsa framework being implemented in the
>future and in that case, we will adjust amdkfd to work within that
>framework.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx
>> >>>>>>>>>>>> drivers, we want to keep it as a seperate driver from
>> >>>>>>>>>>>> radeon. Therefore, the amdkfd code is contained in its
>> >>>>>>>>>>>> own folder. The amdkfd folder was put under the radeon
>> >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux
>> >>>>>>>>>>>> kernel at this point is the radeon driver. Having said
>> >>>>>>>>>>>> that, we will probably need to move it (maybe to be directly
>under drm) after we integrate with additional AMD gfx drivers.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is
>located at:
>> >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-nex
>> >>>>>>>>>>>> t-3.17-v2
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>> >>>>>>>>>>>
>> >>>>>>>>>>> So quick comments before i finish going over all patches.
>> >>>>>>>>>>> There is many things that need more documentation
>> >>>>>>>>>>> espacialy as of right now there is no userspace i can go look at.
>> >>>>>>>>>> So quick comments on some of your questions but first of
>> >>>>>>>>>> all, thanks for the time you dedicated to review the code.
>> >>>>>>>>>>>
>> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning
>> >>>>>>>>>>> this is a big no, that would need serious arguments for
>> >>>>>>>>>>> any hope of convincing me on that side.
>> >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are
>> >>>>>>>>>> no userspace objects that are pinned on the gpu memory in
>> >>>>>>>>>> our driver. If that is the case, is it still a show stopper ?
>> >>>>>>>>>>
>> >>>>>>>>>> The kernel objects are:
>> >>>>>>>>>> - pipelines (4 per device)
>> >>>>>>>>>> - mqd per hiq (only 1 per device)
>> >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K
>> >>>>>>>>>> queues per process, for a total of 512K queues. Each mqd is
>> >>>>>>>>>> 151 bytes, but the allocation is done in
>> >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>> >>>>>>>>>> - kernel queue (only 1 per device)
>> >>>>>>>>>> - fence address for kernel queue
>> >>>>>>>>>> - runlists for the CP (1 or 2 per device)
>> >>>>>>>>>
>> >>>>>>>>> The main questions here are if it's avoid able to pin down
>> >>>>>>>>> the memory and if the memory is pinned down at driver load,
>> >>>>>>>>> by request from userspace or by anything else.
>> >>>>>>>>>
>> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might
>> >>>>>>>>> be a bit questionable, everything else sounds reasonable.
>> >>>>>>>>>
>> >>>>>>>>> Christian.
>> >>>>>>>>
>> >>>>>>>> Most of the pin downs are done on device initialization.
>> >>>>>>>> The "mqd per userspace" is done per userspace queue creation.
>> >>>>>>>> However, as I said, it has an upper limit of 128MB on KV, and
>> >>>>>>>> considering the 2G local memory, I think it is OK.
>> >>>>>>>> The runlists are also done on userspace queue
>> >>>>>>>> creation/deletion, but we only have 1 or 2 runlists per device, so
>it is not that bad.
>> >>>>>>>
>> >>>>>>> 2G local memory ? You can not assume anything on userside
>> >>>>>>> configuration some one might build an hsa computer with 512M
>> >>>>>>> and still expect a functioning desktop.
>> >>>>>> First of all, I'm only considering Kaveri computer, not "hsa"
>computer.
>> >>>>>> Second, I would imagine we can build some protection around it,
>> >>>>>> like checking total local memory and limit number of queues
>> >>>>>> based on some percentage of that total local memory. So, if
>> >>>>>> someone will have only 512M, he will be able to open less queues.
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> I need to go look into what all this mqd is for, what it does
>> >>>>>>> and what it is about. But pinning is really bad and this is an
>> >>>>>>> issue with userspace command scheduling an issue that
>> >>>>>>> obviously AMD fails to take into account in design phase.
>> >>>>>> Maybe, but that is the H/W design non-the-less. We can't very
>> >>>>>> well change the H/W.
>> >>>>>
>> >>>>> You can not change the hardware but it is not an excuse to allow
>> >>>>> bad design to sneak in software to work around that. So i would
>> >>>>> rather penalize bad hardware design and have command submission
>> >>>>> in the kernel, until AMD fix its hardware to allow proper scheduling
>by the kernel and proper control by the kernel.
>> >>>> I'm sorry but I do *not* think this is a bad design. S/W
>> >>>> scheduling in the kernel can not, IMO, scale well to 100K queues and
>10K processes.
>> >>>
>> >>> I am not advocating for having kernel decide down to the very last
>> >>> details. I am advocating for kernel being able to preempt at any
>> >>> time and be able to decrease or increase user queue priority so
>> >>> overall kernel is in charge of resources management and it can handle
>rogue client in proper fashion.
>> >>>
>> >>>>
>> >>>>> Because really where we want to go is having GPU closer to a CPU
>> >>>>> in term of scheduling capacity and once we get there we want the
>> >>>>> kernel to always be able to take over and do whatever it wants
>behind process back.
>> >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>> >>>> direction is where AMD is now and where it is heading in the future.
>> >>>> That doesn't preclude the option to allow the kernel to take over
>> >>>> and do what he wants. I agree that in KV we have a problem where
>> >>>> we can't do a mid-wave preemption, so theoretically, a long
>> >>>> running compute kernel can make things messy, but in Carrizo, we
>> >>>> will have this ability. Having said that, it will only be through
>> >>>> the CP H/W scheduling. So AMD is _not_ going to abandon H/W
>> >>>> scheduling. You can dislike it, but this is the situation.
>> >>>
>> >>> We was for the overall Linux community but maybe i should not
>> >>> pretend to talk for anyone interested in having a common standard.
>> >>>
>> >>> My point is that current hardware do not have approriate hardware
>> >>> support for preemption hence, current hardware should use ioctl to
>> >>> schedule job and AMD should think a bit more on commiting to a
>> >>> design and handwaving any hardware short coming as something that
>> >>> can be work around in the software. The pinning thing is broken by
>> >>> design, only way to work around it is through kernel cmd queue
>scheduling that's a fact.
>> >>
>> >>>
>> >>> Once hardware support proper preemption and allows to move
>> >>> around/evict buffer use on behalf of userspace command queue then
>> >>> we can allow userspace scheduling but until then my personnal
>> >>> opinion is that it should not be allowed and that people will have
>> >>> to pay the ioctl price which i proved to be small, because really
>> >>> if you 100K queue each with one job, i would not expect that all
>> >>> those 100K job will complete in less time than it takes to execute
>> >>> an ioctl ie by even if you do not have the ioctl delay what ever you
>schedule will have to wait on previously submited jobs.
>> >>
>> >> But Jerome, the core problem still remains in effect, even with
>> >> your suggestion. If an application, either via userspace queue or
>> >> via ioctl, submits a long-running kernel, than the CPU in general
>> >> can't stop the GPU from running it. And if that kernel does
>> >> while(1); than that's it, game's over, and no matter how you
>> >> submitted the work. So I don't really see the big advantage in your
>> >> proposal. Only in CZ we can stop this wave (by CP H/W scheduling
>> >> only). What are you saying is basically I won't allow people to use
>> >> compute on Linux KV system because it _may_ get the system stuck.
>> >>
>> >> So even if I really wanted to, and I may agree with you
>> >> theoretically on that, I can't fulfill your desire to make the
>> >> "kernel being able to preempt at any time and be able to decrease
>> >> or increase user queue priority so overall kernel is in charge of
>> >> resources management and it can handle rogue client in proper
>> >> fashion". Not in KV, and I guess not in CZ as well.
>> >>
>> >> 	Oded
>> >
>> > I do understand that but using kernel ioctl provide the same kind of
>> > control as we have now ie we can bind/unbind buffer on per command
>> > buffer submission basis, just like with current graphic or compute stuff.
>> >
>> > Yes current graphic and compute stuff can launch a while and never
>> > return back and yes currently we have nothing against that but we
>> > should and solution would be simple just kill the gpu thread.
>> >
>> OK, so in that case, the kernel can simple unmap all the queues by
>> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues
>> are userspace, they will not be mapped to the internal CP scheduler.
>> Does that satisfy the kernel control level you want ?
>
>This raises questions, what does happen to currently running thread when
>you unmap queue ? Do they keep running until done ? If not than this means
>this will break user application and those is not an acceptable solution.
>
>Otherwise, infrastructre inside radeon would be needed to force this queue
>unmap on bo_pin failure so gfx pinning can be retry.
>
>Also how do you cope with doorbell exhaustion ? Do you just plan to error
>out ?
>In which case this is another DDOS vector but only affecting the gpu.
>
>And there is many other questions that need answer, like my kernel memory
>map question because as of right now i assume that kfd allow any thread on
>the gpu to access any kernel memory.
>
>Otherthings are how ill formated packet are handled by the hardware ? I do
>not see any mecanism to deal with SIGBUS or SIGFAULT.
>
>
>Also it is a worrisome prospect of seeing resource management completely
>ignore for future AMD hardware. Kernel exist for a reason ! Kernel main
>purpose is to provide resource management if AMD fails to understand that,
>this is not looking good on long term and i expect none of the HSA
>technology will get momentum and i would certainly advocate against any
>use of it inside product i work on.

Hi Jerome;

I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ?

Thanks,
JB
>
>Cheers,
>Jérôme
>
>>
>> 	Oded
>> >>
>> >>>
>> >>>>>
>> >>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory
>> >>>>>>>>>>> and add common stuff there.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT
>> >>>>>>>>>>> then i would say this far better to avoid the whole kfd module
>and add ioctl to radeon.
>> >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>> >>>>>>>>>>>
>> >>>>>>>>>>> The whole aperture business needs some serious
>> >>>>>>>>>>> explanation. Especialy as you want to use userspace
>> >>>>>>>>>>> address there is nothing to prevent userspace program from
>> >>>>>>>>>>> allocating things at address you reserve for lds, scratch,
>> >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside
>the virtual address reserved for kernel (see kernel memory map).
>> >>>>>>>>>>>
>> >>>>>>>>>>> The whole business of locking performance counter for
>> >>>>>>>>>>> exclusive per process access is a big NO. Which leads me
>> >>>>>>>>>>> to the questionable usefullness of user space command ring.
>> >>>>>>>>>> That's like saying: "Which leads me to the questionable
>> >>>>>>>>>> usefulness of HSA". I find it analogous to a situation
>> >>>>>>>>>> where a network maintainer nacking a driver for a network
>> >>>>>>>>>> card, which is slower than a different network card.
>> >>>>>>>>>> Doesn't seem reasonable this situation is would happen. He
>> >>>>>>>>>> would still put both the drivers in the kernel because people
>want to use the H/W and its features. So, I don't think this is a valid reason to
>NACK the driver.
>> >>>>>>>
>> >>>>>>> Let me rephrase, drop the the performance counter ioctl and
>> >>>>>>> modulo memory pinning i see no objection. In other word, i am
>> >>>>>>> not NACKING whole patchset i am NACKING the performance ioctl.
>> >>>>>>>
>> >>>>>>> Again this is another argument for round trip to the kernel.
>> >>>>>>> As inside kernel you could properly do exclusive gpu counter
>> >>>>>>> access accross single user cmd buffer execution.
>> >>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> I only see issues with that. First and foremost i would
>> >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has
>> >>>>>>>>>>> a higher an overhead that is measurable in any meaning
>> >>>>>>>>>>> full way against a simple function call. I know the
>> >>>>>>>>>>> userspace command ring is a big marketing features that
>> >>>>>>>>>>> please ignorant userspace programmer. But really this only
>brings issues and for absolutely not upside afaict.
>> >>>>>>>>>> Really ? You think that doing a context switch to kernel
>> >>>>>>>>>> space, with all its overhead, is _not_ more expansive than
>> >>>>>>>>>> just calling a function in userspace which only puts a buffer on a
>ring and writes a doorbell ?
>> >>>>>>>
>> >>>>>>> I am saying the overhead is not that big and it probably will
>> >>>>>>> not matter in most usecase. For instance i did wrote the most
>> >>>>>>> useless kernel module that add two number through an ioctl
>> >>>>>>> (http://people.freedesktop.org/~glisse/adder.tar) and it takes
>> >>>>>>> ~0.35microseconds with ioctl while function is ~0.025microseconds
>so ioctl is 13 times slower.
>> >>>>>>>
>> >>>>>>> Now if there is enough data that shows that a significant
>> >>>>>>> percentage of jobs submited to the GPU will take less that
>> >>>>>>> 0.35microsecond then yes userspace scheduling does make sense.
>> >>>>>>> But so far all we have is handwaving with no data to support any
>facts.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Now if we want to schedule from userspace than you will need
>> >>>>>>> to do something about the pinning, something that gives
>> >>>>>>> control to kernel so that kernel can unpin when it wants and
>> >>>>>>> move object when it wants no matter what userspace is doing.
>> >>>>>>>
>> >>>>>>>>>>>
>> >>>
>> >>> --
>> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in the
>> >>> body to majordomo@kvack.org.  For more info on Linux MM,
>> >>> see: http://www.linux-mm.org/ .
>> >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >>>
>> >>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>_______________________________________________
>dri-devel mailing list
>dri-devel@lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 23:29                             ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-21 23:29 UTC (permalink / raw)
  To: Jerome Glisse, Gabbay, Oded
  Cc: Lewycky, Andrew, Pinchuk, Evgeny, Daenzer, Michel, linux-kernel,
	dri-devel, linux-mm, Skidanov, Alexey, Andrew Morton



>-----Original Message-----
>From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf
>Of Jerome Glisse
>Sent: Monday, July 21, 2014 7:06 PM
>To: Gabbay, Oded
>Cc: Lewycky, Andrew; Pinchuk, Evgeny; Daenzer, Michel; linux-
>kernel@vger.kernel.org; dri-devel@lists.freedesktop.org; linux-mm;
>Skidanov, Alexey; Andrew Morton
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
>> On 21/07/14 22:28, Jerome Glisse wrote:
>> > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>> >> On 21/07/14 21:59, Jerome Glisse wrote:
>> >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>> >>>> On 21/07/14 21:14, Jerome Glisse wrote:
>> >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>> >>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>> >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>> >>>>>>>> On 21/07/14 16:39, Christian König wrote:
>> >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>> >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>> >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> As a continuation to the existing discussion, here is a
>> >>>>>>>>>>>> v2 patch series restructured with a cleaner history and
>> >>>>>>>>>>>> no totally-different-early-versions of the code.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25
>> >>>>>>>>>>>> patches, where 5 of them are modifications to radeon driver
>and 18 of them include only amdkfd code.
>> >>>>>>>>>>>> There is no code going away or even modified between
>patches, only added.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and
>> >>>>>>>>>>>> moved to reside under drm/radeon/amdkfd. This move was
>> >>>>>>>>>>>> done to emphasize the fact that this driver is an
>> >>>>>>>>>>>> AMD-only driver at this point. Having said that, we do
>> >>>>>>>>>>>> foresee a generic hsa framework being implemented in the
>future and in that case, we will adjust amdkfd to work within that
>framework.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx
>> >>>>>>>>>>>> drivers, we want to keep it as a seperate driver from
>> >>>>>>>>>>>> radeon. Therefore, the amdkfd code is contained in its
>> >>>>>>>>>>>> own folder. The amdkfd folder was put under the radeon
>> >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux
>> >>>>>>>>>>>> kernel at this point is the radeon driver. Having said
>> >>>>>>>>>>>> that, we will probably need to move it (maybe to be directly
>under drm) after we integrate with additional AMD gfx drivers.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is
>located at:
>> >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-nex
>> >>>>>>>>>>>> t-3.17-v2
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>> >>>>>>>>>>>
>> >>>>>>>>>>> So quick comments before i finish going over all patches.
>> >>>>>>>>>>> There is many things that need more documentation
>> >>>>>>>>>>> espacialy as of right now there is no userspace i can go look at.
>> >>>>>>>>>> So quick comments on some of your questions but first of
>> >>>>>>>>>> all, thanks for the time you dedicated to review the code.
>> >>>>>>>>>>>
>> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning
>> >>>>>>>>>>> this is a big no, that would need serious arguments for
>> >>>>>>>>>>> any hope of convincing me on that side.
>> >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are
>> >>>>>>>>>> no userspace objects that are pinned on the gpu memory in
>> >>>>>>>>>> our driver. If that is the case, is it still a show stopper ?
>> >>>>>>>>>>
>> >>>>>>>>>> The kernel objects are:
>> >>>>>>>>>> - pipelines (4 per device)
>> >>>>>>>>>> - mqd per hiq (only 1 per device)
>> >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K
>> >>>>>>>>>> queues per process, for a total of 512K queues. Each mqd is
>> >>>>>>>>>> 151 bytes, but the allocation is done in
>> >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>> >>>>>>>>>> - kernel queue (only 1 per device)
>> >>>>>>>>>> - fence address for kernel queue
>> >>>>>>>>>> - runlists for the CP (1 or 2 per device)
>> >>>>>>>>>
>> >>>>>>>>> The main questions here are if it's avoid able to pin down
>> >>>>>>>>> the memory and if the memory is pinned down at driver load,
>> >>>>>>>>> by request from userspace or by anything else.
>> >>>>>>>>>
>> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might
>> >>>>>>>>> be a bit questionable, everything else sounds reasonable.
>> >>>>>>>>>
>> >>>>>>>>> Christian.
>> >>>>>>>>
>> >>>>>>>> Most of the pin downs are done on device initialization.
>> >>>>>>>> The "mqd per userspace" is done per userspace queue creation.
>> >>>>>>>> However, as I said, it has an upper limit of 128MB on KV, and
>> >>>>>>>> considering the 2G local memory, I think it is OK.
>> >>>>>>>> The runlists are also done on userspace queue
>> >>>>>>>> creation/deletion, but we only have 1 or 2 runlists per device, so
>it is not that bad.
>> >>>>>>>
>> >>>>>>> 2G local memory ? You can not assume anything on userside
>> >>>>>>> configuration some one might build an hsa computer with 512M
>> >>>>>>> and still expect a functioning desktop.
>> >>>>>> First of all, I'm only considering Kaveri computer, not "hsa"
>computer.
>> >>>>>> Second, I would imagine we can build some protection around it,
>> >>>>>> like checking total local memory and limit number of queues
>> >>>>>> based on some percentage of that total local memory. So, if
>> >>>>>> someone will have only 512M, he will be able to open less queues.
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> I need to go look into what all this mqd is for, what it does
>> >>>>>>> and what it is about. But pinning is really bad and this is an
>> >>>>>>> issue with userspace command scheduling an issue that
>> >>>>>>> obviously AMD fails to take into account in design phase.
>> >>>>>> Maybe, but that is the H/W design non-the-less. We can't very
>> >>>>>> well change the H/W.
>> >>>>>
>> >>>>> You can not change the hardware but it is not an excuse to allow
>> >>>>> bad design to sneak in software to work around that. So i would
>> >>>>> rather penalize bad hardware design and have command submission
>> >>>>> in the kernel, until AMD fix its hardware to allow proper scheduling
>by the kernel and proper control by the kernel.
>> >>>> I'm sorry but I do *not* think this is a bad design. S/W
>> >>>> scheduling in the kernel can not, IMO, scale well to 100K queues and
>10K processes.
>> >>>
>> >>> I am not advocating for having kernel decide down to the very last
>> >>> details. I am advocating for kernel being able to preempt at any
>> >>> time and be able to decrease or increase user queue priority so
>> >>> overall kernel is in charge of resources management and it can handle
>rogue client in proper fashion.
>> >>>
>> >>>>
>> >>>>> Because really where we want to go is having GPU closer to a CPU
>> >>>>> in term of scheduling capacity and once we get there we want the
>> >>>>> kernel to always be able to take over and do whatever it wants
>behind process back.
>> >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>> >>>> direction is where AMD is now and where it is heading in the future.
>> >>>> That doesn't preclude the option to allow the kernel to take over
>> >>>> and do what he wants. I agree that in KV we have a problem where
>> >>>> we can't do a mid-wave preemption, so theoretically, a long
>> >>>> running compute kernel can make things messy, but in Carrizo, we
>> >>>> will have this ability. Having said that, it will only be through
>> >>>> the CP H/W scheduling. So AMD is _not_ going to abandon H/W
>> >>>> scheduling. You can dislike it, but this is the situation.
>> >>>
>> >>> We was for the overall Linux community but maybe i should not
>> >>> pretend to talk for anyone interested in having a common standard.
>> >>>
>> >>> My point is that current hardware do not have approriate hardware
>> >>> support for preemption hence, current hardware should use ioctl to
>> >>> schedule job and AMD should think a bit more on commiting to a
>> >>> design and handwaving any hardware short coming as something that
>> >>> can be work around in the software. The pinning thing is broken by
>> >>> design, only way to work around it is through kernel cmd queue
>scheduling that's a fact.
>> >>
>> >>>
>> >>> Once hardware support proper preemption and allows to move
>> >>> around/evict buffer use on behalf of userspace command queue then
>> >>> we can allow userspace scheduling but until then my personnal
>> >>> opinion is that it should not be allowed and that people will have
>> >>> to pay the ioctl price which i proved to be small, because really
>> >>> if you 100K queue each with one job, i would not expect that all
>> >>> those 100K job will complete in less time than it takes to execute
>> >>> an ioctl ie by even if you do not have the ioctl delay what ever you
>schedule will have to wait on previously submited jobs.
>> >>
>> >> But Jerome, the core problem still remains in effect, even with
>> >> your suggestion. If an application, either via userspace queue or
>> >> via ioctl, submits a long-running kernel, than the CPU in general
>> >> can't stop the GPU from running it. And if that kernel does
>> >> while(1); than that's it, game's over, and no matter how you
>> >> submitted the work. So I don't really see the big advantage in your
>> >> proposal. Only in CZ we can stop this wave (by CP H/W scheduling
>> >> only). What are you saying is basically I won't allow people to use
>> >> compute on Linux KV system because it _may_ get the system stuck.
>> >>
>> >> So even if I really wanted to, and I may agree with you
>> >> theoretically on that, I can't fulfill your desire to make the
>> >> "kernel being able to preempt at any time and be able to decrease
>> >> or increase user queue priority so overall kernel is in charge of
>> >> resources management and it can handle rogue client in proper
>> >> fashion". Not in KV, and I guess not in CZ as well.
>> >>
>> >> 	Oded
>> >
>> > I do understand that but using kernel ioctl provide the same kind of
>> > control as we have now ie we can bind/unbind buffer on per command
>> > buffer submission basis, just like with current graphic or compute stuff.
>> >
>> > Yes current graphic and compute stuff can launch a while and never
>> > return back and yes currently we have nothing against that but we
>> > should and solution would be simple just kill the gpu thread.
>> >
>> OK, so in that case, the kernel can simple unmap all the queues by
>> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues
>> are userspace, they will not be mapped to the internal CP scheduler.
>> Does that satisfy the kernel control level you want ?
>
>This raises questions, what does happen to currently running thread when
>you unmap queue ? Do they keep running until done ? If not than this means
>this will break user application and those is not an acceptable solution.
>
>Otherwise, infrastructre inside radeon would be needed to force this queue
>unmap on bo_pin failure so gfx pinning can be retry.
>
>Also how do you cope with doorbell exhaustion ? Do you just plan to error
>out ?
>In which case this is another DDOS vector but only affecting the gpu.
>
>And there is many other questions that need answer, like my kernel memory
>map question because as of right now i assume that kfd allow any thread on
>the gpu to access any kernel memory.
>
>Otherthings are how ill formated packet are handled by the hardware ? I do
>not see any mecanism to deal with SIGBUS or SIGFAULT.
>
>
>Also it is a worrisome prospect of seeing resource management completely
>ignore for future AMD hardware. Kernel exist for a reason ! Kernel main
>purpose is to provide resource management if AMD fails to understand that,
>this is not looking good on long term and i expect none of the HSA
>technology will get momentum and i would certainly advocate against any
>use of it inside product i work on.

Hi Jerome;

I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ?

Thanks,
JB
>
>Cheers,
>Jérôme
>
>>
>> 	Oded
>> >>
>> >>>
>> >>>>>
>> >>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory
>> >>>>>>>>>>> and add common stuff there.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT
>> >>>>>>>>>>> then i would say this far better to avoid the whole kfd module
>and add ioctl to radeon.
>> >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>> >>>>>>>>>>>
>> >>>>>>>>>>> The whole aperture business needs some serious
>> >>>>>>>>>>> explanation. Especialy as you want to use userspace
>> >>>>>>>>>>> address there is nothing to prevent userspace program from
>> >>>>>>>>>>> allocating things at address you reserve for lds, scratch,
>> >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside
>the virtual address reserved for kernel (see kernel memory map).
>> >>>>>>>>>>>
>> >>>>>>>>>>> The whole business of locking performance counter for
>> >>>>>>>>>>> exclusive per process access is a big NO. Which leads me
>> >>>>>>>>>>> to the questionable usefullness of user space command ring.
>> >>>>>>>>>> That's like saying: "Which leads me to the questionable
>> >>>>>>>>>> usefulness of HSA". I find it analogous to a situation
>> >>>>>>>>>> where a network maintainer nacking a driver for a network
>> >>>>>>>>>> card, which is slower than a different network card.
>> >>>>>>>>>> Doesn't seem reasonable this situation is would happen. He
>> >>>>>>>>>> would still put both the drivers in the kernel because people
>want to use the H/W and its features. So, I don't think this is a valid reason to
>NACK the driver.
>> >>>>>>>
>> >>>>>>> Let me rephrase, drop the the performance counter ioctl and
>> >>>>>>> modulo memory pinning i see no objection. In other word, i am
>> >>>>>>> not NACKING whole patchset i am NACKING the performance ioctl.
>> >>>>>>>
>> >>>>>>> Again this is another argument for round trip to the kernel.
>> >>>>>>> As inside kernel you could properly do exclusive gpu counter
>> >>>>>>> access accross single user cmd buffer execution.
>> >>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> I only see issues with that. First and foremost i would
>> >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has
>> >>>>>>>>>>> a higher an overhead that is measurable in any meaning
>> >>>>>>>>>>> full way against a simple function call. I know the
>> >>>>>>>>>>> userspace command ring is a big marketing features that
>> >>>>>>>>>>> please ignorant userspace programmer. But really this only
>brings issues and for absolutely not upside afaict.
>> >>>>>>>>>> Really ? You think that doing a context switch to kernel
>> >>>>>>>>>> space, with all its overhead, is _not_ more expansive than
>> >>>>>>>>>> just calling a function in userspace which only puts a buffer on a
>ring and writes a doorbell ?
>> >>>>>>>
>> >>>>>>> I am saying the overhead is not that big and it probably will
>> >>>>>>> not matter in most usecase. For instance i did wrote the most
>> >>>>>>> useless kernel module that add two number through an ioctl
>> >>>>>>> (http://people.freedesktop.org/~glisse/adder.tar) and it takes
>> >>>>>>> ~0.35microseconds with ioctl while function is ~0.025microseconds
>so ioctl is 13 times slower.
>> >>>>>>>
>> >>>>>>> Now if there is enough data that shows that a significant
>> >>>>>>> percentage of jobs submited to the GPU will take less that
>> >>>>>>> 0.35microsecond then yes userspace scheduling does make sense.
>> >>>>>>> But so far all we have is handwaving with no data to support any
>facts.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Now if we want to schedule from userspace than you will need
>> >>>>>>> to do something about the pinning, something that gives
>> >>>>>>> control to kernel so that kernel can unpin when it wants and
>> >>>>>>> move object when it wants no matter what userspace is doing.
>> >>>>>>>
>> >>>>>>>>>>>
>> >>>
>> >>> --
>> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in the
>> >>> body to majordomo@kvack.org.  For more info on Linux MM,
>> >>> see: http://www.linux-mm.org/ .
>> >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >>>
>> >>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>_______________________________________________
>dri-devel mailing list
>dri-devel@lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/dri-devel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 23:29                             ` Bridgman, John
  (?)
@ 2014-07-21 23:36                               ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 23:36 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Gabbay, Oded, Lewycky, Andrew, Pinchuk, Evgeny, Daenzer, Michel,
	linux-kernel, dri-devel, linux-mm, Skidanov, Alexey,
	Andrew Morton

On Mon, Jul 21, 2014 at 11:29:23PM +0000, Bridgman, John wrote:
> >> >> So even if I really wanted to, and I may agree with you
> >> >> theoretically on that, I can't fulfill your desire to make the
> >> >> "kernel being able to preempt at any time and be able to decrease
> >> >> or increase user queue priority so overall kernel is in charge of
> >> >> resources management and it can handle rogue client in proper
> >> >> fashion". Not in KV, and I guess not in CZ as well.

                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> >
> >Also it is a worrisome prospect of seeing resource management completely
> >ignore for future AMD hardware. Kernel exist for a reason ! Kernel main
> >purpose is to provide resource management if AMD fails to understand that,
> >this is not looking good on long term and i expect none of the HSA
> >technology will get momentum and i would certainly advocate against any
> >use of it inside product i work on.
> 
> Hi Jerome;
> 
> I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ?


Highlighted above.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 23:36                               ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 23:36 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Gabbay, Oded, Lewycky, Andrew, Pinchuk, Evgeny, Daenzer, Michel,
	linux-kernel, dri-devel, linux-mm, Skidanov, Alexey,
	Andrew Morton

On Mon, Jul 21, 2014 at 11:29:23PM +0000, Bridgman, John wrote:
> >> >> So even if I really wanted to, and I may agree with you
> >> >> theoretically on that, I can't fulfill your desire to make the
> >> >> "kernel being able to preempt at any time and be able to decrease
> >> >> or increase user queue priority so overall kernel is in charge of
> >> >> resources management and it can handle rogue client in proper
> >> >> fashion". Not in KV, and I guess not in CZ as well.

                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> >
> >Also it is a worrisome prospect of seeing resource management completely
> >ignore for future AMD hardware. Kernel exist for a reason ! Kernel main
> >purpose is to provide resource management if AMD fails to understand that,
> >this is not looking good on long term and i expect none of the HSA
> >technology will get momentum and i would certainly advocate against any
> >use of it inside product i work on.
> 
> Hi Jerome;
> 
> I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ?


Highlighted above.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-21 23:36                               ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-21 23:36 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Lewycky, Andrew, linux-mm, Daenzer, Michel, linux-kernel,
	dri-devel, Pinchuk, Evgeny, Skidanov, Alexey, Andrew Morton

On Mon, Jul 21, 2014 at 11:29:23PM +0000, Bridgman, John wrote:
> >> >> So even if I really wanted to, and I may agree with you
> >> >> theoretically on that, I can't fulfill your desire to make the
> >> >> "kernel being able to preempt at any time and be able to decrease
> >> >> or increase user queue priority so overall kernel is in charge of
> >> >> resources management and it can handle rogue client in proper
> >> >> fashion". Not in KV, and I guess not in CZ as well.

                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> >
> >Also it is a worrisome prospect of seeing resource management completely
> >ignore for future AMD hardware. Kernel exist for a reason ! Kernel main
> >purpose is to provide resource management if AMD fails to understand that,
> >this is not looking good on long term and i expect none of the HSA
> >technology will get momentum and i would certainly advocate against any
> >use of it inside product i work on.
> 
> Hi Jerome;
> 
> I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ?


Highlighted above.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 19:23                     ` Oded Gabbay
  (?)
@ 2014-07-22  7:23                       ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:23 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Andrew Lewycky, Michel Dänzer, linux-kernel,
	dri-devel, linux-mm, Evgeny Pinchuk, Alexey Skidanov,
	Andrew Morton

On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> But Jerome, the core problem still remains in effect, even with your
> suggestion. If an application, either via userspace queue or via ioctl,
> submits a long-running kernel, than the CPU in general can't stop the
> GPU from running it. And if that kernel does while(1); than that's it,
> game's over, and no matter how you submitted the work. So I don't really
> see the big advantage in your proposal. Only in CZ we can stop this wave
> (by CP H/W scheduling only). What are you saying is basically I won't
> allow people to use compute on Linux KV system because it _may_ get the
> system stuck.
> 
> So even if I really wanted to, and I may agree with you theoretically on
> that, I can't fulfill your desire to make the "kernel being able to
> preempt at any time and be able to decrease or increase user queue
> priority so overall kernel is in charge of resources management and it
> can handle rogue client in proper fashion". Not in KV, and I guess not
> in CZ as well.

At least on intel the execlist stuff which is used for preemption can be
used by both the cpu and the firmware scheduler. So we can actually
preempt when doing cpu scheduling.

It sounds like current amd hw doesn't have any preemption at all. And
without preemption I don't think we should ever consider to allow
userspace to directly submit stuff to the hw and overload. Imo the kernel
_must_ sit in between and reject clients that don't behave. Of course you
can only ever react (worst case with a gpu reset, there's code floating
around for that on intel-gfx), but at least you can do something.

If userspace has a direct submit path to the hw then this gets really
tricky, if not impossible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  7:23                       ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:23 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Andrew Lewycky, Michel Dänzer, linux-kernel,
	dri-devel, linux-mm, Evgeny Pinchuk, Alexey Skidanov,
	Andrew Morton

On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> But Jerome, the core problem still remains in effect, even with your
> suggestion. If an application, either via userspace queue or via ioctl,
> submits a long-running kernel, than the CPU in general can't stop the
> GPU from running it. And if that kernel does while(1); than that's it,
> game's over, and no matter how you submitted the work. So I don't really
> see the big advantage in your proposal. Only in CZ we can stop this wave
> (by CP H/W scheduling only). What are you saying is basically I won't
> allow people to use compute on Linux KV system because it _may_ get the
> system stuck.
> 
> So even if I really wanted to, and I may agree with you theoretically on
> that, I can't fulfill your desire to make the "kernel being able to
> preempt at any time and be able to decrease or increase user queue
> priority so overall kernel is in charge of resources management and it
> can handle rogue client in proper fashion". Not in KV, and I guess not
> in CZ as well.

At least on intel the execlist stuff which is used for preemption can be
used by both the cpu and the firmware scheduler. So we can actually
preempt when doing cpu scheduling.

It sounds like current amd hw doesn't have any preemption at all. And
without preemption I don't think we should ever consider to allow
userspace to directly submit stuff to the hw and overload. Imo the kernel
_must_ sit in between and reject clients that don't behave. Of course you
can only ever react (worst case with a gpu reset, there's code floating
around for that on intel-gfx), but at least you can do something.

If userspace has a direct submit path to the hw then this gets really
tricky, if not impossible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  7:23                       ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:23 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Evgeny Pinchuk, Michel Dänzer, linux-kernel,
	dri-devel, linux-mm, Alexey Skidanov, Andrew Morton

On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
> But Jerome, the core problem still remains in effect, even with your
> suggestion. If an application, either via userspace queue or via ioctl,
> submits a long-running kernel, than the CPU in general can't stop the
> GPU from running it. And if that kernel does while(1); than that's it,
> game's over, and no matter how you submitted the work. So I don't really
> see the big advantage in your proposal. Only in CZ we can stop this wave
> (by CP H/W scheduling only). What are you saying is basically I won't
> allow people to use compute on Linux KV system because it _may_ get the
> system stuck.
> 
> So even if I really wanted to, and I may agree with you theoretically on
> that, I can't fulfill your desire to make the "kernel being able to
> preempt at any time and be able to decrease or increase user queue
> priority so overall kernel is in charge of resources management and it
> can handle rogue client in proper fashion". Not in KV, and I guess not
> in CZ as well.

At least on intel the execlist stuff which is used for preemption can be
used by both the cpu and the firmware scheduler. So we can actually
preempt when doing cpu scheduling.

It sounds like current amd hw doesn't have any preemption at all. And
without preemption I don't think we should ever consider to allow
userspace to directly submit stuff to the hw and overload. Imo the kernel
_must_ sit in between and reject clients that don't behave. Of course you
can only ever react (worst case with a gpu reset, there's code floating
around for that on intel-gfx), but at least you can do something.

If userspace has a direct submit path to the hw then this gets really
tricky, if not impossible.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 19:03                     ` Jerome Glisse
  (?)
@ 2014-07-22  7:28                       ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Oded Gabbay, Daniel Vetter, Christian König, David Airlie,
	Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel,
	Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov,
	Evgeny Pinchuk, linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> > On 21/07/14 21:22, Daniel Vetter wrote:
> > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> > >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> > >>> and evict.
> > >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > > 
> > > Well as long as you can drain the hw scheduler queue (and you can do
> > > that, worst case you have to unmap all the doorbells and other stuff
> > > to intercept further submission from userspace) you can evict stuff.
> > 
> > I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> > Moreover, if I use the dequeue request register to preempt a queue
> > during a dispatch it may be that some waves (wave groups actually) of
> > the dispatch have not yet been created, and when I reactivate the mqd,
> > they should be created but are not. However, this works fine if you use
> > the HIQ. the CP ucode correctly saves and restores the state of an
> > outstanding dispatch. I don't think we have access to the state from
> > software at all, so it's not a bug, it is "as designed".
> > 
> 
> I think here Daniel is suggesting to unmapp the doorbell page, and track
> each write made by userspace to it and while unmapped wait for the gpu to
> drain or use some kind of fence on a special queue. Once GPU is drain we
> can move pinned buffer, then remap the doorbell and update it to the last
> value written by userspace which will resume execution to the next job.

Exactly, just prevent userspace from submitting more. And if you have
misbehaving userspace that submits too much, reset the gpu and tell it
that you're sorry but won't schedule any more work.

We have this already in i915 (since like all other gpus we're not
preempting right now) and it works. There's some code floating around to
even restrict the reset to _just_ the offending submission context, with
nothing else getting corrupted.

You can do all this with the doorbells and unmapping them, but it's a
pain. Much easier if you have a real ioctl, and I haven't seen anyone with
perf data indicating that an ioctl would be too much overhead on linux.
Neither in this thread nor internally here at intel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  7:28                       ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Oded Gabbay, Daniel Vetter, Christian König, David Airlie,
	Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel,
	Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov,
	Evgeny Pinchuk, linux-kernel, dri-devel, linux-mm

On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> > On 21/07/14 21:22, Daniel Vetter wrote:
> > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> > >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> > >>> and evict.
> > >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > > 
> > > Well as long as you can drain the hw scheduler queue (and you can do
> > > that, worst case you have to unmap all the doorbells and other stuff
> > > to intercept further submission from userspace) you can evict stuff.
> > 
> > I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> > Moreover, if I use the dequeue request register to preempt a queue
> > during a dispatch it may be that some waves (wave groups actually) of
> > the dispatch have not yet been created, and when I reactivate the mqd,
> > they should be created but are not. However, this works fine if you use
> > the HIQ. the CP ucode correctly saves and restores the state of an
> > outstanding dispatch. I don't think we have access to the state from
> > software at all, so it's not a bug, it is "as designed".
> > 
> 
> I think here Daniel is suggesting to unmapp the doorbell page, and track
> each write made by userspace to it and while unmapped wait for the gpu to
> drain or use some kind of fence on a special queue. Once GPU is drain we
> can move pinned buffer, then remap the doorbell and update it to the last
> value written by userspace which will resume execution to the next job.

Exactly, just prevent userspace from submitting more. And if you have
misbehaving userspace that submits too much, reset the gpu and tell it
that you're sorry but won't schedule any more work.

We have this already in i915 (since like all other gpus we're not
preempting right now) and it works. There's some code floating around to
even restrict the reset to _just_ the offending submission context, with
nothing else getting corrupted.

You can do all this with the doorbells and unmapping them, but it's a
pain. Much easier if you have a real ioctl, and I haven't seen anyone with
perf data indicating that an ioctl would be too much overhead on linux.
Neither in this thread nor internally here at intel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  7:28                       ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:28 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, linux-mm,
	Evgeny Pinchuk, Alexey Skidanov, dri-devel, Andrew Morton

On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> > On 21/07/14 21:22, Daniel Vetter wrote:
> > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> > >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> > >>> and evict.
> > >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > > 
> > > Well as long as you can drain the hw scheduler queue (and you can do
> > > that, worst case you have to unmap all the doorbells and other stuff
> > > to intercept further submission from userspace) you can evict stuff.
> > 
> > I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> > Moreover, if I use the dequeue request register to preempt a queue
> > during a dispatch it may be that some waves (wave groups actually) of
> > the dispatch have not yet been created, and when I reactivate the mqd,
> > they should be created but are not. However, this works fine if you use
> > the HIQ. the CP ucode correctly saves and restores the state of an
> > outstanding dispatch. I don't think we have access to the state from
> > software at all, so it's not a bug, it is "as designed".
> > 
> 
> I think here Daniel is suggesting to unmapp the doorbell page, and track
> each write made by userspace to it and while unmapped wait for the gpu to
> drain or use some kind of fence on a special queue. Once GPU is drain we
> can move pinned buffer, then remap the doorbell and update it to the last
> value written by userspace which will resume execution to the next job.

Exactly, just prevent userspace from submitting more. And if you have
misbehaving userspace that submits too much, reset the gpu and tell it
that you're sorry but won't schedule any more work.

We have this already in i915 (since like all other gpus we're not
preempting right now) and it works. There's some code floating around to
even restrict the reset to _just_ the offending submission context, with
nothing else getting corrupted.

You can do all this with the doorbells and unmapping them, but it's a
pain. Much easier if you have a real ioctl, and I haven't seen anyone with
perf data indicating that an ioctl would be too much overhead on linux.
Neither in this thread nor internally here at intel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  7:28                       ` Daniel Vetter
@ 2014-07-22  7:40                         ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:40 UTC (permalink / raw)
  To: Jerome Glisse, Oded Gabbay, Christian König, David Airlie,
	Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel,
	Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov,
	Evgeny Pinchuk, linux-kernel, dri-devel, linux-mm

On Tue, Jul 22, 2014 at 09:28:51AM +0200, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> > > On 21/07/14 21:22, Daniel Vetter wrote:
> > > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> > > >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> > > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> > > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> > > >>> and evict.
> > > >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > > > 
> > > > Well as long as you can drain the hw scheduler queue (and you can do
> > > > that, worst case you have to unmap all the doorbells and other stuff
> > > > to intercept further submission from userspace) you can evict stuff.
> > > 
> > > I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> > > Moreover, if I use the dequeue request register to preempt a queue
> > > during a dispatch it may be that some waves (wave groups actually) of
> > > the dispatch have not yet been created, and when I reactivate the mqd,
> > > they should be created but are not. However, this works fine if you use
> > > the HIQ. the CP ucode correctly saves and restores the state of an
> > > outstanding dispatch. I don't think we have access to the state from
> > > software at all, so it's not a bug, it is "as designed".
> > > 
> > 
> > I think here Daniel is suggesting to unmapp the doorbell page, and track
> > each write made by userspace to it and while unmapped wait for the gpu to
> > drain or use some kind of fence on a special queue. Once GPU is drain we
> > can move pinned buffer, then remap the doorbell and update it to the last
> > value written by userspace which will resume execution to the next job.
> 
> Exactly, just prevent userspace from submitting more. And if you have
> misbehaving userspace that submits too much, reset the gpu and tell it
> that you're sorry but won't schedule any more work.
> 
> We have this already in i915 (since like all other gpus we're not
> preempting right now) and it works. There's some code floating around to
> even restrict the reset to _just_ the offending submission context, with
> nothing else getting corrupted.
> 
> You can do all this with the doorbells and unmapping them, but it's a
> pain. Much easier if you have a real ioctl, and I haven't seen anyone with
> perf data indicating that an ioctl would be too much overhead on linux.
> Neither in this thread nor internally here at intel.

Aside: Another reason why the ioctl is better than the doorbell is
integration with other drivers. Yeah I know this is about compute, but
sooner or later someone will want to e.g. post-proc video frames between
the v4l capture device and the gpu mpeg encoder. Or something else fancy.

Then you want to be able to somehow integrate into a cross-driver fence
framework like android syncpts, and you can't do that without an ioctl for
the compute submissions.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  7:40                         ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  7:40 UTC (permalink / raw)
  To: Jerome Glisse, Oded Gabbay, Christian König, David Airlie,
	Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel,
	Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov,
	Evgeny Pinchuk, linux-kernel, dri-devel, linux-mm

On Tue, Jul 22, 2014 at 09:28:51AM +0200, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
> > > On 21/07/14 21:22, Daniel Vetter wrote:
> > > > On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> > > >>> I'm not sure whether we can do the same trick with the hw scheduler. But
> > > >>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> > > >>> can just stop feeding the hw scheduler until it runs dry. And then unpin
> > > >>> and evict.
> > > >> So, I'm afraid but we can't do this for AMD Kaveri because:
> > > > 
> > > > Well as long as you can drain the hw scheduler queue (and you can do
> > > > that, worst case you have to unmap all the doorbells and other stuff
> > > > to intercept further submission from userspace) you can evict stuff.
> > > 
> > > I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
> > > Moreover, if I use the dequeue request register to preempt a queue
> > > during a dispatch it may be that some waves (wave groups actually) of
> > > the dispatch have not yet been created, and when I reactivate the mqd,
> > > they should be created but are not. However, this works fine if you use
> > > the HIQ. the CP ucode correctly saves and restores the state of an
> > > outstanding dispatch. I don't think we have access to the state from
> > > software at all, so it's not a bug, it is "as designed".
> > > 
> > 
> > I think here Daniel is suggesting to unmapp the doorbell page, and track
> > each write made by userspace to it and while unmapped wait for the gpu to
> > drain or use some kind of fence on a special queue. Once GPU is drain we
> > can move pinned buffer, then remap the doorbell and update it to the last
> > value written by userspace which will resume execution to the next job.
> 
> Exactly, just prevent userspace from submitting more. And if you have
> misbehaving userspace that submits too much, reset the gpu and tell it
> that you're sorry but won't schedule any more work.
> 
> We have this already in i915 (since like all other gpus we're not
> preempting right now) and it works. There's some code floating around to
> even restrict the reset to _just_ the offending submission context, with
> nothing else getting corrupted.
> 
> You can do all this with the doorbells and unmapping them, but it's a
> pain. Much easier if you have a real ioctl, and I haven't seen anyone with
> perf data indicating that an ioctl would be too much overhead on linux.
> Neither in this thread nor internally here at intel.

Aside: Another reason why the ioctl is better than the doorbell is
integration with other drivers. Yeah I know this is about compute, but
sooner or later someone will want to e.g. post-proc video frames between
the v4l capture device and the gpu mpeg encoder. Or something else fancy.

Then you want to be able to somehow integrate into a cross-driver fence
framework like android syncpts, and you can't do that without an ioctl for
the compute submissions.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 23:05                           ` Jerome Glisse
  (?)
@ 2014-07-22  8:05                             ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel,
	dri-devel, Alexey Skidanov, Andrew Morton, Dave Airlie, Bridgman,
	John, Deucher, Alexander, Joerg Roedel, Ben Goz,
	Christian König, Daniel Vetter, Sellek, Tom

On 22/07/14 02:05, Jerome Glisse wrote:
> On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
>> On 21/07/14 22:28, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:59, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 21:14, Jerome Glisse wrote:
>>>>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>>>>>> of the code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>>>>>> drivers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>>>>>> no userspace i can go look at.
>>>>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>>>>>> time you dedicated to review the code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>>>>>> that side.
>>>>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>>>>>> is it still a show stopper ?
>>>>>>>>>>>>
>>>>>>>>>>>> The kernel objects are:
>>>>>>>>>>>> - pipelines (4 per device)
>>>>>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>>>>>> - fence address for kernel queue
>>>>>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>>>>>
>>>>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>>>>>> else.
>>>>>>>>>>>
>>>>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Most of the pin downs are done on device initialization.
>>>>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>>>>>> memory, I think it is OK.
>>>>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>>>>>
>>>>>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>>>>>> desktop.
>>>>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>>>>>> Second, I would imagine we can build some protection around it, like
>>>>>>>> checking total local memory and limit number of queues based on some
>>>>>>>> percentage of that total local memory. So, if someone will have only
>>>>>>>> 512M, he will be able to open less queues.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>>>>>> phase.
>>>>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>>>>>> change the H/W.
>>>>>>>
>>>>>>> You can not change the hardware but it is not an excuse to allow bad design to
>>>>>>> sneak in software to work around that. So i would rather penalize bad hardware
>>>>>>> design and have command submission in the kernel, until AMD fix its hardware to
>>>>>>> allow proper scheduling by the kernel and proper control by the kernel.
>>>>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>>>>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
>>>>>
>>>>> I am not advocating for having kernel decide down to the very last details. I am
>>>>> advocating for kernel being able to preempt at any time and be able to decrease
>>>>> or increase user queue priority so overall kernel is in charge of resources
>>>>> management and it can handle rogue client in proper fashion.
>>>>>
>>>>>>
>>>>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>>>>>> capacity and once we get there we want the kernel to always be able to take over
>>>>>>> and do whatever it wants behind process back.
>>>>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>>>>>> direction is where AMD is now and where it is heading in the future.
>>>>>> That doesn't preclude the option to allow the kernel to take over and do
>>>>>> what he wants. I agree that in KV we have a problem where we can't do a
>>>>>> mid-wave preemption, so theoretically, a long running compute kernel can
>>>>>> make things messy, but in Carrizo, we will have this ability. Having
>>>>>> said that, it will only be through the CP H/W scheduling. So AMD is
>>>>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>>>>>> the situation.
>>>>>
>>>>> We was for the overall Linux community but maybe i should not pretend to talk
>>>>> for anyone interested in having a common standard.
>>>>>
>>>>> My point is that current hardware do not have approriate hardware support for
>>>>> preemption hence, current hardware should use ioctl to schedule job and AMD
>>>>> should think a bit more on commiting to a design and handwaving any hardware
>>>>> short coming as something that can be work around in the software. The pinning
>>>>> thing is broken by design, only way to work around it is through kernel cmd
>>>>> queue scheduling that's a fact.
>>>>
>>>>>
>>>>> Once hardware support proper preemption and allows to move around/evict buffer
>>>>> use on behalf of userspace command queue then we can allow userspace scheduling
>>>>> but until then my personnal opinion is that it should not be allowed and that
>>>>> people will have to pay the ioctl price which i proved to be small, because
>>>>> really if you 100K queue each with one job, i would not expect that all those
>>>>> 100K job will complete in less time than it takes to execute an ioctl ie by
>>>>> even if you do not have the ioctl delay what ever you schedule will have to
>>>>> wait on previously submited jobs.
>>>>
>>>> But Jerome, the core problem still remains in effect, even with your
>>>> suggestion. If an application, either via userspace queue or via ioctl,
>>>> submits a long-running kernel, than the CPU in general can't stop the
>>>> GPU from running it. And if that kernel does while(1); than that's it,
>>>> game's over, and no matter how you submitted the work. So I don't really
>>>> see the big advantage in your proposal. Only in CZ we can stop this wave
>>>> (by CP H/W scheduling only). What are you saying is basically I won't
>>>> allow people to use compute on Linux KV system because it _may_ get the
>>>> system stuck.
>>>>
>>>> So even if I really wanted to, and I may agree with you theoretically on
>>>> that, I can't fulfill your desire to make the "kernel being able to
>>>> preempt at any time and be able to decrease or increase user queue
>>>> priority so overall kernel is in charge of resources management and it
>>>> can handle rogue client in proper fashion". Not in KV, and I guess not
>>>> in CZ as well.
>>>>
>>>> 	Oded
>>>
>>> I do understand that but using kernel ioctl provide the same kind of control
>>> as we have now ie we can bind/unbind buffer on per command buffer submission
>>> basis, just like with current graphic or compute stuff.
>>>
>>> Yes current graphic and compute stuff can launch a while and never return back
>>> and yes currently we have nothing against that but we should and solution would
>>> be simple just kill the gpu thread.
>>>
>> OK, so in that case, the kernel can simple unmap all the queues by
>> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
>> userspace, they will not be mapped to the internal CP scheduler.
>> Does that satisfy the kernel control level you want ?
>
> This raises questions, what does happen to currently running thread when you
> unmap queue ? Do they keep running until done ? If not than this means this
> will break user application and those is not an acceptable solution.

They keep running until they are done. However, their submission of workloads to 
their queues has no effect, of course.
Maybe I should explain how this works from the userspace POV. When the userspace 
app wants to submit a work to the queue, it writes to 2 different locations, the 
doorbell and a wptr shadow (which is in system memory, viewable by the GPU). 
Every write to the doorbell triggers the CP (and other stuff) in the GPU. The CP 
then checks if the doorbell's queue is mapped. If so, than it handles this 
write. If not, it simply ignores it.
So, when we do unmap queues, the CP will ignore the doorbell writes by the 
userspace app, however the app will not know that (unless it specifically waits 
for results). When the queue is re-mapped, the CP will take the wptr shadow and 
use that to re-synchronize itself with the queue.

>
> Otherwise, infrastructre inside radeon would be needed to force this queue
> unmap on bo_pin failure so gfx pinning can be retry.
If we fail to bo_pin than we of course unmap the queue and return -ENOMEM.
I would like to add another information here that is relevant. I checked the 
code again, and the "mqd per userspace queue" allocation is done only on 
RADEON_GEM_DOMAIN_GTT, which AFAIK is *system memory* that is also mapped (and 
pinned) on the GART address space. Does that still counts as GPU memory from 
your POV ? Are you really concern about GART address space being exhausted ?

Moreover, in all of our code, I don't see us using RADEON_GEM_DOMAIN_VRAM. We 
have a function in radeon_kfd.c called pool_to_domain, and you can see there 
that we map KGD_POOL_FRAMEBUFFER to RADEON_GEM_DOMAIN_VRAM. However, if you 
search for KGD_POOL_FRAMEBUFFER, you will see that we don't use it anywhere.
>
> Also how do you cope with doorbell exhaustion ? Do you just plan to error out ?
> In which case this is another DDOS vector but only affecting the gpu.
Yes, we plan to error out, but I don't see how we can defend from that. For a 
single process, we limit the queues to be 1K (as we assign 1 doorbell page per 
process, and each doorbell is 4 bytes). However, if someone would fork a lot of 
processes, and each of them will register and open 1K queues, than that would be 
a problem. But how can we recognize such an event and differentiate it from 
normal operation ? Did you have something specific in mind ?
>
> And there is many other questions that need answer, like my kernel memory map
> question because as of right now i assume that kfd allow any thread on the gpu
> to access any kernel memory.
Actually, no. We don't allow any access from gpu kernels to the Linux kernel 
memory.

Let me explain more. In KV, the GPU is responsible of telling the IOMMU whether 
the access is privileged or not. If the access is privileged, than the IOMMU can 
allow the GPU to access kernel memory. However, we never configure the GPU in 
our driver to issue privileged accesses. In CZ, this is solved by configuring 
the IOMMU to not allow privileged accesses.

>
> Otherthings are how ill formated packet are handled by the hardware ? I do not
> see any mecanism to deal with SIGBUS or SIGFAULT.
You are correct when you say you don't see any mechanism. We are now developing 
it :) Basically, there will be two new modules. The first one is the event 
module, which is already written and working. The second module is the exception 
handling module, which is now being developed and will be build upon the event 
module. The exception handling module will take care of ill formated packets and 
other exceptions from the GPU (that are not handled by radeon).
>
>
> Also it is a worrisome prospect of seeing resource management completely ignore
> for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to
> provide resource management if AMD fails to understand that, this is not looking
> good on long term and i expect none of the HSA technology will get momentum and
> i would certainly advocate against any use of it inside product i work on.
>
So I made a mistake in writing that: "Not in KV, and I guess not in CZ as well" 
and I apologize for misleading you. What I needed to write was:

"In KV, as a first generation HSA APU, we have limited ability to allow the 
kernel to preempt at any time and control user queue priority. However, in CZ we 
have dramatically improved control and resource management capabilities, that 
will allow the kernel to preempt at any time and also control user queue priority."

So, as you can see, AMD fully understands that the kernel main purpose is to 
provide resource management and I hope this will make you recommend AMD H/W now 
and in the future.

	Oded

> Cheers,
> Jérôme
>
>>
>> 	Oded
>>>>
>>>>>
>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>>>>>> stuff there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>>>>>> space command ring.
>>>>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>>>>>
>>>>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>>>>>> the performance ioctl.
>>>>>>>>>
>>>>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>>>>>> execution.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>>>>>
>>>>>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>>>>>> ioctl is 13 times slower.
>>>>>>>>>
>>>>>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>>>>>> to support any facts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>>>>>> doing.
>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>> see: http://www.linux-mm.org/ .
>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>
>>>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  8:05                             ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, linux-mm, Michel Dänzer, linux-kernel,
	dri-devel, Alexey Skidanov, Andrew Morton, Dave Airlie, Bridgman,
	John, Deucher, Alexander, Joerg Roedel, Ben Goz,
	Christian König, Daniel Vetter, Sellek, Tom

On 22/07/14 02:05, Jerome Glisse wrote:
> On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
>> On 21/07/14 22:28, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:59, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 21:14, Jerome Glisse wrote:
>>>>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>>>>>> of the code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>>>>>> drivers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>>>>>> no userspace i can go look at.
>>>>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>>>>>> time you dedicated to review the code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>>>>>> that side.
>>>>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>>>>>> is it still a show stopper ?
>>>>>>>>>>>>
>>>>>>>>>>>> The kernel objects are:
>>>>>>>>>>>> - pipelines (4 per device)
>>>>>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>>>>>> - fence address for kernel queue
>>>>>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>>>>>
>>>>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>>>>>> else.
>>>>>>>>>>>
>>>>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Most of the pin downs are done on device initialization.
>>>>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>>>>>> memory, I think it is OK.
>>>>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>>>>>
>>>>>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>>>>>> desktop.
>>>>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>>>>>> Second, I would imagine we can build some protection around it, like
>>>>>>>> checking total local memory and limit number of queues based on some
>>>>>>>> percentage of that total local memory. So, if someone will have only
>>>>>>>> 512M, he will be able to open less queues.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>>>>>> phase.
>>>>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>>>>>> change the H/W.
>>>>>>>
>>>>>>> You can not change the hardware but it is not an excuse to allow bad design to
>>>>>>> sneak in software to work around that. So i would rather penalize bad hardware
>>>>>>> design and have command submission in the kernel, until AMD fix its hardware to
>>>>>>> allow proper scheduling by the kernel and proper control by the kernel.
>>>>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>>>>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
>>>>>
>>>>> I am not advocating for having kernel decide down to the very last details. I am
>>>>> advocating for kernel being able to preempt at any time and be able to decrease
>>>>> or increase user queue priority so overall kernel is in charge of resources
>>>>> management and it can handle rogue client in proper fashion.
>>>>>
>>>>>>
>>>>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>>>>>> capacity and once we get there we want the kernel to always be able to take over
>>>>>>> and do whatever it wants behind process back.
>>>>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>>>>>> direction is where AMD is now and where it is heading in the future.
>>>>>> That doesn't preclude the option to allow the kernel to take over and do
>>>>>> what he wants. I agree that in KV we have a problem where we can't do a
>>>>>> mid-wave preemption, so theoretically, a long running compute kernel can
>>>>>> make things messy, but in Carrizo, we will have this ability. Having
>>>>>> said that, it will only be through the CP H/W scheduling. So AMD is
>>>>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>>>>>> the situation.
>>>>>
>>>>> We was for the overall Linux community but maybe i should not pretend to talk
>>>>> for anyone interested in having a common standard.
>>>>>
>>>>> My point is that current hardware do not have approriate hardware support for
>>>>> preemption hence, current hardware should use ioctl to schedule job and AMD
>>>>> should think a bit more on commiting to a design and handwaving any hardware
>>>>> short coming as something that can be work around in the software. The pinning
>>>>> thing is broken by design, only way to work around it is through kernel cmd
>>>>> queue scheduling that's a fact.
>>>>
>>>>>
>>>>> Once hardware support proper preemption and allows to move around/evict buffer
>>>>> use on behalf of userspace command queue then we can allow userspace scheduling
>>>>> but until then my personnal opinion is that it should not be allowed and that
>>>>> people will have to pay the ioctl price which i proved to be small, because
>>>>> really if you 100K queue each with one job, i would not expect that all those
>>>>> 100K job will complete in less time than it takes to execute an ioctl ie by
>>>>> even if you do not have the ioctl delay what ever you schedule will have to
>>>>> wait on previously submited jobs.
>>>>
>>>> But Jerome, the core problem still remains in effect, even with your
>>>> suggestion. If an application, either via userspace queue or via ioctl,
>>>> submits a long-running kernel, than the CPU in general can't stop the
>>>> GPU from running it. And if that kernel does while(1); than that's it,
>>>> game's over, and no matter how you submitted the work. So I don't really
>>>> see the big advantage in your proposal. Only in CZ we can stop this wave
>>>> (by CP H/W scheduling only). What are you saying is basically I won't
>>>> allow people to use compute on Linux KV system because it _may_ get the
>>>> system stuck.
>>>>
>>>> So even if I really wanted to, and I may agree with you theoretically on
>>>> that, I can't fulfill your desire to make the "kernel being able to
>>>> preempt at any time and be able to decrease or increase user queue
>>>> priority so overall kernel is in charge of resources management and it
>>>> can handle rogue client in proper fashion". Not in KV, and I guess not
>>>> in CZ as well.
>>>>
>>>> 	Oded
>>>
>>> I do understand that but using kernel ioctl provide the same kind of control
>>> as we have now ie we can bind/unbind buffer on per command buffer submission
>>> basis, just like with current graphic or compute stuff.
>>>
>>> Yes current graphic and compute stuff can launch a while and never return back
>>> and yes currently we have nothing against that but we should and solution would
>>> be simple just kill the gpu thread.
>>>
>> OK, so in that case, the kernel can simple unmap all the queues by
>> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
>> userspace, they will not be mapped to the internal CP scheduler.
>> Does that satisfy the kernel control level you want ?
>
> This raises questions, what does happen to currently running thread when you
> unmap queue ? Do they keep running until done ? If not than this means this
> will break user application and those is not an acceptable solution.

They keep running until they are done. However, their submission of workloads to 
their queues has no effect, of course.
Maybe I should explain how this works from the userspace POV. When the userspace 
app wants to submit a work to the queue, it writes to 2 different locations, the 
doorbell and a wptr shadow (which is in system memory, viewable by the GPU). 
Every write to the doorbell triggers the CP (and other stuff) in the GPU. The CP 
then checks if the doorbell's queue is mapped. If so, than it handles this 
write. If not, it simply ignores it.
So, when we do unmap queues, the CP will ignore the doorbell writes by the 
userspace app, however the app will not know that (unless it specifically waits 
for results). When the queue is re-mapped, the CP will take the wptr shadow and 
use that to re-synchronize itself with the queue.

>
> Otherwise, infrastructre inside radeon would be needed to force this queue
> unmap on bo_pin failure so gfx pinning can be retry.
If we fail to bo_pin than we of course unmap the queue and return -ENOMEM.
I would like to add another information here that is relevant. I checked the 
code again, and the "mqd per userspace queue" allocation is done only on 
RADEON_GEM_DOMAIN_GTT, which AFAIK is *system memory* that is also mapped (and 
pinned) on the GART address space. Does that still counts as GPU memory from 
your POV ? Are you really concern about GART address space being exhausted ?

Moreover, in all of our code, I don't see us using RADEON_GEM_DOMAIN_VRAM. We 
have a function in radeon_kfd.c called pool_to_domain, and you can see there 
that we map KGD_POOL_FRAMEBUFFER to RADEON_GEM_DOMAIN_VRAM. However, if you 
search for KGD_POOL_FRAMEBUFFER, you will see that we don't use it anywhere.
>
> Also how do you cope with doorbell exhaustion ? Do you just plan to error out ?
> In which case this is another DDOS vector but only affecting the gpu.
Yes, we plan to error out, but I don't see how we can defend from that. For a 
single process, we limit the queues to be 1K (as we assign 1 doorbell page per 
process, and each doorbell is 4 bytes). However, if someone would fork a lot of 
processes, and each of them will register and open 1K queues, than that would be 
a problem. But how can we recognize such an event and differentiate it from 
normal operation ? Did you have something specific in mind ?
>
> And there is many other questions that need answer, like my kernel memory map
> question because as of right now i assume that kfd allow any thread on the gpu
> to access any kernel memory.
Actually, no. We don't allow any access from gpu kernels to the Linux kernel 
memory.

Let me explain more. In KV, the GPU is responsible of telling the IOMMU whether 
the access is privileged or not. If the access is privileged, than the IOMMU can 
allow the GPU to access kernel memory. However, we never configure the GPU in 
our driver to issue privileged accesses. In CZ, this is solved by configuring 
the IOMMU to not allow privileged accesses.

>
> Otherthings are how ill formated packet are handled by the hardware ? I do not
> see any mecanism to deal with SIGBUS or SIGFAULT.
You are correct when you say you don't see any mechanism. We are now developing 
it :) Basically, there will be two new modules. The first one is the event 
module, which is already written and working. The second module is the exception 
handling module, which is now being developed and will be build upon the event 
module. The exception handling module will take care of ill formated packets and 
other exceptions from the GPU (that are not handled by radeon).
>
>
> Also it is a worrisome prospect of seeing resource management completely ignore
> for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to
> provide resource management if AMD fails to understand that, this is not looking
> good on long term and i expect none of the HSA technology will get momentum and
> i would certainly advocate against any use of it inside product i work on.
>
So I made a mistake in writing that: "Not in KV, and I guess not in CZ as well" 
and I apologize for misleading you. What I needed to write was:

"In KV, as a first generation HSA APU, we have limited ability to allow the 
kernel to preempt at any time and control user queue priority. However, in CZ we 
have dramatically improved control and resource management capabilities, that 
will allow the kernel to preempt at any time and also control user queue priority."

So, as you can see, AMD fully understands that the kernel main purpose is to 
provide resource management and I hope this will make you recommend AMD H/W now 
and in the future.

	Oded

> Cheers,
> Jérôme
>
>>
>> 	Oded
>>>>
>>>>>
>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>>>>>> stuff there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>>>>>> space command ring.
>>>>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>>>>>
>>>>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>>>>>> the performance ioctl.
>>>>>>>>>
>>>>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>>>>>> execution.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>>>>>
>>>>>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>>>>>> ioctl is 13 times slower.
>>>>>>>>>
>>>>>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>>>>>> to support any facts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>>>>>> doing.
>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>> see: http://www.linux-mm.org/ .
>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>
>>>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  8:05                             ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:05 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Lewycky, Sellek, Tom, Michel Dänzer, linux-kernel,
	dri-devel, linux-mm, Alexey Skidanov, Deucher, Alexander,
	Dave Airlie, Andrew Morton, Christian König

On 22/07/14 02:05, Jerome Glisse wrote:
> On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote:
>> On 21/07/14 22:28, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:59, Jerome Glisse wrote:
>>>>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote:
>>>>>> On 21/07/14 21:14, Jerome Glisse wrote:
>>>>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote:
>>>>>>>> On 21/07/14 18:54, Jerome Glisse wrote:
>>>>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote:
>>>>>>>>>> On 21/07/14 16:39, Christian König wrote:
>>>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote:
>>>>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As a continuation to the existing discussion, here is a v2 patch series
>>>>>>>>>>>>>> restructured with a cleaner history and no totally-different-early-versions
>>>>>>>>>>>>>> of the code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 patches, where 5 of them
>>>>>>>>>>>>>> are modifications to radeon driver and 18 of them include only amdkfd code.
>>>>>>>>>>>>>> There is no code going away or even modified between patches, only added.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and moved to reside under
>>>>>>>>>>>>>> drm/radeon/amdkfd. This move was done to emphasize the fact that this driver
>>>>>>>>>>>>>> is an AMD-only driver at this point. Having said that, we do foresee a
>>>>>>>>>>>>>> generic hsa framework being implemented in the future and in that case, we
>>>>>>>>>>>>>> will adjust amdkfd to work within that framework.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx drivers, we want to
>>>>>>>>>>>>>> keep it as a seperate driver from radeon. Therefore, the amdkfd code is
>>>>>>>>>>>>>> contained in its own folder. The amdkfd folder was put under the radeon
>>>>>>>>>>>>>> folder because the only AMD gfx driver in the Linux kernel at this point
>>>>>>>>>>>>>> is the radeon driver. Having said that, we will probably need to move it
>>>>>>>>>>>>>> (maybe to be directly under drm) after we integrate with additional AMD gfx
>>>>>>>>>>>>>> drivers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For people who like to review using git, the v2 patch set is located at:
>>>>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-next-3.17-v2
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> So quick comments before i finish going over all patches. There is many
>>>>>>>>>>>>> things that need more documentation espacialy as of right now there is
>>>>>>>>>>>>> no userspace i can go look at.
>>>>>>>>>>>> So quick comments on some of your questions but first of all, thanks for the
>>>>>>>>>>>> time you dedicated to review the code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning this is a big
>>>>>>>>>>>>> no, that would need serious arguments for any hope of convincing me on
>>>>>>>>>>>>> that side.
>>>>>>>>>>>> We only do gpu memory pinning for kernel objects. There are no userspace
>>>>>>>>>>>> objects that are pinned on the gpu memory in our driver. If that is the case,
>>>>>>>>>>>> is it still a show stopper ?
>>>>>>>>>>>>
>>>>>>>>>>>> The kernel objects are:
>>>>>>>>>>>> - pipelines (4 per device)
>>>>>>>>>>>> - mqd per hiq (only 1 per device)
>>>>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K queues per process, for
>>>>>>>>>>>> a total of 512K queues. Each mqd is 151 bytes, but the allocation is done in
>>>>>>>>>>>> 256 alignment. So total *possible* memory is 128MB
>>>>>>>>>>>> - kernel queue (only 1 per device)
>>>>>>>>>>>> - fence address for kernel queue
>>>>>>>>>>>> - runlists for the CP (1 or 2 per device)
>>>>>>>>>>>
>>>>>>>>>>> The main questions here are if it's avoid able to pin down the memory and if the
>>>>>>>>>>> memory is pinned down at driver load, by request from userspace or by anything
>>>>>>>>>>> else.
>>>>>>>>>>>
>>>>>>>>>>> As far as I can see only the "mqd per userspace queue" might be a bit
>>>>>>>>>>> questionable, everything else sounds reasonable.
>>>>>>>>>>>
>>>>>>>>>>> Christian.
>>>>>>>>>>
>>>>>>>>>> Most of the pin downs are done on device initialization.
>>>>>>>>>> The "mqd per userspace" is done per userspace queue creation. However, as I
>>>>>>>>>> said, it has an upper limit of 128MB on KV, and considering the 2G local
>>>>>>>>>> memory, I think it is OK.
>>>>>>>>>> The runlists are also done on userspace queue creation/deletion, but we only
>>>>>>>>>> have 1 or 2 runlists per device, so it is not that bad.
>>>>>>>>>
>>>>>>>>> 2G local memory ? You can not assume anything on userside configuration some
>>>>>>>>> one might build an hsa computer with 512M and still expect a functioning
>>>>>>>>> desktop.
>>>>>>>> First of all, I'm only considering Kaveri computer, not "hsa" computer.
>>>>>>>> Second, I would imagine we can build some protection around it, like
>>>>>>>> checking total local memory and limit number of queues based on some
>>>>>>>> percentage of that total local memory. So, if someone will have only
>>>>>>>> 512M, he will be able to open less queues.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I need to go look into what all this mqd is for, what it does and what it is
>>>>>>>>> about. But pinning is really bad and this is an issue with userspace command
>>>>>>>>> scheduling an issue that obviously AMD fails to take into account in design
>>>>>>>>> phase.
>>>>>>>> Maybe, but that is the H/W design non-the-less. We can't very well
>>>>>>>> change the H/W.
>>>>>>>
>>>>>>> You can not change the hardware but it is not an excuse to allow bad design to
>>>>>>> sneak in software to work around that. So i would rather penalize bad hardware
>>>>>>> design and have command submission in the kernel, until AMD fix its hardware to
>>>>>>> allow proper scheduling by the kernel and proper control by the kernel.
>>>>>> I'm sorry but I do *not* think this is a bad design. S/W scheduling in
>>>>>> the kernel can not, IMO, scale well to 100K queues and 10K processes.
>>>>>
>>>>> I am not advocating for having kernel decide down to the very last details. I am
>>>>> advocating for kernel being able to preempt at any time and be able to decrease
>>>>> or increase user queue priority so overall kernel is in charge of resources
>>>>> management and it can handle rogue client in proper fashion.
>>>>>
>>>>>>
>>>>>>> Because really where we want to go is having GPU closer to a CPU in term of scheduling
>>>>>>> capacity and once we get there we want the kernel to always be able to take over
>>>>>>> and do whatever it wants behind process back.
>>>>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling
>>>>>> direction is where AMD is now and where it is heading in the future.
>>>>>> That doesn't preclude the option to allow the kernel to take over and do
>>>>>> what he wants. I agree that in KV we have a problem where we can't do a
>>>>>> mid-wave preemption, so theoretically, a long running compute kernel can
>>>>>> make things messy, but in Carrizo, we will have this ability. Having
>>>>>> said that, it will only be through the CP H/W scheduling. So AMD is
>>>>>> _not_ going to abandon H/W scheduling. You can dislike it, but this is
>>>>>> the situation.
>>>>>
>>>>> We was for the overall Linux community but maybe i should not pretend to talk
>>>>> for anyone interested in having a common standard.
>>>>>
>>>>> My point is that current hardware do not have approriate hardware support for
>>>>> preemption hence, current hardware should use ioctl to schedule job and AMD
>>>>> should think a bit more on commiting to a design and handwaving any hardware
>>>>> short coming as something that can be work around in the software. The pinning
>>>>> thing is broken by design, only way to work around it is through kernel cmd
>>>>> queue scheduling that's a fact.
>>>>
>>>>>
>>>>> Once hardware support proper preemption and allows to move around/evict buffer
>>>>> use on behalf of userspace command queue then we can allow userspace scheduling
>>>>> but until then my personnal opinion is that it should not be allowed and that
>>>>> people will have to pay the ioctl price which i proved to be small, because
>>>>> really if you 100K queue each with one job, i would not expect that all those
>>>>> 100K job will complete in less time than it takes to execute an ioctl ie by
>>>>> even if you do not have the ioctl delay what ever you schedule will have to
>>>>> wait on previously submited jobs.
>>>>
>>>> But Jerome, the core problem still remains in effect, even with your
>>>> suggestion. If an application, either via userspace queue or via ioctl,
>>>> submits a long-running kernel, than the CPU in general can't stop the
>>>> GPU from running it. And if that kernel does while(1); than that's it,
>>>> game's over, and no matter how you submitted the work. So I don't really
>>>> see the big advantage in your proposal. Only in CZ we can stop this wave
>>>> (by CP H/W scheduling only). What are you saying is basically I won't
>>>> allow people to use compute on Linux KV system because it _may_ get the
>>>> system stuck.
>>>>
>>>> So even if I really wanted to, and I may agree with you theoretically on
>>>> that, I can't fulfill your desire to make the "kernel being able to
>>>> preempt at any time and be able to decrease or increase user queue
>>>> priority so overall kernel is in charge of resources management and it
>>>> can handle rogue client in proper fashion". Not in KV, and I guess not
>>>> in CZ as well.
>>>>
>>>> 	Oded
>>>
>>> I do understand that but using kernel ioctl provide the same kind of control
>>> as we have now ie we can bind/unbind buffer on per command buffer submission
>>> basis, just like with current graphic or compute stuff.
>>>
>>> Yes current graphic and compute stuff can launch a while and never return back
>>> and yes currently we have nothing against that but we should and solution would
>>> be simple just kill the gpu thread.
>>>
>> OK, so in that case, the kernel can simple unmap all the queues by
>> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues are
>> userspace, they will not be mapped to the internal CP scheduler.
>> Does that satisfy the kernel control level you want ?
>
> This raises questions, what does happen to currently running thread when you
> unmap queue ? Do they keep running until done ? If not than this means this
> will break user application and those is not an acceptable solution.

They keep running until they are done. However, their submission of workloads to 
their queues has no effect, of course.
Maybe I should explain how this works from the userspace POV. When the userspace 
app wants to submit a work to the queue, it writes to 2 different locations, the 
doorbell and a wptr shadow (which is in system memory, viewable by the GPU). 
Every write to the doorbell triggers the CP (and other stuff) in the GPU. The CP 
then checks if the doorbell's queue is mapped. If so, than it handles this 
write. If not, it simply ignores it.
So, when we do unmap queues, the CP will ignore the doorbell writes by the 
userspace app, however the app will not know that (unless it specifically waits 
for results). When the queue is re-mapped, the CP will take the wptr shadow and 
use that to re-synchronize itself with the queue.

>
> Otherwise, infrastructre inside radeon would be needed to force this queue
> unmap on bo_pin failure so gfx pinning can be retry.
If we fail to bo_pin than we of course unmap the queue and return -ENOMEM.
I would like to add another information here that is relevant. I checked the 
code again, and the "mqd per userspace queue" allocation is done only on 
RADEON_GEM_DOMAIN_GTT, which AFAIK is *system memory* that is also mapped (and 
pinned) on the GART address space. Does that still counts as GPU memory from 
your POV ? Are you really concern about GART address space being exhausted ?

Moreover, in all of our code, I don't see us using RADEON_GEM_DOMAIN_VRAM. We 
have a function in radeon_kfd.c called pool_to_domain, and you can see there 
that we map KGD_POOL_FRAMEBUFFER to RADEON_GEM_DOMAIN_VRAM. However, if you 
search for KGD_POOL_FRAMEBUFFER, you will see that we don't use it anywhere.
>
> Also how do you cope with doorbell exhaustion ? Do you just plan to error out ?
> In which case this is another DDOS vector but only affecting the gpu.
Yes, we plan to error out, but I don't see how we can defend from that. For a 
single process, we limit the queues to be 1K (as we assign 1 doorbell page per 
process, and each doorbell is 4 bytes). However, if someone would fork a lot of 
processes, and each of them will register and open 1K queues, than that would be 
a problem. But how can we recognize such an event and differentiate it from 
normal operation ? Did you have something specific in mind ?
>
> And there is many other questions that need answer, like my kernel memory map
> question because as of right now i assume that kfd allow any thread on the gpu
> to access any kernel memory.
Actually, no. We don't allow any access from gpu kernels to the Linux kernel 
memory.

Let me explain more. In KV, the GPU is responsible of telling the IOMMU whether 
the access is privileged or not. If the access is privileged, than the IOMMU can 
allow the GPU to access kernel memory. However, we never configure the GPU in 
our driver to issue privileged accesses. In CZ, this is solved by configuring 
the IOMMU to not allow privileged accesses.

>
> Otherthings are how ill formated packet are handled by the hardware ? I do not
> see any mecanism to deal with SIGBUS or SIGFAULT.
You are correct when you say you don't see any mechanism. We are now developing 
it :) Basically, there will be two new modules. The first one is the event 
module, which is already written and working. The second module is the exception 
handling module, which is now being developed and will be build upon the event 
module. The exception handling module will take care of ill formated packets and 
other exceptions from the GPU (that are not handled by radeon).
>
>
> Also it is a worrisome prospect of seeing resource management completely ignore
> for future AMD hardware. Kernel exist for a reason ! Kernel main purpose is to
> provide resource management if AMD fails to understand that, this is not looking
> good on long term and i expect none of the HSA technology will get momentum and
> i would certainly advocate against any use of it inside product i work on.
>
So I made a mistake in writing that: "Not in KV, and I guess not in CZ as well" 
and I apologize for misleading you. What I needed to write was:

"In KV, as a first generation HSA APU, we have limited ability to allow the 
kernel to preempt at any time and control user queue priority. However, in CZ we 
have dramatically improved control and resource management capabilities, that 
will allow the kernel to preempt at any time and also control user queue priority."

So, as you can see, AMD fully understands that the kernel main purpose is to 
provide resource management and I hope this will make you recommend AMD H/W now 
and in the future.

	Oded

> Cheers,
> Jérôme
>
>>
>> 	Oded
>>>>
>>>>>
>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory and add common
>>>>>>>>>>>>> stuff there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT then i would
>>>>>>>>>>>>> say this far better to avoid the whole kfd module and add ioctl to radeon.
>>>>>>>>>>>>> This would avoid crazy communication btw radeon and kfd.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The whole aperture business needs some serious explanation. Especialy as
>>>>>>>>>>>>> you want to use userspace address there is nothing to prevent userspace
>>>>>>>>>>>>> program from allocating things at address you reserve for lds, scratch,
>>>>>>>>>>>>> ... only sane way would be to move those lds, scratch inside the virtual
>>>>>>>>>>>>> address reserved for kernel (see kernel memory map).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The whole business of locking performance counter for exclusive per process
>>>>>>>>>>>>> access is a big NO. Which leads me to the questionable usefullness of user
>>>>>>>>>>>>> space command ring.
>>>>>>>>>>>> That's like saying: "Which leads me to the questionable usefulness of HSA". I
>>>>>>>>>>>> find it analogous to a situation where a network maintainer nacking a driver
>>>>>>>>>>>> for a network card, which is slower than a different network card. Doesn't
>>>>>>>>>>>> seem reasonable this situation is would happen. He would still put both the
>>>>>>>>>>>> drivers in the kernel because people want to use the H/W and its features. So,
>>>>>>>>>>>> I don't think this is a valid reason to NACK the driver.
>>>>>>>>>
>>>>>>>>> Let me rephrase, drop the the performance counter ioctl and modulo memory pinning
>>>>>>>>> i see no objection. In other word, i am not NACKING whole patchset i am NACKING
>>>>>>>>> the performance ioctl.
>>>>>>>>>
>>>>>>>>> Again this is another argument for round trip to the kernel. As inside kernel you
>>>>>>>>> could properly do exclusive gpu counter access accross single user cmd buffer
>>>>>>>>> execution.
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I only see issues with that. First and foremost i would
>>>>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has a higher an
>>>>>>>>>>>>> overhead that is measurable in any meaning full way against a simple
>>>>>>>>>>>>> function call. I know the userspace command ring is a big marketing features
>>>>>>>>>>>>> that please ignorant userspace programmer. But really this only brings issues
>>>>>>>>>>>>> and for absolutely not upside afaict.
>>>>>>>>>>>> Really ? You think that doing a context switch to kernel space, with all its
>>>>>>>>>>>> overhead, is _not_ more expansive than just calling a function in userspace
>>>>>>>>>>>> which only puts a buffer on a ring and writes a doorbell ?
>>>>>>>>>
>>>>>>>>> I am saying the overhead is not that big and it probably will not matter in most
>>>>>>>>> usecase. For instance i did wrote the most useless kernel module that add two
>>>>>>>>> number through an ioctl (http://people.freedesktop.org/~glisse/adder.tar) and
>>>>>>>>> it takes ~0.35microseconds with ioctl while function is ~0.025microseconds so
>>>>>>>>> ioctl is 13 times slower.
>>>>>>>>>
>>>>>>>>> Now if there is enough data that shows that a significant percentage of jobs
>>>>>>>>> submited to the GPU will take less that 0.35microsecond then yes userspace
>>>>>>>>> scheduling does make sense. But so far all we have is handwaving with no data
>>>>>>>>> to support any facts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now if we want to schedule from userspace than you will need to do something
>>>>>>>>> about the pinning, something that gives control to kernel so that kernel can
>>>>>>>>> unpin when it wants and move object when it wants no matter what userspace is
>>>>>>>>> doing.
>>>>>>>>>
>>>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>>> see: http://www.linux-mm.org/ .
>>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>>>
>>>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  7:23                       ` Daniel Vetter
@ 2014-07-22  8:10                         ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:10 UTC (permalink / raw)
  To: Jerome Glisse, Andrew Lewycky, Michel Dänzer, linux-kernel,
	dri-devel, linux-mm, Alexey Skidanov, Andrew Morton, Bridgman,
	John, Dave Airlie, Christian König, Joerg Roedel,
	Daniel Vetter, Sellek, Tom, Deucher, Alexander

On 22/07/14 10:23, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>> But Jerome, the core problem still remains in effect, even with your
>> suggestion. If an application, either via userspace queue or via ioctl,
>> submits a long-running kernel, than the CPU in general can't stop the
>> GPU from running it. And if that kernel does while(1); than that's it,
>> game's over, and no matter how you submitted the work. So I don't really
>> see the big advantage in your proposal. Only in CZ we can stop this wave
>> (by CP H/W scheduling only). What are you saying is basically I won't
>> allow people to use compute on Linux KV system because it _may_ get the
>> system stuck.
>>
>> So even if I really wanted to, and I may agree with you theoretically on
>> that, I can't fulfill your desire to make the "kernel being able to
>> preempt at any time and be able to decrease or increase user queue
>> priority so overall kernel is in charge of resources management and it
>> can handle rogue client in proper fashion". Not in KV, and I guess not
>> in CZ as well.
>
> At least on intel the execlist stuff which is used for preemption can be
> used by both the cpu and the firmware scheduler. So we can actually
> preempt when doing cpu scheduling.
>
> It sounds like current amd hw doesn't have any preemption at all. And
> without preemption I don't think we should ever consider to allow
> userspace to directly submit stuff to the hw and overload. Imo the kernel
> _must_ sit in between and reject clients that don't behave. Of course you
> can only ever react (worst case with a gpu reset, there's code floating
> around for that on intel-gfx), but at least you can do something.
>
> If userspace has a direct submit path to the hw then this gets really
> tricky, if not impossible.
> -Daniel
>

Hi Daniel,
See the email I just sent to Jerome regarding preemption. Bottom line, in KV, we 
can preempt running queues, except from the case of a stuck gpu kernel. In CZ, 
this was solved.

So, in this regard, I don't think there is any difference between userspace 
queues and ioctl.

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  8:10                         ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:10 UTC (permalink / raw)
  To: Jerome Glisse, Andrew Lewycky, Michel Dänzer, linux-kernel,
	dri-devel, linux-mm, Alexey Skidanov, Andrew Morton, Bridgman,
	John, Dave Airlie, Christian König, Joerg Roedel,
	Daniel Vetter, Sellek, Tom, Deucher, Alexander

On 22/07/14 10:23, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote:
>> But Jerome, the core problem still remains in effect, even with your
>> suggestion. If an application, either via userspace queue or via ioctl,
>> submits a long-running kernel, than the CPU in general can't stop the
>> GPU from running it. And if that kernel does while(1); than that's it,
>> game's over, and no matter how you submitted the work. So I don't really
>> see the big advantage in your proposal. Only in CZ we can stop this wave
>> (by CP H/W scheduling only). What are you saying is basically I won't
>> allow people to use compute on Linux KV system because it _may_ get the
>> system stuck.
>>
>> So even if I really wanted to, and I may agree with you theoretically on
>> that, I can't fulfill your desire to make the "kernel being able to
>> preempt at any time and be able to decrease or increase user queue
>> priority so overall kernel is in charge of resources management and it
>> can handle rogue client in proper fashion". Not in KV, and I guess not
>> in CZ as well.
>
> At least on intel the execlist stuff which is used for preemption can be
> used by both the cpu and the firmware scheduler. So we can actually
> preempt when doing cpu scheduling.
>
> It sounds like current amd hw doesn't have any preemption at all. And
> without preemption I don't think we should ever consider to allow
> userspace to directly submit stuff to the hw and overload. Imo the kernel
> _must_ sit in between and reject clients that don't behave. Of course you
> can only ever react (worst case with a gpu reset, there's code floating
> around for that on intel-gfx), but at least you can do something.
>
> If userspace has a direct submit path to the hw then this gets really
> tricky, if not impossible.
> -Daniel
>

Hi Daniel,
See the email I just sent to Jerome regarding preemption. Bottom line, in KV, we 
can preempt running queues, except from the case of a stuck gpu kernel. In CZ, 
this was solved.

So, in this regard, I don't think there is any difference between userspace 
queues and ioctl.

	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  7:28                       ` Daniel Vetter
@ 2014-07-22  8:19                         ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:19 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 22/07/14 10:28, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
>> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
>>> On 21/07/14 21:22, Daniel Vetter wrote:
>>>> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>>>>> I'm not sure whether we can do the same trick with the hw scheduler. But
>>>>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>>>>>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>>>>>> and evict.
>>>>> So, I'm afraid but we can't do this for AMD Kaveri because:
>>>>
>>>> Well as long as you can drain the hw scheduler queue (and you can do
>>>> that, worst case you have to unmap all the doorbells and other stuff
>>>> to intercept further submission from userspace) you can evict stuff.
>>>
>>> I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
>>> Moreover, if I use the dequeue request register to preempt a queue
>>> during a dispatch it may be that some waves (wave groups actually) of
>>> the dispatch have not yet been created, and when I reactivate the mqd,
>>> they should be created but are not. However, this works fine if you use
>>> the HIQ. the CP ucode correctly saves and restores the state of an
>>> outstanding dispatch. I don't think we have access to the state from
>>> software at all, so it's not a bug, it is "as designed".
>>>
>>
>> I think here Daniel is suggesting to unmapp the doorbell page, and track
>> each write made by userspace to it and while unmapped wait for the gpu to
>> drain or use some kind of fence on a special queue. Once GPU is drain we
>> can move pinned buffer, then remap the doorbell and update it to the last
>> value written by userspace which will resume execution to the next job.
>
> Exactly, just prevent userspace from submitting more. And if you have
> misbehaving userspace that submits too much, reset the gpu and tell it
> that you're sorry but won't schedule any more work.

I'm not sure how you intend to know if a userspace misbehaves or not. Can you 
elaborate ?

	Oded
>
> We have this already in i915 (since like all other gpus we're not
> preempting right now) and it works. There's some code floating around to
> even restrict the reset to _just_ the offending submission context, with
> nothing else getting corrupted.
>
> You can do all this with the doorbells and unmapping them, but it's a
> pain. Much easier if you have a real ioctl, and I haven't seen anyone with
> perf data indicating that an ioctl would be too much overhead on linux.
> Neither in this thread nor internally here at intel.
> -Daniel
>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  8:19                         ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:19 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On 22/07/14 10:28, Daniel Vetter wrote:
> On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
>> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
>>> On 21/07/14 21:22, Daniel Vetter wrote:
>>>> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>>>>> I'm not sure whether we can do the same trick with the hw scheduler. But
>>>>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>>>>>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>>>>>> and evict.
>>>>> So, I'm afraid but we can't do this for AMD Kaveri because:
>>>>
>>>> Well as long as you can drain the hw scheduler queue (and you can do
>>>> that, worst case you have to unmap all the doorbells and other stuff
>>>> to intercept further submission from userspace) you can evict stuff.
>>>
>>> I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
>>> Moreover, if I use the dequeue request register to preempt a queue
>>> during a dispatch it may be that some waves (wave groups actually) of
>>> the dispatch have not yet been created, and when I reactivate the mqd,
>>> they should be created but are not. However, this works fine if you use
>>> the HIQ. the CP ucode correctly saves and restores the state of an
>>> outstanding dispatch. I don't think we have access to the state from
>>> software at all, so it's not a bug, it is "as designed".
>>>
>>
>> I think here Daniel is suggesting to unmapp the doorbell page, and track
>> each write made by userspace to it and while unmapped wait for the gpu to
>> drain or use some kind of fence on a special queue. Once GPU is drain we
>> can move pinned buffer, then remap the doorbell and update it to the last
>> value written by userspace which will resume execution to the next job.
>
> Exactly, just prevent userspace from submitting more. And if you have
> misbehaving userspace that submits too much, reset the gpu and tell it
> that you're sorry but won't schedule any more work.

I'm not sure how you intend to know if a userspace misbehaves or not. Can you 
elaborate ?

	Oded
>
> We have this already in i915 (since like all other gpus we're not
> preempting right now) and it works. There's some code floating around to
> even restrict the reset to _just_ the offending submission context, with
> nothing else getting corrupted.
>
> You can do all this with the doorbells and unmapping them, but it's a
> pain. Much easier if you have a real ioctl, and I haven't seen anyone with
> perf data indicating that an ioctl would be too much overhead on linux.
> Neither in this thread nor internally here at intel.
> -Daniel
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  7:40                         ` Daniel Vetter
@ 2014-07-22  8:21                           ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:21 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Sellek, Tom,
	linux-kernel, dri-devel, linux-mm, Christian König

On 22/07/14 10:40, Daniel Vetter wrote:
> On Tue, Jul 22, 2014 at 09:28:51AM +0200, Daniel Vetter wrote:
>> On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:22, Daniel Vetter wrote:
>>>>> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>>>>>> I'm not sure whether we can do the same trick with the hw scheduler. But
>>>>>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>>>>>>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>>>>>>> and evict.
>>>>>> So, I'm afraid but we can't do this for AMD Kaveri because:
>>>>>
>>>>> Well as long as you can drain the hw scheduler queue (and you can do
>>>>> that, worst case you have to unmap all the doorbells and other stuff
>>>>> to intercept further submission from userspace) you can evict stuff.
>>>>
>>>> I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
>>>> Moreover, if I use the dequeue request register to preempt a queue
>>>> during a dispatch it may be that some waves (wave groups actually) of
>>>> the dispatch have not yet been created, and when I reactivate the mqd,
>>>> they should be created but are not. However, this works fine if you use
>>>> the HIQ. the CP ucode correctly saves and restores the state of an
>>>> outstanding dispatch. I don't think we have access to the state from
>>>> software at all, so it's not a bug, it is "as designed".
>>>>
>>>
>>> I think here Daniel is suggesting to unmapp the doorbell page, and track
>>> each write made by userspace to it and while unmapped wait for the gpu to
>>> drain or use some kind of fence on a special queue. Once GPU is drain we
>>> can move pinned buffer, then remap the doorbell and update it to the last
>>> value written by userspace which will resume execution to the next job.
>>
>> Exactly, just prevent userspace from submitting more. And if you have
>> misbehaving userspace that submits too much, reset the gpu and tell it
>> that you're sorry but won't schedule any more work.
>>
>> We have this already in i915 (since like all other gpus we're not
>> preempting right now) and it works. There's some code floating around to
>> even restrict the reset to _just_ the offending submission context, with
>> nothing else getting corrupted.
>>
>> You can do all this with the doorbells and unmapping them, but it's a
>> pain. Much easier if you have a real ioctl, and I haven't seen anyone with
>> perf data indicating that an ioctl would be too much overhead on linux.
>> Neither in this thread nor internally here at intel.
>
> Aside: Another reason why the ioctl is better than the doorbell is
> integration with other drivers. Yeah I know this is about compute, but
> sooner or later someone will want to e.g. post-proc video frames between
> the v4l capture device and the gpu mpeg encoder. Or something else fancy.
>
> Then you want to be able to somehow integrate into a cross-driver fence
> framework like android syncpts, and you can't do that without an ioctl for
> the compute submissions.
> -Daniel
>

I assume you talk about interop between graphics and compute. For that, we have 
a module that is now being tested, and indeed uses an ioctl to map a graphic 
object to compute process address space. However, after the translation is done, 
the work is done only in userspace.

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  8:21                           ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  8:21 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Sellek, Tom,
	linux-kernel, dri-devel, linux-mm, Christian König

On 22/07/14 10:40, Daniel Vetter wrote:
> On Tue, Jul 22, 2014 at 09:28:51AM +0200, Daniel Vetter wrote:
>> On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
>>> On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
>>>> On 21/07/14 21:22, Daniel Vetter wrote:
>>>>> On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>>>>>> I'm not sure whether we can do the same trick with the hw scheduler. But
>>>>>>> then unpinning hw contexts will drain the pipeline anyway, so I guess we
>>>>>>> can just stop feeding the hw scheduler until it runs dry. And then unpin
>>>>>>> and evict.
>>>>>> So, I'm afraid but we can't do this for AMD Kaveri because:
>>>>>
>>>>> Well as long as you can drain the hw scheduler queue (and you can do
>>>>> that, worst case you have to unmap all the doorbells and other stuff
>>>>> to intercept further submission from userspace) you can evict stuff.
>>>>
>>>> I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
>>>> Moreover, if I use the dequeue request register to preempt a queue
>>>> during a dispatch it may be that some waves (wave groups actually) of
>>>> the dispatch have not yet been created, and when I reactivate the mqd,
>>>> they should be created but are not. However, this works fine if you use
>>>> the HIQ. the CP ucode correctly saves and restores the state of an
>>>> outstanding dispatch. I don't think we have access to the state from
>>>> software at all, so it's not a bug, it is "as designed".
>>>>
>>>
>>> I think here Daniel is suggesting to unmapp the doorbell page, and track
>>> each write made by userspace to it and while unmapped wait for the gpu to
>>> drain or use some kind of fence on a special queue. Once GPU is drain we
>>> can move pinned buffer, then remap the doorbell and update it to the last
>>> value written by userspace which will resume execution to the next job.
>>
>> Exactly, just prevent userspace from submitting more. And if you have
>> misbehaving userspace that submits too much, reset the gpu and tell it
>> that you're sorry but won't schedule any more work.
>>
>> We have this already in i915 (since like all other gpus we're not
>> preempting right now) and it works. There's some code floating around to
>> even restrict the reset to _just_ the offending submission context, with
>> nothing else getting corrupted.
>>
>> You can do all this with the doorbells and unmapping them, but it's a
>> pain. Much easier if you have a real ioctl, and I haven't seen anyone with
>> perf data indicating that an ioctl would be too much overhead on linux.
>> Neither in this thread nor internally here at intel.
>
> Aside: Another reason why the ioctl is better than the doorbell is
> integration with other drivers. Yeah I know this is about compute, but
> sooner or later someone will want to e.g. post-proc video frames between
> the v4l capture device and the gpu mpeg encoder. Or something else fancy.
>
> Then you want to be able to somehow integrate into a cross-driver fence
> framework like android syncpts, and you can't do that without an ioctl for
> the compute submissions.
> -Daniel
>

I assume you talk about interop between graphics and compute. For that, we have 
a module that is now being tested, and indeed uses an ioctl to map a graphic 
object to compute process address space. However, after the translation is done, 
the work is done only in userspace.

	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  8:19                         ` Oded Gabbay
  (?)
@ 2014-07-22  9:21                           ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  9:21 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> Exactly, just prevent userspace from submitting more. And if you have
>> misbehaving userspace that submits too much, reset the gpu and tell it
>> that you're sorry but won't schedule any more work.
>
> I'm not sure how you intend to know if a userspace misbehaves or not. Can
> you elaborate ?

Well that's mostly policy, currently in i915 we only have a check for
hangs, and if userspace hangs a bit too often then we stop it. I guess
you can do that with the queue unmapping you've describe in reply to
Jerome's mail.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  9:21                           ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  9:21 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> Exactly, just prevent userspace from submitting more. And if you have
>> misbehaving userspace that submits too much, reset the gpu and tell it
>> that you're sorry but won't schedule any more work.
>
> I'm not sure how you intend to know if a userspace misbehaves or not. Can
> you elaborate ?

Well that's mostly policy, currently in i915 we only have a check for
hangs, and if userspace hangs a bit too often then we stop it. I guess
you can do that with the queue unmapping you've describe in reply to
Jerome's mail.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  9:21                           ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  9:21 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	Evgeny Pinchuk, linux-mm, Alexey Skidanov, Andrew Morton

On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> Exactly, just prevent userspace from submitting more. And if you have
>> misbehaving userspace that submits too much, reset the gpu and tell it
>> that you're sorry but won't schedule any more work.
>
> I'm not sure how you intend to know if a userspace misbehaves or not. Can
> you elaborate ?

Well that's mostly policy, currently in i915 we only have a check for
hangs, and if userspace hangs a bit too often then we stop it. I guess
you can do that with the queue unmapping you've describe in reply to
Jerome's mail.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  9:21                           ` Daniel Vetter
  (?)
@ 2014-07-22  9:24                             ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  9:24 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Tue, Jul 22, 2014 at 11:21 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> Exactly, just prevent userspace from submitting more. And if you have
>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>> that you're sorry but won't schedule any more work.
>>
>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>> you elaborate ?
>
> Well that's mostly policy, currently in i915 we only have a check for
> hangs, and if userspace hangs a bit too often then we stop it. I guess
> you can do that with the queue unmapping you've describe in reply to
> Jerome's mail.

Not just graphics, and especially not just graphics from amd. My
experience is that soc designers are _really_ good at stitching
randoms stuff together. So you need to deal with non-radeon drivers
very likely, too.

Also the real problem isn't really the memory sharing - we have
dma-buf already and could add a special mmap flag to make sure it will
work with svm/iommuv2. The problem is synchronization (either with the
new struct fence stuff from Maarten or with android syncpoints or
something like that). And for that to be possible you need to go
through the kernel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  9:24                             ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  9:24 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, Evgeny Pinchuk,
	linux-kernel, dri-devel, linux-mm

On Tue, Jul 22, 2014 at 11:21 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> Exactly, just prevent userspace from submitting more. And if you have
>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>> that you're sorry but won't schedule any more work.
>>
>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>> you elaborate ?
>
> Well that's mostly policy, currently in i915 we only have a check for
> hangs, and if userspace hangs a bit too often then we stop it. I guess
> you can do that with the queue unmapping you've describe in reply to
> Jerome's mail.

Not just graphics, and especially not just graphics from amd. My
experience is that soc designers are _really_ good at stitching
randoms stuff together. So you need to deal with non-radeon drivers
very likely, too.

Also the real problem isn't really the memory sharing - we have
dma-buf already and could add a special mmap flag to make sure it will
work with svm/iommuv2. The problem is synchronization (either with the
new struct fence stuff from Maarten or with android syncpoints or
something like that). And for that to be possible you need to go
through the kernel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  9:24                             ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22  9:24 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	Evgeny Pinchuk, linux-mm, Alexey Skidanov, Andrew Morton

On Tue, Jul 22, 2014 at 11:21 AM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> Exactly, just prevent userspace from submitting more. And if you have
>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>> that you're sorry but won't schedule any more work.
>>
>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>> you elaborate ?
>
> Well that's mostly policy, currently in i915 we only have a check for
> hangs, and if userspace hangs a bit too often then we stop it. I guess
> you can do that with the queue unmapping you've describe in reply to
> Jerome's mail.

Not just graphics, and especially not just graphics from amd. My
experience is that soc designers are _really_ good at stitching
randoms stuff together. So you need to deal with non-radeon drivers
very likely, too.

Also the real problem isn't really the memory sharing - we have
dma-buf already and could add a special mmap flag to make sure it will
work with svm/iommuv2. The problem is synchronization (either with the
new struct fence stuff from Maarten or with android syncpoints or
something like that). And for that to be possible you need to go
through the kernel.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  9:21                           ` Daniel Vetter
  (?)
@ 2014-07-22  9:52                             ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  9:52 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On 22/07/14 12:21, Daniel Vetter wrote:
> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> Exactly, just prevent userspace from submitting more. And if you have
>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>> that you're sorry but won't schedule any more work.
>>
>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>> you elaborate ?
>
> Well that's mostly policy, currently in i915 we only have a check for
> hangs, and if userspace hangs a bit too often then we stop it. I guess
> you can do that with the queue unmapping you've describe in reply to
> Jerome's mail.
> -Daniel
>
What do you mean by hang ? Like the tdr mechanism in Windows (checks if a gpu 
job takes more than 2 seconds, I think, and if so, terminates the job).

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  9:52                             ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  9:52 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On 22/07/14 12:21, Daniel Vetter wrote:
> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> Exactly, just prevent userspace from submitting more. And if you have
>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>> that you're sorry but won't schedule any more work.
>>
>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>> you elaborate ?
>
> Well that's mostly policy, currently in i915 we only have a check for
> hangs, and if userspace hangs a bit too often then we stop it. I guess
> you can do that with the queue unmapping you've describe in reply to
> Jerome's mail.
> -Daniel
>
What do you mean by hang ? Like the tdr mechanism in Windows (checks if a gpu 
job takes more than 2 seconds, I think, and if so, terminates the job).

	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22  9:52                             ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-22  9:52 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Alexey Skidanov, Andrew Morton, Sellek, Tom

On 22/07/14 12:21, Daniel Vetter wrote:
> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>> Exactly, just prevent userspace from submitting more. And if you have
>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>> that you're sorry but won't schedule any more work.
>>
>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>> you elaborate ?
>
> Well that's mostly policy, currently in i915 we only have a check for
> hangs, and if userspace hangs a bit too often then we stop it. I guess
> you can do that with the queue unmapping you've describe in reply to
> Jerome's mail.
> -Daniel
>
What do you mean by hang ? Like the tdr mechanism in Windows (checks if a gpu 
job takes more than 2 seconds, I think, and if so, terminates the job).

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22  9:52                             ` Oded Gabbay
@ 2014-07-22 11:15                               ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22 11:15 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Daniel Vetter, Jerome Glisse, Christian König, David Airlie,
	Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel,
	Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov,
	linux-kernel, dri-devel, linux-mm, Sellek, Tom

On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> On 22/07/14 12:21, Daniel Vetter wrote:
> >On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> >>>Exactly, just prevent userspace from submitting more. And if you have
> >>>misbehaving userspace that submits too much, reset the gpu and tell it
> >>>that you're sorry but won't schedule any more work.
> >>
> >>I'm not sure how you intend to know if a userspace misbehaves or not. Can
> >>you elaborate ?
> >
> >Well that's mostly policy, currently in i915 we only have a check for
> >hangs, and if userspace hangs a bit too often then we stop it. I guess
> >you can do that with the queue unmapping you've describe in reply to
> >Jerome's mail.
> >-Daniel
> >
> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
> gpu job takes more than 2 seconds, I think, and if so, terminates the job).

Essentially yes. But we also have some hw features to kill jobs quicker,
e.g. for media workloads.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-22 11:15                               ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-22 11:15 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Daniel Vetter, Jerome Glisse, Christian König, David Airlie,
	Alex Deucher, Andrew Morton, John Bridgman, Joerg Roedel,
	Andrew Lewycky, Michel Dänzer, Ben Goz, Alexey Skidanov,
	linux-kernel, dri-devel, linux-mm, Sellek, Tom

On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> On 22/07/14 12:21, Daniel Vetter wrote:
> >On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> >>>Exactly, just prevent userspace from submitting more. And if you have
> >>>misbehaving userspace that submits too much, reset the gpu and tell it
> >>>that you're sorry but won't schedule any more work.
> >>
> >>I'm not sure how you intend to know if a userspace misbehaves or not. Can
> >>you elaborate ?
> >
> >Well that's mostly policy, currently in i915 we only have a check for
> >hangs, and if userspace hangs a bit too often then we stop it. I guess
> >you can do that with the queue unmapping you've describe in reply to
> >Jerome's mail.
> >-Daniel
> >
> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
> gpu job takes more than 2 seconds, I think, and if so, terminates the job).

Essentially yes. But we also have some hw features to kill jobs quicker,
e.g. for media workloads.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-22 11:15                               ` Daniel Vetter
@ 2014-07-23  6:50                                 ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-23  6:50 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On 22/07/14 14:15, Daniel Vetter wrote:
> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>> On 22/07/14 12:21, Daniel Vetter wrote:
>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>> that you're sorry but won't schedule any more work.
>>>>
>>>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>>>> you elaborate ?
>>>
>>> Well that's mostly policy, currently in i915 we only have a check for
>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>> you can do that with the queue unmapping you've describe in reply to
>>> Jerome's mail.
>>> -Daniel
>>>
>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>> gpu job takes more than 2 seconds, I think, and if so, terminates the job).
>
> Essentially yes. But we also have some hw features to kill jobs quicker,
> e.g. for media workloads.
> -Daniel
>

Yeah, so this is what I'm talking about when I say that you and Jerome come from 
a graphics POV and amdkfd come from a compute POV, no offense intended.

For compute jobs, we simply can't use this logic to terminate jobs. Graphics are 
mostly Real-Time while compute jobs can take from a few ms to a few hours!!! And 
I'm not talking about an entire application runtime but on a single submission 
of jobs by the userspace app. We have tests with jobs that take between 20-30 
minutes to complete. In theory, we can even imagine a compute job which takes 1 
or 2 days (on larger APUs).

Now, I understand the question of how do we prevent the compute job from 
monopolizing the GPU, and internally here we have some ideas that we will 
probably share in the next few days, but my point is that I don't think we can 
terminate a compute job because it is running for more than x seconds. It is 
like you would terminate a CPU process which runs more than x seconds.

I think this is a *very* important discussion (detecting a misbehaved compute 
process) and I would like to continue it, but I don't think moving the job 
submission from userspace control to kernel control will solve this core problem.

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23  6:50                                 ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-23  6:50 UTC (permalink / raw)
  To: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On 22/07/14 14:15, Daniel Vetter wrote:
> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>> On 22/07/14 12:21, Daniel Vetter wrote:
>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>> that you're sorry but won't schedule any more work.
>>>>
>>>> I'm not sure how you intend to know if a userspace misbehaves or not. Can
>>>> you elaborate ?
>>>
>>> Well that's mostly policy, currently in i915 we only have a check for
>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>> you can do that with the queue unmapping you've describe in reply to
>>> Jerome's mail.
>>> -Daniel
>>>
>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>> gpu job takes more than 2 seconds, I think, and if so, terminates the job).
>
> Essentially yes. But we also have some hw features to kill jobs quicker,
> e.g. for media workloads.
> -Daniel
>

Yeah, so this is what I'm talking about when I say that you and Jerome come from 
a graphics POV and amdkfd come from a compute POV, no offense intended.

For compute jobs, we simply can't use this logic to terminate jobs. Graphics are 
mostly Real-Time while compute jobs can take from a few ms to a few hours!!! And 
I'm not talking about an entire application runtime but on a single submission 
of jobs by the userspace app. We have tests with jobs that take between 20-30 
minutes to complete. In theory, we can even imagine a compute job which takes 1 
or 2 days (on larger APUs).

Now, I understand the question of how do we prevent the compute job from 
monopolizing the GPU, and internally here we have some ideas that we will 
probably share in the next few days, but my point is that I don't think we can 
terminate a compute job because it is running for more than x seconds. It is 
like you would terminate a CPU process which runs more than x seconds.

I think this is a *very* important discussion (detecting a misbehaved compute 
process) and I would like to continue it, but I don't think moving the job 
submission from userspace control to kernel control will solve this core problem.

	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23  6:50                                 ` Oded Gabbay
@ 2014-07-23  7:04                                   ` Christian König
  -1 siblings, 0 replies; 148+ messages in thread
From: Christian König @ 2014-07-23  7:04 UTC (permalink / raw)
  To: Oded Gabbay, Jerome Glisse, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

Am 23.07.2014 08:50, schrieb Oded Gabbay:
> On 22/07/14 14:15, Daniel Vetter wrote:
>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> 
>>>> wrote:
>>>>>> Exactly, just prevent userspace from submitting more. And if you 
>>>>>> have
>>>>>> misbehaving userspace that submits too much, reset the gpu and 
>>>>>> tell it
>>>>>> that you're sorry but won't schedule any more work.
>>>>>
>>>>> I'm not sure how you intend to know if a userspace misbehaves or 
>>>>> not. Can
>>>>> you elaborate ?
>>>>
>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>> you can do that with the queue unmapping you've describe in reply to
>>>> Jerome's mail.
>>>> -Daniel
>>>>
>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks 
>>> if a
>>> gpu job takes more than 2 seconds, I think, and if so, terminates 
>>> the job).
>>
>> Essentially yes. But we also have some hw features to kill jobs quicker,
>> e.g. for media workloads.
>> -Daniel
>>
>
> Yeah, so this is what I'm talking about when I say that you and Jerome 
> come from a graphics POV and amdkfd come from a compute POV, no 
> offense intended.
>
> For compute jobs, we simply can't use this logic to terminate jobs. 
> Graphics are mostly Real-Time while compute jobs can take from a few 
> ms to a few hours!!! And I'm not talking about an entire application 
> runtime but on a single submission of jobs by the userspace app. We 
> have tests with jobs that take between 20-30 minutes to complete. In 
> theory, we can even imagine a compute job which takes 1 or 2 days (on 
> larger APUs).
>
> Now, I understand the question of how do we prevent the compute job 
> from monopolizing the GPU, and internally here we have some ideas that 
> we will probably share in the next few days, but my point is that I 
> don't think we can terminate a compute job because it is running for 
> more than x seconds. It is like you would terminate a CPU process 
> which runs more than x seconds.

Yeah that's why one of the first things I've did was making the timeout 
configurable in the radeon module.

But it doesn't necessary needs be a timeout, we should also kill a 
running job submission if the CPU process associated with the job is killed.

> I think this is a *very* important discussion (detecting a misbehaved 
> compute process) and I would like to continue it, but I don't think 
> moving the job submission from userspace control to kernel control 
> will solve this core problem.

We need to get this topic solved, otherwise the driver won't make it 
upstream. Allowing userpsace to monopolizing resources either memory, 
CPU or GPU time or special things like counters etc... is a strict no go 
for a kernel module.

I agree that moving the job submission from userpsace to kernel wouldn't 
solve this problem. As Daniel and I pointed out now multiple times it's 
rather easily possible to prevent further job submissions from 
userspace, in the worst case by unmapping the doorbell page.

Moving it to an IOCTL would just make it a bit less complicated.

Christian.

>
>     Oded


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23  7:04                                   ` Christian König
  0 siblings, 0 replies; 148+ messages in thread
From: Christian König @ 2014-07-23  7:04 UTC (permalink / raw)
  To: Oded Gabbay, Jerome Glisse, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

Am 23.07.2014 08:50, schrieb Oded Gabbay:
> On 22/07/14 14:15, Daniel Vetter wrote:
>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com> 
>>>> wrote:
>>>>>> Exactly, just prevent userspace from submitting more. And if you 
>>>>>> have
>>>>>> misbehaving userspace that submits too much, reset the gpu and 
>>>>>> tell it
>>>>>> that you're sorry but won't schedule any more work.
>>>>>
>>>>> I'm not sure how you intend to know if a userspace misbehaves or 
>>>>> not. Can
>>>>> you elaborate ?
>>>>
>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>> you can do that with the queue unmapping you've describe in reply to
>>>> Jerome's mail.
>>>> -Daniel
>>>>
>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks 
>>> if a
>>> gpu job takes more than 2 seconds, I think, and if so, terminates 
>>> the job).
>>
>> Essentially yes. But we also have some hw features to kill jobs quicker,
>> e.g. for media workloads.
>> -Daniel
>>
>
> Yeah, so this is what I'm talking about when I say that you and Jerome 
> come from a graphics POV and amdkfd come from a compute POV, no 
> offense intended.
>
> For compute jobs, we simply can't use this logic to terminate jobs. 
> Graphics are mostly Real-Time while compute jobs can take from a few 
> ms to a few hours!!! And I'm not talking about an entire application 
> runtime but on a single submission of jobs by the userspace app. We 
> have tests with jobs that take between 20-30 minutes to complete. In 
> theory, we can even imagine a compute job which takes 1 or 2 days (on 
> larger APUs).
>
> Now, I understand the question of how do we prevent the compute job 
> from monopolizing the GPU, and internally here we have some ideas that 
> we will probably share in the next few days, but my point is that I 
> don't think we can terminate a compute job because it is running for 
> more than x seconds. It is like you would terminate a CPU process 
> which runs more than x seconds.

Yeah that's why one of the first things I've did was making the timeout 
configurable in the radeon module.

But it doesn't necessary needs be a timeout, we should also kill a 
running job submission if the CPU process associated with the job is killed.

> I think this is a *very* important discussion (detecting a misbehaved 
> compute process) and I would like to continue it, but I don't think 
> moving the job submission from userspace control to kernel control 
> will solve this core problem.

We need to get this topic solved, otherwise the driver won't make it 
upstream. Allowing userpsace to monopolizing resources either memory, 
CPU or GPU time or special things like counters etc... is a strict no go 
for a kernel module.

I agree that moving the job submission from userpsace to kernel wouldn't 
solve this problem. As Daniel and I pointed out now multiple times it's 
rather easily possible to prevent further job submissions from 
userspace, in the worst case by unmapping the doorbell page.

Moving it to an IOCTL would just make it a bit less complicated.

Christian.

>
>     Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23  6:50                                 ` Oded Gabbay
  (?)
@ 2014-07-23  7:05                                   ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-23  7:05 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> On 22/07/14 14:15, Daniel Vetter wrote:
>>
>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>
>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>
>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>>>> wrote:
>>>>>>
>>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>>> that you're sorry but won't schedule any more work.
>>>>>
>>>>>
>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>> Can
>>>>> you elaborate ?
>>>>
>>>>
>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>> you can do that with the queue unmapping you've describe in reply to
>>>> Jerome's mail.
>>>> -Daniel
>>>>
>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>>> gpu job takes more than 2 seconds, I think, and if so, terminates the
>>> job).
>>
>>
>> Essentially yes. But we also have some hw features to kill jobs quicker,
>> e.g. for media workloads.
>> -Daniel
>>
>
> Yeah, so this is what I'm talking about when I say that you and Jerome come
> from a graphics POV and amdkfd come from a compute POV, no offense intended.
>
> For compute jobs, we simply can't use this logic to terminate jobs. Graphics
> are mostly Real-Time while compute jobs can take from a few ms to a few
> hours!!! And I'm not talking about an entire application runtime but on a
> single submission of jobs by the userspace app. We have tests with jobs that
> take between 20-30 minutes to complete. In theory, we can even imagine a
> compute job which takes 1 or 2 days (on larger APUs).
>
> Now, I understand the question of how do we prevent the compute job from
> monopolizing the GPU, and internally here we have some ideas that we will
> probably share in the next few days, but my point is that I don't think we
> can terminate a compute job because it is running for more than x seconds.
> It is like you would terminate a CPU process which runs more than x seconds.
>
> I think this is a *very* important discussion (detecting a misbehaved
> compute process) and I would like to continue it, but I don't think moving
> the job submission from userspace control to kernel control will solve this
> core problem.

Well graphics gets away with cooperative scheduling since usually
people want to see stuff within a few frames, so we can legitimately
kill jobs after a fairly short timeout. Imo if you want to allow
userspace to submit compute jobs that are atomic and take a few
minutes to hours with no break-up in between and no hw means to
preempt then that design is screwed up. We really can't tell the core
vm that "sorry we will hold onto these gobloads of memory you really
need now for another few hours". Pinning memory like that essentially
without a time limit is restricted to root.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23  7:05                                   ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-23  7:05 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> On 22/07/14 14:15, Daniel Vetter wrote:
>>
>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>
>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>
>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>>>> wrote:
>>>>>>
>>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>>> that you're sorry but won't schedule any more work.
>>>>>
>>>>>
>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>> Can
>>>>> you elaborate ?
>>>>
>>>>
>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>> you can do that with the queue unmapping you've describe in reply to
>>>> Jerome's mail.
>>>> -Daniel
>>>>
>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>>> gpu job takes more than 2 seconds, I think, and if so, terminates the
>>> job).
>>
>>
>> Essentially yes. But we also have some hw features to kill jobs quicker,
>> e.g. for media workloads.
>> -Daniel
>>
>
> Yeah, so this is what I'm talking about when I say that you and Jerome come
> from a graphics POV and amdkfd come from a compute POV, no offense intended.
>
> For compute jobs, we simply can't use this logic to terminate jobs. Graphics
> are mostly Real-Time while compute jobs can take from a few ms to a few
> hours!!! And I'm not talking about an entire application runtime but on a
> single submission of jobs by the userspace app. We have tests with jobs that
> take between 20-30 minutes to complete. In theory, we can even imagine a
> compute job which takes 1 or 2 days (on larger APUs).
>
> Now, I understand the question of how do we prevent the compute job from
> monopolizing the GPU, and internally here we have some ideas that we will
> probably share in the next few days, but my point is that I don't think we
> can terminate a compute job because it is running for more than x seconds.
> It is like you would terminate a CPU process which runs more than x seconds.
>
> I think this is a *very* important discussion (detecting a misbehaved
> compute process) and I would like to continue it, but I don't think moving
> the job submission from userspace control to kernel control will solve this
> core problem.

Well graphics gets away with cooperative scheduling since usually
people want to see stuff within a few frames, so we can legitimately
kill jobs after a fairly short timeout. Imo if you want to allow
userspace to submit compute jobs that are atomic and take a few
minutes to hours with no break-up in between and no hw means to
preempt then that design is screwed up. We really can't tell the core
vm that "sorry we will hold onto these gobloads of memory you really
need now for another few hours". Pinning memory like that essentially
without a time limit is restricted to root.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23  7:05                                   ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-23  7:05 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Alexey Skidanov, Andrew Morton, Sellek, Tom

On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
> On 22/07/14 14:15, Daniel Vetter wrote:
>>
>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>
>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>
>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>>>> wrote:
>>>>>>
>>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>>> that you're sorry but won't schedule any more work.
>>>>>
>>>>>
>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>> Can
>>>>> you elaborate ?
>>>>
>>>>
>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>> you can do that with the queue unmapping you've describe in reply to
>>>> Jerome's mail.
>>>> -Daniel
>>>>
>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>>> gpu job takes more than 2 seconds, I think, and if so, terminates the
>>> job).
>>
>>
>> Essentially yes. But we also have some hw features to kill jobs quicker,
>> e.g. for media workloads.
>> -Daniel
>>
>
> Yeah, so this is what I'm talking about when I say that you and Jerome come
> from a graphics POV and amdkfd come from a compute POV, no offense intended.
>
> For compute jobs, we simply can't use this logic to terminate jobs. Graphics
> are mostly Real-Time while compute jobs can take from a few ms to a few
> hours!!! And I'm not talking about an entire application runtime but on a
> single submission of jobs by the userspace app. We have tests with jobs that
> take between 20-30 minutes to complete. In theory, we can even imagine a
> compute job which takes 1 or 2 days (on larger APUs).
>
> Now, I understand the question of how do we prevent the compute job from
> monopolizing the GPU, and internally here we have some ideas that we will
> probably share in the next few days, but my point is that I don't think we
> can terminate a compute job because it is running for more than x seconds.
> It is like you would terminate a CPU process which runs more than x seconds.
>
> I think this is a *very* important discussion (detecting a misbehaved
> compute process) and I would like to continue it, but I don't think moving
> the job submission from userspace control to kernel control will solve this
> core problem.

Well graphics gets away with cooperative scheduling since usually
people want to see stuff within a few frames, so we can legitimately
kill jobs after a fairly short timeout. Imo if you want to allow
userspace to submit compute jobs that are atomic and take a few
minutes to hours with no break-up in between and no hw means to
preempt then that design is screwed up. We really can't tell the core
vm that "sorry we will hold onto these gobloads of memory you really
need now for another few hours". Pinning memory like that essentially
without a time limit is restricted to root.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23  7:05                                   ` Daniel Vetter
  (?)
@ 2014-07-23  8:35                                     ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-23  8:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On 23/07/14 10:05, Daniel Vetter wrote:
> On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>>
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>>
>>>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>>>> that you're sorry but won't schedule any more work.
>>>>>>
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>>> Can
>>>>>> you elaborate ?
>>>>>
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>>> you can do that with the queue unmapping you've describe in reply to
>>>>> Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>>>> gpu job takes more than 2 seconds, I think, and if so, terminates the
>>>> job).
>>>
>>>
>>> Essentially yes. But we also have some hw features to kill jobs quicker,
>>> e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome come
>> from a graphics POV and amdkfd come from a compute POV, no offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs. Graphics
>> are mostly Real-Time while compute jobs can take from a few ms to a few
>> hours!!! And I'm not talking about an entire application runtime but on a
>> single submission of jobs by the userspace app. We have tests with jobs that
>> take between 20-30 minutes to complete. In theory, we can even imagine a
>> compute job which takes 1 or 2 days (on larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job from
>> monopolizing the GPU, and internally here we have some ideas that we will
>> probably share in the next few days, but my point is that I don't think we
>> can terminate a compute job because it is running for more than x seconds.
>> It is like you would terminate a CPU process which runs more than x seconds.
>>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think moving
>> the job submission from userspace control to kernel control will solve this
>> core problem.
>
> Well graphics gets away with cooperative scheduling since usually
> people want to see stuff within a few frames, so we can legitimately
> kill jobs after a fairly short timeout. Imo if you want to allow
> userspace to submit compute jobs that are atomic and take a few
> minutes to hours with no break-up in between and no hw means to
> preempt then that design is screwed up. We really can't tell the core
> vm that "sorry we will hold onto these gobloads of memory you really
> need now for another few hours". Pinning memory like that essentially
> without a time limit is restricted to root.
> -Daniel
>

First of all, I don't see the relation to memory pinning here. I already said on 
this thread that amdkfd does NOT pin local memory. The only memory we allocate 
is system memory, and we map it to the gart, and we can limit that memory by 
limiting max # of queues and max # of process through kernel parameters. Most of 
the memory used is allocated via regular means by the userspace, which is 
usually pageable.

Second, it is important to remember that this problem only exists in KV. In CZ, 
the GPU can context switch between waves (by doing mid-wave preemption). So even 
long running waves are getting switched on and off constantly and there is no 
monopolizing of GPU resources.

Third, even in KV, we can kill waves. The question is when and how to recognize 
it. I think it would be sufficient for now if we expose this ability to the kernel.

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23  8:35                                     ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-23  8:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, John Bridgman, Joerg Roedel, Andrew Lewycky,
	Michel Dänzer, Ben Goz, Alexey Skidanov, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom

On 23/07/14 10:05, Daniel Vetter wrote:
> On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>>
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>>
>>>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>>>> that you're sorry but won't schedule any more work.
>>>>>>
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>>> Can
>>>>>> you elaborate ?
>>>>>
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>>> you can do that with the queue unmapping you've describe in reply to
>>>>> Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>>>> gpu job takes more than 2 seconds, I think, and if so, terminates the
>>>> job).
>>>
>>>
>>> Essentially yes. But we also have some hw features to kill jobs quicker,
>>> e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome come
>> from a graphics POV and amdkfd come from a compute POV, no offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs. Graphics
>> are mostly Real-Time while compute jobs can take from a few ms to a few
>> hours!!! And I'm not talking about an entire application runtime but on a
>> single submission of jobs by the userspace app. We have tests with jobs that
>> take between 20-30 minutes to complete. In theory, we can even imagine a
>> compute job which takes 1 or 2 days (on larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job from
>> monopolizing the GPU, and internally here we have some ideas that we will
>> probably share in the next few days, but my point is that I don't think we
>> can terminate a compute job because it is running for more than x seconds.
>> It is like you would terminate a CPU process which runs more than x seconds.
>>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think moving
>> the job submission from userspace control to kernel control will solve this
>> core problem.
>
> Well graphics gets away with cooperative scheduling since usually
> people want to see stuff within a few frames, so we can legitimately
> kill jobs after a fairly short timeout. Imo if you want to allow
> userspace to submit compute jobs that are atomic and take a few
> minutes to hours with no break-up in between and no hw means to
> preempt then that design is screwed up. We really can't tell the core
> vm that "sorry we will hold onto these gobloads of memory you really
> need now for another few hours". Pinning memory like that essentially
> without a time limit is restricted to root.
> -Daniel
>

First of all, I don't see the relation to memory pinning here. I already said on 
this thread that amdkfd does NOT pin local memory. The only memory we allocate 
is system memory, and we map it to the gart, and we can limit that memory by 
limiting max # of queues and max # of process through kernel parameters. Most of 
the memory used is allocated via regular means by the userspace, which is 
usually pageable.

Second, it is important to remember that this problem only exists in KV. In CZ, 
the GPU can context switch between waves (by doing mid-wave preemption). So even 
long running waves are getting switched on and off constantly and there is no 
monopolizing of GPU resources.

Third, even in KV, we can kill waves. The question is when and how to recognize 
it. I think it would be sufficient for now if we expose this ability to the kernel.

	Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23  8:35                                     ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-23  8:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, dri-devel,
	linux-mm, Alexey Skidanov, Andrew Morton, Sellek, Tom

On 23/07/14 10:05, Daniel Vetter wrote:
> On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com> wrote:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>>
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>>
>>>>>>> Exactly, just prevent userspace from submitting more. And if you have
>>>>>>> misbehaving userspace that submits too much, reset the gpu and tell it
>>>>>>> that you're sorry but won't schedule any more work.
>>>>>>
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>>> Can
>>>>>> you elaborate ?
>>>>>
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check for
>>>>> hangs, and if userspace hangs a bit too often then we stop it. I guess
>>>>> you can do that with the queue unmapping you've describe in reply to
>>>>> Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks if a
>>>> gpu job takes more than 2 seconds, I think, and if so, terminates the
>>>> job).
>>>
>>>
>>> Essentially yes. But we also have some hw features to kill jobs quicker,
>>> e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome come
>> from a graphics POV and amdkfd come from a compute POV, no offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs. Graphics
>> are mostly Real-Time while compute jobs can take from a few ms to a few
>> hours!!! And I'm not talking about an entire application runtime but on a
>> single submission of jobs by the userspace app. We have tests with jobs that
>> take between 20-30 minutes to complete. In theory, we can even imagine a
>> compute job which takes 1 or 2 days (on larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job from
>> monopolizing the GPU, and internally here we have some ideas that we will
>> probably share in the next few days, but my point is that I don't think we
>> can terminate a compute job because it is running for more than x seconds.
>> It is like you would terminate a CPU process which runs more than x seconds.
>>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think moving
>> the job submission from userspace control to kernel control will solve this
>> core problem.
>
> Well graphics gets away with cooperative scheduling since usually
> people want to see stuff within a few frames, so we can legitimately
> kill jobs after a fairly short timeout. Imo if you want to allow
> userspace to submit compute jobs that are atomic and take a few
> minutes to hours with no break-up in between and no hw means to
> preempt then that design is screwed up. We really can't tell the core
> vm that "sorry we will hold onto these gobloads of memory you really
> need now for another few hours". Pinning memory like that essentially
> without a time limit is restricted to root.
> -Daniel
>

First of all, I don't see the relation to memory pinning here. I already said on 
this thread that amdkfd does NOT pin local memory. The only memory we allocate 
is system memory, and we map it to the gart, and we can limit that memory by 
limiting max # of queues and max # of process through kernel parameters. Most of 
the memory used is allocated via regular means by the userspace, which is 
usually pageable.

Second, it is important to remember that this problem only exists in KV. In CZ, 
the GPU can context switch between waves (by doing mid-wave preemption). So even 
long running waves are getting switched on and off constantly and there is no 
monopolizing of GPU resources.

Third, even in KV, we can kill waves. The question is when and how to recognize 
it. I think it would be sufficient for now if we expose this ability to the kernel.

	Oded

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23  7:05                                   ` Daniel Vetter
  (?)
@ 2014-07-23 13:33                                     ` Bridgman, John
  -1 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 13:33 UTC (permalink / raw)
  To: Daniel Vetter, Gabbay, Oded
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, Joerg Roedel, Lewycky, Andrew, Daenzer, Michel,
	Goz, Ben, Skidanov, Alexey, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 4625 bytes --]



>-----Original Message-----
>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>Sent: Wednesday, July 23, 2014 3:06 AM
>To: Gabbay, Oded
>Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; Andrew
>Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
>Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>wrote:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>>
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
><oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>>
>>>>>>> Exactly, just prevent userspace from submitting more. And if you
>>>>>>> have misbehaving userspace that submits too much, reset the gpu
>>>>>>> and tell it that you're sorry but won't schedule any more work.
>>>>>>
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>>> Can
>>>>>> you elaborate ?
>>>>>
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check
>>>>> for hangs, and if userspace hangs a bit too often then we stop it.
>>>>> I guess you can do that with the queue unmapping you've describe in
>>>>> reply to Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
>>>> if a gpu job takes more than 2 seconds, I think, and if so,
>>>> terminates the job).
>>>
>>>
>>> Essentially yes. But we also have some hw features to kill jobs
>>> quicker, e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome
>> come from a graphics POV and amdkfd come from a compute POV, no
>offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs.
>> Graphics are mostly Real-Time while compute jobs can take from a few
>> ms to a few hours!!! And I'm not talking about an entire application
>> runtime but on a single submission of jobs by the userspace app. We
>> have tests with jobs that take between 20-30 minutes to complete. In
>> theory, we can even imagine a compute job which takes 1 or 2 days (on
>larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job
>> from monopolizing the GPU, and internally here we have some ideas that
>> we will probably share in the next few days, but my point is that I
>> don't think we can terminate a compute job because it is running for more
>than x seconds.
>> It is like you would terminate a CPU process which runs more than x
>seconds.
>>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think
>> moving the job submission from userspace control to kernel control
>> will solve this core problem.
>
>Well graphics gets away with cooperative scheduling since usually people
>want to see stuff within a few frames, so we can legitimately kill jobs after a
>fairly short timeout. Imo if you want to allow userspace to submit compute
>jobs that are atomic and take a few minutes to hours with no break-up in
>between and no hw means to preempt then that design is screwed up. We
>really can't tell the core vm that "sorry we will hold onto these gobloads of
>memory you really need now for another few hours". Pinning memory like
>that essentially without a time limit is restricted to root.

Hi Daniel;

I don't really understand the reference to "gobloads of memory". Unlike radeon graphics, the userspace data for HSA applications is maintained in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The IOMMUv2 driver and mm subsystem takes care of faulting in memory pages as needed, nothing is long-term pinned.

The only pinned memory we are talking about here is per-queue and per-process data structures in the driver, which are tiny by comparison. Oded provided the "hardware limits" (ie an insane number of process & threads) for context, but real-world limits will be one or two orders of magnitude lower. Agree we should have included those limits in the initial code, that would have made the "real world" memory footprint much more visible. 

Make sense ?

>-Daniel
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 13:33                                     ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 13:33 UTC (permalink / raw)
  To: Daniel Vetter, Gabbay, Oded
  Cc: Jerome Glisse, Christian König, David Airlie, Alex Deucher,
	Andrew Morton, Joerg Roedel, Lewycky, Andrew, Daenzer, Michel,
	Goz, Ben, Skidanov, Alexey, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom



>-----Original Message-----
>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>Sent: Wednesday, July 23, 2014 3:06 AM
>To: Gabbay, Oded
>Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; Andrew
>Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
>Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>wrote:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>>
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
><oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>>
>>>>>>> Exactly, just prevent userspace from submitting more. And if you
>>>>>>> have misbehaving userspace that submits too much, reset the gpu
>>>>>>> and tell it that you're sorry but won't schedule any more work.
>>>>>>
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>>> Can
>>>>>> you elaborate ?
>>>>>
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check
>>>>> for hangs, and if userspace hangs a bit too often then we stop it.
>>>>> I guess you can do that with the queue unmapping you've describe in
>>>>> reply to Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
>>>> if a gpu job takes more than 2 seconds, I think, and if so,
>>>> terminates the job).
>>>
>>>
>>> Essentially yes. But we also have some hw features to kill jobs
>>> quicker, e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome
>> come from a graphics POV and amdkfd come from a compute POV, no
>offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs.
>> Graphics are mostly Real-Time while compute jobs can take from a few
>> ms to a few hours!!! And I'm not talking about an entire application
>> runtime but on a single submission of jobs by the userspace app. We
>> have tests with jobs that take between 20-30 minutes to complete. In
>> theory, we can even imagine a compute job which takes 1 or 2 days (on
>larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job
>> from monopolizing the GPU, and internally here we have some ideas that
>> we will probably share in the next few days, but my point is that I
>> don't think we can terminate a compute job because it is running for more
>than x seconds.
>> It is like you would terminate a CPU process which runs more than x
>seconds.
>>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think
>> moving the job submission from userspace control to kernel control
>> will solve this core problem.
>
>Well graphics gets away with cooperative scheduling since usually people
>want to see stuff within a few frames, so we can legitimately kill jobs after a
>fairly short timeout. Imo if you want to allow userspace to submit compute
>jobs that are atomic and take a few minutes to hours with no break-up in
>between and no hw means to preempt then that design is screwed up. We
>really can't tell the core vm that "sorry we will hold onto these gobloads of
>memory you really need now for another few hours". Pinning memory like
>that essentially without a time limit is restricted to root.

Hi Daniel;

I don't really understand the reference to "gobloads of memory". Unlike radeon graphics, the userspace data for HSA applications is maintained in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The IOMMUv2 driver and mm subsystem takes care of faulting in memory pages as needed, nothing is long-term pinned.

The only pinned memory we are talking about here is per-queue and per-process data structures in the driver, which are tiny by comparison. Oded provided the "hardware limits" (ie an insane number of process & threads) for context, but real-world limits will be one or two orders of magnitude lower. Agree we should have included those limits in the initial code, that would have made the "real world" memory footprint much more visible. 

Make sense ?

>-Daniel
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 13:33                                     ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 13:33 UTC (permalink / raw)
  To: Daniel Vetter, Gabbay, Oded
  Cc: Lewycky, Andrew, linux-mm, Daenzer, Michel, linux-kernel,
	dri-devel, Sellek, Tom, Skidanov, Alexey, Andrew Morton



>-----Original Message-----
>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>Sent: Wednesday, July 23, 2014 3:06 AM
>To: Gabbay, Oded
>Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; Andrew
>Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
>Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>wrote:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>>
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>>
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
><oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>>
>>>>>>> Exactly, just prevent userspace from submitting more. And if you
>>>>>>> have misbehaving userspace that submits too much, reset the gpu
>>>>>>> and tell it that you're sorry but won't schedule any more work.
>>>>>>
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
>>>>>> Can
>>>>>> you elaborate ?
>>>>>
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check
>>>>> for hangs, and if userspace hangs a bit too often then we stop it.
>>>>> I guess you can do that with the queue unmapping you've describe in
>>>>> reply to Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
>>>> if a gpu job takes more than 2 seconds, I think, and if so,
>>>> terminates the job).
>>>
>>>
>>> Essentially yes. But we also have some hw features to kill jobs
>>> quicker, e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome
>> come from a graphics POV and amdkfd come from a compute POV, no
>offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs.
>> Graphics are mostly Real-Time while compute jobs can take from a few
>> ms to a few hours!!! And I'm not talking about an entire application
>> runtime but on a single submission of jobs by the userspace app. We
>> have tests with jobs that take between 20-30 minutes to complete. In
>> theory, we can even imagine a compute job which takes 1 or 2 days (on
>larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job
>> from monopolizing the GPU, and internally here we have some ideas that
>> we will probably share in the next few days, but my point is that I
>> don't think we can terminate a compute job because it is running for more
>than x seconds.
>> It is like you would terminate a CPU process which runs more than x
>seconds.
>>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think
>> moving the job submission from userspace control to kernel control
>> will solve this core problem.
>
>Well graphics gets away with cooperative scheduling since usually people
>want to see stuff within a few frames, so we can legitimately kill jobs after a
>fairly short timeout. Imo if you want to allow userspace to submit compute
>jobs that are atomic and take a few minutes to hours with no break-up in
>between and no hw means to preempt then that design is screwed up. We
>really can't tell the core vm that "sorry we will hold onto these gobloads of
>memory you really need now for another few hours". Pinning memory like
>that essentially without a time limit is restricted to root.

Hi Daniel;

I don't really understand the reference to "gobloads of memory". Unlike radeon graphics, the userspace data for HSA applications is maintained in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The IOMMUv2 driver and mm subsystem takes care of faulting in memory pages as needed, nothing is long-term pinned.

The only pinned memory we are talking about here is per-queue and per-process data structures in the driver, which are tiny by comparison. Oded provided the "hardware limits" (ie an insane number of process & threads) for context, but real-world limits will be one or two orders of magnitude lower. Agree we should have included those limits in the initial code, that would have made the "real world" memory footprint much more visible. 

Make sense ?

>-Daniel
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23  7:04                                   ` Christian König
@ 2014-07-23 13:39                                     ` Bridgman, John
  -1 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 13:39 UTC (permalink / raw)
  To: Christian König, Gabbay, Oded, Jerome Glisse, David Airlie,
	Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky, Andrew,
	Daenzer, Michel, Goz, Ben, Skidanov, Alexey, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom



>-----Original Message-----
>From: Christian König [mailto:deathsimple@vodafone.de]
>Sent: Wednesday, July 23, 2014 3:04 AM
>To: Gabbay, Oded; Jerome Glisse; David Airlie; Alex Deucher; Andrew
>Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
>Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>Am 23.07.2014 08:50, schrieb Oded Gabbay:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
><oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>> Exactly, just prevent userspace from submitting more. And if you
>>>>>>> have misbehaving userspace that submits too much, reset the gpu
>>>>>>> and tell it that you're sorry but won't schedule any more work.
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or
>>>>>> not. Can you elaborate ?
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check
>>>>> for hangs, and if userspace hangs a bit too often then we stop it.
>>>>> I guess you can do that with the queue unmapping you've describe in
>>>>> reply to Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
>>>> if a gpu job takes more than 2 seconds, I think, and if so,
>>>> terminates the job).
>>>
>>> Essentially yes. But we also have some hw features to kill jobs
>>> quicker, e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome
>> come from a graphics POV and amdkfd come from a compute POV, no
>> offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs.
>> Graphics are mostly Real-Time while compute jobs can take from a few
>> ms to a few hours!!! And I'm not talking about an entire application
>> runtime but on a single submission of jobs by the userspace app. We
>> have tests with jobs that take between 20-30 minutes to complete. In
>> theory, we can even imagine a compute job which takes 1 or 2 days (on
>> larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job
>> from monopolizing the GPU, and internally here we have some ideas that
>> we will probably share in the next few days, but my point is that I
>> don't think we can terminate a compute job because it is running for
>> more than x seconds. It is like you would terminate a CPU process
>> which runs more than x seconds.
>
>Yeah that's why one of the first things I've did was making the timeout
>configurable in the radeon module.
>
>But it doesn't necessary needs be a timeout, we should also kill a running job
>submission if the CPU process associated with the job is killed.
>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think
>> moving the job submission from userspace control to kernel control
>> will solve this core problem.
>
>We need to get this topic solved, otherwise the driver won't make it
>upstream. Allowing userpsace to monopolizing resources either memory,
>CPU or GPU time or special things like counters etc... is a strict no go for a
>kernel module.
>
>I agree that moving the job submission from userpsace to kernel wouldn't
>solve this problem. As Daniel and I pointed out now multiple times it's rather
>easily possible to prevent further job submissions from userspace, in the
>worst case by unmapping the doorbell page.
>
>Moving it to an IOCTL would just make it a bit less complicated.

Hi Christian;

HSA uses usermode queues so that programs running on GPU can dispatch work to themselves or to other GPUs with a consistent dispatch mechanism for CPU and GPU code. We could potentially use s_msg and trap every GPU dispatch back through CPU code but that gets slow and ugly very quickly. 

>
>Christian.
>
>>
>>     Oded


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 13:39                                     ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 13:39 UTC (permalink / raw)
  To: Christian König, Gabbay, Oded, Jerome Glisse, David Airlie,
	Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky, Andrew,
	Daenzer, Michel, Goz, Ben, Skidanov, Alexey, linux-kernel,
	dri-devel, linux-mm, Sellek, Tom



>-----Original Message-----
>From: Christian König [mailto:deathsimple@vodafone.de]
>Sent: Wednesday, July 23, 2014 3:04 AM
>To: Gabbay, Oded; Jerome Glisse; David Airlie; Alex Deucher; Andrew
>Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
>Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>Am 23.07.2014 08:50, schrieb Oded Gabbay:
>> On 22/07/14 14:15, Daniel Vetter wrote:
>>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
><oded.gabbay@amd.com>
>>>>> wrote:
>>>>>>> Exactly, just prevent userspace from submitting more. And if you
>>>>>>> have misbehaving userspace that submits too much, reset the gpu
>>>>>>> and tell it that you're sorry but won't schedule any more work.
>>>>>>
>>>>>> I'm not sure how you intend to know if a userspace misbehaves or
>>>>>> not. Can you elaborate ?
>>>>>
>>>>> Well that's mostly policy, currently in i915 we only have a check
>>>>> for hangs, and if userspace hangs a bit too often then we stop it.
>>>>> I guess you can do that with the queue unmapping you've describe in
>>>>> reply to Jerome's mail.
>>>>> -Daniel
>>>>>
>>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
>>>> if a gpu job takes more than 2 seconds, I think, and if so,
>>>> terminates the job).
>>>
>>> Essentially yes. But we also have some hw features to kill jobs
>>> quicker, e.g. for media workloads.
>>> -Daniel
>>>
>>
>> Yeah, so this is what I'm talking about when I say that you and Jerome
>> come from a graphics POV and amdkfd come from a compute POV, no
>> offense intended.
>>
>> For compute jobs, we simply can't use this logic to terminate jobs.
>> Graphics are mostly Real-Time while compute jobs can take from a few
>> ms to a few hours!!! And I'm not talking about an entire application
>> runtime but on a single submission of jobs by the userspace app. We
>> have tests with jobs that take between 20-30 minutes to complete. In
>> theory, we can even imagine a compute job which takes 1 or 2 days (on
>> larger APUs).
>>
>> Now, I understand the question of how do we prevent the compute job
>> from monopolizing the GPU, and internally here we have some ideas that
>> we will probably share in the next few days, but my point is that I
>> don't think we can terminate a compute job because it is running for
>> more than x seconds. It is like you would terminate a CPU process
>> which runs more than x seconds.
>
>Yeah that's why one of the first things I've did was making the timeout
>configurable in the radeon module.
>
>But it doesn't necessary needs be a timeout, we should also kill a running job
>submission if the CPU process associated with the job is killed.
>
>> I think this is a *very* important discussion (detecting a misbehaved
>> compute process) and I would like to continue it, but I don't think
>> moving the job submission from userspace control to kernel control
>> will solve this core problem.
>
>We need to get this topic solved, otherwise the driver won't make it
>upstream. Allowing userpsace to monopolizing resources either memory,
>CPU or GPU time or special things like counters etc... is a strict no go for a
>kernel module.
>
>I agree that moving the job submission from userpsace to kernel wouldn't
>solve this problem. As Daniel and I pointed out now multiple times it's rather
>easily possible to prevent further job submissions from userspace, in the
>worst case by unmapping the doorbell page.
>
>Moving it to an IOCTL would just make it a bit less complicated.

Hi Christian;

HSA uses usermode queues so that programs running on GPU can dispatch work to themselves or to other GPUs with a consistent dispatch mechanism for CPU and GPU code. We could potentially use s_msg and trap every GPU dispatch back through CPU code but that gets slow and ugly very quickly. 

>
>Christian.
>
>>
>>     Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 13:33                                     ` Bridgman, John
  (?)
@ 2014-07-23 14:41                                       ` Daniel Vetter
  -1 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-23 14:41 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Daniel Vetter, Gabbay, Oded, Jerome Glisse, Christian König,
	David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky,
	Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey,
	linux-kernel, dri-devel, linux-mm, Sellek, Tom

On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
> 
> 
> >-----Original Message-----
> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
> >Sent: Wednesday, July 23, 2014 3:06 AM
> >To: Gabbay, Oded
> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; Andrew
> >Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
> >Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
> >devel@lists.freedesktop.org; linux-mm; Sellek, Tom
> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
> >
> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
> >wrote:
> >> On 22/07/14 14:15, Daniel Vetter wrote:
> >>>
> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >>>>
> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
> >>>>>
> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
> ><oded.gabbay@amd.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Exactly, just prevent userspace from submitting more. And if you
> >>>>>>> have misbehaving userspace that submits too much, reset the gpu
> >>>>>>> and tell it that you're sorry but won't schedule any more work.
> >>>>>>
> >>>>>>
> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
> >>>>>> Can
> >>>>>> you elaborate ?
> >>>>>
> >>>>>
> >>>>> Well that's mostly policy, currently in i915 we only have a check
> >>>>> for hangs, and if userspace hangs a bit too often then we stop it.
> >>>>> I guess you can do that with the queue unmapping you've describe in
> >>>>> reply to Jerome's mail.
> >>>>> -Daniel
> >>>>>
> >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >>>> if a gpu job takes more than 2 seconds, I think, and if so,
> >>>> terminates the job).
> >>>
> >>>
> >>> Essentially yes. But we also have some hw features to kill jobs
> >>> quicker, e.g. for media workloads.
> >>> -Daniel
> >>>
> >>
> >> Yeah, so this is what I'm talking about when I say that you and Jerome
> >> come from a graphics POV and amdkfd come from a compute POV, no
> >offense intended.
> >>
> >> For compute jobs, we simply can't use this logic to terminate jobs.
> >> Graphics are mostly Real-Time while compute jobs can take from a few
> >> ms to a few hours!!! And I'm not talking about an entire application
> >> runtime but on a single submission of jobs by the userspace app. We
> >> have tests with jobs that take between 20-30 minutes to complete. In
> >> theory, we can even imagine a compute job which takes 1 or 2 days (on
> >larger APUs).
> >>
> >> Now, I understand the question of how do we prevent the compute job
> >> from monopolizing the GPU, and internally here we have some ideas that
> >> we will probably share in the next few days, but my point is that I
> >> don't think we can terminate a compute job because it is running for more
> >than x seconds.
> >> It is like you would terminate a CPU process which runs more than x
> >seconds.
> >>
> >> I think this is a *very* important discussion (detecting a misbehaved
> >> compute process) and I would like to continue it, but I don't think
> >> moving the job submission from userspace control to kernel control
> >> will solve this core problem.
> >
> >Well graphics gets away with cooperative scheduling since usually people
> >want to see stuff within a few frames, so we can legitimately kill jobs after a
> >fairly short timeout. Imo if you want to allow userspace to submit compute
> >jobs that are atomic and take a few minutes to hours with no break-up in
> >between and no hw means to preempt then that design is screwed up. We
> >really can't tell the core vm that "sorry we will hold onto these gobloads of
> >memory you really need now for another few hours". Pinning memory like
> >that essentially without a time limit is restricted to root.
> 
> Hi Daniel;
> 
> I don't really understand the reference to "gobloads of memory". Unlike
> radeon graphics, the userspace data for HSA applications is maintained
> in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The
> IOMMUv2 driver and mm subsystem takes care of faulting in memory pages
> as needed, nothing is long-term pinned.

Yeah I've lost that part of the equation a bit since I've always thought
that proper faulting support without preemption is not really possible. I
guess those platforms completely stall on a fault until the ptes are all
set up?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 14:41                                       ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-23 14:41 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Daniel Vetter, Gabbay, Oded, Jerome Glisse, Christian König,
	David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky,
	Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey,
	linux-kernel, dri-devel, linux-mm, Sellek, Tom

On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
> 
> 
> >-----Original Message-----
> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
> >Sent: Wednesday, July 23, 2014 3:06 AM
> >To: Gabbay, Oded
> >Cc: Jerome Glisse; Christian Konig; David Airlie; Alex Deucher; Andrew
> >Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
> >Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
> >devel@lists.freedesktop.org; linux-mm; Sellek, Tom
> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
> >
> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
> >wrote:
> >> On 22/07/14 14:15, Daniel Vetter wrote:
> >>>
> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >>>>
> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
> >>>>>
> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
> ><oded.gabbay@amd.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Exactly, just prevent userspace from submitting more. And if you
> >>>>>>> have misbehaving userspace that submits too much, reset the gpu
> >>>>>>> and tell it that you're sorry but won't schedule any more work.
> >>>>>>
> >>>>>>
> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
> >>>>>> Can
> >>>>>> you elaborate ?
> >>>>>
> >>>>>
> >>>>> Well that's mostly policy, currently in i915 we only have a check
> >>>>> for hangs, and if userspace hangs a bit too often then we stop it.
> >>>>> I guess you can do that with the queue unmapping you've describe in
> >>>>> reply to Jerome's mail.
> >>>>> -Daniel
> >>>>>
> >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >>>> if a gpu job takes more than 2 seconds, I think, and if so,
> >>>> terminates the job).
> >>>
> >>>
> >>> Essentially yes. But we also have some hw features to kill jobs
> >>> quicker, e.g. for media workloads.
> >>> -Daniel
> >>>
> >>
> >> Yeah, so this is what I'm talking about when I say that you and Jerome
> >> come from a graphics POV and amdkfd come from a compute POV, no
> >offense intended.
> >>
> >> For compute jobs, we simply can't use this logic to terminate jobs.
> >> Graphics are mostly Real-Time while compute jobs can take from a few
> >> ms to a few hours!!! And I'm not talking about an entire application
> >> runtime but on a single submission of jobs by the userspace app. We
> >> have tests with jobs that take between 20-30 minutes to complete. In
> >> theory, we can even imagine a compute job which takes 1 or 2 days (on
> >larger APUs).
> >>
> >> Now, I understand the question of how do we prevent the compute job
> >> from monopolizing the GPU, and internally here we have some ideas that
> >> we will probably share in the next few days, but my point is that I
> >> don't think we can terminate a compute job because it is running for more
> >than x seconds.
> >> It is like you would terminate a CPU process which runs more than x
> >seconds.
> >>
> >> I think this is a *very* important discussion (detecting a misbehaved
> >> compute process) and I would like to continue it, but I don't think
> >> moving the job submission from userspace control to kernel control
> >> will solve this core problem.
> >
> >Well graphics gets away with cooperative scheduling since usually people
> >want to see stuff within a few frames, so we can legitimately kill jobs after a
> >fairly short timeout. Imo if you want to allow userspace to submit compute
> >jobs that are atomic and take a few minutes to hours with no break-up in
> >between and no hw means to preempt then that design is screwed up. We
> >really can't tell the core vm that "sorry we will hold onto these gobloads of
> >memory you really need now for another few hours". Pinning memory like
> >that essentially without a time limit is restricted to root.
> 
> Hi Daniel;
> 
> I don't really understand the reference to "gobloads of memory". Unlike
> radeon graphics, the userspace data for HSA applications is maintained
> in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The
> IOMMUv2 driver and mm subsystem takes care of faulting in memory pages
> as needed, nothing is long-term pinned.

Yeah I've lost that part of the equation a bit since I've always thought
that proper faulting support without preemption is not really possible. I
guess those platforms completely stall on a fault until the ptes are all
set up?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 14:41                                       ` Daniel Vetter
  0 siblings, 0 replies; 148+ messages in thread
From: Daniel Vetter @ 2014-07-23 14:41 UTC (permalink / raw)
  To: Bridgman, John
  Cc: Daniel Vetter, Gabbay, Oded, Jerome Glisse, Christian König,
	David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky,
	Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey,
	linux-kernel, dri-devel, linux-mm, Sellek, Tom

On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
> 
> 
> >-----Original Message-----
> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
> >Sent: Wednesday, July 23, 2014 3:06 AM
> >To: Gabbay, Oded
> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher; Andrew
> >Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel;
> >Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
> >devel@lists.freedesktop.org; linux-mm; Sellek, Tom
> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
> >
> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
> >wrote:
> >> On 22/07/14 14:15, Daniel Vetter wrote:
> >>>
> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >>>>
> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
> >>>>>
> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
> ><oded.gabbay@amd.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Exactly, just prevent userspace from submitting more. And if you
> >>>>>>> have misbehaving userspace that submits too much, reset the gpu
> >>>>>>> and tell it that you're sorry but won't schedule any more work.
> >>>>>>
> >>>>>>
> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or not.
> >>>>>> Can
> >>>>>> you elaborate ?
> >>>>>
> >>>>>
> >>>>> Well that's mostly policy, currently in i915 we only have a check
> >>>>> for hangs, and if userspace hangs a bit too often then we stop it.
> >>>>> I guess you can do that with the queue unmapping you've describe in
> >>>>> reply to Jerome's mail.
> >>>>> -Daniel
> >>>>>
> >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >>>> if a gpu job takes more than 2 seconds, I think, and if so,
> >>>> terminates the job).
> >>>
> >>>
> >>> Essentially yes. But we also have some hw features to kill jobs
> >>> quicker, e.g. for media workloads.
> >>> -Daniel
> >>>
> >>
> >> Yeah, so this is what I'm talking about when I say that you and Jerome
> >> come from a graphics POV and amdkfd come from a compute POV, no
> >offense intended.
> >>
> >> For compute jobs, we simply can't use this logic to terminate jobs.
> >> Graphics are mostly Real-Time while compute jobs can take from a few
> >> ms to a few hours!!! And I'm not talking about an entire application
> >> runtime but on a single submission of jobs by the userspace app. We
> >> have tests with jobs that take between 20-30 minutes to complete. In
> >> theory, we can even imagine a compute job which takes 1 or 2 days (on
> >larger APUs).
> >>
> >> Now, I understand the question of how do we prevent the compute job
> >> from monopolizing the GPU, and internally here we have some ideas that
> >> we will probably share in the next few days, but my point is that I
> >> don't think we can terminate a compute job because it is running for more
> >than x seconds.
> >> It is like you would terminate a CPU process which runs more than x
> >seconds.
> >>
> >> I think this is a *very* important discussion (detecting a misbehaved
> >> compute process) and I would like to continue it, but I don't think
> >> moving the job submission from userspace control to kernel control
> >> will solve this core problem.
> >
> >Well graphics gets away with cooperative scheduling since usually people
> >want to see stuff within a few frames, so we can legitimately kill jobs after a
> >fairly short timeout. Imo if you want to allow userspace to submit compute
> >jobs that are atomic and take a few minutes to hours with no break-up in
> >between and no hw means to preempt then that design is screwed up. We
> >really can't tell the core vm that "sorry we will hold onto these gobloads of
> >memory you really need now for another few hours". Pinning memory like
> >that essentially without a time limit is restricted to root.
> 
> Hi Daniel;
> 
> I don't really understand the reference to "gobloads of memory". Unlike
> radeon graphics, the userspace data for HSA applications is maintained
> in pageable system memory and accessed via the IOMMUv2 (ATC/PRI). The
> IOMMUv2 driver and mm subsystem takes care of faulting in memory pages
> as needed, nothing is long-term pinned.

Yeah I've lost that part of the equation a bit since I've always thought
that proper faulting support without preemption is not really possible. I
guess those platforms completely stall on a fault until the ptes are all
set up?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23  7:04                                   ` Christian König
  (?)
@ 2014-07-23 14:56                                     ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-23 14:56 UTC (permalink / raw)
  To: Christian König
  Cc: Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote:
> Am 23.07.2014 08:50, schrieb Oded Gabbay:
> >On 22/07/14 14:15, Daniel Vetter wrote:
> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >>>On 22/07/14 12:21, Daniel Vetter wrote:
> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
> >>>>wrote:
> >>>>>>Exactly, just prevent userspace from submitting more. And if you
> >>>>>>have
> >>>>>>misbehaving userspace that submits too much, reset the gpu and
> >>>>>>tell it
> >>>>>>that you're sorry but won't schedule any more work.
> >>>>>
> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
> >>>>>not. Can
> >>>>>you elaborate ?
> >>>>
> >>>>Well that's mostly policy, currently in i915 we only have a check for
> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
> >>>>you can do that with the queue unmapping you've describe in reply to
> >>>>Jerome's mail.
> >>>>-Daniel
> >>>>
> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >>>if a
> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
> >>>job).
> >>
> >>Essentially yes. But we also have some hw features to kill jobs quicker,
> >>e.g. for media workloads.
> >>-Daniel
> >>
> >
> >Yeah, so this is what I'm talking about when I say that you and Jerome
> >come from a graphics POV and amdkfd come from a compute POV, no offense
> >intended.
> >
> >For compute jobs, we simply can't use this logic to terminate jobs.
> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
> >a few hours!!! And I'm not talking about an entire application runtime but
> >on a single submission of jobs by the userspace app. We have tests with
> >jobs that take between 20-30 minutes to complete. In theory, we can even
> >imagine a compute job which takes 1 or 2 days (on larger APUs).
> >
> >Now, I understand the question of how do we prevent the compute job from
> >monopolizing the GPU, and internally here we have some ideas that we will
> >probably share in the next few days, but my point is that I don't think we
> >can terminate a compute job because it is running for more than x seconds.
> >It is like you would terminate a CPU process which runs more than x
> >seconds.
> 
> Yeah that's why one of the first things I've did was making the timeout
> configurable in the radeon module.
> 
> But it doesn't necessary needs be a timeout, we should also kill a running
> job submission if the CPU process associated with the job is killed.
> 
> >I think this is a *very* important discussion (detecting a misbehaved
> >compute process) and I would like to continue it, but I don't think moving
> >the job submission from userspace control to kernel control will solve
> >this core problem.
> 
> We need to get this topic solved, otherwise the driver won't make it
> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
> GPU time or special things like counters etc... is a strict no go for a
> kernel module.
> 
> I agree that moving the job submission from userpsace to kernel wouldn't
> solve this problem. As Daniel and I pointed out now multiple times it's
> rather easily possible to prevent further job submissions from userspace, in
> the worst case by unmapping the doorbell page.
> 
> Moving it to an IOCTL would just make it a bit less complicated.
> 

It is not only complexity, my main concern is not really the amount of memory
pinned (well it would be if it was vram which by the way you need to remove
the api that allow to allocate vram just so that it clearly shows that vram is
not allowed).

Issue is with GPU address space fragmentation, new process hsa queue might be
allocated in middle of gtt space and stays there for so long that i will forbid
any big buffer to be bind to gtt. Thought with virtual address space for graphics
this is less of an issue and only the kernel suffer but still it might block the
kernel from evicting some VRAM because i can not bind a system buffer big enough
to GTT because some GTT space is taken by some HSA queue.

To mitigate this at very least, you need to implement special memory allocation
inside ttm and radeon to force this per queue to be allocate for instance from
top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
on number of queue.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 14:56                                     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-23 14:56 UTC (permalink / raw)
  To: Christian König
  Cc: Oded Gabbay, David Airlie, Alex Deucher, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian Konig wrote:
> Am 23.07.2014 08:50, schrieb Oded Gabbay:
> >On 22/07/14 14:15, Daniel Vetter wrote:
> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >>>On 22/07/14 12:21, Daniel Vetter wrote:
> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
> >>>>wrote:
> >>>>>>Exactly, just prevent userspace from submitting more. And if you
> >>>>>>have
> >>>>>>misbehaving userspace that submits too much, reset the gpu and
> >>>>>>tell it
> >>>>>>that you're sorry but won't schedule any more work.
> >>>>>
> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
> >>>>>not. Can
> >>>>>you elaborate ?
> >>>>
> >>>>Well that's mostly policy, currently in i915 we only have a check for
> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
> >>>>you can do that with the queue unmapping you've describe in reply to
> >>>>Jerome's mail.
> >>>>-Daniel
> >>>>
> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >>>if a
> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
> >>>job).
> >>
> >>Essentially yes. But we also have some hw features to kill jobs quicker,
> >>e.g. for media workloads.
> >>-Daniel
> >>
> >
> >Yeah, so this is what I'm talking about when I say that you and Jerome
> >come from a graphics POV and amdkfd come from a compute POV, no offense
> >intended.
> >
> >For compute jobs, we simply can't use this logic to terminate jobs.
> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
> >a few hours!!! And I'm not talking about an entire application runtime but
> >on a single submission of jobs by the userspace app. We have tests with
> >jobs that take between 20-30 minutes to complete. In theory, we can even
> >imagine a compute job which takes 1 or 2 days (on larger APUs).
> >
> >Now, I understand the question of how do we prevent the compute job from
> >monopolizing the GPU, and internally here we have some ideas that we will
> >probably share in the next few days, but my point is that I don't think we
> >can terminate a compute job because it is running for more than x seconds.
> >It is like you would terminate a CPU process which runs more than x
> >seconds.
> 
> Yeah that's why one of the first things I've did was making the timeout
> configurable in the radeon module.
> 
> But it doesn't necessary needs be a timeout, we should also kill a running
> job submission if the CPU process associated with the job is killed.
> 
> >I think this is a *very* important discussion (detecting a misbehaved
> >compute process) and I would like to continue it, but I don't think moving
> >the job submission from userspace control to kernel control will solve
> >this core problem.
> 
> We need to get this topic solved, otherwise the driver won't make it
> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
> GPU time or special things like counters etc... is a strict no go for a
> kernel module.
> 
> I agree that moving the job submission from userpsace to kernel wouldn't
> solve this problem. As Daniel and I pointed out now multiple times it's
> rather easily possible to prevent further job submissions from userspace, in
> the worst case by unmapping the doorbell page.
> 
> Moving it to an IOCTL would just make it a bit less complicated.
> 

It is not only complexity, my main concern is not really the amount of memory
pinned (well it would be if it was vram which by the way you need to remove
the api that allow to allocate vram just so that it clearly shows that vram is
not allowed).

Issue is with GPU address space fragmentation, new process hsa queue might be
allocated in middle of gtt space and stays there for so long that i will forbid
any big buffer to be bind to gtt. Thought with virtual address space for graphics
this is less of an issue and only the kernel suffer but still it might block the
kernel from evicting some VRAM because i can not bind a system buffer big enough
to GTT because some GTT space is taken by some HSA queue.

To mitigate this at very least, you need to implement special memory allocation
inside ttm and radeon to force this per queue to be allocate for instance from
top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
on number of queue.

Cheers,
Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 14:56                                     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-23 14:56 UTC (permalink / raw)
  To: Christian König
  Cc: Andrew Lewycky, Michel Dänzer, linux-kernel, Sellek, Tom,
	linux-mm, Alexey Skidanov, dri-devel, Andrew Morton

On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote:
> Am 23.07.2014 08:50, schrieb Oded Gabbay:
> >On 22/07/14 14:15, Daniel Vetter wrote:
> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >>>On 22/07/14 12:21, Daniel Vetter wrote:
> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
> >>>>wrote:
> >>>>>>Exactly, just prevent userspace from submitting more. And if you
> >>>>>>have
> >>>>>>misbehaving userspace that submits too much, reset the gpu and
> >>>>>>tell it
> >>>>>>that you're sorry but won't schedule any more work.
> >>>>>
> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
> >>>>>not. Can
> >>>>>you elaborate ?
> >>>>
> >>>>Well that's mostly policy, currently in i915 we only have a check for
> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
> >>>>you can do that with the queue unmapping you've describe in reply to
> >>>>Jerome's mail.
> >>>>-Daniel
> >>>>
> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >>>if a
> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
> >>>job).
> >>
> >>Essentially yes. But we also have some hw features to kill jobs quicker,
> >>e.g. for media workloads.
> >>-Daniel
> >>
> >
> >Yeah, so this is what I'm talking about when I say that you and Jerome
> >come from a graphics POV and amdkfd come from a compute POV, no offense
> >intended.
> >
> >For compute jobs, we simply can't use this logic to terminate jobs.
> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
> >a few hours!!! And I'm not talking about an entire application runtime but
> >on a single submission of jobs by the userspace app. We have tests with
> >jobs that take between 20-30 minutes to complete. In theory, we can even
> >imagine a compute job which takes 1 or 2 days (on larger APUs).
> >
> >Now, I understand the question of how do we prevent the compute job from
> >monopolizing the GPU, and internally here we have some ideas that we will
> >probably share in the next few days, but my point is that I don't think we
> >can terminate a compute job because it is running for more than x seconds.
> >It is like you would terminate a CPU process which runs more than x
> >seconds.
> 
> Yeah that's why one of the first things I've did was making the timeout
> configurable in the radeon module.
> 
> But it doesn't necessary needs be a timeout, we should also kill a running
> job submission if the CPU process associated with the job is killed.
> 
> >I think this is a *very* important discussion (detecting a misbehaved
> >compute process) and I would like to continue it, but I don't think moving
> >the job submission from userspace control to kernel control will solve
> >this core problem.
> 
> We need to get this topic solved, otherwise the driver won't make it
> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
> GPU time or special things like counters etc... is a strict no go for a
> kernel module.
> 
> I agree that moving the job submission from userpsace to kernel wouldn't
> solve this problem. As Daniel and I pointed out now multiple times it's
> rather easily possible to prevent further job submissions from userspace, in
> the worst case by unmapping the doorbell page.
> 
> Moving it to an IOCTL would just make it a bit less complicated.
> 

It is not only complexity, my main concern is not really the amount of memory
pinned (well it would be if it was vram which by the way you need to remove
the api that allow to allocate vram just so that it clearly shows that vram is
not allowed).

Issue is with GPU address space fragmentation, new process hsa queue might be
allocated in middle of gtt space and stays there for so long that i will forbid
any big buffer to be bind to gtt. Thought with virtual address space for graphics
this is less of an issue and only the kernel suffer but still it might block the
kernel from evicting some VRAM because i can not bind a system buffer big enough
to GTT because some GTT space is taken by some HSA queue.

To mitigate this at very least, you need to implement special memory allocation
inside ttm and radeon to force this per queue to be allocate for instance from
top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
on number of queue.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 14:41                                       ` Daniel Vetter
  (?)
@ 2014-07-23 15:06                                         ` Bridgman, John
  -1 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 15:06 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, Gabbay, Oded, Jerome Glisse, Christian König,
	David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky,
	Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey,
	linux-kernel, dri-devel, linux-mm, Sellek, Tom



>-----Original Message-----
>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
>Vetter
>Sent: Wednesday, July 23, 2014 10:42 AM
>To: Bridgman, John
>Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David Airlie;
>Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew; Daenzer,
>Michel; Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
>>
>>
>> >-----Original Message-----
>> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>> >Sent: Wednesday, July 23, 2014 3:06 AM
>> >To: Gabbay, Oded
>> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher;
>> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew;
>> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>> >linux-mm; Sellek, Tom
>> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>> >
>> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>> >wrote:
>> >> On 22/07/14 14:15, Daniel Vetter wrote:
>> >>>
>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>> >>>>
>> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
>> >>>>>
>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
>> ><oded.gabbay@amd.com>
>> >>>>> wrote:
>> >>>>>>>
>> >>>>>>> Exactly, just prevent userspace from submitting more. And if
>> >>>>>>> you have misbehaving userspace that submits too much, reset
>> >>>>>>> the gpu and tell it that you're sorry but won't schedule any more
>work.
>> >>>>>>
>> >>>>>>
>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or
>not.
>> >>>>>> Can
>> >>>>>> you elaborate ?
>> >>>>>
>> >>>>>
>> >>>>> Well that's mostly policy, currently in i915 we only have a
>> >>>>> check for hangs, and if userspace hangs a bit too often then we stop
>it.
>> >>>>> I guess you can do that with the queue unmapping you've describe
>> >>>>> in reply to Jerome's mail.
>> >>>>> -Daniel
>> >>>>>
>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows
>> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if
>> >>>> so, terminates the job).
>> >>>
>> >>>
>> >>> Essentially yes. But we also have some hw features to kill jobs
>> >>> quicker, e.g. for media workloads.
>> >>> -Daniel
>> >>>
>> >>
>> >> Yeah, so this is what I'm talking about when I say that you and
>> >> Jerome come from a graphics POV and amdkfd come from a compute
>POV,
>> >> no
>> >offense intended.
>> >>
>> >> For compute jobs, we simply can't use this logic to terminate jobs.
>> >> Graphics are mostly Real-Time while compute jobs can take from a
>> >> few ms to a few hours!!! And I'm not talking about an entire
>> >> application runtime but on a single submission of jobs by the
>> >> userspace app. We have tests with jobs that take between 20-30
>> >> minutes to complete. In theory, we can even imagine a compute job
>> >> which takes 1 or 2 days (on
>> >larger APUs).
>> >>
>> >> Now, I understand the question of how do we prevent the compute job
>> >> from monopolizing the GPU, and internally here we have some ideas
>> >> that we will probably share in the next few days, but my point is
>> >> that I don't think we can terminate a compute job because it is
>> >> running for more
>> >than x seconds.
>> >> It is like you would terminate a CPU process which runs more than x
>> >seconds.
>> >>
>> >> I think this is a *very* important discussion (detecting a
>> >> misbehaved compute process) and I would like to continue it, but I
>> >> don't think moving the job submission from userspace control to
>> >> kernel control will solve this core problem.
>> >
>> >Well graphics gets away with cooperative scheduling since usually
>> >people want to see stuff within a few frames, so we can legitimately
>> >kill jobs after a fairly short timeout. Imo if you want to allow
>> >userspace to submit compute jobs that are atomic and take a few
>> >minutes to hours with no break-up in between and no hw means to
>> >preempt then that design is screwed up. We really can't tell the core
>> >vm that "sorry we will hold onto these gobloads of memory you really
>> >need now for another few hours". Pinning memory like that essentially
>without a time limit is restricted to root.
>>
>> Hi Daniel;
>>
>> I don't really understand the reference to "gobloads of memory".
>> Unlike radeon graphics, the userspace data for HSA applications is
>> maintained in pageable system memory and accessed via the IOMMUv2
>> (ATC/PRI). The
>> IOMMUv2 driver and mm subsystem takes care of faulting in memory pages
>> as needed, nothing is long-term pinned.
>
>Yeah I've lost that part of the equation a bit since I've always thought that
>proper faulting support without preemption is not really possible. I guess
>those platforms completely stall on a fault until the ptes are all set up?

Correct. The GPU thread accessing the faulted page definitely stalls but processing can continue on other GPU threads. 

I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK graphics processing (assuming it is not using ATC path to system memory) is not affected. I will double-check that though, haven't asked internally for a couple of years but I do remember concluding something along the lines of "OK, that'll do" ;)
 
>-Daniel
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 15:06                                         ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 15:06 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Daniel Vetter, Gabbay, Oded, Jerome Glisse, Christian König,
	David Airlie, Alex Deucher, Andrew Morton, Joerg Roedel, Lewycky,
	Andrew, Daenzer, Michel, Goz, Ben, Skidanov, Alexey,
	linux-kernel, dri-devel, linux-mm, Sellek, Tom



>-----Original Message-----
>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
>Vetter
>Sent: Wednesday, July 23, 2014 10:42 AM
>To: Bridgman, John
>Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David Airlie;
>Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew; Daenzer,
>Michel; Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
>>
>>
>> >-----Original Message-----
>> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>> >Sent: Wednesday, July 23, 2014 3:06 AM
>> >To: Gabbay, Oded
>> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher;
>> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew;
>> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>> >linux-mm; Sellek, Tom
>> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>> >
>> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>> >wrote:
>> >> On 22/07/14 14:15, Daniel Vetter wrote:
>> >>>
>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>> >>>>
>> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
>> >>>>>
>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
>> ><oded.gabbay@amd.com>
>> >>>>> wrote:
>> >>>>>>>
>> >>>>>>> Exactly, just prevent userspace from submitting more. And if
>> >>>>>>> you have misbehaving userspace that submits too much, reset
>> >>>>>>> the gpu and tell it that you're sorry but won't schedule any more
>work.
>> >>>>>>
>> >>>>>>
>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or
>not.
>> >>>>>> Can
>> >>>>>> you elaborate ?
>> >>>>>
>> >>>>>
>> >>>>> Well that's mostly policy, currently in i915 we only have a
>> >>>>> check for hangs, and if userspace hangs a bit too often then we stop
>it.
>> >>>>> I guess you can do that with the queue unmapping you've describe
>> >>>>> in reply to Jerome's mail.
>> >>>>> -Daniel
>> >>>>>
>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows
>> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if
>> >>>> so, terminates the job).
>> >>>
>> >>>
>> >>> Essentially yes. But we also have some hw features to kill jobs
>> >>> quicker, e.g. for media workloads.
>> >>> -Daniel
>> >>>
>> >>
>> >> Yeah, so this is what I'm talking about when I say that you and
>> >> Jerome come from a graphics POV and amdkfd come from a compute
>POV,
>> >> no
>> >offense intended.
>> >>
>> >> For compute jobs, we simply can't use this logic to terminate jobs.
>> >> Graphics are mostly Real-Time while compute jobs can take from a
>> >> few ms to a few hours!!! And I'm not talking about an entire
>> >> application runtime but on a single submission of jobs by the
>> >> userspace app. We have tests with jobs that take between 20-30
>> >> minutes to complete. In theory, we can even imagine a compute job
>> >> which takes 1 or 2 days (on
>> >larger APUs).
>> >>
>> >> Now, I understand the question of how do we prevent the compute job
>> >> from monopolizing the GPU, and internally here we have some ideas
>> >> that we will probably share in the next few days, but my point is
>> >> that I don't think we can terminate a compute job because it is
>> >> running for more
>> >than x seconds.
>> >> It is like you would terminate a CPU process which runs more than x
>> >seconds.
>> >>
>> >> I think this is a *very* important discussion (detecting a
>> >> misbehaved compute process) and I would like to continue it, but I
>> >> don't think moving the job submission from userspace control to
>> >> kernel control will solve this core problem.
>> >
>> >Well graphics gets away with cooperative scheduling since usually
>> >people want to see stuff within a few frames, so we can legitimately
>> >kill jobs after a fairly short timeout. Imo if you want to allow
>> >userspace to submit compute jobs that are atomic and take a few
>> >minutes to hours with no break-up in between and no hw means to
>> >preempt then that design is screwed up. We really can't tell the core
>> >vm that "sorry we will hold onto these gobloads of memory you really
>> >need now for another few hours". Pinning memory like that essentially
>without a time limit is restricted to root.
>>
>> Hi Daniel;
>>
>> I don't really understand the reference to "gobloads of memory".
>> Unlike radeon graphics, the userspace data for HSA applications is
>> maintained in pageable system memory and accessed via the IOMMUv2
>> (ATC/PRI). The
>> IOMMUv2 driver and mm subsystem takes care of faulting in memory pages
>> as needed, nothing is long-term pinned.
>
>Yeah I've lost that part of the equation a bit since I've always thought that
>proper faulting support without preemption is not really possible. I guess
>those platforms completely stall on a fault until the ptes are all set up?

Correct. The GPU thread accessing the faulted page definitely stalls but processing can continue on other GPU threads. 

I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK graphics processing (assuming it is not using ATC path to system memory) is not affected. I will double-check that though, haven't asked internally for a couple of years but I do remember concluding something along the lines of "OK, that'll do" ;)
 
>-Daniel
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 15:06                                         ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 15:06 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Lewycky, Andrew, linux-mm, Daniel Vetter, Daenzer, Michel,
	linux-kernel, Sellek, Tom, Skidanov, Alexey, dri-devel,
	Andrew Morton



>-----Original Message-----
>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
>Vetter
>Sent: Wednesday, July 23, 2014 10:42 AM
>To: Bridgman, John
>Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David Airlie;
>Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew; Daenzer,
>Michel; Goz, Ben; Skidanov, Alexey; linux-kernel@vger.kernel.org; dri-
>devel@lists.freedesktop.org; linux-mm; Sellek, Tom
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
>>
>>
>> >-----Original Message-----
>> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>> >Sent: Wednesday, July 23, 2014 3:06 AM
>> >To: Gabbay, Oded
>> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher;
>> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew;
>> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>> >linux-mm; Sellek, Tom
>> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>> >
>> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>> >wrote:
>> >> On 22/07/14 14:15, Daniel Vetter wrote:
>> >>>
>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>> >>>>
>> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
>> >>>>>
>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
>> ><oded.gabbay@amd.com>
>> >>>>> wrote:
>> >>>>>>>
>> >>>>>>> Exactly, just prevent userspace from submitting more. And if
>> >>>>>>> you have misbehaving userspace that submits too much, reset
>> >>>>>>> the gpu and tell it that you're sorry but won't schedule any more
>work.
>> >>>>>>
>> >>>>>>
>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or
>not.
>> >>>>>> Can
>> >>>>>> you elaborate ?
>> >>>>>
>> >>>>>
>> >>>>> Well that's mostly policy, currently in i915 we only have a
>> >>>>> check for hangs, and if userspace hangs a bit too often then we stop
>it.
>> >>>>> I guess you can do that with the queue unmapping you've describe
>> >>>>> in reply to Jerome's mail.
>> >>>>> -Daniel
>> >>>>>
>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows
>> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if
>> >>>> so, terminates the job).
>> >>>
>> >>>
>> >>> Essentially yes. But we also have some hw features to kill jobs
>> >>> quicker, e.g. for media workloads.
>> >>> -Daniel
>> >>>
>> >>
>> >> Yeah, so this is what I'm talking about when I say that you and
>> >> Jerome come from a graphics POV and amdkfd come from a compute
>POV,
>> >> no
>> >offense intended.
>> >>
>> >> For compute jobs, we simply can't use this logic to terminate jobs.
>> >> Graphics are mostly Real-Time while compute jobs can take from a
>> >> few ms to a few hours!!! And I'm not talking about an entire
>> >> application runtime but on a single submission of jobs by the
>> >> userspace app. We have tests with jobs that take between 20-30
>> >> minutes to complete. In theory, we can even imagine a compute job
>> >> which takes 1 or 2 days (on
>> >larger APUs).
>> >>
>> >> Now, I understand the question of how do we prevent the compute job
>> >> from monopolizing the GPU, and internally here we have some ideas
>> >> that we will probably share in the next few days, but my point is
>> >> that I don't think we can terminate a compute job because it is
>> >> running for more
>> >than x seconds.
>> >> It is like you would terminate a CPU process which runs more than x
>> >seconds.
>> >>
>> >> I think this is a *very* important discussion (detecting a
>> >> misbehaved compute process) and I would like to continue it, but I
>> >> don't think moving the job submission from userspace control to
>> >> kernel control will solve this core problem.
>> >
>> >Well graphics gets away with cooperative scheduling since usually
>> >people want to see stuff within a few frames, so we can legitimately
>> >kill jobs after a fairly short timeout. Imo if you want to allow
>> >userspace to submit compute jobs that are atomic and take a few
>> >minutes to hours with no break-up in between and no hw means to
>> >preempt then that design is screwed up. We really can't tell the core
>> >vm that "sorry we will hold onto these gobloads of memory you really
>> >need now for another few hours". Pinning memory like that essentially
>without a time limit is restricted to root.
>>
>> Hi Daniel;
>>
>> I don't really understand the reference to "gobloads of memory".
>> Unlike radeon graphics, the userspace data for HSA applications is
>> maintained in pageable system memory and accessed via the IOMMUv2
>> (ATC/PRI). The
>> IOMMUv2 driver and mm subsystem takes care of faulting in memory pages
>> as needed, nothing is long-term pinned.
>
>Yeah I've lost that part of the equation a bit since I've always thought that
>proper faulting support without preemption is not really possible. I guess
>those platforms completely stall on a fault until the ptes are all set up?

Correct. The GPU thread accessing the faulted page definitely stalls but processing can continue on other GPU threads. 

I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK graphics processing (assuming it is not using ATC path to system memory) is not affected. I will double-check that though, haven't asked internally for a couple of years but I do remember concluding something along the lines of "OK, that'll do" ;)
 
>-Daniel
>--
>Daniel Vetter
>Software Engineer, Intel Corporation
>+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 15:06                                         ` Bridgman, John
  (?)
@ 2014-07-23 15:12                                           ` Bridgman, John
  -1 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 15:12 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Lewycky, Andrew, linux-mm, Daniel Vetter, Daenzer, Michel,
	linux-kernel, Sellek, Tom, Skidanov, Alexey, dri-devel,
	Andrew Morton



>-----Original Message-----
>From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf
>Of Bridgman, John
>Sent: Wednesday, July 23, 2014 11:07 AM
>To: Daniel Vetter
>Cc: Lewycky, Andrew; linux-mm; Daniel Vetter; Daenzer, Michel; linux-
>kernel@vger.kernel.org; Sellek, Tom; Skidanov, Alexey; dri-
>devel@lists.freedesktop.org; Andrew Morton
>Subject: RE: [PATCH v2 00/25] AMDKFD kernel driver
>
>
>
>>-----Original Message-----
>>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
>>Vetter
>>Sent: Wednesday, July 23, 2014 10:42 AM
>>To: Bridgman, John
>>Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David
>>Airlie; Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew;
>>Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>>linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>>linux-mm; Sellek, Tom
>>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>>
>>On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
>>>
>>>
>>> >-----Original Message-----
>>> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>>> >Sent: Wednesday, July 23, 2014 3:06 AM
>>> >To: Gabbay, Oded
>>> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher;
>>> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew;
>>> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>>> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>>> >linux-mm; Sellek, Tom
>>> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>>> >
>>> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>>> >wrote:
>>> >> On 22/07/14 14:15, Daniel Vetter wrote:
>>> >>>
>>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>> >>>>
>>> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>> >>>>>
>>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
>>> ><oded.gabbay@amd.com>
>>> >>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> Exactly, just prevent userspace from submitting more. And if
>>> >>>>>>> you have misbehaving userspace that submits too much, reset
>>> >>>>>>> the gpu and tell it that you're sorry but won't schedule any
>>> >>>>>>> more
>>work.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves
>>> >>>>>> or
>>not.
>>> >>>>>> Can
>>> >>>>>> you elaborate ?
>>> >>>>>
>>> >>>>>
>>> >>>>> Well that's mostly policy, currently in i915 we only have a
>>> >>>>> check for hangs, and if userspace hangs a bit too often then we
>>> >>>>> stop
>>it.
>>> >>>>> I guess you can do that with the queue unmapping you've
>>> >>>>> describe in reply to Jerome's mail.
>>> >>>>> -Daniel
>>> >>>>>
>>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows
>>> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if
>>> >>>> so, terminates the job).
>>> >>>
>>> >>>
>>> >>> Essentially yes. But we also have some hw features to kill jobs
>>> >>> quicker, e.g. for media workloads.
>>> >>> -Daniel
>>> >>>
>>> >>
>>> >> Yeah, so this is what I'm talking about when I say that you and
>>> >> Jerome come from a graphics POV and amdkfd come from a compute
>>POV,
>>> >> no
>>> >offense intended.
>>> >>
>>> >> For compute jobs, we simply can't use this logic to terminate jobs.
>>> >> Graphics are mostly Real-Time while compute jobs can take from a
>>> >> few ms to a few hours!!! And I'm not talking about an entire
>>> >> application runtime but on a single submission of jobs by the
>>> >> userspace app. We have tests with jobs that take between 20-30
>>> >> minutes to complete. In theory, we can even imagine a compute job
>>> >> which takes 1 or 2 days (on
>>> >larger APUs).
>>> >>
>>> >> Now, I understand the question of how do we prevent the compute
>>> >> job from monopolizing the GPU, and internally here we have some
>>> >> ideas that we will probably share in the next few days, but my
>>> >> point is that I don't think we can terminate a compute job because
>>> >> it is running for more
>>> >than x seconds.
>>> >> It is like you would terminate a CPU process which runs more than
>>> >> x
>>> >seconds.
>>> >>
>>> >> I think this is a *very* important discussion (detecting a
>>> >> misbehaved compute process) and I would like to continue it, but I
>>> >> don't think moving the job submission from userspace control to
>>> >> kernel control will solve this core problem.
>>> >
>>> >Well graphics gets away with cooperative scheduling since usually
>>> >people want to see stuff within a few frames, so we can legitimately
>>> >kill jobs after a fairly short timeout. Imo if you want to allow
>>> >userspace to submit compute jobs that are atomic and take a few
>>> >minutes to hours with no break-up in between and no hw means to
>>> >preempt then that design is screwed up. We really can't tell the
>>> >core vm that "sorry we will hold onto these gobloads of memory you
>>> >really need now for another few hours". Pinning memory like that
>>> >essentially
>>without a time limit is restricted to root.
>>>
>>> Hi Daniel;
>>>
>>> I don't really understand the reference to "gobloads of memory".
>>> Unlike radeon graphics, the userspace data for HSA applications is
>>> maintained in pageable system memory and accessed via the IOMMUv2
>>> (ATC/PRI). The
>>> IOMMUv2 driver and mm subsystem takes care of faulting in memory
>>> pages as needed, nothing is long-term pinned.
>>
>>Yeah I've lost that part of the equation a bit since I've always
>>thought that proper faulting support without preemption is not really
>>possible. I guess those platforms completely stall on a fault until the ptes
>are all set up?
>
>Correct. The GPU thread accessing the faulted page definitely stalls but
>processing can continue on other GPU threads.

Sorry, this may be oversimplified -- I'll double-check that internally. We may stall the CU for the duration of the fault processing. Stay tuned.
 
>
>I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system
>RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK
>graphics processing (assuming it is not using ATC path to system memory) is
>not affected. I will double-check that though, haven't asked internally for a
>couple of years but I do remember concluding something along the lines of
>"OK, that'll do" ;)
>
>>-Daniel
>>--
>>Daniel Vetter
>>Software Engineer, Intel Corporation
>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
>_______________________________________________
>dri-devel mailing list
>dri-devel@lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 15:12                                           ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 15:12 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Lewycky, Andrew, linux-mm, Daniel Vetter, Daenzer, Michel,
	linux-kernel, Sellek, Tom, Skidanov, Alexey, dri-devel,
	Andrew Morton



>-----Original Message-----
>From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf
>Of Bridgman, John
>Sent: Wednesday, July 23, 2014 11:07 AM
>To: Daniel Vetter
>Cc: Lewycky, Andrew; linux-mm; Daniel Vetter; Daenzer, Michel; linux-
>kernel@vger.kernel.org; Sellek, Tom; Skidanov, Alexey; dri-
>devel@lists.freedesktop.org; Andrew Morton
>Subject: RE: [PATCH v2 00/25] AMDKFD kernel driver
>
>
>
>>-----Original Message-----
>>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
>>Vetter
>>Sent: Wednesday, July 23, 2014 10:42 AM
>>To: Bridgman, John
>>Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David
>>Airlie; Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew;
>>Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>>linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>>linux-mm; Sellek, Tom
>>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>>
>>On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
>>>
>>>
>>> >-----Original Message-----
>>> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>>> >Sent: Wednesday, July 23, 2014 3:06 AM
>>> >To: Gabbay, Oded
>>> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher;
>>> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew;
>>> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>>> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>>> >linux-mm; Sellek, Tom
>>> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>>> >
>>> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>>> >wrote:
>>> >> On 22/07/14 14:15, Daniel Vetter wrote:
>>> >>>
>>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>> >>>>
>>> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>> >>>>>
>>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
>>> ><oded.gabbay@amd.com>
>>> >>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> Exactly, just prevent userspace from submitting more. And if
>>> >>>>>>> you have misbehaving userspace that submits too much, reset
>>> >>>>>>> the gpu and tell it that you're sorry but won't schedule any
>>> >>>>>>> more
>>work.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves
>>> >>>>>> or
>>not.
>>> >>>>>> Can
>>> >>>>>> you elaborate ?
>>> >>>>>
>>> >>>>>
>>> >>>>> Well that's mostly policy, currently in i915 we only have a
>>> >>>>> check for hangs, and if userspace hangs a bit too often then we
>>> >>>>> stop
>>it.
>>> >>>>> I guess you can do that with the queue unmapping you've
>>> >>>>> describe in reply to Jerome's mail.
>>> >>>>> -Daniel
>>> >>>>>
>>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows
>>> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if
>>> >>>> so, terminates the job).
>>> >>>
>>> >>>
>>> >>> Essentially yes. But we also have some hw features to kill jobs
>>> >>> quicker, e.g. for media workloads.
>>> >>> -Daniel
>>> >>>
>>> >>
>>> >> Yeah, so this is what I'm talking about when I say that you and
>>> >> Jerome come from a graphics POV and amdkfd come from a compute
>>POV,
>>> >> no
>>> >offense intended.
>>> >>
>>> >> For compute jobs, we simply can't use this logic to terminate jobs.
>>> >> Graphics are mostly Real-Time while compute jobs can take from a
>>> >> few ms to a few hours!!! And I'm not talking about an entire
>>> >> application runtime but on a single submission of jobs by the
>>> >> userspace app. We have tests with jobs that take between 20-30
>>> >> minutes to complete. In theory, we can even imagine a compute job
>>> >> which takes 1 or 2 days (on
>>> >larger APUs).
>>> >>
>>> >> Now, I understand the question of how do we prevent the compute
>>> >> job from monopolizing the GPU, and internally here we have some
>>> >> ideas that we will probably share in the next few days, but my
>>> >> point is that I don't think we can terminate a compute job because
>>> >> it is running for more
>>> >than x seconds.
>>> >> It is like you would terminate a CPU process which runs more than
>>> >> x
>>> >seconds.
>>> >>
>>> >> I think this is a *very* important discussion (detecting a
>>> >> misbehaved compute process) and I would like to continue it, but I
>>> >> don't think moving the job submission from userspace control to
>>> >> kernel control will solve this core problem.
>>> >
>>> >Well graphics gets away with cooperative scheduling since usually
>>> >people want to see stuff within a few frames, so we can legitimately
>>> >kill jobs after a fairly short timeout. Imo if you want to allow
>>> >userspace to submit compute jobs that are atomic and take a few
>>> >minutes to hours with no break-up in between and no hw means to
>>> >preempt then that design is screwed up. We really can't tell the
>>> >core vm that "sorry we will hold onto these gobloads of memory you
>>> >really need now for another few hours". Pinning memory like that
>>> >essentially
>>without a time limit is restricted to root.
>>>
>>> Hi Daniel;
>>>
>>> I don't really understand the reference to "gobloads of memory".
>>> Unlike radeon graphics, the userspace data for HSA applications is
>>> maintained in pageable system memory and accessed via the IOMMUv2
>>> (ATC/PRI). The
>>> IOMMUv2 driver and mm subsystem takes care of faulting in memory
>>> pages as needed, nothing is long-term pinned.
>>
>>Yeah I've lost that part of the equation a bit since I've always
>>thought that proper faulting support without preemption is not really
>>possible. I guess those platforms completely stall on a fault until the ptes
>are all set up?
>
>Correct. The GPU thread accessing the faulted page definitely stalls but
>processing can continue on other GPU threads.

Sorry, this may be oversimplified -- I'll double-check that internally. We may stall the CU for the duration of the fault processing. Stay tuned.
 
>
>I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system
>RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK
>graphics processing (assuming it is not using ATC path to system memory) is
>not affected. I will double-check that though, haven't asked internally for a
>couple of years but I do remember concluding something along the lines of
>"OK, that'll do" ;)
>
>>-Daniel
>>--
>>Daniel Vetter
>>Software Engineer, Intel Corporation
>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
>_______________________________________________
>dri-devel mailing list
>dri-devel@lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/dri-devel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 15:12                                           ` Bridgman, John
  0 siblings, 0 replies; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 15:12 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Lewycky, Andrew, Daniel Vetter, Skidanov, Alexey, Daenzer,
	Michel, linux-kernel, dri-devel, linux-mm, Sellek, Tom,
	Andrew Morton



>-----Original Message-----
>From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf
>Of Bridgman, John
>Sent: Wednesday, July 23, 2014 11:07 AM
>To: Daniel Vetter
>Cc: Lewycky, Andrew; linux-mm; Daniel Vetter; Daenzer, Michel; linux-
>kernel@vger.kernel.org; Sellek, Tom; Skidanov, Alexey; dri-
>devel@lists.freedesktop.org; Andrew Morton
>Subject: RE: [PATCH v2 00/25] AMDKFD kernel driver
>
>
>
>>-----Original Message-----
>>From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
>>Vetter
>>Sent: Wednesday, July 23, 2014 10:42 AM
>>To: Bridgman, John
>>Cc: Daniel Vetter; Gabbay, Oded; Jerome Glisse; Christian König; David
>>Airlie; Alex Deucher; Andrew Morton; Joerg Roedel; Lewycky, Andrew;
>>Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>>linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>>linux-mm; Sellek, Tom
>>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>>
>>On Wed, Jul 23, 2014 at 01:33:24PM +0000, Bridgman, John wrote:
>>>
>>>
>>> >-----Original Message-----
>>> >From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch]
>>> >Sent: Wednesday, July 23, 2014 3:06 AM
>>> >To: Gabbay, Oded
>>> >Cc: Jerome Glisse; Christian König; David Airlie; Alex Deucher;
>>> >Andrew Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew;
>>> >Daenzer, Michel; Goz, Ben; Skidanov, Alexey;
>>> >linux-kernel@vger.kernel.org; dri- devel@lists.freedesktop.org;
>>> >linux-mm; Sellek, Tom
>>> >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>>> >
>>> >On Wed, Jul 23, 2014 at 8:50 AM, Oded Gabbay <oded.gabbay@amd.com>
>>> >wrote:
>>> >> On 22/07/14 14:15, Daniel Vetter wrote:
>>> >>>
>>> >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>>> >>>>
>>> >>>> On 22/07/14 12:21, Daniel Vetter wrote:
>>> >>>>>
>>> >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay
>>> ><oded.gabbay@amd.com>
>>> >>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> Exactly, just prevent userspace from submitting more. And if
>>> >>>>>>> you have misbehaving userspace that submits too much, reset
>>> >>>>>>> the gpu and tell it that you're sorry but won't schedule any
>>> >>>>>>> more
>>work.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves
>>> >>>>>> or
>>not.
>>> >>>>>> Can
>>> >>>>>> you elaborate ?
>>> >>>>>
>>> >>>>>
>>> >>>>> Well that's mostly policy, currently in i915 we only have a
>>> >>>>> check for hangs, and if userspace hangs a bit too often then we
>>> >>>>> stop
>>it.
>>> >>>>> I guess you can do that with the queue unmapping you've
>>> >>>>> describe in reply to Jerome's mail.
>>> >>>>> -Daniel
>>> >>>>>
>>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows
>>> >>>> (checks if a gpu job takes more than 2 seconds, I think, and if
>>> >>>> so, terminates the job).
>>> >>>
>>> >>>
>>> >>> Essentially yes. But we also have some hw features to kill jobs
>>> >>> quicker, e.g. for media workloads.
>>> >>> -Daniel
>>> >>>
>>> >>
>>> >> Yeah, so this is what I'm talking about when I say that you and
>>> >> Jerome come from a graphics POV and amdkfd come from a compute
>>POV,
>>> >> no
>>> >offense intended.
>>> >>
>>> >> For compute jobs, we simply can't use this logic to terminate jobs.
>>> >> Graphics are mostly Real-Time while compute jobs can take from a
>>> >> few ms to a few hours!!! And I'm not talking about an entire
>>> >> application runtime but on a single submission of jobs by the
>>> >> userspace app. We have tests with jobs that take between 20-30
>>> >> minutes to complete. In theory, we can even imagine a compute job
>>> >> which takes 1 or 2 days (on
>>> >larger APUs).
>>> >>
>>> >> Now, I understand the question of how do we prevent the compute
>>> >> job from monopolizing the GPU, and internally here we have some
>>> >> ideas that we will probably share in the next few days, but my
>>> >> point is that I don't think we can terminate a compute job because
>>> >> it is running for more
>>> >than x seconds.
>>> >> It is like you would terminate a CPU process which runs more than
>>> >> x
>>> >seconds.
>>> >>
>>> >> I think this is a *very* important discussion (detecting a
>>> >> misbehaved compute process) and I would like to continue it, but I
>>> >> don't think moving the job submission from userspace control to
>>> >> kernel control will solve this core problem.
>>> >
>>> >Well graphics gets away with cooperative scheduling since usually
>>> >people want to see stuff within a few frames, so we can legitimately
>>> >kill jobs after a fairly short timeout. Imo if you want to allow
>>> >userspace to submit compute jobs that are atomic and take a few
>>> >minutes to hours with no break-up in between and no hw means to
>>> >preempt then that design is screwed up. We really can't tell the
>>> >core vm that "sorry we will hold onto these gobloads of memory you
>>> >really need now for another few hours". Pinning memory like that
>>> >essentially
>>without a time limit is restricted to root.
>>>
>>> Hi Daniel;
>>>
>>> I don't really understand the reference to "gobloads of memory".
>>> Unlike radeon graphics, the userspace data for HSA applications is
>>> maintained in pageable system memory and accessed via the IOMMUv2
>>> (ATC/PRI). The
>>> IOMMUv2 driver and mm subsystem takes care of faulting in memory
>>> pages as needed, nothing is long-term pinned.
>>
>>Yeah I've lost that part of the equation a bit since I've always
>>thought that proper faulting support without preemption is not really
>>possible. I guess those platforms completely stall on a fault until the ptes
>are all set up?
>
>Correct. The GPU thread accessing the faulted page definitely stalls but
>processing can continue on other GPU threads.

Sorry, this may be oversimplified -- I'll double-check that internally. We may stall the CU for the duration of the fault processing. Stay tuned.
 
>
>I don't remember offhand how much of the GPU=>ATC=>IOMMUv2=>system
>RAM path gets stalled (ie whether other HSA apps get blocked) but AFAIK
>graphics processing (assuming it is not using ATC path to system memory) is
>not affected. I will double-check that though, haven't asked internally for a
>couple of years but I do remember concluding something along the lines of
>"OK, that'll do" ;)
>
>>-Daniel
>>--
>>Daniel Vetter
>>Software Engineer, Intel Corporation
>>+41 (0) 79 365 57 48 - http://blog.ffwll.ch
>_______________________________________________
>dri-devel mailing list
>dri-devel@lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 14:56                                     ` Jerome Glisse
@ 2014-07-23 19:49                                       ` Alex Deucher
  -1 siblings, 0 replies; 148+ messages in thread
From: Alex Deucher @ 2014-07-23 19:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Oded Gabbay, David Airlie, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote:
>> Am 23.07.2014 08:50, schrieb Oded Gabbay:
>> >On 22/07/14 14:15, Daniel Vetter wrote:
>> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>> >>>On 22/07/14 12:21, Daniel Vetter wrote:
>> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>> >>>>wrote:
>> >>>>>>Exactly, just prevent userspace from submitting more. And if you
>> >>>>>>have
>> >>>>>>misbehaving userspace that submits too much, reset the gpu and
>> >>>>>>tell it
>> >>>>>>that you're sorry but won't schedule any more work.
>> >>>>>
>> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
>> >>>>>not. Can
>> >>>>>you elaborate ?
>> >>>>
>> >>>>Well that's mostly policy, currently in i915 we only have a check for
>> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
>> >>>>you can do that with the queue unmapping you've describe in reply to
>> >>>>Jerome's mail.
>> >>>>-Daniel
>> >>>>
>> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
>> >>>if a
>> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
>> >>>job).
>> >>
>> >>Essentially yes. But we also have some hw features to kill jobs quicker,
>> >>e.g. for media workloads.
>> >>-Daniel
>> >>
>> >
>> >Yeah, so this is what I'm talking about when I say that you and Jerome
>> >come from a graphics POV and amdkfd come from a compute POV, no offense
>> >intended.
>> >
>> >For compute jobs, we simply can't use this logic to terminate jobs.
>> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
>> >a few hours!!! And I'm not talking about an entire application runtime but
>> >on a single submission of jobs by the userspace app. We have tests with
>> >jobs that take between 20-30 minutes to complete. In theory, we can even
>> >imagine a compute job which takes 1 or 2 days (on larger APUs).
>> >
>> >Now, I understand the question of how do we prevent the compute job from
>> >monopolizing the GPU, and internally here we have some ideas that we will
>> >probably share in the next few days, but my point is that I don't think we
>> >can terminate a compute job because it is running for more than x seconds.
>> >It is like you would terminate a CPU process which runs more than x
>> >seconds.
>>
>> Yeah that's why one of the first things I've did was making the timeout
>> configurable in the radeon module.
>>
>> But it doesn't necessary needs be a timeout, we should also kill a running
>> job submission if the CPU process associated with the job is killed.
>>
>> >I think this is a *very* important discussion (detecting a misbehaved
>> >compute process) and I would like to continue it, but I don't think moving
>> >the job submission from userspace control to kernel control will solve
>> >this core problem.
>>
>> We need to get this topic solved, otherwise the driver won't make it
>> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
>> GPU time or special things like counters etc... is a strict no go for a
>> kernel module.
>>
>> I agree that moving the job submission from userpsace to kernel wouldn't
>> solve this problem. As Daniel and I pointed out now multiple times it's
>> rather easily possible to prevent further job submissions from userspace, in
>> the worst case by unmapping the doorbell page.
>>
>> Moving it to an IOCTL would just make it a bit less complicated.
>>
>
> It is not only complexity, my main concern is not really the amount of memory
> pinned (well it would be if it was vram which by the way you need to remove
> the api that allow to allocate vram just so that it clearly shows that vram is
> not allowed).
>
> Issue is with GPU address space fragmentation, new process hsa queue might be
> allocated in middle of gtt space and stays there for so long that i will forbid
> any big buffer to be bind to gtt. Thought with virtual address space for graphics
> this is less of an issue and only the kernel suffer but still it might block the
> kernel from evicting some VRAM because i can not bind a system buffer big enough
> to GTT because some GTT space is taken by some HSA queue.
>
> To mitigate this at very least, you need to implement special memory allocation
> inside ttm and radeon to force this per queue to be allocate for instance from
> top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
> on number of queue.

This same sort of thing can already happen with gfx, although it's
less likely since the workloads are usually shorter.  That said, we
can issue compute jobs right today with the current CS ioctl and we
may end up with a buffer pinned in an inopportune spot.  I'm not sure
reserving a static pool at init really helps that much.  If you aren't
using any HSA apps, it just wastes gtt space.  So you have a trade
off: waste memory for a possibly unused MQD descriptor pool or
allocate MQD descriptors on the fly, but possibly end up with a long
running one stuck in a bad location.  Additionally, we already have a
ttm flag for whether we want to allocate from the top or bottom of the
pool.  We use it today for gfx depending on the buffer (e.g., buffers
smaller than 512k are allocated from the bottom and buffers larger
than 512 are allocated from the top).  So we can't really re-size a
static buffer easily as there may already be other buffers pinned up
there.

If we add sysfs controls to limit the amount of hsa processes, and
queues per process so you could use this to dynamically limit the max
amount gtt memory that would be in use for MQD descriptors.

Alex

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 19:49                                       ` Alex Deucher
  0 siblings, 0 replies; 148+ messages in thread
From: Alex Deucher @ 2014-07-23 19:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christian König, Oded Gabbay, David Airlie, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote:
>> Am 23.07.2014 08:50, schrieb Oded Gabbay:
>> >On 22/07/14 14:15, Daniel Vetter wrote:
>> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
>> >>>On 22/07/14 12:21, Daniel Vetter wrote:
>> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
>> >>>>wrote:
>> >>>>>>Exactly, just prevent userspace from submitting more. And if you
>> >>>>>>have
>> >>>>>>misbehaving userspace that submits too much, reset the gpu and
>> >>>>>>tell it
>> >>>>>>that you're sorry but won't schedule any more work.
>> >>>>>
>> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
>> >>>>>not. Can
>> >>>>>you elaborate ?
>> >>>>
>> >>>>Well that's mostly policy, currently in i915 we only have a check for
>> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
>> >>>>you can do that with the queue unmapping you've describe in reply to
>> >>>>Jerome's mail.
>> >>>>-Daniel
>> >>>>
>> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
>> >>>if a
>> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
>> >>>job).
>> >>
>> >>Essentially yes. But we also have some hw features to kill jobs quicker,
>> >>e.g. for media workloads.
>> >>-Daniel
>> >>
>> >
>> >Yeah, so this is what I'm talking about when I say that you and Jerome
>> >come from a graphics POV and amdkfd come from a compute POV, no offense
>> >intended.
>> >
>> >For compute jobs, we simply can't use this logic to terminate jobs.
>> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
>> >a few hours!!! And I'm not talking about an entire application runtime but
>> >on a single submission of jobs by the userspace app. We have tests with
>> >jobs that take between 20-30 minutes to complete. In theory, we can even
>> >imagine a compute job which takes 1 or 2 days (on larger APUs).
>> >
>> >Now, I understand the question of how do we prevent the compute job from
>> >monopolizing the GPU, and internally here we have some ideas that we will
>> >probably share in the next few days, but my point is that I don't think we
>> >can terminate a compute job because it is running for more than x seconds.
>> >It is like you would terminate a CPU process which runs more than x
>> >seconds.
>>
>> Yeah that's why one of the first things I've did was making the timeout
>> configurable in the radeon module.
>>
>> But it doesn't necessary needs be a timeout, we should also kill a running
>> job submission if the CPU process associated with the job is killed.
>>
>> >I think this is a *very* important discussion (detecting a misbehaved
>> >compute process) and I would like to continue it, but I don't think moving
>> >the job submission from userspace control to kernel control will solve
>> >this core problem.
>>
>> We need to get this topic solved, otherwise the driver won't make it
>> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
>> GPU time or special things like counters etc... is a strict no go for a
>> kernel module.
>>
>> I agree that moving the job submission from userpsace to kernel wouldn't
>> solve this problem. As Daniel and I pointed out now multiple times it's
>> rather easily possible to prevent further job submissions from userspace, in
>> the worst case by unmapping the doorbell page.
>>
>> Moving it to an IOCTL would just make it a bit less complicated.
>>
>
> It is not only complexity, my main concern is not really the amount of memory
> pinned (well it would be if it was vram which by the way you need to remove
> the api that allow to allocate vram just so that it clearly shows that vram is
> not allowed).
>
> Issue is with GPU address space fragmentation, new process hsa queue might be
> allocated in middle of gtt space and stays there for so long that i will forbid
> any big buffer to be bind to gtt. Thought with virtual address space for graphics
> this is less of an issue and only the kernel suffer but still it might block the
> kernel from evicting some VRAM because i can not bind a system buffer big enough
> to GTT because some GTT space is taken by some HSA queue.
>
> To mitigate this at very least, you need to implement special memory allocation
> inside ttm and radeon to force this per queue to be allocate for instance from
> top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
> on number of queue.

This same sort of thing can already happen with gfx, although it's
less likely since the workloads are usually shorter.  That said, we
can issue compute jobs right today with the current CS ioctl and we
may end up with a buffer pinned in an inopportune spot.  I'm not sure
reserving a static pool at init really helps that much.  If you aren't
using any HSA apps, it just wastes gtt space.  So you have a trade
off: waste memory for a possibly unused MQD descriptor pool or
allocate MQD descriptors on the fly, but possibly end up with a long
running one stuck in a bad location.  Additionally, we already have a
ttm flag for whether we want to allocate from the top or bottom of the
pool.  We use it today for gfx depending on the buffer (e.g., buffers
smaller than 512k are allocated from the bottom and buffers larger
than 512 are allocated from the top).  So we can't really re-size a
static buffer easily as there may already be other buffers pinned up
there.

If we add sysfs controls to limit the amount of hsa processes, and
queues per process so you could use this to dynamically limit the max
amount gtt memory that would be in use for MQD descriptors.

Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 19:49                                       ` Alex Deucher
  (?)
@ 2014-07-23 20:25                                         ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-23 20:25 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, Oded Gabbay, David Airlie, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

On Wed, Jul 23, 2014 at 03:49:57PM -0400, Alex Deucher wrote:
> On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote:
> >> Am 23.07.2014 08:50, schrieb Oded Gabbay:
> >> >On 22/07/14 14:15, Daniel Vetter wrote:
> >> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >> >>>On 22/07/14 12:21, Daniel Vetter wrote:
> >> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
> >> >>>>wrote:
> >> >>>>>>Exactly, just prevent userspace from submitting more. And if you
> >> >>>>>>have
> >> >>>>>>misbehaving userspace that submits too much, reset the gpu and
> >> >>>>>>tell it
> >> >>>>>>that you're sorry but won't schedule any more work.
> >> >>>>>
> >> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
> >> >>>>>not. Can
> >> >>>>>you elaborate ?
> >> >>>>
> >> >>>>Well that's mostly policy, currently in i915 we only have a check for
> >> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
> >> >>>>you can do that with the queue unmapping you've describe in reply to
> >> >>>>Jerome's mail.
> >> >>>>-Daniel
> >> >>>>
> >> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >> >>>if a
> >> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
> >> >>>job).
> >> >>
> >> >>Essentially yes. But we also have some hw features to kill jobs quicker,
> >> >>e.g. for media workloads.
> >> >>-Daniel
> >> >>
> >> >
> >> >Yeah, so this is what I'm talking about when I say that you and Jerome
> >> >come from a graphics POV and amdkfd come from a compute POV, no offense
> >> >intended.
> >> >
> >> >For compute jobs, we simply can't use this logic to terminate jobs.
> >> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
> >> >a few hours!!! And I'm not talking about an entire application runtime but
> >> >on a single submission of jobs by the userspace app. We have tests with
> >> >jobs that take between 20-30 minutes to complete. In theory, we can even
> >> >imagine a compute job which takes 1 or 2 days (on larger APUs).
> >> >
> >> >Now, I understand the question of how do we prevent the compute job from
> >> >monopolizing the GPU, and internally here we have some ideas that we will
> >> >probably share in the next few days, but my point is that I don't think we
> >> >can terminate a compute job because it is running for more than x seconds.
> >> >It is like you would terminate a CPU process which runs more than x
> >> >seconds.
> >>
> >> Yeah that's why one of the first things I've did was making the timeout
> >> configurable in the radeon module.
> >>
> >> But it doesn't necessary needs be a timeout, we should also kill a running
> >> job submission if the CPU process associated with the job is killed.
> >>
> >> >I think this is a *very* important discussion (detecting a misbehaved
> >> >compute process) and I would like to continue it, but I don't think moving
> >> >the job submission from userspace control to kernel control will solve
> >> >this core problem.
> >>
> >> We need to get this topic solved, otherwise the driver won't make it
> >> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
> >> GPU time or special things like counters etc... is a strict no go for a
> >> kernel module.
> >>
> >> I agree that moving the job submission from userpsace to kernel wouldn't
> >> solve this problem. As Daniel and I pointed out now multiple times it's
> >> rather easily possible to prevent further job submissions from userspace, in
> >> the worst case by unmapping the doorbell page.
> >>
> >> Moving it to an IOCTL would just make it a bit less complicated.
> >>
> >
> > It is not only complexity, my main concern is not really the amount of memory
> > pinned (well it would be if it was vram which by the way you need to remove
> > the api that allow to allocate vram just so that it clearly shows that vram is
> > not allowed).
> >
> > Issue is with GPU address space fragmentation, new process hsa queue might be
> > allocated in middle of gtt space and stays there for so long that i will forbid
> > any big buffer to be bind to gtt. Thought with virtual address space for graphics
> > this is less of an issue and only the kernel suffer but still it might block the
> > kernel from evicting some VRAM because i can not bind a system buffer big enough
> > to GTT because some GTT space is taken by some HSA queue.
> >
> > To mitigate this at very least, you need to implement special memory allocation
> > inside ttm and radeon to force this per queue to be allocate for instance from
> > top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
> > on number of queue.
> 
> This same sort of thing can already happen with gfx, although it's
> less likely since the workloads are usually shorter.  That said, we
> can issue compute jobs right today with the current CS ioctl and we
> may end up with a buffer pinned in an inopportune spot.

I thought compute was using virtual address space (well on > cayman at least).

> I'm not sure
> reserving a static pool at init really helps that much.  If you aren't
> using any HSA apps, it just wastes gtt space.  So you have a trade
> off: waste memory for a possibly unused MQD descriptor pool or
> allocate MQD descriptors on the fly, but possibly end up with a long
> running one stuck in a bad location.  Additionally, we already have a
> ttm flag for whether we want to allocate from the top or bottom of the
> pool.  We use it today for gfx depending on the buffer (e.g., buffers
> smaller than 512k are allocated from the bottom and buffers larger
> than 512 are allocated from the top).  So we can't really re-size a
> static buffer easily as there may already be other buffers pinned up
> there.

Again here iirc only kernel use the GTT space everything else (userspace)
is using virtual address space or am i forgeting something ?

My point was not so much to be static but to enforce doing it from one
end of the address space and to have shrink/grow depending on usage forcing
anything else out of that range.

On VM GPU only thing left using the "global" GTT is the kernel, it uses it
for ring and for moving buffer around. I would assume that pining ring buffers
at begining of address space no matter what there size is would be a good idea
as anyway those will not fragment ie there lifetime is the lifetime of the
driver.

My point is that all the HSA queue buffer can have a lifetime way bigger than
anything we have now, really now we can bind/unbind any buffer btw cs submission
modulo OpenCL task.

> 
> If we add sysfs controls to limit the amount of hsa processes, and
> queues per process so you could use this to dynamically limit the max
> amount gtt memory that would be in use for MQD descriptors.

No this can not be set dynamicly, once a process has created its queue
it has it and i see no channel to tell userspace: "sorry buddy but no
more room for you"

> 
> Alex

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 20:25                                         ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-23 20:25 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, Oded Gabbay, David Airlie, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

On Wed, Jul 23, 2014 at 03:49:57PM -0400, Alex Deucher wrote:
> On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian Konig wrote:
> >> Am 23.07.2014 08:50, schrieb Oded Gabbay:
> >> >On 22/07/14 14:15, Daniel Vetter wrote:
> >> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >> >>>On 22/07/14 12:21, Daniel Vetter wrote:
> >> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
> >> >>>>wrote:
> >> >>>>>>Exactly, just prevent userspace from submitting more. And if you
> >> >>>>>>have
> >> >>>>>>misbehaving userspace that submits too much, reset the gpu and
> >> >>>>>>tell it
> >> >>>>>>that you're sorry but won't schedule any more work.
> >> >>>>>
> >> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
> >> >>>>>not. Can
> >> >>>>>you elaborate ?
> >> >>>>
> >> >>>>Well that's mostly policy, currently in i915 we only have a check for
> >> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
> >> >>>>you can do that with the queue unmapping you've describe in reply to
> >> >>>>Jerome's mail.
> >> >>>>-Daniel
> >> >>>>
> >> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >> >>>if a
> >> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
> >> >>>job).
> >> >>
> >> >>Essentially yes. But we also have some hw features to kill jobs quicker,
> >> >>e.g. for media workloads.
> >> >>-Daniel
> >> >>
> >> >
> >> >Yeah, so this is what I'm talking about when I say that you and Jerome
> >> >come from a graphics POV and amdkfd come from a compute POV, no offense
> >> >intended.
> >> >
> >> >For compute jobs, we simply can't use this logic to terminate jobs.
> >> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
> >> >a few hours!!! And I'm not talking about an entire application runtime but
> >> >on a single submission of jobs by the userspace app. We have tests with
> >> >jobs that take between 20-30 minutes to complete. In theory, we can even
> >> >imagine a compute job which takes 1 or 2 days (on larger APUs).
> >> >
> >> >Now, I understand the question of how do we prevent the compute job from
> >> >monopolizing the GPU, and internally here we have some ideas that we will
> >> >probably share in the next few days, but my point is that I don't think we
> >> >can terminate a compute job because it is running for more than x seconds.
> >> >It is like you would terminate a CPU process which runs more than x
> >> >seconds.
> >>
> >> Yeah that's why one of the first things I've did was making the timeout
> >> configurable in the radeon module.
> >>
> >> But it doesn't necessary needs be a timeout, we should also kill a running
> >> job submission if the CPU process associated with the job is killed.
> >>
> >> >I think this is a *very* important discussion (detecting a misbehaved
> >> >compute process) and I would like to continue it, but I don't think moving
> >> >the job submission from userspace control to kernel control will solve
> >> >this core problem.
> >>
> >> We need to get this topic solved, otherwise the driver won't make it
> >> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
> >> GPU time or special things like counters etc... is a strict no go for a
> >> kernel module.
> >>
> >> I agree that moving the job submission from userpsace to kernel wouldn't
> >> solve this problem. As Daniel and I pointed out now multiple times it's
> >> rather easily possible to prevent further job submissions from userspace, in
> >> the worst case by unmapping the doorbell page.
> >>
> >> Moving it to an IOCTL would just make it a bit less complicated.
> >>
> >
> > It is not only complexity, my main concern is not really the amount of memory
> > pinned (well it would be if it was vram which by the way you need to remove
> > the api that allow to allocate vram just so that it clearly shows that vram is
> > not allowed).
> >
> > Issue is with GPU address space fragmentation, new process hsa queue might be
> > allocated in middle of gtt space and stays there for so long that i will forbid
> > any big buffer to be bind to gtt. Thought with virtual address space for graphics
> > this is less of an issue and only the kernel suffer but still it might block the
> > kernel from evicting some VRAM because i can not bind a system buffer big enough
> > to GTT because some GTT space is taken by some HSA queue.
> >
> > To mitigate this at very least, you need to implement special memory allocation
> > inside ttm and radeon to force this per queue to be allocate for instance from
> > top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
> > on number of queue.
> 
> This same sort of thing can already happen with gfx, although it's
> less likely since the workloads are usually shorter.  That said, we
> can issue compute jobs right today with the current CS ioctl and we
> may end up with a buffer pinned in an inopportune spot.

I thought compute was using virtual address space (well on > cayman at least).

> I'm not sure
> reserving a static pool at init really helps that much.  If you aren't
> using any HSA apps, it just wastes gtt space.  So you have a trade
> off: waste memory for a possibly unused MQD descriptor pool or
> allocate MQD descriptors on the fly, but possibly end up with a long
> running one stuck in a bad location.  Additionally, we already have a
> ttm flag for whether we want to allocate from the top or bottom of the
> pool.  We use it today for gfx depending on the buffer (e.g., buffers
> smaller than 512k are allocated from the bottom and buffers larger
> than 512 are allocated from the top).  So we can't really re-size a
> static buffer easily as there may already be other buffers pinned up
> there.

Again here iirc only kernel use the GTT space everything else (userspace)
is using virtual address space or am i forgeting something ?

My point was not so much to be static but to enforce doing it from one
end of the address space and to have shrink/grow depending on usage forcing
anything else out of that range.

On VM GPU only thing left using the "global" GTT is the kernel, it uses it
for ring and for moving buffer around. I would assume that pining ring buffers
at begining of address space no matter what there size is would be a good idea
as anyway those will not fragment ie there lifetime is the lifetime of the
driver.

My point is that all the HSA queue buffer can have a lifetime way bigger than
anything we have now, really now we can bind/unbind any buffer btw cs submission
modulo OpenCL task.

> 
> If we add sysfs controls to limit the amount of hsa processes, and
> queues per process so you could use this to dynamically limit the max
> amount gtt memory that would be in use for MQD descriptors.

No this can not be set dynamicly, once a process has created its queue
it has it and i see no channel to tell userspace: "sorry buddy but no
more room for you"

> 
> Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-23 20:25                                         ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-23 20:25 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, Oded Gabbay, David Airlie, Andrew Morton,
	John Bridgman, Joerg Roedel, Andrew Lewycky, Michel Dänzer,
	Ben Goz, Alexey Skidanov, linux-kernel, dri-devel, linux-mm,
	Sellek, Tom

On Wed, Jul 23, 2014 at 03:49:57PM -0400, Alex Deucher wrote:
> On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote:
> >> Am 23.07.2014 08:50, schrieb Oded Gabbay:
> >> >On 22/07/14 14:15, Daniel Vetter wrote:
> >> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >> >>>On 22/07/14 12:21, Daniel Vetter wrote:
> >> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@amd.com>
> >> >>>>wrote:
> >> >>>>>>Exactly, just prevent userspace from submitting more. And if you
> >> >>>>>>have
> >> >>>>>>misbehaving userspace that submits too much, reset the gpu and
> >> >>>>>>tell it
> >> >>>>>>that you're sorry but won't schedule any more work.
> >> >>>>>
> >> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
> >> >>>>>not. Can
> >> >>>>>you elaborate ?
> >> >>>>
> >> >>>>Well that's mostly policy, currently in i915 we only have a check for
> >> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
> >> >>>>you can do that with the queue unmapping you've describe in reply to
> >> >>>>Jerome's mail.
> >> >>>>-Daniel
> >> >>>>
> >> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >> >>>if a
> >> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
> >> >>>job).
> >> >>
> >> >>Essentially yes. But we also have some hw features to kill jobs quicker,
> >> >>e.g. for media workloads.
> >> >>-Daniel
> >> >>
> >> >
> >> >Yeah, so this is what I'm talking about when I say that you and Jerome
> >> >come from a graphics POV and amdkfd come from a compute POV, no offense
> >> >intended.
> >> >
> >> >For compute jobs, we simply can't use this logic to terminate jobs.
> >> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
> >> >a few hours!!! And I'm not talking about an entire application runtime but
> >> >on a single submission of jobs by the userspace app. We have tests with
> >> >jobs that take between 20-30 minutes to complete. In theory, we can even
> >> >imagine a compute job which takes 1 or 2 days (on larger APUs).
> >> >
> >> >Now, I understand the question of how do we prevent the compute job from
> >> >monopolizing the GPU, and internally here we have some ideas that we will
> >> >probably share in the next few days, but my point is that I don't think we
> >> >can terminate a compute job because it is running for more than x seconds.
> >> >It is like you would terminate a CPU process which runs more than x
> >> >seconds.
> >>
> >> Yeah that's why one of the first things I've did was making the timeout
> >> configurable in the radeon module.
> >>
> >> But it doesn't necessary needs be a timeout, we should also kill a running
> >> job submission if the CPU process associated with the job is killed.
> >>
> >> >I think this is a *very* important discussion (detecting a misbehaved
> >> >compute process) and I would like to continue it, but I don't think moving
> >> >the job submission from userspace control to kernel control will solve
> >> >this core problem.
> >>
> >> We need to get this topic solved, otherwise the driver won't make it
> >> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
> >> GPU time or special things like counters etc... is a strict no go for a
> >> kernel module.
> >>
> >> I agree that moving the job submission from userpsace to kernel wouldn't
> >> solve this problem. As Daniel and I pointed out now multiple times it's
> >> rather easily possible to prevent further job submissions from userspace, in
> >> the worst case by unmapping the doorbell page.
> >>
> >> Moving it to an IOCTL would just make it a bit less complicated.
> >>
> >
> > It is not only complexity, my main concern is not really the amount of memory
> > pinned (well it would be if it was vram which by the way you need to remove
> > the api that allow to allocate vram just so that it clearly shows that vram is
> > not allowed).
> >
> > Issue is with GPU address space fragmentation, new process hsa queue might be
> > allocated in middle of gtt space and stays there for so long that i will forbid
> > any big buffer to be bind to gtt. Thought with virtual address space for graphics
> > this is less of an issue and only the kernel suffer but still it might block the
> > kernel from evicting some VRAM because i can not bind a system buffer big enough
> > to GTT because some GTT space is taken by some HSA queue.
> >
> > To mitigate this at very least, you need to implement special memory allocation
> > inside ttm and radeon to force this per queue to be allocate for instance from
> > top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
> > on number of queue.
> 
> This same sort of thing can already happen with gfx, although it's
> less likely since the workloads are usually shorter.  That said, we
> can issue compute jobs right today with the current CS ioctl and we
> may end up with a buffer pinned in an inopportune spot.

I thought compute was using virtual address space (well on > cayman at least).

> I'm not sure
> reserving a static pool at init really helps that much.  If you aren't
> using any HSA apps, it just wastes gtt space.  So you have a trade
> off: waste memory for a possibly unused MQD descriptor pool or
> allocate MQD descriptors on the fly, but possibly end up with a long
> running one stuck in a bad location.  Additionally, we already have a
> ttm flag for whether we want to allocate from the top or bottom of the
> pool.  We use it today for gfx depending on the buffer (e.g., buffers
> smaller than 512k are allocated from the bottom and buffers larger
> than 512 are allocated from the top).  So we can't really re-size a
> static buffer easily as there may already be other buffers pinned up
> there.

Again here iirc only kernel use the GTT space everything else (userspace)
is using virtual address space or am i forgeting something ?

My point was not so much to be static but to enforce doing it from one
end of the address space and to have shrink/grow depending on usage forcing
anything else out of that range.

On VM GPU only thing left using the "global" GTT is the kernel, it uses it
for ring and for moving buffer around. I would assume that pining ring buffers
at begining of address space no matter what there size is would be a good idea
as anyway those will not fragment ie there lifetime is the lifetime of the
driver.

My point is that all the HSA queue buffer can have a lifetime way bigger than
anything we have now, really now we can bind/unbind any buffer btw cs submission
modulo OpenCL task.

> 
> If we add sysfs controls to limit the amount of hsa processes, and
> queues per process so you could use this to dynamically limit the max
> amount gtt memory that would be in use for MQD descriptors.

No this can not be set dynamicly, once a process has created its queue
it has it and i see no channel to tell userspace: "sorry buddy but no
more room for you"

> 
> Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-21 17:05             ` Daniel Vetter
                               ` (2 preceding siblings ...)
  (?)
@ 2014-07-23 20:59             ` Jesse Barnes
  2014-07-23 21:46               ` Bridgman, John
  -1 siblings, 1 reply; 148+ messages in thread
From: Jesse Barnes @ 2014-07-23 20:59 UTC (permalink / raw)
  To: dri-devel

On Mon, 21 Jul 2014 19:05:46 +0200
daniel at ffwll.ch (Daniel Vetter) wrote:

> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> > On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> > > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig wrote:
> > > > Am 21.07.2014 14:36, schrieb Oded Gabbay:
> > > > >On 20/07/14 20:46, Jerome Glisse wrote:

[snip!!]

> > > > 
> > > > The main questions here are if it's avoid able to pin down the memory and if
> > > > the memory is pinned down at driver load, by request from userspace or by
> > > > anything else.
> > > > 
> > > > As far as I can see only the "mqd per userspace queue" might be a bit
> > > > questionable, everything else sounds reasonable.
> > > 
> > > Aside, i915 perspective again (i.e. how we solved this): When scheduling
> > > away from contexts we unpin them and put them into the lru. And in the
> > > shrinker we have a last-ditch callback to switch to a default context
> > > (since you can't ever have no context once you've started) which means we
> > > can evict any context object if it's getting in the way.
> > 
> > So Intel hardware report through some interrupt or some channel when it is
> > not using a context ? ie kernel side get notification when some user context
> > is done executing ?
> 
> Yes, as long as we do the scheduling with the cpu we get interrupts for
> context switches. The mechanic is already published in the execlist
> patches currently floating around. We get a special context switch
> interrupt.
> 
> But we have this unpin logic already on the current code where we switch
> contexts through in-line cs commands from the kernel. There we obviously
> use the normal batch completion events.

Yeah and we can continue that going forward.  And of course if your hw
can do page faulting, you don't need to pin the normal data buffers.

Usually there are some special buffers that need to be pinned for
longer periods though, anytime the context could be active.  Sounds
like in this case the userland queues, which makes some sense.  But
maybe for smaller systems the size limit could be clamped to something
smaller than 128M.  Or tie it into the rlimit somehow, just like we do
for mlock() stuff.

> > The issue with radeon hardware AFAICT is that the hardware do not report any
> > thing about the userspace context running ie you do not get notification when
> > a context is not use. Well AFAICT. Maybe hardware do provide that.
> 
> I'm not sure whether we can do the same trick with the hw scheduler. But
> then unpinning hw contexts will drain the pipeline anyway, so I guess we
> can just stop feeding the hw scheduler until it runs dry. And then unpin
> and evict.

Yeah we should have an idea which contexts have been fed to the
scheduler, at least with kernel based submission.  With userspace
submission we'll be in a tougher spot...  but as you say we can always
idle things and unpin everything under pressure.  That's a really big
hammer to apply though.

> > Like the VMID is a limited resources so you have to dynamicly bind them so
> > maybe we can only allocate pinned buffer for each VMID and then when binding
> > a PASID to a VMID it also copy back pinned buffer to pasid unpinned copy.
> 
> Yeah, pasid assignment will be fun. Not sure whether Jesse's patches will
> do this already. We _do_ already have fun with ctx id assigments though
> since we move them around (and the hw id is the ggtt address afaik). So we
> need to remap them already. Not sure on the details for pasid mapping,
> iirc it's a separate field somewhere in the context struct. Jesse knows
> the details.

The PASID space is a bit bigger, 20 bits iirc.  So we probably won't
run out quickly or often.  But when we do I thought we could apply the
same trick Linux uses for ASID management on SPARC and ia64 (iirc on
sparc anyway, maybe MIPS too): "allocate" a PASID everytime you need
one, but don't tie it to the process at all, just use it as a counter
that lets you know when you need to do a full TLB flush, then start the
allocation process over.  This lets you minimize TLB flushing and
gracefully handles oversubscription.

My current code doesn't bother though; context creation will fail if we
run out of PASIDs on a given device.

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 20:59             ` Jesse Barnes
@ 2014-07-23 21:46               ` Bridgman, John
  2014-07-23 22:01                 ` Oded Gabbay
  0 siblings, 1 reply; 148+ messages in thread
From: Bridgman, John @ 2014-07-23 21:46 UTC (permalink / raw)
  To: Jesse Barnes, dri-devel


>-----Original Message-----
>From: dri-devel [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf
>Of Jesse Barnes
>Sent: Wednesday, July 23, 2014 5:00 PM
>To: dri-devel@lists.freedesktop.org
>Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver
>
>On Mon, 21 Jul 2014 19:05:46 +0200
>daniel at ffwll.ch (Daniel Vetter) wrote:
>
>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>> > On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>> > > On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig wrote:
>> > > > Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> > > > >On 20/07/14 20:46, Jerome Glisse wrote:
>
>[snip!!]
My BlackBerry thumb thanks you ;)
>
>> > > >
>> > > > The main questions here are if it's avoid able to pin down the
>> > > > memory and if the memory is pinned down at driver load, by
>> > > > request from userspace or by anything else.
>> > > >
>> > > > As far as I can see only the "mqd per userspace queue" might be
>> > > > a bit questionable, everything else sounds reasonable.
>> > >
>> > > Aside, i915 perspective again (i.e. how we solved this): When
>> > > scheduling away from contexts we unpin them and put them into the
>> > > lru. And in the shrinker we have a last-ditch callback to switch
>> > > to a default context (since you can't ever have no context once
>> > > you've started) which means we can evict any context object if it's
>getting in the way.
>> >
>> > So Intel hardware report through some interrupt or some channel when
>> > it is not using a context ? ie kernel side get notification when
>> > some user context is done executing ?
>>
>> Yes, as long as we do the scheduling with the cpu we get interrupts
>> for context switches. The mechanic is already published in the
>> execlist patches currently floating around. We get a special context
>> switch interrupt.
>>
>> But we have this unpin logic already on the current code where we
>> switch contexts through in-line cs commands from the kernel. There we
>> obviously use the normal batch completion events.
>
>Yeah and we can continue that going forward.  And of course if your hw can
>do page faulting, you don't need to pin the normal data buffers.
>
>Usually there are some special buffers that need to be pinned for longer
>periods though, anytime the context could be active.  Sounds like in this case
>the userland queues, which makes some sense.  But maybe for smaller
>systems the size limit could be clamped to something smaller than 128M.  Or
>tie it into the rlimit somehow, just like we do for mlock() stuff.
>
Yeah, even the queues are in pageable memory, it's just a ~256 byte structure per queue (the Memory Queue Descriptor) that describes the queue to hardware, plus a couple of pages for each process using HSA to hold things like doorbells. Current thinking is to limit # processes using HSA to ~256 and #queues per process to ~1024 by default in the initial code, although my guess is that we could take the #queues per process default limit even lower.  

>> > The issue with radeon hardware AFAICT is that the hardware do not
>> > report any thing about the userspace context running ie you do not
>> > get notification when a context is not use. Well AFAICT. Maybe hardware
>do provide that.
>>
>> I'm not sure whether we can do the same trick with the hw scheduler.
>> But then unpinning hw contexts will drain the pipeline anyway, so I
>> guess we can just stop feeding the hw scheduler until it runs dry. And
>> then unpin and evict.
>
>Yeah we should have an idea which contexts have been fed to the scheduler,
>at least with kernel based submission.  With userspace submission we'll be in a
>tougher spot...  but as you say we can always idle things and unpin everything
>under pressure.  That's a really big hammer to apply though.
>
>> > Like the VMID is a limited resources so you have to dynamicly bind
>> > them so maybe we can only allocate pinned buffer for each VMID and
>> > then when binding a PASID to a VMID it also copy back pinned buffer to
>pasid unpinned copy.
>>
>> Yeah, pasid assignment will be fun. Not sure whether Jesse's patches
>> will do this already. We _do_ already have fun with ctx id assigments
>> though since we move them around (and the hw id is the ggtt address
>> afaik). So we need to remap them already. Not sure on the details for
>> pasid mapping, iirc it's a separate field somewhere in the context
>> struct. Jesse knows the details.
>
>The PASID space is a bit bigger, 20 bits iirc.  So we probably won't run out
>quickly or often.  But when we do I thought we could apply the same trick
>Linux uses for ASID management on SPARC and ia64 (iirc on sparc anyway,
>maybe MIPS too): "allocate" a PASID everytime you need one, but don't tie it
>to the process at all, just use it as a counter that lets you know when you need
>to do a full TLB flush, then start the allocation process over.  This lets you
>minimize TLB flushing and gracefully handles oversubscription.

IIRC we have a 9-bit limit for PASID on current hardware, although that will go up in future. 
>
>My current code doesn't bother though; context creation will fail if we run out
>of PASIDs on a given device.
>
>--
>Jesse Barnes, Intel Open Source Technology Center
>_______________________________________________
>dri-devel mailing list
>dri-devel@lists.freedesktop.org
>http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 21:46               ` Bridgman, John
@ 2014-07-23 22:01                 ` Oded Gabbay
  2014-07-24 15:44                     ` Jerome Glisse
  0 siblings, 1 reply; 148+ messages in thread
From: Oded Gabbay @ 2014-07-23 22:01 UTC (permalink / raw)
  To: Bridgman, John, Jesse Barnes, dri-devel, Alex Deucher,
	Jerome Glisse, Christian König, Lewycky, Andrew,
	David Airlie, linux-kernel

On 24/07/14 00:46, Bridgman, John wrote:
> 
>> -----Original Message----- From: dri-devel
>> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
>> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
>> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
>> AMDKFD kernel driver
>> 
>> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
>> Vetter) wrote:
>> 
>>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
>>>>> wrote:
>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>> 
>> [snip!!]
> My BlackBerry thumb thanks you ;)
>> 
>>>>>> 
>>>>>> The main questions here are if it's avoid able to pin down
>>>>>> the memory and if the memory is pinned down at driver load,
>>>>>> by request from userspace or by anything else.
>>>>>> 
>>>>>> As far as I can see only the "mqd per userspace queue"
>>>>>> might be a bit questionable, everything else sounds
>>>>>> reasonable.
>>>>> 
>>>>> Aside, i915 perspective again (i.e. how we solved this):
>>>>> When scheduling away from contexts we unpin them and put them
>>>>> into the lru. And in the shrinker we have a last-ditch
>>>>> callback to switch to a default context (since you can't ever
>>>>> have no context once you've started) which means we can evict
>>>>> any context object if it's
>> getting in the way.
>>>> 
>>>> So Intel hardware report through some interrupt or some channel
>>>> when it is not using a context ? ie kernel side get
>>>> notification when some user context is done executing ?
>>> 
>>> Yes, as long as we do the scheduling with the cpu we get
>>> interrupts for context switches. The mechanic is already
>>> published in the execlist patches currently floating around. We
>>> get a special context switch interrupt.
>>> 
>>> But we have this unpin logic already on the current code where
>>> we switch contexts through in-line cs commands from the kernel.
>>> There we obviously use the normal batch completion events.
>> 
>> Yeah and we can continue that going forward.  And of course if your
>> hw can do page faulting, you don't need to pin the normal data
>> buffers.
>> 
>> Usually there are some special buffers that need to be pinned for
>> longer periods though, anytime the context could be active.  Sounds
>> like in this case the userland queues, which makes some sense.  But
>> maybe for smaller systems the size limit could be clamped to
>> something smaller than 128M.  Or tie it into the rlimit somehow,
>> just like we do for mlock() stuff.
>> 
> Yeah, even the queues are in pageable memory, it's just a ~256 byte
> structure per queue (the Memory Queue Descriptor) that describes the
> queue to hardware, plus a couple of pages for each process using HSA
> to hold things like doorbells. Current thinking is to limit #
> processes using HSA to ~256 and #queues per process to ~1024 by
> default in the initial code, although my guess is that we could take
> the #queues per process default limit even lower.
> 

So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
on 256 boundary.
I had in mind to reserve 64MB of gart by default, which translates to
512 queues per process, with 128 processes. Add 2 kernel module
parameters, # of max-queues-per-process and # of max-processes (default
is, as I said, 512 and 128) for better control of system admin.

	Oded

>>>> The issue with radeon hardware AFAICT is that the hardware do
>>>> not report any thing about the userspace context running ie you
>>>> do not get notification when a context is not use. Well AFAICT.
>>>> Maybe hardware
>> do provide that.
>>> 
>>> I'm not sure whether we can do the same trick with the hw
>>> scheduler. But then unpinning hw contexts will drain the pipeline
>>> anyway, so I guess we can just stop feeding the hw scheduler
>>> until it runs dry. And then unpin and evict.
>> 
>> Yeah we should have an idea which contexts have been fed to the
>> scheduler, at least with kernel based submission.  With userspace
>> submission we'll be in a tougher spot...  but as you say we can
>> always idle things and unpin everything under pressure.  That's a
>> really big hammer to apply though.
>> 
>>>> Like the VMID is a limited resources so you have to dynamicly
>>>> bind them so maybe we can only allocate pinned buffer for each
>>>> VMID and then when binding a PASID to a VMID it also copy back
>>>> pinned buffer to
>> pasid unpinned copy.
>>> 
>>> Yeah, pasid assignment will be fun. Not sure whether Jesse's
>>> patches will do this already. We _do_ already have fun with ctx
>>> id assigments though since we move them around (and the hw id is
>>> the ggtt address afaik). So we need to remap them already. Not
>>> sure on the details for pasid mapping, iirc it's a separate field
>>> somewhere in the context struct. Jesse knows the details.
>> 
>> The PASID space is a bit bigger, 20 bits iirc.  So we probably
>> won't run out quickly or often.  But when we do I thought we could
>> apply the same trick Linux uses for ASID management on SPARC and
>> ia64 (iirc on sparc anyway, maybe MIPS too): "allocate" a PASID
>> everytime you need one, but don't tie it to the process at all,
>> just use it as a counter that lets you know when you need to do a
>> full TLB flush, then start the allocation process over.  This lets
>> you minimize TLB flushing and gracefully handles oversubscription.
> 
> IIRC we have a 9-bit limit for PASID on current hardware, although
> that will go up in future.
>> 
>> My current code doesn't bother though; context creation will fail
>> if we run out of PASIDs on a given device.
>> 
>> -- Jesse Barnes, Intel Open Source Technology Center 
>> _______________________________________________ dri-devel mailing
>> list dri-devel@lists.freedesktop.org 
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________ dri-devel mailing
> list dri-devel@lists.freedesktop.org 
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
> 


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-23 22:01                 ` Oded Gabbay
@ 2014-07-24 15:44                     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-24 15:44 UTC (permalink / raw)
  To: Oded Gabbay
  Cc: Bridgman, John, Jesse Barnes, dri-devel, Alex Deucher,
	Christian König, Lewycky, Andrew, David Airlie,
	linux-kernel

On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> On 24/07/14 00:46, Bridgman, John wrote:
> > 
> >> -----Original Message----- From: dri-devel
> >> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >> AMDKFD kernel driver
> >> 
> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >> Vetter) wrote:
> >> 
> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >>>>> wrote:
> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >> 
> >> [snip!!]
> > My BlackBerry thumb thanks you ;)
> >> 
> >>>>>> 
> >>>>>> The main questions here are if it's avoid able to pin down
> >>>>>> the memory and if the memory is pinned down at driver load,
> >>>>>> by request from userspace or by anything else.
> >>>>>> 
> >>>>>> As far as I can see only the "mqd per userspace queue"
> >>>>>> might be a bit questionable, everything else sounds
> >>>>>> reasonable.
> >>>>> 
> >>>>> Aside, i915 perspective again (i.e. how we solved this):
> >>>>> When scheduling away from contexts we unpin them and put them
> >>>>> into the lru. And in the shrinker we have a last-ditch
> >>>>> callback to switch to a default context (since you can't ever
> >>>>> have no context once you've started) which means we can evict
> >>>>> any context object if it's
> >> getting in the way.
> >>>> 
> >>>> So Intel hardware report through some interrupt or some channel
> >>>> when it is not using a context ? ie kernel side get
> >>>> notification when some user context is done executing ?
> >>> 
> >>> Yes, as long as we do the scheduling with the cpu we get
> >>> interrupts for context switches. The mechanic is already
> >>> published in the execlist patches currently floating around. We
> >>> get a special context switch interrupt.
> >>> 
> >>> But we have this unpin logic already on the current code where
> >>> we switch contexts through in-line cs commands from the kernel.
> >>> There we obviously use the normal batch completion events.
> >> 
> >> Yeah and we can continue that going forward.  And of course if your
> >> hw can do page faulting, you don't need to pin the normal data
> >> buffers.
> >> 
> >> Usually there are some special buffers that need to be pinned for
> >> longer periods though, anytime the context could be active.  Sounds
> >> like in this case the userland queues, which makes some sense.  But
> >> maybe for smaller systems the size limit could be clamped to
> >> something smaller than 128M.  Or tie it into the rlimit somehow,
> >> just like we do for mlock() stuff.
> >> 
> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
> > structure per queue (the Memory Queue Descriptor) that describes the
> > queue to hardware, plus a couple of pages for each process using HSA
> > to hold things like doorbells. Current thinking is to limit #
> > processes using HSA to ~256 and #queues per process to ~1024 by
> > default in the initial code, although my guess is that we could take
> > the #queues per process default limit even lower.
> > 
> 
> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> on 256 boundary.
> I had in mind to reserve 64MB of gart by default, which translates to
> 512 queues per process, with 128 processes. Add 2 kernel module
> parameters, # of max-queues-per-process and # of max-processes (default
> is, as I said, 512 and 128) for better control of system admin.
> 

So as i said somewhere else in this thread, this should not be reserved
but use a special allocation. Any HSA GPU use virtual address space for
userspace so only issue is for kernel side GTT.

What i would like is seeing radeon pinned GTT allocation at bottom of
GTT space (ie all ring buffer and the ib pool buffer). Then have an
allocator that allocate new queue from top of GTT address space and
grow to the bottom.

It should not staticly reserved 64M or anything. When doing allocation
it should move any ttm buffer that are in the region it want to allocate
to a different location.


As this needs some work, i am not against reserving some small amount
(couple MB) as a first stage but anything more would need a proper solution
like the one i just described.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-24 15:44                     ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-24 15:44 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: Lewycky, Andrew, linux-kernel, dri-devel

On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> On 24/07/14 00:46, Bridgman, John wrote:
> > 
> >> -----Original Message----- From: dri-devel
> >> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >> AMDKFD kernel driver
> >> 
> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >> Vetter) wrote:
> >> 
> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >>>>> wrote:
> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >> 
> >> [snip!!]
> > My BlackBerry thumb thanks you ;)
> >> 
> >>>>>> 
> >>>>>> The main questions here are if it's avoid able to pin down
> >>>>>> the memory and if the memory is pinned down at driver load,
> >>>>>> by request from userspace or by anything else.
> >>>>>> 
> >>>>>> As far as I can see only the "mqd per userspace queue"
> >>>>>> might be a bit questionable, everything else sounds
> >>>>>> reasonable.
> >>>>> 
> >>>>> Aside, i915 perspective again (i.e. how we solved this):
> >>>>> When scheduling away from contexts we unpin them and put them
> >>>>> into the lru. And in the shrinker we have a last-ditch
> >>>>> callback to switch to a default context (since you can't ever
> >>>>> have no context once you've started) which means we can evict
> >>>>> any context object if it's
> >> getting in the way.
> >>>> 
> >>>> So Intel hardware report through some interrupt or some channel
> >>>> when it is not using a context ? ie kernel side get
> >>>> notification when some user context is done executing ?
> >>> 
> >>> Yes, as long as we do the scheduling with the cpu we get
> >>> interrupts for context switches. The mechanic is already
> >>> published in the execlist patches currently floating around. We
> >>> get a special context switch interrupt.
> >>> 
> >>> But we have this unpin logic already on the current code where
> >>> we switch contexts through in-line cs commands from the kernel.
> >>> There we obviously use the normal batch completion events.
> >> 
> >> Yeah and we can continue that going forward.  And of course if your
> >> hw can do page faulting, you don't need to pin the normal data
> >> buffers.
> >> 
> >> Usually there are some special buffers that need to be pinned for
> >> longer periods though, anytime the context could be active.  Sounds
> >> like in this case the userland queues, which makes some sense.  But
> >> maybe for smaller systems the size limit could be clamped to
> >> something smaller than 128M.  Or tie it into the rlimit somehow,
> >> just like we do for mlock() stuff.
> >> 
> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
> > structure per queue (the Memory Queue Descriptor) that describes the
> > queue to hardware, plus a couple of pages for each process using HSA
> > to hold things like doorbells. Current thinking is to limit #
> > processes using HSA to ~256 and #queues per process to ~1024 by
> > default in the initial code, although my guess is that we could take
> > the #queues per process default limit even lower.
> > 
> 
> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> on 256 boundary.
> I had in mind to reserve 64MB of gart by default, which translates to
> 512 queues per process, with 128 processes. Add 2 kernel module
> parameters, # of max-queues-per-process and # of max-processes (default
> is, as I said, 512 and 128) for better control of system admin.
> 

So as i said somewhere else in this thread, this should not be reserved
but use a special allocation. Any HSA GPU use virtual address space for
userspace so only issue is for kernel side GTT.

What i would like is seeing radeon pinned GTT allocation at bottom of
GTT space (ie all ring buffer and the ib pool buffer). Then have an
allocator that allocate new queue from top of GTT address space and
grow to the bottom.

It should not staticly reserved 64M or anything. When doing allocation
it should move any ttm buffer that are in the region it want to allocate
to a different location.


As this needs some work, i am not against reserving some small amount
(couple MB) as a first stage but anything more would need a proper solution
like the one i just described.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-24 15:44                     ` Jerome Glisse
@ 2014-07-24 17:35                       ` Alex Deucher
  -1 siblings, 0 replies; 148+ messages in thread
From: Alex Deucher @ 2014-07-24 17:35 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Oded Gabbay, Bridgman, John, Jesse Barnes, dri-devel,
	Christian König, Lewycky, Andrew, David Airlie,
	linux-kernel

On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
>> On 24/07/14 00:46, Bridgman, John wrote:
>> >
>> >> -----Original Message----- From: dri-devel
>> >> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
>> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
>> >> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
>> >> AMDKFD kernel driver
>> >>
>> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
>> >> Vetter) wrote:
>> >>
>> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
>> >>>>> wrote:
>> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>> >>
>> >> [snip!!]
>> > My BlackBerry thumb thanks you ;)
>> >>
>> >>>>>>
>> >>>>>> The main questions here are if it's avoid able to pin down
>> >>>>>> the memory and if the memory is pinned down at driver load,
>> >>>>>> by request from userspace or by anything else.
>> >>>>>>
>> >>>>>> As far as I can see only the "mqd per userspace queue"
>> >>>>>> might be a bit questionable, everything else sounds
>> >>>>>> reasonable.
>> >>>>>
>> >>>>> Aside, i915 perspective again (i.e. how we solved this):
>> >>>>> When scheduling away from contexts we unpin them and put them
>> >>>>> into the lru. And in the shrinker we have a last-ditch
>> >>>>> callback to switch to a default context (since you can't ever
>> >>>>> have no context once you've started) which means we can evict
>> >>>>> any context object if it's
>> >> getting in the way.
>> >>>>
>> >>>> So Intel hardware report through some interrupt or some channel
>> >>>> when it is not using a context ? ie kernel side get
>> >>>> notification when some user context is done executing ?
>> >>>
>> >>> Yes, as long as we do the scheduling with the cpu we get
>> >>> interrupts for context switches. The mechanic is already
>> >>> published in the execlist patches currently floating around. We
>> >>> get a special context switch interrupt.
>> >>>
>> >>> But we have this unpin logic already on the current code where
>> >>> we switch contexts through in-line cs commands from the kernel.
>> >>> There we obviously use the normal batch completion events.
>> >>
>> >> Yeah and we can continue that going forward.  And of course if your
>> >> hw can do page faulting, you don't need to pin the normal data
>> >> buffers.
>> >>
>> >> Usually there are some special buffers that need to be pinned for
>> >> longer periods though, anytime the context could be active.  Sounds
>> >> like in this case the userland queues, which makes some sense.  But
>> >> maybe for smaller systems the size limit could be clamped to
>> >> something smaller than 128M.  Or tie it into the rlimit somehow,
>> >> just like we do for mlock() stuff.
>> >>
>> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
>> > structure per queue (the Memory Queue Descriptor) that describes the
>> > queue to hardware, plus a couple of pages for each process using HSA
>> > to hold things like doorbells. Current thinking is to limit #
>> > processes using HSA to ~256 and #queues per process to ~1024 by
>> > default in the initial code, although my guess is that we could take
>> > the #queues per process default limit even lower.
>> >
>>
>> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
>> on 256 boundary.
>> I had in mind to reserve 64MB of gart by default, which translates to
>> 512 queues per process, with 128 processes. Add 2 kernel module
>> parameters, # of max-queues-per-process and # of max-processes (default
>> is, as I said, 512 and 128) for better control of system admin.
>>
>
> So as i said somewhere else in this thread, this should not be reserved
> but use a special allocation. Any HSA GPU use virtual address space for
> userspace so only issue is for kernel side GTT.
>
> What i would like is seeing radeon pinned GTT allocation at bottom of
> GTT space (ie all ring buffer and the ib pool buffer). Then have an
> allocator that allocate new queue from top of GTT address space and
> grow to the bottom.
>
> It should not staticly reserved 64M or anything. When doing allocation
> it should move any ttm buffer that are in the region it want to allocate
> to a different location.
>
>
> As this needs some work, i am not against reserving some small amount
> (couple MB) as a first stage but anything more would need a proper solution
> like the one i just described.

It's still a trade off.  Even if we reserve a couple of megs it'll be
wasted if we are not running HSA apps. And even today if we run a
compute job using the current interfaces we could end up in the same
case.  So while I think it's definitely a good goal to come up with
some solution for fragmentation, I don't think it should be a
show-stopper right now.

A better solution to deal with fragmentation of GTT and provide a
better way to allocate larger buffers in vram would be to break up
vram <-> system pool transfers into multiple transfers depending on
the available GTT size.  Or use GPUVM dynamically for  vram <-> system
transfers.

Alex

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-24 17:35                       ` Alex Deucher
  0 siblings, 0 replies; 148+ messages in thread
From: Alex Deucher @ 2014-07-24 17:35 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Lewycky, Andrew, linux-kernel, dri-devel

On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
>> On 24/07/14 00:46, Bridgman, John wrote:
>> >
>> >> -----Original Message----- From: dri-devel
>> >> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
>> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
>> >> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
>> >> AMDKFD kernel driver
>> >>
>> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
>> >> Vetter) wrote:
>> >>
>> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
>> >>>>> wrote:
>> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>> >>
>> >> [snip!!]
>> > My BlackBerry thumb thanks you ;)
>> >>
>> >>>>>>
>> >>>>>> The main questions here are if it's avoid able to pin down
>> >>>>>> the memory and if the memory is pinned down at driver load,
>> >>>>>> by request from userspace or by anything else.
>> >>>>>>
>> >>>>>> As far as I can see only the "mqd per userspace queue"
>> >>>>>> might be a bit questionable, everything else sounds
>> >>>>>> reasonable.
>> >>>>>
>> >>>>> Aside, i915 perspective again (i.e. how we solved this):
>> >>>>> When scheduling away from contexts we unpin them and put them
>> >>>>> into the lru. And in the shrinker we have a last-ditch
>> >>>>> callback to switch to a default context (since you can't ever
>> >>>>> have no context once you've started) which means we can evict
>> >>>>> any context object if it's
>> >> getting in the way.
>> >>>>
>> >>>> So Intel hardware report through some interrupt or some channel
>> >>>> when it is not using a context ? ie kernel side get
>> >>>> notification when some user context is done executing ?
>> >>>
>> >>> Yes, as long as we do the scheduling with the cpu we get
>> >>> interrupts for context switches. The mechanic is already
>> >>> published in the execlist patches currently floating around. We
>> >>> get a special context switch interrupt.
>> >>>
>> >>> But we have this unpin logic already on the current code where
>> >>> we switch contexts through in-line cs commands from the kernel.
>> >>> There we obviously use the normal batch completion events.
>> >>
>> >> Yeah and we can continue that going forward.  And of course if your
>> >> hw can do page faulting, you don't need to pin the normal data
>> >> buffers.
>> >>
>> >> Usually there are some special buffers that need to be pinned for
>> >> longer periods though, anytime the context could be active.  Sounds
>> >> like in this case the userland queues, which makes some sense.  But
>> >> maybe for smaller systems the size limit could be clamped to
>> >> something smaller than 128M.  Or tie it into the rlimit somehow,
>> >> just like we do for mlock() stuff.
>> >>
>> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
>> > structure per queue (the Memory Queue Descriptor) that describes the
>> > queue to hardware, plus a couple of pages for each process using HSA
>> > to hold things like doorbells. Current thinking is to limit #
>> > processes using HSA to ~256 and #queues per process to ~1024 by
>> > default in the initial code, although my guess is that we could take
>> > the #queues per process default limit even lower.
>> >
>>
>> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
>> on 256 boundary.
>> I had in mind to reserve 64MB of gart by default, which translates to
>> 512 queues per process, with 128 processes. Add 2 kernel module
>> parameters, # of max-queues-per-process and # of max-processes (default
>> is, as I said, 512 and 128) for better control of system admin.
>>
>
> So as i said somewhere else in this thread, this should not be reserved
> but use a special allocation. Any HSA GPU use virtual address space for
> userspace so only issue is for kernel side GTT.
>
> What i would like is seeing radeon pinned GTT allocation at bottom of
> GTT space (ie all ring buffer and the ib pool buffer). Then have an
> allocator that allocate new queue from top of GTT address space and
> grow to the bottom.
>
> It should not staticly reserved 64M or anything. When doing allocation
> it should move any ttm buffer that are in the region it want to allocate
> to a different location.
>
>
> As this needs some work, i am not against reserving some small amount
> (couple MB) as a first stage but anything more would need a proper solution
> like the one i just described.

It's still a trade off.  Even if we reserve a couple of megs it'll be
wasted if we are not running HSA apps. And even today if we run a
compute job using the current interfaces we could end up in the same
case.  So while I think it's definitely a good goal to come up with
some solution for fragmentation, I don't think it should be a
show-stopper right now.

A better solution to deal with fragmentation of GTT and provide a
better way to allocate larger buffers in vram would be to break up
vram <-> system pool transfers into multiple transfers depending on
the available GTT size.  Or use GPUVM dynamically for  vram <-> system
transfers.

Alex

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-24 17:35                       ` Alex Deucher
@ 2014-07-24 18:47                         ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-24 18:47 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Oded Gabbay, Bridgman, John, Jesse Barnes, dri-devel,
	Christian König, Lewycky, Andrew, David Airlie,
	linux-kernel

On Thu, Jul 24, 2014 at 01:35:53PM -0400, Alex Deucher wrote:
> On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> >> On 24/07/14 00:46, Bridgman, John wrote:
> >> >
> >> >> -----Original Message----- From: dri-devel
> >> >> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
> >> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >> >> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >> >> AMDKFD kernel driver
> >> >>
> >> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >> >> Vetter) wrote:
> >> >>
> >> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >> >>>>> wrote:
> >> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >> >>
> >> >> [snip!!]
> >> > My BlackBerry thumb thanks you ;)
> >> >>
> >> >>>>>>
> >> >>>>>> The main questions here are if it's avoid able to pin down
> >> >>>>>> the memory and if the memory is pinned down at driver load,
> >> >>>>>> by request from userspace or by anything else.
> >> >>>>>>
> >> >>>>>> As far as I can see only the "mqd per userspace queue"
> >> >>>>>> might be a bit questionable, everything else sounds
> >> >>>>>> reasonable.
> >> >>>>>
> >> >>>>> Aside, i915 perspective again (i.e. how we solved this):
> >> >>>>> When scheduling away from contexts we unpin them and put them
> >> >>>>> into the lru. And in the shrinker we have a last-ditch
> >> >>>>> callback to switch to a default context (since you can't ever
> >> >>>>> have no context once you've started) which means we can evict
> >> >>>>> any context object if it's
> >> >> getting in the way.
> >> >>>>
> >> >>>> So Intel hardware report through some interrupt or some channel
> >> >>>> when it is not using a context ? ie kernel side get
> >> >>>> notification when some user context is done executing ?
> >> >>>
> >> >>> Yes, as long as we do the scheduling with the cpu we get
> >> >>> interrupts for context switches. The mechanic is already
> >> >>> published in the execlist patches currently floating around. We
> >> >>> get a special context switch interrupt.
> >> >>>
> >> >>> But we have this unpin logic already on the current code where
> >> >>> we switch contexts through in-line cs commands from the kernel.
> >> >>> There we obviously use the normal batch completion events.
> >> >>
> >> >> Yeah and we can continue that going forward.  And of course if your
> >> >> hw can do page faulting, you don't need to pin the normal data
> >> >> buffers.
> >> >>
> >> >> Usually there are some special buffers that need to be pinned for
> >> >> longer periods though, anytime the context could be active.  Sounds
> >> >> like in this case the userland queues, which makes some sense.  But
> >> >> maybe for smaller systems the size limit could be clamped to
> >> >> something smaller than 128M.  Or tie it into the rlimit somehow,
> >> >> just like we do for mlock() stuff.
> >> >>
> >> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
> >> > structure per queue (the Memory Queue Descriptor) that describes the
> >> > queue to hardware, plus a couple of pages for each process using HSA
> >> > to hold things like doorbells. Current thinking is to limit #
> >> > processes using HSA to ~256 and #queues per process to ~1024 by
> >> > default in the initial code, although my guess is that we could take
> >> > the #queues per process default limit even lower.
> >> >
> >>
> >> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> >> on 256 boundary.
> >> I had in mind to reserve 64MB of gart by default, which translates to
> >> 512 queues per process, with 128 processes. Add 2 kernel module
> >> parameters, # of max-queues-per-process and # of max-processes (default
> >> is, as I said, 512 and 128) for better control of system admin.
> >>
> >
> > So as i said somewhere else in this thread, this should not be reserved
> > but use a special allocation. Any HSA GPU use virtual address space for
> > userspace so only issue is for kernel side GTT.
> >
> > What i would like is seeing radeon pinned GTT allocation at bottom of
> > GTT space (ie all ring buffer and the ib pool buffer). Then have an
> > allocator that allocate new queue from top of GTT address space and
> > grow to the bottom.
> >
> > It should not staticly reserved 64M or anything. When doing allocation
> > it should move any ttm buffer that are in the region it want to allocate
> > to a different location.
> >
> >
> > As this needs some work, i am not against reserving some small amount
> > (couple MB) as a first stage but anything more would need a proper solution
> > like the one i just described.
> 
> It's still a trade off.  Even if we reserve a couple of megs it'll be
> wasted if we are not running HSA apps. And even today if we run a
> compute job using the current interfaces we could end up in the same
> case.  So while I think it's definitely a good goal to come up with
> some solution for fragmentation, I don't think it should be a
> show-stopper right now.
> 

Seems i am having a hard time to express myself. I am not saying it is a
showstopper i am saying until proper solution is implemented KFD should
limit its number of queue to consume at most couple MB ie not 64MB or more
but 2MB, 4MB something in that water.

> A better solution to deal with fragmentation of GTT and provide a
> better way to allocate larger buffers in vram would be to break up
> vram <-> system pool transfers into multiple transfers depending on
> the available GTT size.  Or use GPUVM dynamically for  vram <-> system
> transfers.

Isn't the UVD engine still using the main GTT ? I have not look much at
UVD in a while.

Yes there is way to fix buffer migration but i would also like to see
address space fragmentation to a minimum which is the main reason i
uterly hate any design that forbid kernel to take over and do its thing.

Buffer pining should really be only for front buffer and thing like ring
ie buffer that have a lifetime bound to the driver lifetime.

Cheers,
Jérôme

> 
> Alex

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-24 18:47                         ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-24 18:47 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Lewycky, Andrew, linux-kernel, dri-devel

On Thu, Jul 24, 2014 at 01:35:53PM -0400, Alex Deucher wrote:
> On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> >> On 24/07/14 00:46, Bridgman, John wrote:
> >> >
> >> >> -----Original Message----- From: dri-devel
> >> >> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
> >> >> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >> >> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >> >> AMDKFD kernel driver
> >> >>
> >> >> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >> >> Vetter) wrote:
> >> >>
> >> >>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >> >>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >> >>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >> >>>>> wrote:
> >> >>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >> >>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >> >>
> >> >> [snip!!]
> >> > My BlackBerry thumb thanks you ;)
> >> >>
> >> >>>>>>
> >> >>>>>> The main questions here are if it's avoid able to pin down
> >> >>>>>> the memory and if the memory is pinned down at driver load,
> >> >>>>>> by request from userspace or by anything else.
> >> >>>>>>
> >> >>>>>> As far as I can see only the "mqd per userspace queue"
> >> >>>>>> might be a bit questionable, everything else sounds
> >> >>>>>> reasonable.
> >> >>>>>
> >> >>>>> Aside, i915 perspective again (i.e. how we solved this):
> >> >>>>> When scheduling away from contexts we unpin them and put them
> >> >>>>> into the lru. And in the shrinker we have a last-ditch
> >> >>>>> callback to switch to a default context (since you can't ever
> >> >>>>> have no context once you've started) which means we can evict
> >> >>>>> any context object if it's
> >> >> getting in the way.
> >> >>>>
> >> >>>> So Intel hardware report through some interrupt or some channel
> >> >>>> when it is not using a context ? ie kernel side get
> >> >>>> notification when some user context is done executing ?
> >> >>>
> >> >>> Yes, as long as we do the scheduling with the cpu we get
> >> >>> interrupts for context switches. The mechanic is already
> >> >>> published in the execlist patches currently floating around. We
> >> >>> get a special context switch interrupt.
> >> >>>
> >> >>> But we have this unpin logic already on the current code where
> >> >>> we switch contexts through in-line cs commands from the kernel.
> >> >>> There we obviously use the normal batch completion events.
> >> >>
> >> >> Yeah and we can continue that going forward.  And of course if your
> >> >> hw can do page faulting, you don't need to pin the normal data
> >> >> buffers.
> >> >>
> >> >> Usually there are some special buffers that need to be pinned for
> >> >> longer periods though, anytime the context could be active.  Sounds
> >> >> like in this case the userland queues, which makes some sense.  But
> >> >> maybe for smaller systems the size limit could be clamped to
> >> >> something smaller than 128M.  Or tie it into the rlimit somehow,
> >> >> just like we do for mlock() stuff.
> >> >>
> >> > Yeah, even the queues are in pageable memory, it's just a ~256 byte
> >> > structure per queue (the Memory Queue Descriptor) that describes the
> >> > queue to hardware, plus a couple of pages for each process using HSA
> >> > to hold things like doorbells. Current thinking is to limit #
> >> > processes using HSA to ~256 and #queues per process to ~1024 by
> >> > default in the initial code, although my guess is that we could take
> >> > the #queues per process default limit even lower.
> >> >
> >>
> >> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> >> on 256 boundary.
> >> I had in mind to reserve 64MB of gart by default, which translates to
> >> 512 queues per process, with 128 processes. Add 2 kernel module
> >> parameters, # of max-queues-per-process and # of max-processes (default
> >> is, as I said, 512 and 128) for better control of system admin.
> >>
> >
> > So as i said somewhere else in this thread, this should not be reserved
> > but use a special allocation. Any HSA GPU use virtual address space for
> > userspace so only issue is for kernel side GTT.
> >
> > What i would like is seeing radeon pinned GTT allocation at bottom of
> > GTT space (ie all ring buffer and the ib pool buffer). Then have an
> > allocator that allocate new queue from top of GTT address space and
> > grow to the bottom.
> >
> > It should not staticly reserved 64M or anything. When doing allocation
> > it should move any ttm buffer that are in the region it want to allocate
> > to a different location.
> >
> >
> > As this needs some work, i am not against reserving some small amount
> > (couple MB) as a first stage but anything more would need a proper solution
> > like the one i just described.
> 
> It's still a trade off.  Even if we reserve a couple of megs it'll be
> wasted if we are not running HSA apps. And even today if we run a
> compute job using the current interfaces we could end up in the same
> case.  So while I think it's definitely a good goal to come up with
> some solution for fragmentation, I don't think it should be a
> show-stopper right now.
> 

Seems i am having a hard time to express myself. I am not saying it is a
showstopper i am saying until proper solution is implemented KFD should
limit its number of queue to consume at most couple MB ie not 64MB or more
but 2MB, 4MB something in that water.

> A better solution to deal with fragmentation of GTT and provide a
> better way to allocate larger buffers in vram would be to break up
> vram <-> system pool transfers into multiple transfers depending on
> the available GTT size.  Or use GPUVM dynamically for  vram <-> system
> transfers.

Isn't the UVD engine still using the main GTT ? I have not look much at
UVD in a while.

Yes there is way to fix buffer migration but i would also like to see
address space fragmentation to a minimum which is the main reason i
uterly hate any design that forbid kernel to take over and do its thing.

Buffer pining should really be only for front buffer and thing like ring
ie buffer that have a lifetime bound to the driver lifetime.

Cheers,
Jérôme

> 
> Alex

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-24 18:47                         ` Jerome Glisse
@ 2014-07-24 18:57                           ` Oded Gabbay
  -1 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-24 18:57 UTC (permalink / raw)
  To: Jerome Glisse, Alex Deucher
  Cc: Bridgman, John, Jesse Barnes, dri-devel, Christian König,
	Lewycky, Andrew, David Airlie, linux-kernel

On 24/07/14 21:47, Jerome Glisse wrote:
> On Thu, Jul 24, 2014 at 01:35:53PM -0400, Alex Deucher wrote:
>> On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
>>>> On 24/07/14 00:46, Bridgman, John wrote:
>>>>>
>>>>>> -----Original Message----- From: dri-devel
>>>>>> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
>>>>>> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
>>>>>> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
>>>>>> AMDKFD kernel driver
>>>>>>
>>>>>> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
>>>>>> Vetter) wrote:
>>>>>>
>>>>>>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>>>>>>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>>>>>>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
>>>>>>>>> wrote:
>>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>
>>>>>> [snip!!]
>>>>> My BlackBerry thumb thanks you ;)
>>>>>>
>>>>>>>>>>
>>>>>>>>>> The main questions here are if it's avoid able to pin down
>>>>>>>>>> the memory and if the memory is pinned down at driver load,
>>>>>>>>>> by request from userspace or by anything else.
>>>>>>>>>>
>>>>>>>>>> As far as I can see only the "mqd per userspace queue"
>>>>>>>>>> might be a bit questionable, everything else sounds
>>>>>>>>>> reasonable.
>>>>>>>>>
>>>>>>>>> Aside, i915 perspective again (i.e. how we solved this):
>>>>>>>>> When scheduling away from contexts we unpin them and put them
>>>>>>>>> into the lru. And in the shrinker we have a last-ditch
>>>>>>>>> callback to switch to a default context (since you can't ever
>>>>>>>>> have no context once you've started) which means we can evict
>>>>>>>>> any context object if it's
>>>>>> getting in the way.
>>>>>>>>
>>>>>>>> So Intel hardware report through some interrupt or some channel
>>>>>>>> when it is not using a context ? ie kernel side get
>>>>>>>> notification when some user context is done executing ?
>>>>>>>
>>>>>>> Yes, as long as we do the scheduling with the cpu we get
>>>>>>> interrupts for context switches. The mechanic is already
>>>>>>> published in the execlist patches currently floating around. We
>>>>>>> get a special context switch interrupt.
>>>>>>>
>>>>>>> But we have this unpin logic already on the current code where
>>>>>>> we switch contexts through in-line cs commands from the kernel.
>>>>>>> There we obviously use the normal batch completion events.
>>>>>>
>>>>>> Yeah and we can continue that going forward.  And of course if your
>>>>>> hw can do page faulting, you don't need to pin the normal data
>>>>>> buffers.
>>>>>>
>>>>>> Usually there are some special buffers that need to be pinned for
>>>>>> longer periods though, anytime the context could be active.  Sounds
>>>>>> like in this case the userland queues, which makes some sense.  But
>>>>>> maybe for smaller systems the size limit could be clamped to
>>>>>> something smaller than 128M.  Or tie it into the rlimit somehow,
>>>>>> just like we do for mlock() stuff.
>>>>>>
>>>>> Yeah, even the queues are in pageable memory, it's just a ~256 byte
>>>>> structure per queue (the Memory Queue Descriptor) that describes the
>>>>> queue to hardware, plus a couple of pages for each process using HSA
>>>>> to hold things like doorbells. Current thinking is to limit #
>>>>> processes using HSA to ~256 and #queues per process to ~1024 by
>>>>> default in the initial code, although my guess is that we could take
>>>>> the #queues per process default limit even lower.
>>>>>
>>>>
>>>> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
>>>> on 256 boundary.
>>>> I had in mind to reserve 64MB of gart by default, which translates to
>>>> 512 queues per process, with 128 processes. Add 2 kernel module
>>>> parameters, # of max-queues-per-process and # of max-processes (default
>>>> is, as I said, 512 and 128) for better control of system admin.
>>>>
>>>
>>> So as i said somewhere else in this thread, this should not be reserved
>>> but use a special allocation. Any HSA GPU use virtual address space for
>>> userspace so only issue is for kernel side GTT.
>>>
>>> What i would like is seeing radeon pinned GTT allocation at bottom of
>>> GTT space (ie all ring buffer and the ib pool buffer). Then have an
>>> allocator that allocate new queue from top of GTT address space and
>>> grow to the bottom.
>>>
>>> It should not staticly reserved 64M or anything. When doing allocation
>>> it should move any ttm buffer that are in the region it want to allocate
>>> to a different location.
>>>
>>>
>>> As this needs some work, i am not against reserving some small amount
>>> (couple MB) as a first stage but anything more would need a proper solution
>>> like the one i just described.
>>
>> It's still a trade off.  Even if we reserve a couple of megs it'll be
>> wasted if we are not running HSA apps. And even today if we run a
>> compute job using the current interfaces we could end up in the same
>> case.  So while I think it's definitely a good goal to come up with
>> some solution for fragmentation, I don't think it should be a
>> show-stopper right now.
>>
> 
> Seems i am having a hard time to express myself. I am not saying it is a
> showstopper i am saying until proper solution is implemented KFD should
> limit its number of queue to consume at most couple MB ie not 64MB or more
> but 2MB, 4MB something in that water.
So we thought internally about limiting ourselves through two kernel
module parameters, # of queues per process and # of processes. Default
values will be 128 queues per process and 32 processes. mqd takes 768
bytes at most, so that gives us a maximum of 3MB.

For absolute maximum, I think using H/W limits which are 1024 queues per
process and 512 processes. That gives us 384MB.

Would that be acceptable ?
> 
>> A better solution to deal with fragmentation of GTT and provide a
>> better way to allocate larger buffers in vram would be to break up
>> vram <-> system pool transfers into multiple transfers depending on
>> the available GTT size.  Or use GPUVM dynamically for  vram <-> system
>> transfers.
> 
> Isn't the UVD engine still using the main GTT ? I have not look much at
> UVD in a while.
> 
> Yes there is way to fix buffer migration but i would also like to see
> address space fragmentation to a minimum which is the main reason i
> uterly hate any design that forbid kernel to take over and do its thing.
> 
> Buffer pining should really be only for front buffer and thing like ring
> ie buffer that have a lifetime bound to the driver lifetime.
> 
> Cheers,
> Jérôme
> 
>>
>> Alex


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-24 18:57                           ` Oded Gabbay
  0 siblings, 0 replies; 148+ messages in thread
From: Oded Gabbay @ 2014-07-24 18:57 UTC (permalink / raw)
  To: Jerome Glisse, Alex Deucher; +Cc: Lewycky, Andrew, linux-kernel, dri-devel

On 24/07/14 21:47, Jerome Glisse wrote:
> On Thu, Jul 24, 2014 at 01:35:53PM -0400, Alex Deucher wrote:
>> On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
>>>> On 24/07/14 00:46, Bridgman, John wrote:
>>>>>
>>>>>> -----Original Message----- From: dri-devel
>>>>>> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
>>>>>> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
>>>>>> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
>>>>>> AMDKFD kernel driver
>>>>>>
>>>>>> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
>>>>>> Vetter) wrote:
>>>>>>
>>>>>>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
>>>>>>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
>>>>>>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
>>>>>>>>> wrote:
>>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
>>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
>>>>>>
>>>>>> [snip!!]
>>>>> My BlackBerry thumb thanks you ;)
>>>>>>
>>>>>>>>>>
>>>>>>>>>> The main questions here are if it's avoid able to pin down
>>>>>>>>>> the memory and if the memory is pinned down at driver load,
>>>>>>>>>> by request from userspace or by anything else.
>>>>>>>>>>
>>>>>>>>>> As far as I can see only the "mqd per userspace queue"
>>>>>>>>>> might be a bit questionable, everything else sounds
>>>>>>>>>> reasonable.
>>>>>>>>>
>>>>>>>>> Aside, i915 perspective again (i.e. how we solved this):
>>>>>>>>> When scheduling away from contexts we unpin them and put them
>>>>>>>>> into the lru. And in the shrinker we have a last-ditch
>>>>>>>>> callback to switch to a default context (since you can't ever
>>>>>>>>> have no context once you've started) which means we can evict
>>>>>>>>> any context object if it's
>>>>>> getting in the way.
>>>>>>>>
>>>>>>>> So Intel hardware report through some interrupt or some channel
>>>>>>>> when it is not using a context ? ie kernel side get
>>>>>>>> notification when some user context is done executing ?
>>>>>>>
>>>>>>> Yes, as long as we do the scheduling with the cpu we get
>>>>>>> interrupts for context switches. The mechanic is already
>>>>>>> published in the execlist patches currently floating around. We
>>>>>>> get a special context switch interrupt.
>>>>>>>
>>>>>>> But we have this unpin logic already on the current code where
>>>>>>> we switch contexts through in-line cs commands from the kernel.
>>>>>>> There we obviously use the normal batch completion events.
>>>>>>
>>>>>> Yeah and we can continue that going forward.  And of course if your
>>>>>> hw can do page faulting, you don't need to pin the normal data
>>>>>> buffers.
>>>>>>
>>>>>> Usually there are some special buffers that need to be pinned for
>>>>>> longer periods though, anytime the context could be active.  Sounds
>>>>>> like in this case the userland queues, which makes some sense.  But
>>>>>> maybe for smaller systems the size limit could be clamped to
>>>>>> something smaller than 128M.  Or tie it into the rlimit somehow,
>>>>>> just like we do for mlock() stuff.
>>>>>>
>>>>> Yeah, even the queues are in pageable memory, it's just a ~256 byte
>>>>> structure per queue (the Memory Queue Descriptor) that describes the
>>>>> queue to hardware, plus a couple of pages for each process using HSA
>>>>> to hold things like doorbells. Current thinking is to limit #
>>>>> processes using HSA to ~256 and #queues per process to ~1024 by
>>>>> default in the initial code, although my guess is that we could take
>>>>> the #queues per process default limit even lower.
>>>>>
>>>>
>>>> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
>>>> on 256 boundary.
>>>> I had in mind to reserve 64MB of gart by default, which translates to
>>>> 512 queues per process, with 128 processes. Add 2 kernel module
>>>> parameters, # of max-queues-per-process and # of max-processes (default
>>>> is, as I said, 512 and 128) for better control of system admin.
>>>>
>>>
>>> So as i said somewhere else in this thread, this should not be reserved
>>> but use a special allocation. Any HSA GPU use virtual address space for
>>> userspace so only issue is for kernel side GTT.
>>>
>>> What i would like is seeing radeon pinned GTT allocation at bottom of
>>> GTT space (ie all ring buffer and the ib pool buffer). Then have an
>>> allocator that allocate new queue from top of GTT address space and
>>> grow to the bottom.
>>>
>>> It should not staticly reserved 64M or anything. When doing allocation
>>> it should move any ttm buffer that are in the region it want to allocate
>>> to a different location.
>>>
>>>
>>> As this needs some work, i am not against reserving some small amount
>>> (couple MB) as a first stage but anything more would need a proper solution
>>> like the one i just described.
>>
>> It's still a trade off.  Even if we reserve a couple of megs it'll be
>> wasted if we are not running HSA apps. And even today if we run a
>> compute job using the current interfaces we could end up in the same
>> case.  So while I think it's definitely a good goal to come up with
>> some solution for fragmentation, I don't think it should be a
>> show-stopper right now.
>>
> 
> Seems i am having a hard time to express myself. I am not saying it is a
> showstopper i am saying until proper solution is implemented KFD should
> limit its number of queue to consume at most couple MB ie not 64MB or more
> but 2MB, 4MB something in that water.
So we thought internally about limiting ourselves through two kernel
module parameters, # of queues per process and # of processes. Default
values will be 128 queues per process and 32 processes. mqd takes 768
bytes at most, so that gives us a maximum of 3MB.

For absolute maximum, I think using H/W limits which are 1024 queues per
process and 512 processes. That gives us 384MB.

Would that be acceptable ?
> 
>> A better solution to deal with fragmentation of GTT and provide a
>> better way to allocate larger buffers in vram would be to break up
>> vram <-> system pool transfers into multiple transfers depending on
>> the available GTT size.  Or use GPUVM dynamically for  vram <-> system
>> transfers.
> 
> Isn't the UVD engine still using the main GTT ? I have not look much at
> UVD in a while.
> 
> Yes there is way to fix buffer migration but i would also like to see
> address space fragmentation to a minimum which is the main reason i
> uterly hate any design that forbid kernel to take over and do its thing.
> 
> Buffer pining should really be only for front buffer and thing like ring
> ie buffer that have a lifetime bound to the driver lifetime.
> 
> Cheers,
> Jérôme
> 
>>
>> Alex

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
  2014-07-24 18:57                           ` Oded Gabbay
@ 2014-07-24 20:26                             ` Jerome Glisse
  -1 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-24 20:26 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: Alex Deucher, Lewycky, Andrew, linux-kernel, dri-devel

On Thu, Jul 24, 2014 at 09:57:16PM +0300, Oded Gabbay wrote:
> On 24/07/14 21:47, Jerome Glisse wrote:
> > On Thu, Jul 24, 2014 at 01:35:53PM -0400, Alex Deucher wrote:
> >> On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>> On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> >>>> On 24/07/14 00:46, Bridgman, John wrote:
> >>>>>
> >>>>>> -----Original Message----- From: dri-devel
> >>>>>> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
> >>>>>> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >>>>>> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >>>>>> AMDKFD kernel driver
> >>>>>>
> >>>>>> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >>>>>> Vetter) wrote:
> >>>>>>
> >>>>>>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >>>>>>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >>>>>>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >>>>>>>>> wrote:
> >>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>
> >>>>>> [snip!!]
> >>>>> My BlackBerry thumb thanks you ;)
> >>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The main questions here are if it's avoid able to pin down
> >>>>>>>>>> the memory and if the memory is pinned down at driver load,
> >>>>>>>>>> by request from userspace or by anything else.
> >>>>>>>>>>
> >>>>>>>>>> As far as I can see only the "mqd per userspace queue"
> >>>>>>>>>> might be a bit questionable, everything else sounds
> >>>>>>>>>> reasonable.
> >>>>>>>>>
> >>>>>>>>> Aside, i915 perspective again (i.e. how we solved this):
> >>>>>>>>> When scheduling away from contexts we unpin them and put them
> >>>>>>>>> into the lru. And in the shrinker we have a last-ditch
> >>>>>>>>> callback to switch to a default context (since you can't ever
> >>>>>>>>> have no context once you've started) which means we can evict
> >>>>>>>>> any context object if it's
> >>>>>> getting in the way.
> >>>>>>>>
> >>>>>>>> So Intel hardware report through some interrupt or some channel
> >>>>>>>> when it is not using a context ? ie kernel side get
> >>>>>>>> notification when some user context is done executing ?
> >>>>>>>
> >>>>>>> Yes, as long as we do the scheduling with the cpu we get
> >>>>>>> interrupts for context switches. The mechanic is already
> >>>>>>> published in the execlist patches currently floating around. We
> >>>>>>> get a special context switch interrupt.
> >>>>>>>
> >>>>>>> But we have this unpin logic already on the current code where
> >>>>>>> we switch contexts through in-line cs commands from the kernel.
> >>>>>>> There we obviously use the normal batch completion events.
> >>>>>>
> >>>>>> Yeah and we can continue that going forward.  And of course if your
> >>>>>> hw can do page faulting, you don't need to pin the normal data
> >>>>>> buffers.
> >>>>>>
> >>>>>> Usually there are some special buffers that need to be pinned for
> >>>>>> longer periods though, anytime the context could be active.  Sounds
> >>>>>> like in this case the userland queues, which makes some sense.  But
> >>>>>> maybe for smaller systems the size limit could be clamped to
> >>>>>> something smaller than 128M.  Or tie it into the rlimit somehow,
> >>>>>> just like we do for mlock() stuff.
> >>>>>>
> >>>>> Yeah, even the queues are in pageable memory, it's just a ~256 byte
> >>>>> structure per queue (the Memory Queue Descriptor) that describes the
> >>>>> queue to hardware, plus a couple of pages for each process using HSA
> >>>>> to hold things like doorbells. Current thinking is to limit #
> >>>>> processes using HSA to ~256 and #queues per process to ~1024 by
> >>>>> default in the initial code, although my guess is that we could take
> >>>>> the #queues per process default limit even lower.
> >>>>>
> >>>>
> >>>> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> >>>> on 256 boundary.
> >>>> I had in mind to reserve 64MB of gart by default, which translates to
> >>>> 512 queues per process, with 128 processes. Add 2 kernel module
> >>>> parameters, # of max-queues-per-process and # of max-processes (default
> >>>> is, as I said, 512 and 128) for better control of system admin.
> >>>>
> >>>
> >>> So as i said somewhere else in this thread, this should not be reserved
> >>> but use a special allocation. Any HSA GPU use virtual address space for
> >>> userspace so only issue is for kernel side GTT.
> >>>
> >>> What i would like is seeing radeon pinned GTT allocation at bottom of
> >>> GTT space (ie all ring buffer and the ib pool buffer). Then have an
> >>> allocator that allocate new queue from top of GTT address space and
> >>> grow to the bottom.
> >>>
> >>> It should not staticly reserved 64M or anything. When doing allocation
> >>> it should move any ttm buffer that are in the region it want to allocate
> >>> to a different location.
> >>>
> >>>
> >>> As this needs some work, i am not against reserving some small amount
> >>> (couple MB) as a first stage but anything more would need a proper solution
> >>> like the one i just described.
> >>
> >> It's still a trade off.  Even if we reserve a couple of megs it'll be
> >> wasted if we are not running HSA apps. And even today if we run a
> >> compute job using the current interfaces we could end up in the same
> >> case.  So while I think it's definitely a good goal to come up with
> >> some solution for fragmentation, I don't think it should be a
> >> show-stopper right now.
> >>
> > 
> > Seems i am having a hard time to express myself. I am not saying it is a
> > showstopper i am saying until proper solution is implemented KFD should
> > limit its number of queue to consume at most couple MB ie not 64MB or more
> > but 2MB, 4MB something in that water.
> So we thought internally about limiting ourselves through two kernel
> module parameters, # of queues per process and # of processes. Default
> values will be 128 queues per process and 32 processes. mqd takes 768
> bytes at most, so that gives us a maximum of 3MB.
> 
> For absolute maximum, I think using H/W limits which are 1024 queues per
> process and 512 processes. That gives us 384MB.
> 
> Would that be acceptable ?

Yes and no, yes as _temporary_ solution ie a proper solution must be
implemented.

Moreover i sincerely hope that CZ will allow for easy way to unpin
any buffer and move them.

Cheers,
Jérôme

> > 
> >> A better solution to deal with fragmentation of GTT and provide a
> >> better way to allocate larger buffers in vram would be to break up
> >> vram <-> system pool transfers into multiple transfers depending on
> >> the available GTT size.  Or use GPUVM dynamically for  vram <-> system
> >> transfers.
> > 
> > Isn't the UVD engine still using the main GTT ? I have not look much at
> > UVD in a while.
> > 
> > Yes there is way to fix buffer migration but i would also like to see
> > address space fragmentation to a minimum which is the main reason i
> > uterly hate any design that forbid kernel to take over and do its thing.
> > 
> > Buffer pining should really be only for front buffer and thing like ring
> > ie buffer that have a lifetime bound to the driver lifetime.
> > 
> > Cheers,
> > Jérôme
> > 
> >>
> >> Alex
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v2 00/25] AMDKFD kernel driver
@ 2014-07-24 20:26                             ` Jerome Glisse
  0 siblings, 0 replies; 148+ messages in thread
From: Jerome Glisse @ 2014-07-24 20:26 UTC (permalink / raw)
  To: Oded Gabbay; +Cc: Lewycky, Andrew, linux-kernel, dri-devel

On Thu, Jul 24, 2014 at 09:57:16PM +0300, Oded Gabbay wrote:
> On 24/07/14 21:47, Jerome Glisse wrote:
> > On Thu, Jul 24, 2014 at 01:35:53PM -0400, Alex Deucher wrote:
> >> On Thu, Jul 24, 2014 at 11:44 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>> On Thu, Jul 24, 2014 at 01:01:41AM +0300, Oded Gabbay wrote:
> >>>> On 24/07/14 00:46, Bridgman, John wrote:
> >>>>>
> >>>>>> -----Original Message----- From: dri-devel
> >>>>>> [mailto:dri-devel-bounces@lists.freedesktop.org] On Behalf Of Jesse
> >>>>>> Barnes Sent: Wednesday, July 23, 2014 5:00 PM To:
> >>>>>> dri-devel@lists.freedesktop.org Subject: Re: [PATCH v2 00/25]
> >>>>>> AMDKFD kernel driver
> >>>>>>
> >>>>>> On Mon, 21 Jul 2014 19:05:46 +0200 daniel at ffwll.ch (Daniel
> >>>>>> Vetter) wrote:
> >>>>>>
> >>>>>>> On Mon, Jul 21, 2014 at 11:58:52AM -0400, Jerome Glisse wrote:
> >>>>>>>> On Mon, Jul 21, 2014 at 05:25:11PM +0200, Daniel Vetter wrote:
> >>>>>>>>> On Mon, Jul 21, 2014 at 03:39:09PM +0200, Christian K?nig
> >>>>>>>>> wrote:
> >>>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay:
> >>>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote:
> >>>>>>
> >>>>>> [snip!!]
> >>>>> My BlackBerry thumb thanks you ;)
> >>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The main questions here are if it's avoid able to pin down
> >>>>>>>>>> the memory and if the memory is pinned down at driver load,
> >>>>>>>>>> by request from userspace or by anything else.
> >>>>>>>>>>
> >>>>>>>>>> As far as I can see only the "mqd per userspace queue"
> >>>>>>>>>> might be a bit questionable, everything else sounds
> >>>>>>>>>> reasonable.
> >>>>>>>>>
> >>>>>>>>> Aside, i915 perspective again (i.e. how we solved this):
> >>>>>>>>> When scheduling away from contexts we unpin them and put them
> >>>>>>>>> into the lru. And in the shrinker we have a last-ditch
> >>>>>>>>> callback to switch to a default context (since you can't ever
> >>>>>>>>> have no context once you've started) which means we can evict
> >>>>>>>>> any context object if it's
> >>>>>> getting in the way.
> >>>>>>>>
> >>>>>>>> So Intel hardware report through some interrupt or some channel
> >>>>>>>> when it is not using a context ? ie kernel side get
> >>>>>>>> notification when some user context is done executing ?
> >>>>>>>
> >>>>>>> Yes, as long as we do the scheduling with the cpu we get
> >>>>>>> interrupts for context switches. The mechanic is already
> >>>>>>> published in the execlist patches currently floating around. We
> >>>>>>> get a special context switch interrupt.
> >>>>>>>
> >>>>>>> But we have this unpin logic already on the current code where
> >>>>>>> we switch contexts through in-line cs commands from the kernel.
> >>>>>>> There we obviously use the normal batch completion events.
> >>>>>>
> >>>>>> Yeah and we can continue that going forward.  And of course if your
> >>>>>> hw can do page faulting, you don't need to pin the normal data
> >>>>>> buffers.
> >>>>>>
> >>>>>> Usually there are some special buffers that need to be pinned for
> >>>>>> longer periods though, anytime the context could be active.  Sounds
> >>>>>> like in this case the userland queues, which makes some sense.  But
> >>>>>> maybe for smaller systems the size limit could be clamped to
> >>>>>> something smaller than 128M.  Or tie it into the rlimit somehow,
> >>>>>> just like we do for mlock() stuff.
> >>>>>>
> >>>>> Yeah, even the queues are in pageable memory, it's just a ~256 byte
> >>>>> structure per queue (the Memory Queue Descriptor) that describes the
> >>>>> queue to hardware, plus a couple of pages for each process using HSA
> >>>>> to hold things like doorbells. Current thinking is to limit #
> >>>>> processes using HSA to ~256 and #queues per process to ~1024 by
> >>>>> default in the initial code, although my guess is that we could take
> >>>>> the #queues per process default limit even lower.
> >>>>>
> >>>>
> >>>> So my mistake. struct cik_mqd is actually 604 bytes, and it is allocated
> >>>> on 256 boundary.
> >>>> I had in mind to reserve 64MB of gart by default, which translates to
> >>>> 512 queues per process, with 128 processes. Add 2 kernel module
> >>>> parameters, # of max-queues-per-process and # of max-processes (default
> >>>> is, as I said, 512 and 128) for better control of system admin.
> >>>>
> >>>
> >>> So as i said somewhere else in this thread, this should not be reserved
> >>> but use a special allocation. Any HSA GPU use virtual address space for
> >>> userspace so only issue is for kernel side GTT.
> >>>
> >>> What i would like is seeing radeon pinned GTT allocation at bottom of
> >>> GTT space (ie all ring buffer and the ib pool buffer). Then have an
> >>> allocator that allocate new queue from top of GTT address space and
> >>> grow to the bottom.
> >>>
> >>> It should not staticly reserved 64M or anything. When doing allocation
> >>> it should move any ttm buffer that are in the region it want to allocate
> >>> to a different location.
> >>>
> >>>
> >>> As this needs some work, i am not against reserving some small amount
> >>> (couple MB) as a first stage but anything more would need a proper solution
> >>> like the one i just described.
> >>
> >> It's still a trade off.  Even if we reserve a couple of megs it'll be
> >> wasted if we are not running HSA apps. And even today if we run a
> >> compute job using the current interfaces we could end up in the same
> >> case.  So while I think it's definitely a good goal to come up with
> >> some solution for fragmentation, I don't think it should be a
> >> show-stopper right now.
> >>
> > 
> > Seems i am having a hard time to express myself. I am not saying it is a
> > showstopper i am saying until proper solution is implemented KFD should
> > limit its number of queue to consume at most couple MB ie not 64MB or more
> > but 2MB, 4MB something in that water.
> So we thought internally about limiting ourselves through two kernel
> module parameters, # of queues per process and # of processes. Default
> values will be 128 queues per process and 32 processes. mqd takes 768
> bytes at most, so that gives us a maximum of 3MB.
> 
> For absolute maximum, I think using H/W limits which are 1024 queues per
> process and 512 processes. That gives us 384MB.
> 
> Would that be acceptable ?

Yes and no, yes as _temporary_ solution ie a proper solution must be
implemented.

Moreover i sincerely hope that CZ will allow for easy way to unpin
any buffer and move them.

Cheers,
Jérôme

> > 
> >> A better solution to deal with fragmentation of GTT and provide a
> >> better way to allocate larger buffers in vram would be to break up
> >> vram <-> system pool transfers into multiple transfers depending on
> >> the available GTT size.  Or use GPUVM dynamically for  vram <-> system
> >> transfers.
> > 
> > Isn't the UVD engine still using the main GTT ? I have not look much at
> > UVD in a while.
> > 
> > Yes there is way to fix buffer migration but i would also like to see
> > address space fragmentation to a minimum which is the main reason i
> > uterly hate any design that forbid kernel to take over and do its thing.
> > 
> > Buffer pining should really be only for front buffer and thing like ring
> > ie buffer that have a lifetime bound to the driver lifetime.
> > 
> > Cheers,
> > Jérôme
> > 
> >>
> >> Alex
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 148+ messages in thread

end of thread, other threads:[~2014-07-24 20:27 UTC | newest]

Thread overview: 148+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-17 13:57 [PATCH v2 00/25] AMDKFD kernel driver Oded Gabbay
2014-07-17 13:57 ` Oded Gabbay
2014-07-17 13:57 ` Oded Gabbay
2014-07-20 17:46 ` Jerome Glisse
2014-07-20 17:46   ` Jerome Glisse
2014-07-20 17:46   ` Jerome Glisse
2014-07-21  3:03   ` Jerome Glisse
2014-07-21  3:03     ` Jerome Glisse
2014-07-21  3:03     ` Jerome Glisse
2014-07-21  7:01   ` Daniel Vetter
2014-07-21  7:01     ` Daniel Vetter
2014-07-21  9:34     ` Christian König
2014-07-21  9:34       ` Christian König
2014-07-21 12:36   ` Oded Gabbay
2014-07-21 12:36     ` Oded Gabbay
2014-07-21 12:36     ` Oded Gabbay
2014-07-21 13:39     ` Christian König
2014-07-21 13:39       ` Christian König
2014-07-21 13:39       ` Christian König
2014-07-21 14:12       ` Oded Gabbay
2014-07-21 14:12         ` Oded Gabbay
2014-07-21 14:12         ` Oded Gabbay
2014-07-21 15:54         ` Jerome Glisse
2014-07-21 15:54           ` Jerome Glisse
2014-07-21 15:54           ` Jerome Glisse
2014-07-21 17:42           ` Oded Gabbay
2014-07-21 17:42             ` Oded Gabbay
2014-07-21 17:42             ` Oded Gabbay
2014-07-21 18:14             ` Jerome Glisse
2014-07-21 18:14               ` Jerome Glisse
2014-07-21 18:14               ` Jerome Glisse
2014-07-21 18:36               ` Oded Gabbay
2014-07-21 18:36                 ` Oded Gabbay
2014-07-21 18:36                 ` Oded Gabbay
2014-07-21 18:59                 ` Jerome Glisse
2014-07-21 18:59                   ` Jerome Glisse
2014-07-21 18:59                   ` Jerome Glisse
2014-07-21 19:23                   ` Oded Gabbay
2014-07-21 19:23                     ` Oded Gabbay
2014-07-21 19:23                     ` Oded Gabbay
2014-07-21 19:28                     ` Jerome Glisse
2014-07-21 19:28                       ` Jerome Glisse
2014-07-21 19:28                       ` Jerome Glisse
2014-07-21 21:56                       ` Oded Gabbay
2014-07-21 21:56                         ` Oded Gabbay
2014-07-21 21:56                         ` Oded Gabbay
2014-07-21 23:05                         ` Jerome Glisse
2014-07-21 23:05                           ` Jerome Glisse
2014-07-21 23:05                           ` Jerome Glisse
2014-07-21 23:29                           ` Bridgman, John
2014-07-21 23:29                             ` Bridgman, John
2014-07-21 23:36                             ` Jerome Glisse
2014-07-21 23:36                               ` Jerome Glisse
2014-07-21 23:36                               ` Jerome Glisse
2014-07-22  8:05                           ` Oded Gabbay
2014-07-22  8:05                             ` Oded Gabbay
2014-07-22  8:05                             ` Oded Gabbay
2014-07-22  7:23                     ` Daniel Vetter
2014-07-22  7:23                       ` Daniel Vetter
2014-07-22  7:23                       ` Daniel Vetter
2014-07-22  8:10                       ` Oded Gabbay
2014-07-22  8:10                         ` Oded Gabbay
2014-07-21 15:25       ` Daniel Vetter
2014-07-21 15:25         ` Daniel Vetter
2014-07-21 15:25         ` Daniel Vetter
2014-07-21 15:58         ` Jerome Glisse
2014-07-21 15:58           ` Jerome Glisse
2014-07-21 15:58           ` Jerome Glisse
2014-07-21 17:05           ` Daniel Vetter
2014-07-21 17:05             ` Daniel Vetter
2014-07-21 17:05             ` Daniel Vetter
2014-07-21 17:28             ` Oded Gabbay
2014-07-21 17:28               ` Oded Gabbay
2014-07-21 17:28               ` Oded Gabbay
2014-07-21 18:22               ` Daniel Vetter
2014-07-21 18:22                 ` Daniel Vetter
2014-07-21 18:22                 ` Daniel Vetter
2014-07-21 18:41                 ` Oded Gabbay
2014-07-21 18:41                   ` Oded Gabbay
2014-07-21 18:41                   ` Oded Gabbay
2014-07-21 19:03                   ` Jerome Glisse
2014-07-21 19:03                     ` Jerome Glisse
2014-07-21 19:03                     ` Jerome Glisse
2014-07-22  7:28                     ` Daniel Vetter
2014-07-22  7:28                       ` Daniel Vetter
2014-07-22  7:28                       ` Daniel Vetter
2014-07-22  7:40                       ` Daniel Vetter
2014-07-22  7:40                         ` Daniel Vetter
2014-07-22  8:21                         ` Oded Gabbay
2014-07-22  8:21                           ` Oded Gabbay
2014-07-22  8:19                       ` Oded Gabbay
2014-07-22  8:19                         ` Oded Gabbay
2014-07-22  9:21                         ` Daniel Vetter
2014-07-22  9:21                           ` Daniel Vetter
2014-07-22  9:21                           ` Daniel Vetter
2014-07-22  9:24                           ` Daniel Vetter
2014-07-22  9:24                             ` Daniel Vetter
2014-07-22  9:24                             ` Daniel Vetter
2014-07-22  9:52                           ` Oded Gabbay
2014-07-22  9:52                             ` Oded Gabbay
2014-07-22  9:52                             ` Oded Gabbay
2014-07-22 11:15                             ` Daniel Vetter
2014-07-22 11:15                               ` Daniel Vetter
2014-07-23  6:50                               ` Oded Gabbay
2014-07-23  6:50                                 ` Oded Gabbay
2014-07-23  7:04                                 ` Christian König
2014-07-23  7:04                                   ` Christian König
2014-07-23 13:39                                   ` Bridgman, John
2014-07-23 13:39                                     ` Bridgman, John
2014-07-23 14:56                                   ` Jerome Glisse
2014-07-23 14:56                                     ` Jerome Glisse
2014-07-23 14:56                                     ` Jerome Glisse
2014-07-23 19:49                                     ` Alex Deucher
2014-07-23 19:49                                       ` Alex Deucher
2014-07-23 20:25                                       ` Jerome Glisse
2014-07-23 20:25                                         ` Jerome Glisse
2014-07-23 20:25                                         ` Jerome Glisse
2014-07-23  7:05                                 ` Daniel Vetter
2014-07-23  7:05                                   ` Daniel Vetter
2014-07-23  7:05                                   ` Daniel Vetter
2014-07-23  8:35                                   ` Oded Gabbay
2014-07-23  8:35                                     ` Oded Gabbay
2014-07-23  8:35                                     ` Oded Gabbay
2014-07-23 13:33                                   ` Bridgman, John
2014-07-23 13:33                                     ` Bridgman, John
2014-07-23 13:33                                     ` Bridgman, John
2014-07-23 14:41                                     ` Daniel Vetter
2014-07-23 14:41                                       ` Daniel Vetter
2014-07-23 14:41                                       ` Daniel Vetter
2014-07-23 15:06                                       ` Bridgman, John
2014-07-23 15:06                                         ` Bridgman, John
2014-07-23 15:06                                         ` Bridgman, John
2014-07-23 15:12                                         ` Bridgman, John
2014-07-23 15:12                                           ` Bridgman, John
2014-07-23 15:12                                           ` Bridgman, John
2014-07-23 20:59             ` Jesse Barnes
2014-07-23 21:46               ` Bridgman, John
2014-07-23 22:01                 ` Oded Gabbay
2014-07-24 15:44                   ` Jerome Glisse
2014-07-24 15:44                     ` Jerome Glisse
2014-07-24 17:35                     ` Alex Deucher
2014-07-24 17:35                       ` Alex Deucher
2014-07-24 18:47                       ` Jerome Glisse
2014-07-24 18:47                         ` Jerome Glisse
2014-07-24 18:57                         ` Oded Gabbay
2014-07-24 18:57                           ` Oded Gabbay
2014-07-24 20:26                           ` Jerome Glisse
2014-07-24 20:26                             ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.