All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-10 23:56 ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan
  Cc: daniel, jstultz, cmllamas, kaleshsingh, Kenny.Ho, mkoutny, skhan,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-media,
	dri-devel, linaro-mm-sig, linux-kselftest

This patch series revisits the proposal for a GPU cgroup controller to
track and limit memory allocations by various device/allocator
subsystems. The patch series also contains a simple prototype to
illustrate how Android intends to implement DMA-BUF allocator
attribution using the GPU cgroup controller. The prototype does not
include resource limit enforcements.

Changelog:
v7:
Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
This means gpucg_register_bucket now returns an internally allocated
struct gpucg_bucket.

Move all public function documentation to the cgroup_gpu.h header.

Remove comment in documentation about duplicate name rejection which
is not relevant to cgroups users per Michal Koutný.

v6:
Move documentation into cgroup-v2.rst per Tejun Heo.

Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.

Return error on transfer failure per Carlos Llamas.

v5:
Rebase on top of v5.18-rc3

Drop the global GPU cgroup "total" (sum of all device totals) portion
of the design since there is no currently known use for this per
Tejun Heo.

Fix commit message which still contained the old name for
dma_buf_transfer_charge per Michal Koutný.

Remove all GPU cgroup code except what's necessary to support charge transfer
from dma_buf. Previously charging was done in export, but for non-Android
graphics use-cases this is not ideal since there may be a delay between
allocation and export, during which time there is no accounting.

Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
result of above.

Put the charge and uncharge code in the same file (system_heap_allocate,
system_heap_dma_buf_release) instead of splitting them between the heap and
the dma_buf_release. This avoids asymmetric management of the gpucg charges.

Modify the dma_buf_transfer_charge API to accept a task_struct instead
of a gpucg. This avoids requiring the caller to manage the refcount
of the gpucg upon failure and confusing ownership transfer logic.

Support all strings for gpucg_register_bucket instead of just string
literals.

Enforce globally unique gpucg_bucket names.

Constrain gpucg_bucket name lengths to 64 bytes.

Append "-heap" to gpucg_bucket names from dmabuf-heaps.

Drop patch 7 from the series, which changed the types of
binder_transaction_data's sender_pid and sender_euid fields. This was
done in another commit here:
https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/

Rename:
  gpucg_try_charge -> gpucg_charge
  find_cg_rpool_locked -> cg_rpool_find_locked
  init_cg_rpool -> cg_rpool_init
  get_cg_rpool_locked -> cg_rpool_get_locked
  "gpu cgroup controller" -> "GPU controller"
  gpucg_device -> gpucg_bucket
  usage -> size

Tests:
  Support both binder_fd_array_object and binder_fd_object. This is
  necessary because new versions of Android will use binder_fd_object
  instead of binder_fd_array_object, and we need to support both.

  Tests for both binder_fd_array_object and binder_fd_object.

  For binder_utils return error codes instead of
  struct binder{fs}_ctx.

  Use ifdef __ANDROID__ to choose platform-dependent temp path instead
  of a runtime fallback.

  Ensure binderfs_mntpt ends with a trailing '/' character instead of
  prepending it where used.

v4:
Skip test if not run as root per Shuah Khan

Add better test logging for abnormal child termination per Shuah Khan

Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutný

Adjust gpucg_try_charge critical section for charge transfer functionality

Fix uninitialized return code error for dmabuf_try_charge error case

v3:
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz

Use more common dual author commit message format per John Stultz

Remove android from binder changes title per Todd Kjos

Add a kselftest for this new behavior per Greg Kroah-Hartman

Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.

Fix pid and uid types in binder UAPI header

v2:
See the previous revision of this change submitted by Hridya Valsaraju
at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/

Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König. Pointers to struct gpucg and struct gpucg_device
tracking the current associations were added to the dma_buf struct to
achieve this.

Fix incorrect Kconfig help section indentation per Randy Dunlap.

History of the GPU cgroup controller
====================================
The GPU/DRM cgroup controller came into being when a consensus[1]
was reached that the resources it tracked were unsuitable to be integrated
into memcg. Originally, the proposed controller was specific to the DRM
subsystem and was intended to track GEM buffers and GPU-specific
resources[2]. In order to help establish a unified memory accounting model
for all GPU and all related subsystems, Daniel Vetter put forth a
suggestion to move it out of the DRM subsystem so that it can be used by
other DMA-BUF exporters as well[3]. This RFC proposes an interface that
does the same.

[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
[2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
[3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/

Hridya Valsaraju (3):
  gpu: rfc: Proposal for a GPU cgroup controller
  cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
    memory
  binder: Add flags to relinquish ownership of fds

T.J. Mercier (3):
  dmabuf: heaps: export system_heap buffers with GPU cgroup charging
  dmabuf: Add gpu cgroup charge transfer function
  selftests: Add binder cgroup gpu memory transfer tests

 Documentation/admin-guide/cgroup-v2.rst       |  23 +
 drivers/android/binder.c                      |  31 +-
 drivers/dma-buf/dma-buf.c                     |  80 ++-
 drivers/dma-buf/dma-heap.c                    |  38 ++
 drivers/dma-buf/heaps/system_heap.c           |  28 +-
 include/linux/cgroup_gpu.h                    | 146 +++++
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/dma-buf.h                       |  49 +-
 include/linux/dma-heap.h                      |  15 +
 include/uapi/linux/android/binder.h           |  23 +-
 init/Kconfig                                  |   7 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/gpu.c                           | 390 +++++++++++++
 .../selftests/drivers/android/binder/Makefile |   8 +
 .../drivers/android/binder/binder_util.c      | 250 +++++++++
 .../drivers/android/binder/binder_util.h      |  32 ++
 .../selftests/drivers/android/binder/config   |   4 +
 .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
 18 files changed, 1632 insertions(+), 23 deletions(-)
 create mode 100644 include/linux/cgroup_gpu.h
 create mode 100644 kernel/cgroup/gpu.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
 create mode 100644 tools/testing/selftests/drivers/android/binder/config
 create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c

-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-10 23:56 ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan
  Cc: linux-kselftest, linux-doc, Kenny.Ho, cgroups, skhan, cmllamas,
	dri-devel, linux-kernel, linaro-mm-sig, mkoutny, kaleshsingh,
	jstultz, kernel-team, linux-media

This patch series revisits the proposal for a GPU cgroup controller to
track and limit memory allocations by various device/allocator
subsystems. The patch series also contains a simple prototype to
illustrate how Android intends to implement DMA-BUF allocator
attribution using the GPU cgroup controller. The prototype does not
include resource limit enforcements.

Changelog:
v7:
Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
This means gpucg_register_bucket now returns an internally allocated
struct gpucg_bucket.

Move all public function documentation to the cgroup_gpu.h header.

Remove comment in documentation about duplicate name rejection which
is not relevant to cgroups users per Michal Koutný.

v6:
Move documentation into cgroup-v2.rst per Tejun Heo.

Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.

Return error on transfer failure per Carlos Llamas.

v5:
Rebase on top of v5.18-rc3

Drop the global GPU cgroup "total" (sum of all device totals) portion
of the design since there is no currently known use for this per
Tejun Heo.

Fix commit message which still contained the old name for
dma_buf_transfer_charge per Michal Koutný.

Remove all GPU cgroup code except what's necessary to support charge transfer
from dma_buf. Previously charging was done in export, but for non-Android
graphics use-cases this is not ideal since there may be a delay between
allocation and export, during which time there is no accounting.

Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
result of above.

Put the charge and uncharge code in the same file (system_heap_allocate,
system_heap_dma_buf_release) instead of splitting them between the heap and
the dma_buf_release. This avoids asymmetric management of the gpucg charges.

Modify the dma_buf_transfer_charge API to accept a task_struct instead
of a gpucg. This avoids requiring the caller to manage the refcount
of the gpucg upon failure and confusing ownership transfer logic.

Support all strings for gpucg_register_bucket instead of just string
literals.

Enforce globally unique gpucg_bucket names.

Constrain gpucg_bucket name lengths to 64 bytes.

Append "-heap" to gpucg_bucket names from dmabuf-heaps.

Drop patch 7 from the series, which changed the types of
binder_transaction_data's sender_pid and sender_euid fields. This was
done in another commit here:
https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/

Rename:
  gpucg_try_charge -> gpucg_charge
  find_cg_rpool_locked -> cg_rpool_find_locked
  init_cg_rpool -> cg_rpool_init
  get_cg_rpool_locked -> cg_rpool_get_locked
  "gpu cgroup controller" -> "GPU controller"
  gpucg_device -> gpucg_bucket
  usage -> size

Tests:
  Support both binder_fd_array_object and binder_fd_object. This is
  necessary because new versions of Android will use binder_fd_object
  instead of binder_fd_array_object, and we need to support both.

  Tests for both binder_fd_array_object and binder_fd_object.

  For binder_utils return error codes instead of
  struct binder{fs}_ctx.

  Use ifdef __ANDROID__ to choose platform-dependent temp path instead
  of a runtime fallback.

  Ensure binderfs_mntpt ends with a trailing '/' character instead of
  prepending it where used.

v4:
Skip test if not run as root per Shuah Khan

Add better test logging for abnormal child termination per Shuah Khan

Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutný

Adjust gpucg_try_charge critical section for charge transfer functionality

Fix uninitialized return code error for dmabuf_try_charge error case

v3:
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz

Use more common dual author commit message format per John Stultz

Remove android from binder changes title per Todd Kjos

Add a kselftest for this new behavior per Greg Kroah-Hartman

Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.

Fix pid and uid types in binder UAPI header

v2:
See the previous revision of this change submitted by Hridya Valsaraju
at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/

Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König. Pointers to struct gpucg and struct gpucg_device
tracking the current associations were added to the dma_buf struct to
achieve this.

Fix incorrect Kconfig help section indentation per Randy Dunlap.

History of the GPU cgroup controller
====================================
The GPU/DRM cgroup controller came into being when a consensus[1]
was reached that the resources it tracked were unsuitable to be integrated
into memcg. Originally, the proposed controller was specific to the DRM
subsystem and was intended to track GEM buffers and GPU-specific
resources[2]. In order to help establish a unified memory accounting model
for all GPU and all related subsystems, Daniel Vetter put forth a
suggestion to move it out of the DRM subsystem so that it can be used by
other DMA-BUF exporters as well[3]. This RFC proposes an interface that
does the same.

[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
[2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
[3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/

Hridya Valsaraju (3):
  gpu: rfc: Proposal for a GPU cgroup controller
  cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
    memory
  binder: Add flags to relinquish ownership of fds

T.J. Mercier (3):
  dmabuf: heaps: export system_heap buffers with GPU cgroup charging
  dmabuf: Add gpu cgroup charge transfer function
  selftests: Add binder cgroup gpu memory transfer tests

 Documentation/admin-guide/cgroup-v2.rst       |  23 +
 drivers/android/binder.c                      |  31 +-
 drivers/dma-buf/dma-buf.c                     |  80 ++-
 drivers/dma-buf/dma-heap.c                    |  38 ++
 drivers/dma-buf/heaps/system_heap.c           |  28 +-
 include/linux/cgroup_gpu.h                    | 146 +++++
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/dma-buf.h                       |  49 +-
 include/linux/dma-heap.h                      |  15 +
 include/uapi/linux/android/binder.h           |  23 +-
 init/Kconfig                                  |   7 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/gpu.c                           | 390 +++++++++++++
 .../selftests/drivers/android/binder/Makefile |   8 +
 .../drivers/android/binder/binder_util.c      | 250 +++++++++
 .../drivers/android/binder/binder_util.h      |  32 ++
 .../selftests/drivers/android/binder/config   |   4 +
 .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
 18 files changed, 1632 insertions(+), 23 deletions(-)
 create mode 100644 include/linux/cgroup_gpu.h
 create mode 100644 kernel/cgroup/gpu.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
 create mode 100644 tools/testing/selftests/drivers/android/binder/config
 create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c

-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-10 23:56 ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey
  Cc: daniel, jstultz, cmllamas, kaleshsingh, Kenny.Ho, mkoutny, skhan,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-media,
	dri-devel, linaro-mm-sig, linux-kselftest

This patch series revisits the proposal for a GPU cgroup controller to
track and limit memory allocations by various device/allocator
subsystems. The patch series also contains a simple prototype to
illustrate how Android intends to implement DMA-BUF allocator
attribution using the GPU cgroup controller. The prototype does not
include resource limit enforcements.

Changelog:
v7:
Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
This means gpucg_register_bucket now returns an internally allocated
struct gpucg_bucket.

Move all public function documentation to the cgroup_gpu.h header.

Remove comment in documentation about duplicate name rejection which
is not relevant to cgroups users per Michal Koutný.

v6:
Move documentation into cgroup-v2.rst per Tejun Heo.

Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.

Return error on transfer failure per Carlos Llamas.

v5:
Rebase on top of v5.18-rc3

Drop the global GPU cgroup "total" (sum of all device totals) portion
of the design since there is no currently known use for this per
Tejun Heo.

Fix commit message which still contained the old name for
dma_buf_transfer_charge per Michal Koutný.

Remove all GPU cgroup code except what's necessary to support charge transfer
from dma_buf. Previously charging was done in export, but for non-Android
graphics use-cases this is not ideal since there may be a delay between
allocation and export, during which time there is no accounting.

Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
result of above.

Put the charge and uncharge code in the same file (system_heap_allocate,
system_heap_dma_buf_release) instead of splitting them between the heap and
the dma_buf_release. This avoids asymmetric management of the gpucg charges.

Modify the dma_buf_transfer_charge API to accept a task_struct instead
of a gpucg. This avoids requiring the caller to manage the refcount
of the gpucg upon failure and confusing ownership transfer logic.

Support all strings for gpucg_register_bucket instead of just string
literals.

Enforce globally unique gpucg_bucket names.

Constrain gpucg_bucket name lengths to 64 bytes.

Append "-heap" to gpucg_bucket names from dmabuf-heaps.

Drop patch 7 from the series, which changed the types of
binder_transaction_data's sender_pid and sender_euid fields. This was
done in another commit here:
https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/

Rename:
  gpucg_try_charge -> gpucg_charge
  find_cg_rpool_locked -> cg_rpool_find_locked
  init_cg_rpool -> cg_rpool_init
  get_cg_rpool_locked -> cg_rpool_get_locked
  "gpu cgroup controller" -> "GPU controller"
  gpucg_device -> gpucg_bucket
  usage -> size

Tests:
  Support both binder_fd_array_object and binder_fd_object. This is
  necessary because new versions of Android will use binder_fd_object
  instead of binder_fd_array_object, and we need to support both.

  Tests for both binder_fd_array_object and binder_fd_object.

  For binder_utils return error codes instead of
  struct binder{fs}_ctx.

  Use ifdef __ANDROID__ to choose platform-dependent temp path instead
  of a runtime fallback.

  Ensure binderfs_mntpt ends with a trailing '/' character instead of
  prepending it where used.

v4:
Skip test if not run as root per Shuah Khan

Add better test logging for abnormal child termination per Shuah Khan

Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutný

Adjust gpucg_try_charge critical section for charge transfer functionality

Fix uninitialized return code error for dmabuf_try_charge error case

v3:
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz

Use more common dual author commit message format per John Stultz

Remove android from binder changes title per Todd Kjos

Add a kselftest for this new behavior per Greg Kroah-Hartman

Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.

Fix pid and uid types in binder UAPI header

v2:
See the previous revision of this change submitted by Hridya Valsaraju
at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/

Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König. Pointers to struct gpucg and struct gpucg_device
tracking the current associations were added to the dma_buf struct to
achieve this.

Fix incorrect Kconfig help section indentation per Randy Dunlap.

History of the GPU cgroup controller
====================================
The GPU/DRM cgroup controller came into being when a consensus[1]
was reached that the resources it tracked were unsuitable to be integrated
into memcg. Originally, the proposed controller was specific to the DRM
subsystem and was intended to track GEM buffers and GPU-specific
resources[2]. In order to help establish a unified memory accounting model
for all GPU and all related subsystems, Daniel Vetter put forth a
suggestion to move it out of the DRM subsystem so that it can be used by
other DMA-BUF exporters as well[3]. This RFC proposes an interface that
does the same.

[1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
[2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
[3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/

Hridya Valsaraju (3):
  gpu: rfc: Proposal for a GPU cgroup controller
  cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
    memory
  binder: Add flags to relinquish ownership of fds

T.J. Mercier (3):
  dmabuf: heaps: export system_heap buffers with GPU cgroup charging
  dmabuf: Add gpu cgroup charge transfer function
  selftests: Add binder cgroup gpu memory transfer tests

 Documentation/admin-guide/cgroup-v2.rst       |  23 +
 drivers/android/binder.c                      |  31 +-
 drivers/dma-buf/dma-buf.c                     |  80 ++-
 drivers/dma-buf/dma-heap.c                    |  38 ++
 drivers/dma-buf/heaps/system_heap.c           |  28 +-
 include/linux/cgroup_gpu.h                    | 146 +++++
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/dma-buf.h                       |  49 +-
 include/linux/dma-heap.h                      |  15 +
 include/uapi/linux/android/binder.h           |  23 +-
 init/Kconfig                                  |   7 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/gpu.c                           | 390 +++++++++++++
 .../selftests/drivers/android/binder/Makefile |   8 +
 .../drivers/android/binder/binder_util.c      | 250 +++++++++
 .../drivers/android/binder/binder_util.h      |  32 ++
 .../selftests/drivers/android/binder/config   |   4 +
 .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
 18 files changed, 1632 insertions(+), 23 deletions(-)
 create mode 100644 include/linux/cgroup_gpu.h
 create mode 100644 kernel/cgroup/gpu.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
 create mode 100644 tools/testing/selftests/drivers/android/binder/config
 create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c

-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
  2022-05-10 23:56 ` T.J. Mercier
@ 2022-05-10 23:56   ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet
  Cc: daniel, hridya, christian.koenig, jstultz, tkjos, cmllamas,
	surenb, kaleshsingh, Kenny.Ho, mkoutny, skhan, kernel-team,
	cgroups, linux-doc, linux-kernel

From: Hridya Valsaraju <hridya@google.com>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v7 changes
Remove comment about duplicate name rejection which is not relevant to
cgroups users per Michal Koutný.

v6 changes
Move documentation into cgroup-v2.rst per Tejun Heo.

v5 changes
Drop the global GPU cgroup "total" (sum of all device totals) portion
of the design since there is no currently known use for this per
Tejun Heo.

Update for renamed functions/variables.

v3 changes
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.

Use more common dual author commit message format per John Stultz.
---
 Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69d7a6983f78..2e1d26e327c7 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
 a process to a different cgroup does not move the charge to the destination
 cgroup where the process has moved.
 
+
+GPU
+---
+
+The GPU controller accounts for device and system memory allocated by the GPU
+and related subsystems for graphics use. Resource limits are not currently
+supported.
+
+GPU Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+  gpu.memory.current
+	A read-only file containing memory allocations in flat-keyed format. The key
+	is a string representing the device name. The value is the size of the memory
+	charged to the device in bytes. The device names are globally unique.::
+
+	  $ cat /sys/kernel/fs/cgroup1/gpu.memory.current
+	  dev1 4194304
+	  dev2 104857600
+
+	The device name string is set by a device driver when it registers with the
+	GPU cgroup controller to participate in resource accounting.
+
 Others
 ------
 
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
@ 2022-05-10 23:56   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet
  Cc: daniel, hridya, christian.koenig, jstultz, tkjos, cmllamas,
	surenb, kaleshsingh, Kenny.Ho, mkoutny, skhan, kernel-team,
	cgroups, linux-doc, linux-kernel

From: Hridya Valsaraju <hridya@google.com>

This patch adds a proposal for a new GPU cgroup controller for
accounting/limiting GPU and GPU-related memory allocations.
The proposed controller is based on the DRM cgroup controller[1] and
follows the design of the RDMA cgroup controller.

The new cgroup controller would:
* Allow setting per-device limits on the total size of buffers
  allocated by device within a cgroup.
* Expose a per-device/allocator breakdown of the buffers charged to a
  cgroup.

The prototype in the following patches is only for memory accounting
using the GPU cgroup controller and does not implement limit setting.

[1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v7 changes
Remove comment about duplicate name rejection which is not relevant to
cgroups users per Michal Koutn√Ω.

v6 changes
Move documentation into cgroup-v2.rst per Tejun Heo.

v5 changes
Drop the global GPU cgroup "total" (sum of all device totals) portion
of the design since there is no currently known use for this per
Tejun Heo.

Update for renamed functions/variables.

v3 changes
Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.

Use more common dual author commit message format per John Stultz.
---
 Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 69d7a6983f78..2e1d26e327c7 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
 a process to a different cgroup does not move the charge to the destination
 cgroup where the process has moved.
 
+
+GPU
+---
+
+The GPU controller accounts for device and system memory allocated by the GPU
+and related subsystems for graphics use. Resource limits are not currently
+supported.
+
+GPU Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+  gpu.memory.current
+	A read-only file containing memory allocations in flat-keyed format. The key
+	is a string representing the device name. The value is the size of the memory
+	charged to the device in bytes. The device names are globally unique.::
+
+	  $ cat /sys/kernel/fs/cgroup1/gpu.memory.current
+	  dev1 4194304
+	  dev2 104857600
+
+	The device name string is set by a device driver when it registers with the
+	GPU cgroup controller to participate in resource accounting.
+
 Others
 ------
 
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
@ 2022-05-10 23:56   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Tejun Heo, Zefan Li, Johannes Weiner
  Cc: daniel, hridya, christian.koenig, jstultz, tkjos, cmllamas,
	surenb, kaleshsingh, Kenny.Ho, mkoutny, skhan, kernel-team,
	linux-kernel, cgroups

From: Hridya Valsaraju <hridya@google.com>

The cgroup controller provides accounting for GPU and GPU-related
memory allocations. The memory being accounted can be device memory or
memory allocated from pools dedicated to serve GPU-related tasks.

This patch adds APIs to:
-allow a device to register for memory accounting using the GPU cgroup
controller.
-charge and uncharge allocated memory to a cgroup.

When the cgroup controller is enabled, it would expose information about
the memory allocated by each device(registered for GPU cgroup memory
accounting) for each cgroup.

The API/UAPI can be extended to set per-device/total allocation limits
in the future.

The cgroup controller has been named following the discussion in [1].

[1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v7 changes
Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
This means gpucg_register_bucket now returns an internally allocated
struct gpucg_bucket.

Move all public function documentation to the cgroup_gpu.h header.

v5 changes
Support all strings for gpucg_register_device instead of just string
literals.

Enforce globally unique gpucg_bucket names.

Constrain gpucg_bucket name lengths to 64 bytes.

Obtain just a single css refcount instead of nr_pages for each
charge.

Rename:
gpucg_try_charge -> gpucg_charge
find_cg_rpool_locked -> cg_rpool_find_locked
init_cg_rpool -> cg_rpool_init
get_cg_rpool_locked -> cg_rpool_get_locked
"gpu cgroup controller" -> "GPU controller"
gpucg_device -> gpucg_bucket
usage -> size

v4 changes
Adjust gpucg_try_charge critical section for future charge transfer
functionality.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Fix incorrect Kconfig help section indentation per Randy Dunlap.
---
 include/linux/cgroup_gpu.h    | 122 ++++++++++++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |   7 +
 kernel/cgroup/Makefile        |   1 +
 kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
 5 files changed, 473 insertions(+)
 create mode 100644 include/linux/cgroup_gpu.h
 create mode 100644 kernel/cgroup/gpu.c

diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
new file mode 100644
index 000000000000..cb228a16aa1f
--- /dev/null
+++ b/include/linux/cgroup_gpu.h
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: MIT
+ * Copyright 2019 Advanced Micro Devices, Inc.
+ * Copyright (C) 2022 Google LLC.
+ */
+#ifndef _CGROUP_GPU_H
+#define _CGROUP_GPU_H
+
+#include <linux/cgroup.h>
+
+#define GPUCG_BUCKET_NAME_MAX_LEN 64
+
+struct gpucg;
+struct gpucg_bucket;
+
+#ifdef CONFIG_CGROUP_GPU
+
+/**
+ * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
+ * @css: the target cgroup_subsys_state
+ *
+ * Returns: gpu cgroup that contains the @css
+ */
+struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
+
+/**
+ * gpucg_get - get the gpucg reference that a task belongs to
+ * @task: the target task
+ *
+ * This increases the reference count of the css that the @task belongs to.
+ *
+ * Returns: reference to the gpu cgroup the task belongs to.
+ */
+struct gpucg *gpucg_get(struct task_struct *task);
+
+/**
+ * gpucg_put - put a gpucg reference
+ * @gpucg: the target gpucg
+ *
+ * Put a reference obtained via gpucg_get
+ */
+void gpucg_put(struct gpucg *gpucg);
+
+/**
+ * gpucg_parent - find the parent of a gpu cgroup
+ * @cg: the target gpucg
+ *
+ * This does not increase the reference count of the parent cgroup
+ *
+ * Returns: parent gpu cgroup of @cg
+ */
+struct gpucg *gpucg_parent(struct gpucg *cg);
+
+/**
+ * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
+ * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
+ * rounded up to be a multiple of the page size.
+ *
+ * @gpucg: The gpu cgroup to charge the memory to.
+ * @bucket: The bucket to charge the memory to.
+ * @size: The size of memory to charge in bytes.
+ *        This size will be rounded up to the nearest page size.
+ *
+ * Return: returns 0 if the charging is successful and otherwise returns an error code.
+ */
+int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
+
+/**
+ * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
+ * The caller must hold a reference to @gpucg obtained through gpucg_get().
+ *
+ * @gpucg: The gpu cgroup to uncharge the memory from.
+ * @bucket: The bucket to uncharge the memory from.
+ * @size: The size of memory to uncharge in bytes.
+ *        This size will be rounded up to the nearest page size.
+ */
+void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
+
+/**
+ * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
+ *
+ * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
+ *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
+ *
+ * @bucket must remain valid. @name will be copied.
+ *
+ * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
+ * cannot be unregistered, this can never be freed.
+ */
+struct gpucg_bucket *gpucg_register_bucket(const char *name);
+#else /* CONFIG_CGROUP_GPU */
+
+static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
+{
+	return NULL;
+}
+
+static inline struct gpucg *gpucg_get(struct task_struct *task)
+{
+	return NULL;
+}
+
+static inline void gpucg_put(struct gpucg *gpucg) {}
+
+static inline struct gpucg *gpucg_parent(struct gpucg *cg)
+{
+	return NULL;
+}
+
+static inline int gpucg_charge(struct gpucg *gpucg,
+			       struct gpucg_bucket *bucket,
+			       u64 size)
+{
+	return 0;
+}
+
+static inline void gpucg_uncharge(struct gpucg *gpucg,
+				  struct gpucg_bucket *bucket,
+				  u64 size) {}
+
+static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
+#endif /* CONFIG_CGROUP_GPU */
+#endif /* _CGROUP_GPU_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..46a2a7b93c41 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
 SUBSYS(misc)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_GPU)
+SUBSYS(gpu)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index ddcbefe535e9..2e00a190e170 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -984,6 +984,13 @@ config BLK_CGROUP
 
 	See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
 
+config CGROUP_GPU
+	bool "GPU controller (EXPERIMENTAL)"
+	select PAGE_COUNTER
+	help
+	  Provides accounting and limit setting for memory allocations by the GPU and
+	  GPU-related subsystems.
+
 config CGROUP_WRITEBACK
 	bool
 	depends on MEMCG && BLK_CGROUP
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..be95a5a532fc 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_MISC) += misc.o
 obj-$(CONFIG_CGROUP_DEBUG) += debug.o
+obj-$(CONFIG_CGROUP_GPU) += gpu.o
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
new file mode 100644
index 000000000000..ad16ea15d427
--- /dev/null
+++ b/kernel/cgroup/gpu.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: MIT
+// Copyright 2019 Advanced Micro Devices, Inc.
+// Copyright (C) 2022 Google LLC.
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_gpu.h>
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/page_counter.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+static struct gpucg *root_gpucg __read_mostly;
+
+/*
+ * Protects list of resource pools maintained on per cgroup basis and list
+ * of buckets registered for memory accounting using the GPU cgroup controller.
+ */
+static DEFINE_MUTEX(gpucg_mutex);
+static LIST_HEAD(gpucg_buckets);
+
+/* The GPU cgroup controller data structure */
+struct gpucg {
+	struct cgroup_subsys_state css;
+
+	/* list of all resource pools that belong to this cgroup */
+	struct list_head rpools;
+};
+
+/* A named entity representing bucket of tracked memory. */
+struct gpucg_bucket {
+	/* list of various resource pools in various cgroups that the bucket is part of */
+	struct list_head rpools;
+
+	/* list of all buckets registered for GPU cgroup accounting */
+	struct list_head bucket_node;
+
+	/* string to be used as identifier for accounting and limit setting */
+	const char *name;
+};
+
+struct gpucg_resource_pool {
+	/* The bucket whose resource usage is tracked by this resource pool */
+	struct gpucg_bucket *bucket;
+
+	/* list of all resource pools for the cgroup */
+	struct list_head cg_node;
+
+	/* list maintained by the gpucg_bucket to keep track of its resource pools */
+	struct list_head bucket_node;
+
+	/* tracks memory usage of the resource pool */
+	struct page_counter total;
+};
+
+static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
+{
+	lockdep_assert_held(&gpucg_mutex);
+
+	list_del(&rpool->cg_node);
+	list_del(&rpool->bucket_node);
+	kfree(rpool);
+}
+
+static void gpucg_css_free(struct cgroup_subsys_state *css)
+{
+	struct gpucg_resource_pool *rpool, *tmp;
+	struct gpucg *gpucg = css_to_gpucg(css);
+
+	// delete all resource pools
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
+		free_cg_rpool_locked(rpool);
+	mutex_unlock(&gpucg_mutex);
+
+	kfree(gpucg);
+}
+
+static struct cgroup_subsys_state *
+gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct gpucg *gpucg, *parent;
+
+	gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
+	if (!gpucg)
+		return ERR_PTR(-ENOMEM);
+
+	parent = css_to_gpucg(parent_css);
+	if (!parent)
+		root_gpucg = gpucg;
+
+	INIT_LIST_HEAD(&gpucg->rpools);
+
+	return &gpucg->css;
+}
+
+static struct gpucg_resource_pool *cg_rpool_find_locked(
+	struct gpucg *cg,
+	struct gpucg_bucket *bucket)
+{
+	struct gpucg_resource_pool *rpool;
+
+	lockdep_assert_held(&gpucg_mutex);
+
+	list_for_each_entry(rpool, &cg->rpools, cg_node)
+		if (rpool->bucket == bucket)
+			return rpool;
+
+	return NULL;
+}
+
+static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
+						 struct gpucg_bucket *bucket)
+{
+	struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
+							GFP_KERNEL);
+	if (!rpool)
+		return ERR_PTR(-ENOMEM);
+
+	rpool->bucket = bucket;
+
+	page_counter_init(&rpool->total, NULL);
+	INIT_LIST_HEAD(&rpool->cg_node);
+	INIT_LIST_HEAD(&rpool->bucket_node);
+	list_add_tail(&rpool->cg_node, &cg->rpools);
+	list_add_tail(&rpool->bucket_node, &bucket->rpools);
+
+	return rpool;
+}
+
+/**
+ * get_cg_rpool_locked - find the resource pool for the specified bucket and
+ * specified cgroup. If the resource pool does not exist for the cg, it is
+ * created in a hierarchical manner in the cgroup and its ancestor cgroups who
+ * do not already have a resource pool entry for the bucket.
+ *
+ * @cg: The cgroup to find the resource pool for.
+ * @bucket: The bucket associated with the returned resource pool.
+ *
+ * Return: return resource pool entry corresponding to the specified bucket in
+ * the specified cgroup (hierarchically creating them if not existing already).
+ *
+ */
+static struct gpucg_resource_pool *
+cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
+{
+	struct gpucg *parent_cg, *p, *stop_cg;
+	struct gpucg_resource_pool *rpool, *tmp_rpool;
+	struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
+
+	rpool = cg_rpool_find_locked(cg, bucket);
+	if (rpool)
+		return rpool;
+
+	stop_cg = cg;
+	do {
+		rpool = cg_rpool_init(stop_cg, bucket);
+		if (IS_ERR(rpool))
+			goto err;
+
+		if (!leaf_rpool)
+			leaf_rpool = rpool;
+
+		stop_cg = gpucg_parent(stop_cg);
+		if (!stop_cg)
+			break;
+
+		rpool = cg_rpool_find_locked(stop_cg, bucket);
+	} while (!rpool);
+
+	/*
+	 * Re-initialize page counters of all rpools created in this invocation
+	 * to enable hierarchical charging.
+	 * stop_cg is the first ancestor cg who already had a resource pool for
+	 * the bucket. It can also be NULL if no ancestors had a pre-existing
+	 * resource pool for the bucket before this invocation.
+	 */
+	rpool = leaf_rpool;
+	for (p = cg; p != stop_cg; p = parent_cg) {
+		parent_cg = gpucg_parent(p);
+		if (!parent_cg)
+			break;
+		parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
+		page_counter_init(&rpool->total, &parent_rpool->total);
+
+		rpool = parent_rpool;
+	}
+
+	return leaf_rpool;
+err:
+	for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
+		tmp_rpool = cg_rpool_find_locked(p, bucket);
+		free_cg_rpool_locked(tmp_rpool);
+	}
+	return rpool;
+}
+
+struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct gpucg, css) : NULL;
+}
+
+struct gpucg *gpucg_get(struct task_struct *task)
+{
+	if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
+		return NULL;
+	return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
+}
+
+void gpucg_put(struct gpucg *gpucg)
+{
+	if (gpucg)
+		css_put(&gpucg->css);
+}
+
+struct gpucg *gpucg_parent(struct gpucg *cg)
+{
+	return css_to_gpucg(cg->css.parent);
+}
+
+int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp;
+	int ret = 0;
+
+	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp = cg_rpool_get_locked(gpucg, bucket);
+	/*
+	 * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
+	 * progress to avoid potentially exceeding a limit.
+	 */
+	if (IS_ERR(rp)) {
+		mutex_unlock(&gpucg_mutex);
+		return PTR_ERR(rp);
+	}
+
+	if (page_counter_try_charge(&rp->total, nr_pages, &counter))
+		css_get(&gpucg->css);
+	else
+		ret = -ENOMEM;
+	mutex_unlock(&gpucg_mutex);
+
+	return ret;
+}
+
+void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
+{
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp;
+
+	mutex_lock(&gpucg_mutex);
+	rp = cg_rpool_find_locked(gpucg, bucket);
+	/*
+	 * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
+	 * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
+	 * no potential to exceed a limit while uncharging and transferring.
+	 */
+	mutex_unlock(&gpucg_mutex);
+
+	if (unlikely(!rp)) {
+		pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
+		return;
+	}
+
+	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	page_counter_uncharge(&rp->total, nr_pages);
+	css_put(&gpucg->css);
+}
+
+struct gpucg_bucket *gpucg_register_bucket(const char *name)
+{
+	struct gpucg_bucket *bucket, *b;
+
+	if (!name)
+		return ERR_PTR(-EINVAL);
+
+	if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
+	if (!bucket)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&bucket->bucket_node);
+	INIT_LIST_HEAD(&bucket->rpools);
+	bucket->name = kstrdup_const(name, GFP_KERNEL);
+
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry(b, &gpucg_buckets, bucket_node) {
+		if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
+			mutex_unlock(&gpucg_mutex);
+			kfree_const(bucket->name);
+			kfree(bucket);
+			return ERR_PTR(-EEXIST);
+		}
+	}
+	list_add_tail(&bucket->bucket_node, &gpucg_buckets);
+	mutex_unlock(&gpucg_mutex);
+
+	return bucket;
+}
+
+static int gpucg_resource_show(struct seq_file *sf, void *v)
+{
+	struct gpucg_resource_pool *rpool;
+	struct gpucg *cg = css_to_gpucg(seq_css(sf));
+
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry(rpool, &cg->rpools, cg_node) {
+		seq_printf(sf, "%s %lu\n", rpool->bucket->name,
+			   page_counter_read(&rpool->total) * PAGE_SIZE);
+	}
+	mutex_unlock(&gpucg_mutex);
+
+	return 0;
+}
+
+struct cftype files[] = {
+	{
+		.name = "memory.current",
+		.seq_show = gpucg_resource_show,
+	},
+	{ }     /* terminate */
+};
+
+struct cgroup_subsys gpu_cgrp_subsys = {
+	.css_alloc      = gpucg_css_alloc,
+	.css_free       = gpucg_css_free,
+	.early_init     = false,
+	.legacy_cftypes = files,
+	.dfl_cftypes    = files,
+};
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
@ 2022-05-10 23:56   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier-hpIqsD4AKlfQT0dZR+AlfA, Tejun Heo, Zefan Li, Johannes Weiner
  Cc: daniel-/w4YWyX8dFk, hridya-hpIqsD4AKlfQT0dZR+AlfA,
	christian.koenig-5C7GfCeVMHo, jstultz-hpIqsD4AKlfQT0dZR+AlfA,
	tkjos-z5hGa2qSFaRBDgjK7y7TUQ, cmllamas-hpIqsD4AKlfQT0dZR+AlfA,
	surenb-hpIqsD4AKlfQT0dZR+AlfA,
	kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA, Kenny.Ho-5C7GfCeVMHo,
	mkoutny-IBi9RG/b67k, skhan-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	kernel-team-z5hGa2qSFaRBDgjK7y7TUQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

From: Hridya Valsaraju <hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

The cgroup controller provides accounting for GPU and GPU-related
memory allocations. The memory being accounted can be device memory or
memory allocated from pools dedicated to serve GPU-related tasks.

This patch adds APIs to:
-allow a device to register for memory accounting using the GPU cgroup
controller.
-charge and uncharge allocated memory to a cgroup.

When the cgroup controller is enabled, it would expose information about
the memory allocated by each device(registered for GPU cgroup memory
accounting) for each cgroup.

The API/UAPI can be extended to set per-device/total allocation limits
in the future.

The cgroup controller has been named following the discussion in [1].

[1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv-dv86pmgwkMBes7Z6vYuT8fd9D2ou9A/h@public.gmane.orgl/

Signed-off-by: Hridya Valsaraju <hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---
v7 changes
Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
This means gpucg_register_bucket now returns an internally allocated
struct gpucg_bucket.

Move all public function documentation to the cgroup_gpu.h header.

v5 changes
Support all strings for gpucg_register_device instead of just string
literals.

Enforce globally unique gpucg_bucket names.

Constrain gpucg_bucket name lengths to 64 bytes.

Obtain just a single css refcount instead of nr_pages for each
charge.

Rename:
gpucg_try_charge -> gpucg_charge
find_cg_rpool_locked -> cg_rpool_find_locked
init_cg_rpool -> cg_rpool_init
get_cg_rpool_locked -> cg_rpool_get_locked
"gpu cgroup controller" -> "GPU controller"
gpucg_device -> gpucg_bucket
usage -> size

v4 changes
Adjust gpucg_try_charge critical section for future charge transfer
functionality.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Fix incorrect Kconfig help section indentation per Randy Dunlap.
---
 include/linux/cgroup_gpu.h    | 122 ++++++++++++
 include/linux/cgroup_subsys.h |   4 +
 init/Kconfig                  |   7 +
 kernel/cgroup/Makefile        |   1 +
 kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
 5 files changed, 473 insertions(+)
 create mode 100644 include/linux/cgroup_gpu.h
 create mode 100644 kernel/cgroup/gpu.c

diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
new file mode 100644
index 000000000000..cb228a16aa1f
--- /dev/null
+++ b/include/linux/cgroup_gpu.h
@@ -0,0 +1,122 @@
+/* SPDX-License-Identifier: MIT
+ * Copyright 2019 Advanced Micro Devices, Inc.
+ * Copyright (C) 2022 Google LLC.
+ */
+#ifndef _CGROUP_GPU_H
+#define _CGROUP_GPU_H
+
+#include <linux/cgroup.h>
+
+#define GPUCG_BUCKET_NAME_MAX_LEN 64
+
+struct gpucg;
+struct gpucg_bucket;
+
+#ifdef CONFIG_CGROUP_GPU
+
+/**
+ * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
+ * @css: the target cgroup_subsys_state
+ *
+ * Returns: gpu cgroup that contains the @css
+ */
+struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
+
+/**
+ * gpucg_get - get the gpucg reference that a task belongs to
+ * @task: the target task
+ *
+ * This increases the reference count of the css that the @task belongs to.
+ *
+ * Returns: reference to the gpu cgroup the task belongs to.
+ */
+struct gpucg *gpucg_get(struct task_struct *task);
+
+/**
+ * gpucg_put - put a gpucg reference
+ * @gpucg: the target gpucg
+ *
+ * Put a reference obtained via gpucg_get
+ */
+void gpucg_put(struct gpucg *gpucg);
+
+/**
+ * gpucg_parent - find the parent of a gpu cgroup
+ * @cg: the target gpucg
+ *
+ * This does not increase the reference count of the parent cgroup
+ *
+ * Returns: parent gpu cgroup of @cg
+ */
+struct gpucg *gpucg_parent(struct gpucg *cg);
+
+/**
+ * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
+ * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
+ * rounded up to be a multiple of the page size.
+ *
+ * @gpucg: The gpu cgroup to charge the memory to.
+ * @bucket: The bucket to charge the memory to.
+ * @size: The size of memory to charge in bytes.
+ *        This size will be rounded up to the nearest page size.
+ *
+ * Return: returns 0 if the charging is successful and otherwise returns an error code.
+ */
+int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
+
+/**
+ * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
+ * The caller must hold a reference to @gpucg obtained through gpucg_get().
+ *
+ * @gpucg: The gpu cgroup to uncharge the memory from.
+ * @bucket: The bucket to uncharge the memory from.
+ * @size: The size of memory to uncharge in bytes.
+ *        This size will be rounded up to the nearest page size.
+ */
+void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
+
+/**
+ * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
+ *
+ * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
+ *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
+ *
+ * @bucket must remain valid. @name will be copied.
+ *
+ * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
+ * cannot be unregistered, this can never be freed.
+ */
+struct gpucg_bucket *gpucg_register_bucket(const char *name);
+#else /* CONFIG_CGROUP_GPU */
+
+static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
+{
+	return NULL;
+}
+
+static inline struct gpucg *gpucg_get(struct task_struct *task)
+{
+	return NULL;
+}
+
+static inline void gpucg_put(struct gpucg *gpucg) {}
+
+static inline struct gpucg *gpucg_parent(struct gpucg *cg)
+{
+	return NULL;
+}
+
+static inline int gpucg_charge(struct gpucg *gpucg,
+			       struct gpucg_bucket *bucket,
+			       u64 size)
+{
+	return 0;
+}
+
+static inline void gpucg_uncharge(struct gpucg *gpucg,
+				  struct gpucg_bucket *bucket,
+				  u64 size) {}
+
+static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
+#endif /* CONFIG_CGROUP_GPU */
+#endif /* _CGROUP_GPU_H */
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 445235487230..46a2a7b93c41 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -65,6 +65,10 @@ SUBSYS(rdma)
 SUBSYS(misc)
 #endif
 
+#if IS_ENABLED(CONFIG_CGROUP_GPU)
+SUBSYS(gpu)
+#endif
+
 /*
  * The following subsystems are not supported on the default hierarchy.
  */
diff --git a/init/Kconfig b/init/Kconfig
index ddcbefe535e9..2e00a190e170 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -984,6 +984,13 @@ config BLK_CGROUP
 
 	See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
 
+config CGROUP_GPU
+	bool "GPU controller (EXPERIMENTAL)"
+	select PAGE_COUNTER
+	help
+	  Provides accounting and limit setting for memory allocations by the GPU and
+	  GPU-related subsystems.
+
 config CGROUP_WRITEBACK
 	bool
 	depends on MEMCG && BLK_CGROUP
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 12f8457ad1f9..be95a5a532fc 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
 obj-$(CONFIG_CPUSETS) += cpuset.o
 obj-$(CONFIG_CGROUP_MISC) += misc.o
 obj-$(CONFIG_CGROUP_DEBUG) += debug.o
+obj-$(CONFIG_CGROUP_GPU) += gpu.o
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
new file mode 100644
index 000000000000..ad16ea15d427
--- /dev/null
+++ b/kernel/cgroup/gpu.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: MIT
+// Copyright 2019 Advanced Micro Devices, Inc.
+// Copyright (C) 2022 Google LLC.
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_gpu.h>
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/list.h>
+#include <linux/mm.h>
+#include <linux/page_counter.h>
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+
+static struct gpucg *root_gpucg __read_mostly;
+
+/*
+ * Protects list of resource pools maintained on per cgroup basis and list
+ * of buckets registered for memory accounting using the GPU cgroup controller.
+ */
+static DEFINE_MUTEX(gpucg_mutex);
+static LIST_HEAD(gpucg_buckets);
+
+/* The GPU cgroup controller data structure */
+struct gpucg {
+	struct cgroup_subsys_state css;
+
+	/* list of all resource pools that belong to this cgroup */
+	struct list_head rpools;
+};
+
+/* A named entity representing bucket of tracked memory. */
+struct gpucg_bucket {
+	/* list of various resource pools in various cgroups that the bucket is part of */
+	struct list_head rpools;
+
+	/* list of all buckets registered for GPU cgroup accounting */
+	struct list_head bucket_node;
+
+	/* string to be used as identifier for accounting and limit setting */
+	const char *name;
+};
+
+struct gpucg_resource_pool {
+	/* The bucket whose resource usage is tracked by this resource pool */
+	struct gpucg_bucket *bucket;
+
+	/* list of all resource pools for the cgroup */
+	struct list_head cg_node;
+
+	/* list maintained by the gpucg_bucket to keep track of its resource pools */
+	struct list_head bucket_node;
+
+	/* tracks memory usage of the resource pool */
+	struct page_counter total;
+};
+
+static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
+{
+	lockdep_assert_held(&gpucg_mutex);
+
+	list_del(&rpool->cg_node);
+	list_del(&rpool->bucket_node);
+	kfree(rpool);
+}
+
+static void gpucg_css_free(struct cgroup_subsys_state *css)
+{
+	struct gpucg_resource_pool *rpool, *tmp;
+	struct gpucg *gpucg = css_to_gpucg(css);
+
+	// delete all resource pools
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
+		free_cg_rpool_locked(rpool);
+	mutex_unlock(&gpucg_mutex);
+
+	kfree(gpucg);
+}
+
+static struct cgroup_subsys_state *
+gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
+{
+	struct gpucg *gpucg, *parent;
+
+	gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
+	if (!gpucg)
+		return ERR_PTR(-ENOMEM);
+
+	parent = css_to_gpucg(parent_css);
+	if (!parent)
+		root_gpucg = gpucg;
+
+	INIT_LIST_HEAD(&gpucg->rpools);
+
+	return &gpucg->css;
+}
+
+static struct gpucg_resource_pool *cg_rpool_find_locked(
+	struct gpucg *cg,
+	struct gpucg_bucket *bucket)
+{
+	struct gpucg_resource_pool *rpool;
+
+	lockdep_assert_held(&gpucg_mutex);
+
+	list_for_each_entry(rpool, &cg->rpools, cg_node)
+		if (rpool->bucket == bucket)
+			return rpool;
+
+	return NULL;
+}
+
+static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
+						 struct gpucg_bucket *bucket)
+{
+	struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
+							GFP_KERNEL);
+	if (!rpool)
+		return ERR_PTR(-ENOMEM);
+
+	rpool->bucket = bucket;
+
+	page_counter_init(&rpool->total, NULL);
+	INIT_LIST_HEAD(&rpool->cg_node);
+	INIT_LIST_HEAD(&rpool->bucket_node);
+	list_add_tail(&rpool->cg_node, &cg->rpools);
+	list_add_tail(&rpool->bucket_node, &bucket->rpools);
+
+	return rpool;
+}
+
+/**
+ * get_cg_rpool_locked - find the resource pool for the specified bucket and
+ * specified cgroup. If the resource pool does not exist for the cg, it is
+ * created in a hierarchical manner in the cgroup and its ancestor cgroups who
+ * do not already have a resource pool entry for the bucket.
+ *
+ * @cg: The cgroup to find the resource pool for.
+ * @bucket: The bucket associated with the returned resource pool.
+ *
+ * Return: return resource pool entry corresponding to the specified bucket in
+ * the specified cgroup (hierarchically creating them if not existing already).
+ *
+ */
+static struct gpucg_resource_pool *
+cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
+{
+	struct gpucg *parent_cg, *p, *stop_cg;
+	struct gpucg_resource_pool *rpool, *tmp_rpool;
+	struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
+
+	rpool = cg_rpool_find_locked(cg, bucket);
+	if (rpool)
+		return rpool;
+
+	stop_cg = cg;
+	do {
+		rpool = cg_rpool_init(stop_cg, bucket);
+		if (IS_ERR(rpool))
+			goto err;
+
+		if (!leaf_rpool)
+			leaf_rpool = rpool;
+
+		stop_cg = gpucg_parent(stop_cg);
+		if (!stop_cg)
+			break;
+
+		rpool = cg_rpool_find_locked(stop_cg, bucket);
+	} while (!rpool);
+
+	/*
+	 * Re-initialize page counters of all rpools created in this invocation
+	 * to enable hierarchical charging.
+	 * stop_cg is the first ancestor cg who already had a resource pool for
+	 * the bucket. It can also be NULL if no ancestors had a pre-existing
+	 * resource pool for the bucket before this invocation.
+	 */
+	rpool = leaf_rpool;
+	for (p = cg; p != stop_cg; p = parent_cg) {
+		parent_cg = gpucg_parent(p);
+		if (!parent_cg)
+			break;
+		parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
+		page_counter_init(&rpool->total, &parent_rpool->total);
+
+		rpool = parent_rpool;
+	}
+
+	return leaf_rpool;
+err:
+	for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
+		tmp_rpool = cg_rpool_find_locked(p, bucket);
+		free_cg_rpool_locked(tmp_rpool);
+	}
+	return rpool;
+}
+
+struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
+{
+	return css ? container_of(css, struct gpucg, css) : NULL;
+}
+
+struct gpucg *gpucg_get(struct task_struct *task)
+{
+	if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
+		return NULL;
+	return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
+}
+
+void gpucg_put(struct gpucg *gpucg)
+{
+	if (gpucg)
+		css_put(&gpucg->css);
+}
+
+struct gpucg *gpucg_parent(struct gpucg *cg)
+{
+	return css_to_gpucg(cg->css.parent);
+}
+
+int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp;
+	int ret = 0;
+
+	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp = cg_rpool_get_locked(gpucg, bucket);
+	/*
+	 * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
+	 * progress to avoid potentially exceeding a limit.
+	 */
+	if (IS_ERR(rp)) {
+		mutex_unlock(&gpucg_mutex);
+		return PTR_ERR(rp);
+	}
+
+	if (page_counter_try_charge(&rp->total, nr_pages, &counter))
+		css_get(&gpucg->css);
+	else
+		ret = -ENOMEM;
+	mutex_unlock(&gpucg_mutex);
+
+	return ret;
+}
+
+void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
+{
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp;
+
+	mutex_lock(&gpucg_mutex);
+	rp = cg_rpool_find_locked(gpucg, bucket);
+	/*
+	 * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
+	 * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
+	 * no potential to exceed a limit while uncharging and transferring.
+	 */
+	mutex_unlock(&gpucg_mutex);
+
+	if (unlikely(!rp)) {
+		pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
+		return;
+	}
+
+	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+	page_counter_uncharge(&rp->total, nr_pages);
+	css_put(&gpucg->css);
+}
+
+struct gpucg_bucket *gpucg_register_bucket(const char *name)
+{
+	struct gpucg_bucket *bucket, *b;
+
+	if (!name)
+		return ERR_PTR(-EINVAL);
+
+	if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
+		return ERR_PTR(-ENAMETOOLONG);
+
+	bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
+	if (!bucket)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&bucket->bucket_node);
+	INIT_LIST_HEAD(&bucket->rpools);
+	bucket->name = kstrdup_const(name, GFP_KERNEL);
+
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry(b, &gpucg_buckets, bucket_node) {
+		if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
+			mutex_unlock(&gpucg_mutex);
+			kfree_const(bucket->name);
+			kfree(bucket);
+			return ERR_PTR(-EEXIST);
+		}
+	}
+	list_add_tail(&bucket->bucket_node, &gpucg_buckets);
+	mutex_unlock(&gpucg_mutex);
+
+	return bucket;
+}
+
+static int gpucg_resource_show(struct seq_file *sf, void *v)
+{
+	struct gpucg_resource_pool *rpool;
+	struct gpucg *cg = css_to_gpucg(seq_css(sf));
+
+	mutex_lock(&gpucg_mutex);
+	list_for_each_entry(rpool, &cg->rpools, cg_node) {
+		seq_printf(sf, "%s %lu\n", rpool->bucket->name,
+			   page_counter_read(&rpool->total) * PAGE_SIZE);
+	}
+	mutex_unlock(&gpucg_mutex);
+
+	return 0;
+}
+
+struct cftype files[] = {
+	{
+		.name = "memory.current",
+		.seq_show = gpucg_resource_show,
+	},
+	{ }     /* terminate */
+};
+
+struct cgroup_subsys gpu_cgrp_subsys = {
+	.css_alloc      = gpucg_css_alloc,
+	.css_free       = gpucg_css_free,
+	.early_init     = false,
+	.legacy_cftypes = files,
+	.dfl_cftypes    = files,
+};
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 3/6] dmabuf: heaps: export system_heap buffers with GPU cgroup charging
  2022-05-10 23:56 ` T.J. Mercier
@ 2022-05-10 23:56   ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Sumit Semwal, Christian König, Benjamin Gaignard,
	Liam Mark, Laura Abbott, Brian Starkey, John Stultz
  Cc: daniel, tj, hridya, jstultz, tkjos, cmllamas, surenb,
	kaleshsingh, Kenny.Ho, mkoutny, skhan, kernel-team, linux-media,
	dri-devel, linaro-mm-sig, linux-kernel

All DMA heaps now register a new GPU cgroup bucket upon creation, and the
system_heap now exports buffers associated with its GPU cgroup bucket for
tracking purposes.

In order to support GPU cgroup charge transfer on a dma-buf, the current
GPU cgroup information must be stored inside the dma-buf struct. For
tracked buffers, exporters include the struct gpucg and struct
gpucg_bucket pointers in the export info which can later be modified if
the charge is migrated to another cgroup.

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
v7 changes
Adapt to new gpucg_register_bucket API.

v5 changes
Merge dmabuf: Use the GPU cgroup charge/uncharge APIs into this patch.

Remove all GPU cgroup code from dma-buf except what's necessary to support
charge transfer. Previously charging was done in export, but for
non-Android graphics use-cases this is not ideal since there may be a
dealy between allocation and export, during which time there is no
accounting.

Append "-heap" to gpucg_bucket names.

Charge on allocation instead of export. This should more closely mirror
non-Android use-cases where there is potentially a delay between allocation
and export.

Put the charge and uncharge code in the same file (system_heap_allocate,
system_heap_dma_buf_release) instead of splitting them between the heap and
the dma_buf_release.

Move no-op code to header file to match other files in the series.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/dma-buf/dma-buf.c           | 19 +++++++++++++
 drivers/dma-buf/dma-heap.c          | 38 ++++++++++++++++++++++++++
 drivers/dma-buf/heaps/system_heap.c | 28 +++++++++++++++++---
 include/linux/dma-buf.h             | 41 +++++++++++++++++++++++------
 include/linux/dma-heap.h            | 15 +++++++++++
 5 files changed, 129 insertions(+), 12 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index df23239b04fc..bc89c44bd9b9 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -462,6 +462,24 @@ static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
  * &dma_buf_ops.
  */
 
+#ifdef CONFIG_CGROUP_GPU
+static void dma_buf_set_gpucg(struct dma_buf *dmabuf, const struct dma_buf_export_info *exp)
+{
+	dmabuf->gpucg = exp->gpucg;
+	dmabuf->gpucg_bucket = exp->gpucg_bucket;
+}
+
+void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+				struct gpucg *gpucg,
+				struct gpucg_bucket *gpucg_bucket)
+{
+	exp_info->gpucg = gpucg;
+	exp_info->gpucg_bucket = gpucg_bucket;
+}
+#else
+static void dma_buf_set_gpucg(struct dma_buf *dmabuf, struct dma_buf_export_info *exp) {}
+#endif
+
 /**
  * dma_buf_export - Creates a new dma_buf, and associates an anon file
  * with this buffer, so it can be exported.
@@ -527,6 +545,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 	init_waitqueue_head(&dmabuf->poll);
 	dmabuf->cb_in.poll = dmabuf->cb_out.poll = &dmabuf->poll;
 	dmabuf->cb_in.active = dmabuf->cb_out.active = 0;
+	dma_buf_set_gpucg(dmabuf, exp_info);
 
 	if (!resv) {
 		resv = (struct dma_resv *)&dmabuf[1];
diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index 8f5848aa144f..48173a66d70d 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -7,10 +7,12 @@
  */
 
 #include <linux/cdev.h>
+#include <linux/cgroup_gpu.h>
 #include <linux/debugfs.h>
 #include <linux/device.h>
 #include <linux/dma-buf.h>
 #include <linux/err.h>
+#include <linux/kconfig.h>
 #include <linux/xarray.h>
 #include <linux/list.h>
 #include <linux/slab.h>
@@ -21,6 +23,7 @@
 #include <uapi/linux/dma-heap.h>
 
 #define DEVNAME "dma_heap"
+#define HEAP_NAME_SUFFIX "-heap"
 
 #define NUM_HEAP_MINORS 128
 
@@ -31,6 +34,7 @@
  * @heap_devt		heap device node
  * @list		list head connecting to list of heaps
  * @heap_cdev		heap char device
+ * @gpucg_bucket	gpu cgroup bucket for memory accounting
  *
  * Represents a heap of memory from which buffers can be made.
  */
@@ -41,6 +45,9 @@ struct dma_heap {
 	dev_t heap_devt;
 	struct list_head list;
 	struct cdev heap_cdev;
+#ifdef CONFIG_CGROUP_GPU
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
 
 static LIST_HEAD(heap_list);
@@ -216,6 +223,18 @@ const char *dma_heap_get_name(struct dma_heap *heap)
 	return heap->name;
 }
 
+/**
+ * dma_heap_get_gpucg_bucket() - get struct gpucg_bucket pointer for the heap.
+ * @heap: DMA-Heap to get the gpucg_bucket struct for.
+ *
+ * Returns:
+ * The gpucg_bucket struct pointer for the heap. NULL if the GPU cgroup controller is not enabled.
+ */
+struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap)
+{
+	return heap->gpucg_bucket;
+}
+
 struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 {
 	struct dma_heap *heap, *h, *err_ret;
@@ -228,6 +247,12 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 		return ERR_PTR(-EINVAL);
 	}
 
+	if (IS_ENABLED(CONFIG_CGROUP_GPU) && strlen(exp_info->name) + strlen(HEAP_NAME_SUFFIX) >=
+		GPUCG_BUCKET_NAME_MAX_LEN) {
+		pr_err("dma_heap: Name is too long for GPU cgroup\n");
+		return ERR_PTR(-ENAMETOOLONG);
+	}
+
 	if (!exp_info->ops || !exp_info->ops->allocate) {
 		pr_err("dma_heap: Cannot add heap with invalid ops struct\n");
 		return ERR_PTR(-EINVAL);
@@ -253,6 +278,19 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 	heap->ops = exp_info->ops;
 	heap->priv = exp_info->priv;
 
+	if (IS_ENABLED(CONFIG_CGROUP_GPU)) {
+		char gpucg_bucket_name[GPUCG_BUCKET_NAME_MAX_LEN];
+
+		snprintf(gpucg_bucket_name, sizeof(gpucg_bucket_name), "%s%s",
+			 exp_info->name, HEAP_NAME_SUFFIX);
+
+		heap->gpucg_bucket = gpucg_register_bucket(gpucg_bucket_name);
+		if (IS_ERR(heap->gpucg_bucket)) {
+			err_ret = ERR_CAST(heap->gpucg_bucket);
+			goto err0;
+		}
+	}
+
 	/* Find unused minor number */
 	ret = xa_alloc(&dma_heap_minors, &minor, heap,
 		       XA_LIMIT(0, NUM_HEAP_MINORS - 1), GFP_KERNEL);
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index fcf836ba9c1f..27f686faef00 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -297,6 +297,11 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf)
 	}
 	sg_free_table(table);
 	kfree(buffer);
+
+	if (dmabuf->gpucg && dmabuf->gpucg_bucket) {
+		gpucg_uncharge(dmabuf->gpucg, dmabuf->gpucg_bucket, dmabuf->size);
+		gpucg_put(dmabuf->gpucg);
+	}
 }
 
 static const struct dma_buf_ops system_heap_buf_ops = {
@@ -346,11 +351,21 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	struct scatterlist *sg;
 	struct list_head pages;
 	struct page *page, *tmp_page;
-	int i, ret = -ENOMEM;
+	struct gpucg *gpucg;
+	struct gpucg_bucket *gpucg_bucket;
+	int i, ret;
+
+	gpucg = gpucg_get(current);
+	gpucg_bucket = dma_heap_get_gpucg_bucket(heap);
+	ret = gpucg_charge(gpucg, gpucg_bucket, len);
+	if (ret)
+		goto put_gpucg;
 
 	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
-	if (!buffer)
-		return ERR_PTR(-ENOMEM);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto uncharge_gpucg;
+	}
 
 	INIT_LIST_HEAD(&buffer->attachments);
 	mutex_init(&buffer->lock);
@@ -396,6 +411,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	exp_info.size = buffer->len;
 	exp_info.flags = fd_flags;
 	exp_info.priv = buffer;
+	dma_buf_exp_info_set_gpucg(&exp_info, gpucg, gpucg_bucket);
+
 	dmabuf = dma_buf_export(&exp_info);
 	if (IS_ERR(dmabuf)) {
 		ret = PTR_ERR(dmabuf);
@@ -414,7 +431,10 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	list_for_each_entry_safe(page, tmp_page, &pages, lru)
 		__free_pages(page, compound_order(page));
 	kfree(buffer);
-
+uncharge_gpucg:
+	gpucg_uncharge(gpucg, gpucg_bucket, len);
+put_gpucg:
+	gpucg_put(gpucg);
 	return ERR_PTR(ret);
 }
 
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 2097760e8e95..8e7c55c830b3 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -13,6 +13,7 @@
 #ifndef __DMA_BUF_H__
 #define __DMA_BUF_H__
 
+#include <linux/cgroup_gpu.h>
 #include <linux/iosys-map.h>
 #include <linux/file.h>
 #include <linux/err.h>
@@ -303,7 +304,7 @@ struct dma_buf {
 	/**
 	 * @size:
 	 *
-	 * Size of the buffer; invariant over the lifetime of the buffer.
+	 * Size of the buffer in bytes; invariant over the lifetime of the buffer.
 	 */
 	size_t size;
 
@@ -453,6 +454,14 @@ struct dma_buf {
 		struct dma_buf *dmabuf;
 	} *sysfs_entry;
 #endif
+
+#ifdef CONFIG_CGROUP_GPU
+	/** @gpucg: Pointer to the GPU cgroup this buffer currently belongs to. */
+	struct gpucg *gpucg;
+
+	/* @gpucg_bucket: Pointer to the GPU cgroup bucket whence this buffer originates. */
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
 
 /**
@@ -526,13 +535,15 @@ struct dma_buf_attachment {
 
 /**
  * struct dma_buf_export_info - holds information needed to export a dma_buf
- * @exp_name:	name of the exporter - useful for debugging.
- * @owner:	pointer to exporter module - used for refcounting kernel module
- * @ops:	Attach allocator-defined dma buf ops to the new buffer
- * @size:	Size of the buffer - invariant over the lifetime of the buffer
- * @flags:	mode flags for the file
- * @resv:	reservation-object, NULL to allocate default one
- * @priv:	Attach private data of allocator to this buffer
+ * @exp_name:		name of the exporter - useful for debugging.
+ * @owner:		pointer to exporter module - used for refcounting kernel module
+ * @ops:		Attach allocator-defined dma buf ops to the new buffer
+ * @size:		Size of the buffer in bytes - invariant over the lifetime of the buffer
+ * @flags:		mode flags for the file
+ * @resv:		reservation-object, NULL to allocate default one
+ * @priv:		Attach private data of allocator to this buffer
+ * @gpucg:		Pointer to GPU cgroup this buffer is charged to, or NULL if not charged
+ * @gpucg_bucket:	Pointer to GPU cgroup bucket this buffer comes from, or NULL if not charged
  *
  * This structure holds the information required to export the buffer. Used
  * with dma_buf_export() only.
@@ -545,6 +556,10 @@ struct dma_buf_export_info {
 	int flags;
 	struct dma_resv *resv;
 	void *priv;
+#ifdef CONFIG_CGROUP_GPU
+	struct gpucg *gpucg;
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
 
 /**
@@ -630,4 +645,14 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 		 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map);
+
+#ifdef CONFIG_CGROUP_GPU
+void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+				struct gpucg *gpucg,
+				struct gpucg_bucket *gpucg_bucket);
+#else/* CONFIG_CGROUP_GPU */
+static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+					      struct gpucg *gpucg,
+					      struct gpucg_bucket *gpucg_bucket) {}
+#endif /* CONFIG_CGROUP_GPU */
 #endif /* __DMA_BUF_H__ */
diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
index 0c05561cad6e..6321e7636538 100644
--- a/include/linux/dma-heap.h
+++ b/include/linux/dma-heap.h
@@ -10,6 +10,7 @@
 #define _DMA_HEAPS_H
 
 #include <linux/cdev.h>
+#include <linux/cgroup_gpu.h>
 #include <linux/types.h>
 
 struct dma_heap;
@@ -59,6 +60,20 @@ void *dma_heap_get_drvdata(struct dma_heap *heap);
  */
 const char *dma_heap_get_name(struct dma_heap *heap);
 
+#ifdef CONFIG_CGROUP_GPU
+/**
+ * dma_heap_get_gpucg_bucket() - get a pointer to the struct gpucg_bucket for the heap.
+ * @heap: DMA-Heap to retrieve gpucg_bucket for
+ *
+ * Returns:
+ * The gpucg_bucket struct for the heap.
+ */
+struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap);
+#else /* CONFIG_CGROUP_GPU */
+static inline struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap)
+{ return NULL; }
+#endif /* CONFIG_CGROUP_GPU */
+
 /**
  * dma_heap_add - adds a heap to dmabuf heaps
  * @exp_info:		information needed to register this heap
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 3/6] dmabuf: heaps: export system_heap buffers with GPU cgroup charging
@ 2022-05-10 23:56   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Sumit Semwal, Christian König, Benjamin Gaignard,
	Liam Mark, Laura Abbott, Brian Starkey, John Stultz
  Cc: kernel-team, tkjos, Kenny.Ho, skhan, cmllamas, dri-devel,
	linux-kernel, linaro-mm-sig, hridya, jstultz, kaleshsingh, tj,
	mkoutny, surenb, linux-media

All DMA heaps now register a new GPU cgroup bucket upon creation, and the
system_heap now exports buffers associated with its GPU cgroup bucket for
tracking purposes.

In order to support GPU cgroup charge transfer on a dma-buf, the current
GPU cgroup information must be stored inside the dma-buf struct. For
tracked buffers, exporters include the struct gpucg and struct
gpucg_bucket pointers in the export info which can later be modified if
the charge is migrated to another cgroup.

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>
---
v7 changes
Adapt to new gpucg_register_bucket API.

v5 changes
Merge dmabuf: Use the GPU cgroup charge/uncharge APIs into this patch.

Remove all GPU cgroup code from dma-buf except what's necessary to support
charge transfer. Previously charging was done in export, but for
non-Android graphics use-cases this is not ideal since there may be a
dealy between allocation and export, during which time there is no
accounting.

Append "-heap" to gpucg_bucket names.

Charge on allocation instead of export. This should more closely mirror
non-Android use-cases where there is potentially a delay between allocation
and export.

Put the charge and uncharge code in the same file (system_heap_allocate,
system_heap_dma_buf_release) instead of splitting them between the heap and
the dma_buf_release.

Move no-op code to header file to match other files in the series.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/dma-buf/dma-buf.c           | 19 +++++++++++++
 drivers/dma-buf/dma-heap.c          | 38 ++++++++++++++++++++++++++
 drivers/dma-buf/heaps/system_heap.c | 28 +++++++++++++++++---
 include/linux/dma-buf.h             | 41 +++++++++++++++++++++++------
 include/linux/dma-heap.h            | 15 +++++++++++
 5 files changed, 129 insertions(+), 12 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index df23239b04fc..bc89c44bd9b9 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -462,6 +462,24 @@ static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
  * &dma_buf_ops.
  */
 
+#ifdef CONFIG_CGROUP_GPU
+static void dma_buf_set_gpucg(struct dma_buf *dmabuf, const struct dma_buf_export_info *exp)
+{
+	dmabuf->gpucg = exp->gpucg;
+	dmabuf->gpucg_bucket = exp->gpucg_bucket;
+}
+
+void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+				struct gpucg *gpucg,
+				struct gpucg_bucket *gpucg_bucket)
+{
+	exp_info->gpucg = gpucg;
+	exp_info->gpucg_bucket = gpucg_bucket;
+}
+#else
+static void dma_buf_set_gpucg(struct dma_buf *dmabuf, struct dma_buf_export_info *exp) {}
+#endif
+
 /**
  * dma_buf_export - Creates a new dma_buf, and associates an anon file
  * with this buffer, so it can be exported.
@@ -527,6 +545,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info)
 	init_waitqueue_head(&dmabuf->poll);
 	dmabuf->cb_in.poll = dmabuf->cb_out.poll = &dmabuf->poll;
 	dmabuf->cb_in.active = dmabuf->cb_out.active = 0;
+	dma_buf_set_gpucg(dmabuf, exp_info);
 
 	if (!resv) {
 		resv = (struct dma_resv *)&dmabuf[1];
diff --git a/drivers/dma-buf/dma-heap.c b/drivers/dma-buf/dma-heap.c
index 8f5848aa144f..48173a66d70d 100644
--- a/drivers/dma-buf/dma-heap.c
+++ b/drivers/dma-buf/dma-heap.c
@@ -7,10 +7,12 @@
  */
 
 #include <linux/cdev.h>
+#include <linux/cgroup_gpu.h>
 #include <linux/debugfs.h>
 #include <linux/device.h>
 #include <linux/dma-buf.h>
 #include <linux/err.h>
+#include <linux/kconfig.h>
 #include <linux/xarray.h>
 #include <linux/list.h>
 #include <linux/slab.h>
@@ -21,6 +23,7 @@
 #include <uapi/linux/dma-heap.h>
 
 #define DEVNAME "dma_heap"
+#define HEAP_NAME_SUFFIX "-heap"
 
 #define NUM_HEAP_MINORS 128
 
@@ -31,6 +34,7 @@
  * @heap_devt		heap device node
  * @list		list head connecting to list of heaps
  * @heap_cdev		heap char device
+ * @gpucg_bucket	gpu cgroup bucket for memory accounting
  *
  * Represents a heap of memory from which buffers can be made.
  */
@@ -41,6 +45,9 @@ struct dma_heap {
 	dev_t heap_devt;
 	struct list_head list;
 	struct cdev heap_cdev;
+#ifdef CONFIG_CGROUP_GPU
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
 
 static LIST_HEAD(heap_list);
@@ -216,6 +223,18 @@ const char *dma_heap_get_name(struct dma_heap *heap)
 	return heap->name;
 }
 
+/**
+ * dma_heap_get_gpucg_bucket() - get struct gpucg_bucket pointer for the heap.
+ * @heap: DMA-Heap to get the gpucg_bucket struct for.
+ *
+ * Returns:
+ * The gpucg_bucket struct pointer for the heap. NULL if the GPU cgroup controller is not enabled.
+ */
+struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap)
+{
+	return heap->gpucg_bucket;
+}
+
 struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 {
 	struct dma_heap *heap, *h, *err_ret;
@@ -228,6 +247,12 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 		return ERR_PTR(-EINVAL);
 	}
 
+	if (IS_ENABLED(CONFIG_CGROUP_GPU) && strlen(exp_info->name) + strlen(HEAP_NAME_SUFFIX) >=
+		GPUCG_BUCKET_NAME_MAX_LEN) {
+		pr_err("dma_heap: Name is too long for GPU cgroup\n");
+		return ERR_PTR(-ENAMETOOLONG);
+	}
+
 	if (!exp_info->ops || !exp_info->ops->allocate) {
 		pr_err("dma_heap: Cannot add heap with invalid ops struct\n");
 		return ERR_PTR(-EINVAL);
@@ -253,6 +278,19 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info)
 	heap->ops = exp_info->ops;
 	heap->priv = exp_info->priv;
 
+	if (IS_ENABLED(CONFIG_CGROUP_GPU)) {
+		char gpucg_bucket_name[GPUCG_BUCKET_NAME_MAX_LEN];
+
+		snprintf(gpucg_bucket_name, sizeof(gpucg_bucket_name), "%s%s",
+			 exp_info->name, HEAP_NAME_SUFFIX);
+
+		heap->gpucg_bucket = gpucg_register_bucket(gpucg_bucket_name);
+		if (IS_ERR(heap->gpucg_bucket)) {
+			err_ret = ERR_CAST(heap->gpucg_bucket);
+			goto err0;
+		}
+	}
+
 	/* Find unused minor number */
 	ret = xa_alloc(&dma_heap_minors, &minor, heap,
 		       XA_LIMIT(0, NUM_HEAP_MINORS - 1), GFP_KERNEL);
diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index fcf836ba9c1f..27f686faef00 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -297,6 +297,11 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf)
 	}
 	sg_free_table(table);
 	kfree(buffer);
+
+	if (dmabuf->gpucg && dmabuf->gpucg_bucket) {
+		gpucg_uncharge(dmabuf->gpucg, dmabuf->gpucg_bucket, dmabuf->size);
+		gpucg_put(dmabuf->gpucg);
+	}
 }
 
 static const struct dma_buf_ops system_heap_buf_ops = {
@@ -346,11 +351,21 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	struct scatterlist *sg;
 	struct list_head pages;
 	struct page *page, *tmp_page;
-	int i, ret = -ENOMEM;
+	struct gpucg *gpucg;
+	struct gpucg_bucket *gpucg_bucket;
+	int i, ret;
+
+	gpucg = gpucg_get(current);
+	gpucg_bucket = dma_heap_get_gpucg_bucket(heap);
+	ret = gpucg_charge(gpucg, gpucg_bucket, len);
+	if (ret)
+		goto put_gpucg;
 
 	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
-	if (!buffer)
-		return ERR_PTR(-ENOMEM);
+	if (!buffer) {
+		ret = -ENOMEM;
+		goto uncharge_gpucg;
+	}
 
 	INIT_LIST_HEAD(&buffer->attachments);
 	mutex_init(&buffer->lock);
@@ -396,6 +411,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	exp_info.size = buffer->len;
 	exp_info.flags = fd_flags;
 	exp_info.priv = buffer;
+	dma_buf_exp_info_set_gpucg(&exp_info, gpucg, gpucg_bucket);
+
 	dmabuf = dma_buf_export(&exp_info);
 	if (IS_ERR(dmabuf)) {
 		ret = PTR_ERR(dmabuf);
@@ -414,7 +431,10 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	list_for_each_entry_safe(page, tmp_page, &pages, lru)
 		__free_pages(page, compound_order(page));
 	kfree(buffer);
-
+uncharge_gpucg:
+	gpucg_uncharge(gpucg, gpucg_bucket, len);
+put_gpucg:
+	gpucg_put(gpucg);
 	return ERR_PTR(ret);
 }
 
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 2097760e8e95..8e7c55c830b3 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -13,6 +13,7 @@
 #ifndef __DMA_BUF_H__
 #define __DMA_BUF_H__
 
+#include <linux/cgroup_gpu.h>
 #include <linux/iosys-map.h>
 #include <linux/file.h>
 #include <linux/err.h>
@@ -303,7 +304,7 @@ struct dma_buf {
 	/**
 	 * @size:
 	 *
-	 * Size of the buffer; invariant over the lifetime of the buffer.
+	 * Size of the buffer in bytes; invariant over the lifetime of the buffer.
 	 */
 	size_t size;
 
@@ -453,6 +454,14 @@ struct dma_buf {
 		struct dma_buf *dmabuf;
 	} *sysfs_entry;
 #endif
+
+#ifdef CONFIG_CGROUP_GPU
+	/** @gpucg: Pointer to the GPU cgroup this buffer currently belongs to. */
+	struct gpucg *gpucg;
+
+	/* @gpucg_bucket: Pointer to the GPU cgroup bucket whence this buffer originates. */
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
 
 /**
@@ -526,13 +535,15 @@ struct dma_buf_attachment {
 
 /**
  * struct dma_buf_export_info - holds information needed to export a dma_buf
- * @exp_name:	name of the exporter - useful for debugging.
- * @owner:	pointer to exporter module - used for refcounting kernel module
- * @ops:	Attach allocator-defined dma buf ops to the new buffer
- * @size:	Size of the buffer - invariant over the lifetime of the buffer
- * @flags:	mode flags for the file
- * @resv:	reservation-object, NULL to allocate default one
- * @priv:	Attach private data of allocator to this buffer
+ * @exp_name:		name of the exporter - useful for debugging.
+ * @owner:		pointer to exporter module - used for refcounting kernel module
+ * @ops:		Attach allocator-defined dma buf ops to the new buffer
+ * @size:		Size of the buffer in bytes - invariant over the lifetime of the buffer
+ * @flags:		mode flags for the file
+ * @resv:		reservation-object, NULL to allocate default one
+ * @priv:		Attach private data of allocator to this buffer
+ * @gpucg:		Pointer to GPU cgroup this buffer is charged to, or NULL if not charged
+ * @gpucg_bucket:	Pointer to GPU cgroup bucket this buffer comes from, or NULL if not charged
  *
  * This structure holds the information required to export the buffer. Used
  * with dma_buf_export() only.
@@ -545,6 +556,10 @@ struct dma_buf_export_info {
 	int flags;
 	struct dma_resv *resv;
 	void *priv;
+#ifdef CONFIG_CGROUP_GPU
+	struct gpucg *gpucg;
+	struct gpucg_bucket *gpucg_bucket;
+#endif
 };
 
 /**
@@ -630,4 +645,14 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct *,
 		 unsigned long);
 int dma_buf_vmap(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map);
+
+#ifdef CONFIG_CGROUP_GPU
+void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+				struct gpucg *gpucg,
+				struct gpucg_bucket *gpucg_bucket);
+#else/* CONFIG_CGROUP_GPU */
+static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
+					      struct gpucg *gpucg,
+					      struct gpucg_bucket *gpucg_bucket) {}
+#endif /* CONFIG_CGROUP_GPU */
 #endif /* __DMA_BUF_H__ */
diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
index 0c05561cad6e..6321e7636538 100644
--- a/include/linux/dma-heap.h
+++ b/include/linux/dma-heap.h
@@ -10,6 +10,7 @@
 #define _DMA_HEAPS_H
 
 #include <linux/cdev.h>
+#include <linux/cgroup_gpu.h>
 #include <linux/types.h>
 
 struct dma_heap;
@@ -59,6 +60,20 @@ void *dma_heap_get_drvdata(struct dma_heap *heap);
  */
 const char *dma_heap_get_name(struct dma_heap *heap);
 
+#ifdef CONFIG_CGROUP_GPU
+/**
+ * dma_heap_get_gpucg_bucket() - get a pointer to the struct gpucg_bucket for the heap.
+ * @heap: DMA-Heap to retrieve gpucg_bucket for
+ *
+ * Returns:
+ * The gpucg_bucket struct for the heap.
+ */
+struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap);
+#else /* CONFIG_CGROUP_GPU */
+static inline struct gpucg_bucket *dma_heap_get_gpucg_bucket(struct dma_heap *heap)
+{ return NULL; }
+#endif /* CONFIG_CGROUP_GPU */
+
 /**
  * dma_heap_add - adds a heap to dmabuf heaps
  * @exp_info:		information needed to register this heap
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 4/6] dmabuf: Add gpu cgroup charge transfer function
  2022-05-10 23:56 ` T.J. Mercier
  (?)
@ 2022-05-10 23:56   ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Sumit Semwal, Christian König, Tejun Heo,
	Zefan Li, Johannes Weiner
  Cc: daniel, hridya, jstultz, tkjos, cmllamas, surenb, kaleshsingh,
	Kenny.Ho, mkoutny, skhan, kernel-team, linux-media, dri-devel,
	linaro-mm-sig, linux-kernel, cgroups

The dma_buf_transfer_charge function provides a way for processes to
transfer charge of a buffer to a different process. This is essential
for the cases where a central allocator process does allocations for
various subsystems, hands over the fd to the client who requested the
memory and drops all references to the allocated memory.

Originally-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v5 changes
Fix commit message which still contained the old name for
dma_buf_transfer_charge per Michal Koutný.

Modify the dma_buf_transfer_charge API to accept a task_struct instead
of a gpucg. This avoids requiring the caller to manage the refcount
of the gpucg upon failure and confusing ownership transfer logic.

v4 changes
Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutný.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/dma-buf/dma-buf.c  | 57 ++++++++++++++++++++++++++++++++++++++
 include/linux/cgroup_gpu.h | 24 ++++++++++++++++
 include/linux/dma-buf.h    |  6 ++++
 kernel/cgroup/gpu.c        | 51 ++++++++++++++++++++++++++++++++++
 4 files changed, 138 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index bc89c44bd9b9..f3fb844925e2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1341,6 +1341,63 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map)
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap, DMA_BUF);
 
+/**
+ * dma_buf_transfer_charge - Change the GPU cgroup to which the provided dma_buf is charged.
+ * @dmabuf:	[in]	buffer whose charge will be migrated to a different GPU cgroup
+ * @target:	[in]	the task_struct of the destination process for the GPU cgroup charge
+ *
+ * Only tasks that belong to the same cgroup the buffer is currently charged to
+ * may call this function, otherwise it will return -EPERM.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target)
+{
+	struct gpucg *current_gpucg, *target_gpucg, *to_release;
+	int ret;
+
+	if (!dmabuf->gpucg || !dmabuf->gpucg_bucket) {
+		/* This dmabuf is not tracked under GPU cgroup accounting */
+		return 0;
+	}
+
+	current_gpucg = gpucg_get(current);
+	target_gpucg = gpucg_get(target);
+	to_release = target_gpucg;
+
+	/* If the source and destination cgroups are the same, don't do anything. */
+	if (current_gpucg == target_gpucg) {
+		ret = 0;
+		goto skip_transfer;
+	}
+
+	/*
+	 * Verify that the cgroup of the process requesting the transfer
+	 * is the same as the one the buffer is currently charged to.
+	 */
+	mutex_lock(&dmabuf->lock);
+	if (current_gpucg != dmabuf->gpucg) {
+		ret = -EPERM;
+		goto err;
+	}
+
+	ret = gpucg_transfer_charge(
+		dmabuf->gpucg, target_gpucg, dmabuf->gpucg_bucket, dmabuf->size);
+	if (ret)
+		goto err;
+
+	to_release = dmabuf->gpucg;
+	dmabuf->gpucg = target_gpucg;
+
+err:
+	mutex_unlock(&dmabuf->lock);
+skip_transfer:
+	gpucg_put(current_gpucg);
+	gpucg_put(to_release);
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_transfer_charge, DMA_BUF);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
index cb228a16aa1f..7eb68f1507fb 100644
--- a/include/linux/cgroup_gpu.h
+++ b/include/linux/cgroup_gpu.h
@@ -75,6 +75,22 @@ int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
  */
 void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
 
+/**
+ * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another.
+ *
+ * @source:	[in]	The GPU cgroup the charge will be transferred from.
+ * @dest:	[in]	The GPU cgroup the charge will be transferred to.
+ * @bucket:	[in]	The GPU cgroup bucket corresponding to the charge.
+ * @size:	[in]	The size of the memory in bytes.
+ *                      This size will be rounded up to the nearest page size.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size);
+
 /**
  * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
  *
@@ -117,6 +133,14 @@ static inline void gpucg_uncharge(struct gpucg *gpucg,
 				  struct gpucg_bucket *bucket,
 				  u64 size) {}
 
+static inline int gpucg_transfer_charge(struct gpucg *source,
+					struct gpucg *dest,
+					struct gpucg_bucket *bucket,
+					u64 size)
+{
+	return 0;
+}
+
 static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* _CGROUP_GPU_H */
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8e7c55c830b3..438ad8577b76 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -18,6 +18,7 @@
 #include <linux/file.h>
 #include <linux/err.h>
 #include <linux/scatterlist.h>
+#include <linux/sched.h>
 #include <linux/list.h>
 #include <linux/dma-mapping.h>
 #include <linux/fs.h>
@@ -650,9 +651,14 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
 				struct gpucg *gpucg,
 				struct gpucg_bucket *gpucg_bucket);
+
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target);
 #else/* CONFIG_CGROUP_GPU */
 static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
 					      struct gpucg *gpucg,
 					      struct gpucg_bucket *gpucg_bucket) {}
+
+static inline int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target)
+{ return 0; }
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* __DMA_BUF_H__ */
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
index ad16ea15d427..038ea873a9d3 100644
--- a/kernel/cgroup/gpu.c
+++ b/kernel/cgroup/gpu.c
@@ -274,6 +274,57 @@ void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
 	css_put(&gpucg->css);
 }
 
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp_source, *rp_dest;
+	int ret = 0;
+
+	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp_source = cg_rpool_find_locked(source, bucket);
+	if (unlikely(!rp_source)) {
+		ret = -ENOENT;
+		goto exit_early;
+	}
+
+	rp_dest = cg_rpool_get_locked(dest, bucket);
+	if (IS_ERR(rp_dest)) {
+		ret = PTR_ERR(rp_dest);
+		goto exit_early;
+	}
+
+	/*
+	 * First uncharge from the pool it's currently charged to. This ordering avoids double
+	 * charging while the transfer is in progress, which could cause us to hit a limit.
+	 * If the try_charge fails for this transfer, we need to be able to reverse this uncharge,
+	 * so we continue to hold the gpucg_mutex here.
+	 */
+	page_counter_uncharge(&rp_source->total, nr_pages);
+	css_put(&source->css);
+
+	/* Now attempt the new charge */
+	if (page_counter_try_charge(&rp_dest->total, nr_pages, &counter)) {
+		css_get(&dest->css);
+	} else {
+		/*
+		 * The new charge failed, so reverse the uncharge from above. This should always
+		 * succeed since charges on source are blocked by gpucg_mutex.
+		 */
+		WARN_ON(!page_counter_try_charge(&rp_source->total, nr_pages, &counter));
+		css_get(&source->css);
+		ret = -ENOMEM;
+	}
+exit_early:
+	mutex_unlock(&gpucg_mutex);
+	return ret;
+}
+
 struct gpucg_bucket *gpucg_register_bucket(const char *name)
 {
 	struct gpucg_bucket *bucket, *b;
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 4/6] dmabuf: Add gpu cgroup charge transfer function
@ 2022-05-10 23:56   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Sumit Semwal, Christian König, Tejun Heo,
	Zefan Li, Johannes Weiner
  Cc: kernel-team, tkjos, Kenny.Ho, cgroups, skhan, cmllamas,
	dri-devel, linux-kernel, linaro-mm-sig, jstultz, kaleshsingh,
	hridya, mkoutny, surenb, linux-media

The dma_buf_transfer_charge function provides a way for processes to
transfer charge of a buffer to a different process. This is essential
for the cases where a central allocator process does allocations for
various subsystems, hands over the fd to the client who requested the
memory and drops all references to the allocated memory.

Originally-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v5 changes
Fix commit message which still contained the old name for
dma_buf_transfer_charge per Michal Koutný.

Modify the dma_buf_transfer_charge API to accept a task_struct instead
of a gpucg. This avoids requiring the caller to manage the refcount
of the gpucg upon failure and confusing ownership transfer logic.

v4 changes
Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutný.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/dma-buf/dma-buf.c  | 57 ++++++++++++++++++++++++++++++++++++++
 include/linux/cgroup_gpu.h | 24 ++++++++++++++++
 include/linux/dma-buf.h    |  6 ++++
 kernel/cgroup/gpu.c        | 51 ++++++++++++++++++++++++++++++++++
 4 files changed, 138 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index bc89c44bd9b9..f3fb844925e2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1341,6 +1341,63 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map)
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap, DMA_BUF);
 
+/**
+ * dma_buf_transfer_charge - Change the GPU cgroup to which the provided dma_buf is charged.
+ * @dmabuf:	[in]	buffer whose charge will be migrated to a different GPU cgroup
+ * @target:	[in]	the task_struct of the destination process for the GPU cgroup charge
+ *
+ * Only tasks that belong to the same cgroup the buffer is currently charged to
+ * may call this function, otherwise it will return -EPERM.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target)
+{
+	struct gpucg *current_gpucg, *target_gpucg, *to_release;
+	int ret;
+
+	if (!dmabuf->gpucg || !dmabuf->gpucg_bucket) {
+		/* This dmabuf is not tracked under GPU cgroup accounting */
+		return 0;
+	}
+
+	current_gpucg = gpucg_get(current);
+	target_gpucg = gpucg_get(target);
+	to_release = target_gpucg;
+
+	/* If the source and destination cgroups are the same, don't do anything. */
+	if (current_gpucg == target_gpucg) {
+		ret = 0;
+		goto skip_transfer;
+	}
+
+	/*
+	 * Verify that the cgroup of the process requesting the transfer
+	 * is the same as the one the buffer is currently charged to.
+	 */
+	mutex_lock(&dmabuf->lock);
+	if (current_gpucg != dmabuf->gpucg) {
+		ret = -EPERM;
+		goto err;
+	}
+
+	ret = gpucg_transfer_charge(
+		dmabuf->gpucg, target_gpucg, dmabuf->gpucg_bucket, dmabuf->size);
+	if (ret)
+		goto err;
+
+	to_release = dmabuf->gpucg;
+	dmabuf->gpucg = target_gpucg;
+
+err:
+	mutex_unlock(&dmabuf->lock);
+skip_transfer:
+	gpucg_put(current_gpucg);
+	gpucg_put(to_release);
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_transfer_charge, DMA_BUF);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
index cb228a16aa1f..7eb68f1507fb 100644
--- a/include/linux/cgroup_gpu.h
+++ b/include/linux/cgroup_gpu.h
@@ -75,6 +75,22 @@ int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
  */
 void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
 
+/**
+ * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another.
+ *
+ * @source:	[in]	The GPU cgroup the charge will be transferred from.
+ * @dest:	[in]	The GPU cgroup the charge will be transferred to.
+ * @bucket:	[in]	The GPU cgroup bucket corresponding to the charge.
+ * @size:	[in]	The size of the memory in bytes.
+ *                      This size will be rounded up to the nearest page size.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size);
+
 /**
  * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
  *
@@ -117,6 +133,14 @@ static inline void gpucg_uncharge(struct gpucg *gpucg,
 				  struct gpucg_bucket *bucket,
 				  u64 size) {}
 
+static inline int gpucg_transfer_charge(struct gpucg *source,
+					struct gpucg *dest,
+					struct gpucg_bucket *bucket,
+					u64 size)
+{
+	return 0;
+}
+
 static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* _CGROUP_GPU_H */
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8e7c55c830b3..438ad8577b76 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -18,6 +18,7 @@
 #include <linux/file.h>
 #include <linux/err.h>
 #include <linux/scatterlist.h>
+#include <linux/sched.h>
 #include <linux/list.h>
 #include <linux/dma-mapping.h>
 #include <linux/fs.h>
@@ -650,9 +651,14 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
 				struct gpucg *gpucg,
 				struct gpucg_bucket *gpucg_bucket);
+
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target);
 #else/* CONFIG_CGROUP_GPU */
 static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
 					      struct gpucg *gpucg,
 					      struct gpucg_bucket *gpucg_bucket) {}
+
+static inline int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target)
+{ return 0; }
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* __DMA_BUF_H__ */
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
index ad16ea15d427..038ea873a9d3 100644
--- a/kernel/cgroup/gpu.c
+++ b/kernel/cgroup/gpu.c
@@ -274,6 +274,57 @@ void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
 	css_put(&gpucg->css);
 }
 
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp_source, *rp_dest;
+	int ret = 0;
+
+	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp_source = cg_rpool_find_locked(source, bucket);
+	if (unlikely(!rp_source)) {
+		ret = -ENOENT;
+		goto exit_early;
+	}
+
+	rp_dest = cg_rpool_get_locked(dest, bucket);
+	if (IS_ERR(rp_dest)) {
+		ret = PTR_ERR(rp_dest);
+		goto exit_early;
+	}
+
+	/*
+	 * First uncharge from the pool it's currently charged to. This ordering avoids double
+	 * charging while the transfer is in progress, which could cause us to hit a limit.
+	 * If the try_charge fails for this transfer, we need to be able to reverse this uncharge,
+	 * so we continue to hold the gpucg_mutex here.
+	 */
+	page_counter_uncharge(&rp_source->total, nr_pages);
+	css_put(&source->css);
+
+	/* Now attempt the new charge */
+	if (page_counter_try_charge(&rp_dest->total, nr_pages, &counter)) {
+		css_get(&dest->css);
+	} else {
+		/*
+		 * The new charge failed, so reverse the uncharge from above. This should always
+		 * succeed since charges on source are blocked by gpucg_mutex.
+		 */
+		WARN_ON(!page_counter_try_charge(&rp_source->total, nr_pages, &counter));
+		css_get(&source->css);
+		ret = -ENOMEM;
+	}
+exit_early:
+	mutex_unlock(&gpucg_mutex);
+	return ret;
+}
+
 struct gpucg_bucket *gpucg_register_bucket(const char *name)
 {
 	struct gpucg_bucket *bucket, *b;
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 4/6] dmabuf: Add gpu cgroup charge transfer function
@ 2022-05-10 23:56   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier-hpIqsD4AKlfQT0dZR+AlfA, Sumit Semwal,
	Christian König, Tejun Heo, Zefan Li, Johannes Weiner
  Cc: daniel-/w4YWyX8dFk, hridya-hpIqsD4AKlfQT0dZR+AlfA,
	jstultz-hpIqsD4AKlfQT0dZR+AlfA, tkjos-z5hGa2qSFaRBDgjK7y7TUQ,
	cmllamas-hpIqsD4AKlfQT0dZR+AlfA, surenb-hpIqsD4AKlfQT0dZR+AlfA,
	kaleshsingh-hpIqsD4AKlfQT0dZR+AlfA, Kenny.Ho-5C7GfCeVMHo,
	mkoutny-IBi9RG/b67k, skhan-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	kernel-team-z5hGa2qSFaRBDgjK7y7TUQ,
	linux-media-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linaro-mm-sig-cunTk1MwBs8s++Sfvej+rw,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA

The dma_buf_transfer_charge function provides a way for processes to
transfer charge of a buffer to a different process. This is essential
for the cases where a central allocator process does allocations for
various subsystems, hands over the fd to the client who requested the
memory and drops all references to the allocated memory.

Originally-by: Hridya Valsaraju <hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Signed-off-by: T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

---
v5 changes
Fix commit message which still contained the old name for
dma_buf_transfer_charge per Michal Koutný.

Modify the dma_buf_transfer_charge API to accept a task_struct instead
of a gpucg. This avoids requiring the caller to manage the refcount
of the gpucg upon failure and confusing ownership transfer logic.

v4 changes
Adjust ordering of charge/uncharge during transfer to avoid potentially
hitting cgroup limit per Michal Koutný.

v3 changes
Use more common dual author commit message format per John Stultz.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/dma-buf/dma-buf.c  | 57 ++++++++++++++++++++++++++++++++++++++
 include/linux/cgroup_gpu.h | 24 ++++++++++++++++
 include/linux/dma-buf.h    |  6 ++++
 kernel/cgroup/gpu.c        | 51 ++++++++++++++++++++++++++++++++++
 4 files changed, 138 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index bc89c44bd9b9..f3fb844925e2 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1341,6 +1341,63 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map)
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap, DMA_BUF);
 
+/**
+ * dma_buf_transfer_charge - Change the GPU cgroup to which the provided dma_buf is charged.
+ * @dmabuf:	[in]	buffer whose charge will be migrated to a different GPU cgroup
+ * @target:	[in]	the task_struct of the destination process for the GPU cgroup charge
+ *
+ * Only tasks that belong to the same cgroup the buffer is currently charged to
+ * may call this function, otherwise it will return -EPERM.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target)
+{
+	struct gpucg *current_gpucg, *target_gpucg, *to_release;
+	int ret;
+
+	if (!dmabuf->gpucg || !dmabuf->gpucg_bucket) {
+		/* This dmabuf is not tracked under GPU cgroup accounting */
+		return 0;
+	}
+
+	current_gpucg = gpucg_get(current);
+	target_gpucg = gpucg_get(target);
+	to_release = target_gpucg;
+
+	/* If the source and destination cgroups are the same, don't do anything. */
+	if (current_gpucg == target_gpucg) {
+		ret = 0;
+		goto skip_transfer;
+	}
+
+	/*
+	 * Verify that the cgroup of the process requesting the transfer
+	 * is the same as the one the buffer is currently charged to.
+	 */
+	mutex_lock(&dmabuf->lock);
+	if (current_gpucg != dmabuf->gpucg) {
+		ret = -EPERM;
+		goto err;
+	}
+
+	ret = gpucg_transfer_charge(
+		dmabuf->gpucg, target_gpucg, dmabuf->gpucg_bucket, dmabuf->size);
+	if (ret)
+		goto err;
+
+	to_release = dmabuf->gpucg;
+	dmabuf->gpucg = target_gpucg;
+
+err:
+	mutex_unlock(&dmabuf->lock);
+skip_transfer:
+	gpucg_put(current_gpucg);
+	gpucg_put(to_release);
+	return ret;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_transfer_charge, DMA_BUF);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
index cb228a16aa1f..7eb68f1507fb 100644
--- a/include/linux/cgroup_gpu.h
+++ b/include/linux/cgroup_gpu.h
@@ -75,6 +75,22 @@ int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
  */
 void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
 
+/**
+ * gpucg_transfer_charge - Transfer a GPU charge from one cgroup to another.
+ *
+ * @source:	[in]	The GPU cgroup the charge will be transferred from.
+ * @dest:	[in]	The GPU cgroup the charge will be transferred to.
+ * @bucket:	[in]	The GPU cgroup bucket corresponding to the charge.
+ * @size:	[in]	The size of the memory in bytes.
+ *                      This size will be rounded up to the nearest page size.
+ *
+ * Returns 0 on success, or a negative errno code otherwise.
+ */
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size);
+
 /**
  * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
  *
@@ -117,6 +133,14 @@ static inline void gpucg_uncharge(struct gpucg *gpucg,
 				  struct gpucg_bucket *bucket,
 				  u64 size) {}
 
+static inline int gpucg_transfer_charge(struct gpucg *source,
+					struct gpucg *dest,
+					struct gpucg_bucket *bucket,
+					u64 size)
+{
+	return 0;
+}
+
 static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* _CGROUP_GPU_H */
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8e7c55c830b3..438ad8577b76 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -18,6 +18,7 @@
 #include <linux/file.h>
 #include <linux/err.h>
 #include <linux/scatterlist.h>
+#include <linux/sched.h>
 #include <linux/list.h>
 #include <linux/dma-mapping.h>
 #include <linux/fs.h>
@@ -650,9 +651,14 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
 				struct gpucg *gpucg,
 				struct gpucg_bucket *gpucg_bucket);
+
+int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target);
 #else/* CONFIG_CGROUP_GPU */
 static inline void dma_buf_exp_info_set_gpucg(struct dma_buf_export_info *exp_info,
 					      struct gpucg *gpucg,
 					      struct gpucg_bucket *gpucg_bucket) {}
+
+static inline int dma_buf_transfer_charge(struct dma_buf *dmabuf, struct task_struct *target)
+{ return 0; }
 #endif /* CONFIG_CGROUP_GPU */
 #endif /* __DMA_BUF_H__ */
diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
index ad16ea15d427..038ea873a9d3 100644
--- a/kernel/cgroup/gpu.c
+++ b/kernel/cgroup/gpu.c
@@ -274,6 +274,57 @@ void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
 	css_put(&gpucg->css);
 }
 
+int gpucg_transfer_charge(struct gpucg *source,
+			  struct gpucg *dest,
+			  struct gpucg_bucket *bucket,
+			  u64 size)
+{
+	struct page_counter *counter;
+	u64 nr_pages;
+	struct gpucg_resource_pool *rp_source, *rp_dest;
+	int ret = 0;
+
+	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+	mutex_lock(&gpucg_mutex);
+	rp_source = cg_rpool_find_locked(source, bucket);
+	if (unlikely(!rp_source)) {
+		ret = -ENOENT;
+		goto exit_early;
+	}
+
+	rp_dest = cg_rpool_get_locked(dest, bucket);
+	if (IS_ERR(rp_dest)) {
+		ret = PTR_ERR(rp_dest);
+		goto exit_early;
+	}
+
+	/*
+	 * First uncharge from the pool it's currently charged to. This ordering avoids double
+	 * charging while the transfer is in progress, which could cause us to hit a limit.
+	 * If the try_charge fails for this transfer, we need to be able to reverse this uncharge,
+	 * so we continue to hold the gpucg_mutex here.
+	 */
+	page_counter_uncharge(&rp_source->total, nr_pages);
+	css_put(&source->css);
+
+	/* Now attempt the new charge */
+	if (page_counter_try_charge(&rp_dest->total, nr_pages, &counter)) {
+		css_get(&dest->css);
+	} else {
+		/*
+		 * The new charge failed, so reverse the uncharge from above. This should always
+		 * succeed since charges on source are blocked by gpucg_mutex.
+		 */
+		WARN_ON(!page_counter_try_charge(&rp_source->total, nr_pages, &counter));
+		css_get(&source->css);
+		ret = -ENOMEM;
+	}
+exit_early:
+	mutex_unlock(&gpucg_mutex);
+	return ret;
+}
+
 struct gpucg_bucket *gpucg_register_bucket(const char *name)
 {
 	struct gpucg_bucket *bucket, *b;
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 5/6] binder: Add flags to relinquish ownership of fds
  2022-05-10 23:56 ` T.J. Mercier
@ 2022-05-10 23:56   ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König
  Cc: daniel, tj, jstultz, cmllamas, kaleshsingh, Kenny.Ho, mkoutny,
	skhan, kernel-team, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig

From: Hridya Valsaraju <hridya@google.com>

This patch introduces flags BINDER_FD_FLAG_XFER_CHARGE, and
BINDER_FD_FLAG_XFER_CHARGE that a process sending an individual fd or
fd array to another process over binder IPC can set to relinquish
ownership of the fds being sent for memory accounting purposes. If the
flag is found to be set during the fd or fd array translation and the
fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup
and charged to the receiving process's cgroup instead.

It is up to the sending process to ensure that it closes the fds
regardless of whether the transfer failed or succeeded.

Most graphics shared memory allocations in Android are done by the
graphics allocator HAL process. On requests from clients, the HAL process
allocates memory and sends the fds to the clients over binder IPC.
The graphics allocator HAL will not retain any references to the
buffers. When the HAL sets *_FLAG_XFER_CHARGE for fd arrays holding
DMA-BUF fds, or individual fd objects, the gpu cgroup controller will
be able to correctly charge the buffers to the client processes instead
of the graphics allocator HAL.

Since this is a new feature exposed to userspace, the kernel and userspace
must be compatible for the accounting to work for transfers. In all cases
the allocation and transport of DMA buffers via binder will succeed, but
only when both the kernel supports, and userspace depends on this feature
will the transfer accounting work. The possible scenarios are detailed
below:

1. new kernel + old userspace
The kernel supports the feature but userspace does not use it. The old
userspace won't mount the new cgroup controller, accounting is not
performed, charge is not transferred.

2. old kernel + new userspace
The new cgroup controller is not supported by the kernel, accounting is
not performed, charge is not transferred.

3. old kernel + old userspace
Same as #2

4. new kernel + new userspace
Cgroup is mounted, feature is supported and used.

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v6 changes
Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.

Return error on transfer failure per Carlos Llamas.

v5 changes
Support both binder_fd_array_object and binder_fd_object. This is
necessary because new versions of Android will use binder_fd_object
instead of binder_fd_array_object, and we need to support both.

Use the new, simpler dma_buf_transfer_charge API.

v3 changes
Remove android from title per Todd Kjos.

Use more common dual author commit message format per John Stultz.

Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/android/binder.c            | 31 +++++++++++++++++++++++++----
 drivers/dma-buf/dma-buf.c           |  4 ++--
 include/linux/dma-buf.h             |  2 +-
 include/uapi/linux/android/binder.h | 23 +++++++++++++++++----
 4 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 8351c5638880..1f39b24498f1 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -42,6 +42,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/dma-buf.h>
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/freezer.h>
@@ -2170,7 +2171,7 @@ static int binder_translate_handle(struct flat_binder_object *fp,
 	return ret;
 }
 
-static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
+static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags,
 			       struct binder_transaction *t,
 			       struct binder_thread *thread,
 			       struct binder_transaction *in_reply_to)
@@ -2208,6 +2209,26 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
 		goto err_security;
 	}
 
+	if (IS_ENABLED(CONFIG_CGROUP_GPU) && (flags & BINDER_FD_FLAG_XFER_CHARGE)) {
+		struct dma_buf *dmabuf;
+
+		if (!is_dma_buf_file(file)) {
+			binder_user_error(
+				"%d:%d got transaction with XFER_CHARGE for non-dmabuf fd, %d\n",
+				proc->pid, thread->pid, fd);
+			ret = -EINVAL;
+			goto err_dmabuf;
+		}
+
+		dmabuf = file->private_data;
+		ret = dma_buf_transfer_charge(dmabuf, target_proc->tsk);
+		if (ret) {
+			pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
+				proc->pid, thread->pid, target_proc->pid);
+			goto err_xfer;
+		}
+	}
+
 	/*
 	 * Add fixup record for this transaction. The allocation
 	 * of the fd in the target needs to be done from a
@@ -2226,6 +2247,8 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
 	return ret;
 
 err_alloc:
+err_xfer:
+err_dmabuf:
 err_security:
 	fput(file);
 err_fget:
@@ -2528,7 +2551,7 @@ static int binder_translate_fd_array(struct list_head *pf_head,
 
 		ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd));
 		if (!ret)
-			ret = binder_translate_fd(fd, offset, t, thread,
+			ret = binder_translate_fd(fd, offset, fda->flags, t, thread,
 						  in_reply_to);
 		if (ret)
 			return ret > 0 ? -EINVAL : ret;
@@ -3179,8 +3202,8 @@ static void binder_transaction(struct binder_proc *proc,
 			struct binder_fd_object *fp = to_binder_fd_object(hdr);
 			binder_size_t fd_offset = object_offset +
 				(uintptr_t)&fp->fd - (uintptr_t)fp;
-			int ret = binder_translate_fd(fp->fd, fd_offset, t,
-						      thread, in_reply_to);
+			int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags,
+						      t, thread, in_reply_to);
 
 			fp->pad_binder = 0;
 			if (ret < 0 ||
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f3fb844925e2..36ed6cd4ddcc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -31,7 +31,6 @@
 
 #include "dma-buf-sysfs-stats.h"
 
-static inline int is_dma_buf_file(struct file *);
 
 struct dma_buf_list {
 	struct list_head head;
@@ -400,10 +399,11 @@ static const struct file_operations dma_buf_fops = {
 /*
  * is_dma_buf_file - Check if struct file* is associated with dma_buf
  */
-static inline int is_dma_buf_file(struct file *file)
+int is_dma_buf_file(struct file *file)
 {
 	return file->f_op == &dma_buf_fops;
 }
+EXPORT_SYMBOL_NS_GPL(is_dma_buf_file, DMA_BUF);
 
 static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
 {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 438ad8577b76..2b9812758fee 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -614,7 +614,7 @@ dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
 {
 	return !!attach->importer_ops;
 }
-
+int is_dma_buf_file(struct file *file);
 struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
 					  struct device *dev);
 struct dma_buf_attachment *
diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
index 11157fae8a8e..d17e791b38ab 100644
--- a/include/uapi/linux/android/binder.h
+++ b/include/uapi/linux/android/binder.h
@@ -91,14 +91,14 @@ struct flat_binder_object {
 /**
  * struct binder_fd_object - describes a filedescriptor to be fixed up.
  * @hdr:	common header structure
- * @pad_flags:	padding to remain compatible with old userspace code
+ * @flags:	One or more BINDER_FD_FLAG_* flags
  * @pad_binder:	padding to remain compatible with old userspace code
  * @fd:		file descriptor
  * @cookie:	opaque data, used by user-space
  */
 struct binder_fd_object {
 	struct binder_object_header	hdr;
-	__u32				pad_flags;
+	__u32				flags;
 	union {
 		binder_uintptr_t	pad_binder;
 		__u32			fd;
@@ -107,6 +107,17 @@ struct binder_fd_object {
 	binder_uintptr_t		cookie;
 };
 
+enum {
+	/**
+	 * @BINDER_FD_FLAG_XFER_CHARGE
+	 *
+	 * When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for
+	 * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the
+	 * sender's cgroup and charged to the receiving process's cgroup instead.
+	 */
+	BINDER_FD_FLAG_XFER_CHARGE = 0x2000,
+};
+
 /* struct binder_buffer_object - object describing a userspace buffer
  * @hdr:		common header structure
  * @flags:		one or more BINDER_BUFFER_* flags
@@ -141,7 +152,7 @@ enum {
 
 /* struct binder_fd_array_object - object describing an array of fds in a buffer
  * @hdr:		common header structure
- * @pad:		padding to ensure correct alignment
+ * @flags:		One or more BINDER_FDA_FLAG_* flags
  * @num_fds:		number of file descriptors in the buffer
  * @parent:		index in offset array to buffer holding the fd array
  * @parent_offset:	start offset of fd array in the buffer
@@ -162,12 +173,16 @@ enum {
  */
 struct binder_fd_array_object {
 	struct binder_object_header	hdr;
-	__u32				pad;
+	__u32				flags;
 	binder_size_t			num_fds;
 	binder_size_t			parent;
 	binder_size_t			parent_offset;
 };
 
+enum {
+	BINDER_FDA_FLAG_XFER_CHARGE = BINDER_FD_FLAG_XFER_CHARGE,
+};
+
 /*
  * On 64-bit platforms where user code may run in 32-bits the driver must
  * translate the buffer (and local binder) addresses appropriately.
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 5/6] binder: Add flags to relinquish ownership of fds
@ 2022-05-10 23:56   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König
  Cc: Kenny.Ho, skhan, cmllamas, dri-devel, linux-kernel,
	linaro-mm-sig, jstultz, kaleshsingh, tj, mkoutny, kernel-team,
	linux-media

From: Hridya Valsaraju <hridya@google.com>

This patch introduces flags BINDER_FD_FLAG_XFER_CHARGE, and
BINDER_FD_FLAG_XFER_CHARGE that a process sending an individual fd or
fd array to another process over binder IPC can set to relinquish
ownership of the fds being sent for memory accounting purposes. If the
flag is found to be set during the fd or fd array translation and the
fd is for a DMA-BUF, the buffer is uncharged from the sender's cgroup
and charged to the receiving process's cgroup instead.

It is up to the sending process to ensure that it closes the fds
regardless of whether the transfer failed or succeeded.

Most graphics shared memory allocations in Android are done by the
graphics allocator HAL process. On requests from clients, the HAL process
allocates memory and sends the fds to the clients over binder IPC.
The graphics allocator HAL will not retain any references to the
buffers. When the HAL sets *_FLAG_XFER_CHARGE for fd arrays holding
DMA-BUF fds, or individual fd objects, the gpu cgroup controller will
be able to correctly charge the buffers to the client processes instead
of the graphics allocator HAL.

Since this is a new feature exposed to userspace, the kernel and userspace
must be compatible for the accounting to work for transfers. In all cases
the allocation and transport of DMA buffers via binder will succeed, but
only when both the kernel supports, and userspace depends on this feature
will the transfer accounting work. The possible scenarios are detailed
below:

1. new kernel + old userspace
The kernel supports the feature but userspace does not use it. The old
userspace won't mount the new cgroup controller, accounting is not
performed, charge is not transferred.

2. old kernel + new userspace
The new cgroup controller is not supported by the kernel, accounting is
not performed, charge is not transferred.

3. old kernel + old userspace
Same as #2

4. new kernel + new userspace
Cgroup is mounted, feature is supported and used.

Signed-off-by: Hridya Valsaraju <hridya@google.com>
Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v6 changes
Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.

Return error on transfer failure per Carlos Llamas.

v5 changes
Support both binder_fd_array_object and binder_fd_object. This is
necessary because new versions of Android will use binder_fd_object
instead of binder_fd_array_object, and we need to support both.

Use the new, simpler dma_buf_transfer_charge API.

v3 changes
Remove android from title per Todd Kjos.

Use more common dual author commit message format per John Stultz.

Include details on behavior for all combinations of kernel/userspace
versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.

v2 changes
Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
heap to a single dma-buf function for all heaps per Daniel Vetter and
Christian König.
---
 drivers/android/binder.c            | 31 +++++++++++++++++++++++++----
 drivers/dma-buf/dma-buf.c           |  4 ++--
 include/linux/dma-buf.h             |  2 +-
 include/uapi/linux/android/binder.h | 23 +++++++++++++++++----
 4 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 8351c5638880..1f39b24498f1 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -42,6 +42,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/dma-buf.h>
 #include <linux/fdtable.h>
 #include <linux/file.h>
 #include <linux/freezer.h>
@@ -2170,7 +2171,7 @@ static int binder_translate_handle(struct flat_binder_object *fp,
 	return ret;
 }
 
-static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
+static int binder_translate_fd(u32 fd, binder_size_t fd_offset, __u32 flags,
 			       struct binder_transaction *t,
 			       struct binder_thread *thread,
 			       struct binder_transaction *in_reply_to)
@@ -2208,6 +2209,26 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
 		goto err_security;
 	}
 
+	if (IS_ENABLED(CONFIG_CGROUP_GPU) && (flags & BINDER_FD_FLAG_XFER_CHARGE)) {
+		struct dma_buf *dmabuf;
+
+		if (!is_dma_buf_file(file)) {
+			binder_user_error(
+				"%d:%d got transaction with XFER_CHARGE for non-dmabuf fd, %d\n",
+				proc->pid, thread->pid, fd);
+			ret = -EINVAL;
+			goto err_dmabuf;
+		}
+
+		dmabuf = file->private_data;
+		ret = dma_buf_transfer_charge(dmabuf, target_proc->tsk);
+		if (ret) {
+			pr_warn("%d:%d Unable to transfer DMA-BUF fd charge to %d\n",
+				proc->pid, thread->pid, target_proc->pid);
+			goto err_xfer;
+		}
+	}
+
 	/*
 	 * Add fixup record for this transaction. The allocation
 	 * of the fd in the target needs to be done from a
@@ -2226,6 +2247,8 @@ static int binder_translate_fd(u32 fd, binder_size_t fd_offset,
 	return ret;
 
 err_alloc:
+err_xfer:
+err_dmabuf:
 err_security:
 	fput(file);
 err_fget:
@@ -2528,7 +2551,7 @@ static int binder_translate_fd_array(struct list_head *pf_head,
 
 		ret = copy_from_user(&fd, sender_ufda_base + sender_uoffset, sizeof(fd));
 		if (!ret)
-			ret = binder_translate_fd(fd, offset, t, thread,
+			ret = binder_translate_fd(fd, offset, fda->flags, t, thread,
 						  in_reply_to);
 		if (ret)
 			return ret > 0 ? -EINVAL : ret;
@@ -3179,8 +3202,8 @@ static void binder_transaction(struct binder_proc *proc,
 			struct binder_fd_object *fp = to_binder_fd_object(hdr);
 			binder_size_t fd_offset = object_offset +
 				(uintptr_t)&fp->fd - (uintptr_t)fp;
-			int ret = binder_translate_fd(fp->fd, fd_offset, t,
-						      thread, in_reply_to);
+			int ret = binder_translate_fd(fp->fd, fd_offset, fp->flags,
+						      t, thread, in_reply_to);
 
 			fp->pad_binder = 0;
 			if (ret < 0 ||
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index f3fb844925e2..36ed6cd4ddcc 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -31,7 +31,6 @@
 
 #include "dma-buf-sysfs-stats.h"
 
-static inline int is_dma_buf_file(struct file *);
 
 struct dma_buf_list {
 	struct list_head head;
@@ -400,10 +399,11 @@ static const struct file_operations dma_buf_fops = {
 /*
  * is_dma_buf_file - Check if struct file* is associated with dma_buf
  */
-static inline int is_dma_buf_file(struct file *file)
+int is_dma_buf_file(struct file *file)
 {
 	return file->f_op == &dma_buf_fops;
 }
+EXPORT_SYMBOL_NS_GPL(is_dma_buf_file, DMA_BUF);
 
 static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
 {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 438ad8577b76..2b9812758fee 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -614,7 +614,7 @@ dma_buf_attachment_is_dynamic(struct dma_buf_attachment *attach)
 {
 	return !!attach->importer_ops;
 }
-
+int is_dma_buf_file(struct file *file);
 struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
 					  struct device *dev);
 struct dma_buf_attachment *
diff --git a/include/uapi/linux/android/binder.h b/include/uapi/linux/android/binder.h
index 11157fae8a8e..d17e791b38ab 100644
--- a/include/uapi/linux/android/binder.h
+++ b/include/uapi/linux/android/binder.h
@@ -91,14 +91,14 @@ struct flat_binder_object {
 /**
  * struct binder_fd_object - describes a filedescriptor to be fixed up.
  * @hdr:	common header structure
- * @pad_flags:	padding to remain compatible with old userspace code
+ * @flags:	One or more BINDER_FD_FLAG_* flags
  * @pad_binder:	padding to remain compatible with old userspace code
  * @fd:		file descriptor
  * @cookie:	opaque data, used by user-space
  */
 struct binder_fd_object {
 	struct binder_object_header	hdr;
-	__u32				pad_flags;
+	__u32				flags;
 	union {
 		binder_uintptr_t	pad_binder;
 		__u32			fd;
@@ -107,6 +107,17 @@ struct binder_fd_object {
 	binder_uintptr_t		cookie;
 };
 
+enum {
+	/**
+	 * @BINDER_FD_FLAG_XFER_CHARGE
+	 *
+	 * When set, the sender of a binder_fd_object wishes to relinquish ownership of the fd for
+	 * memory accounting purposes. If the fd is for a DMA-BUF, the buffer is uncharged from the
+	 * sender's cgroup and charged to the receiving process's cgroup instead.
+	 */
+	BINDER_FD_FLAG_XFER_CHARGE = 0x2000,
+};
+
 /* struct binder_buffer_object - object describing a userspace buffer
  * @hdr:		common header structure
  * @flags:		one or more BINDER_BUFFER_* flags
@@ -141,7 +152,7 @@ enum {
 
 /* struct binder_fd_array_object - object describing an array of fds in a buffer
  * @hdr:		common header structure
- * @pad:		padding to ensure correct alignment
+ * @flags:		One or more BINDER_FDA_FLAG_* flags
  * @num_fds:		number of file descriptors in the buffer
  * @parent:		index in offset array to buffer holding the fd array
  * @parent_offset:	start offset of fd array in the buffer
@@ -162,12 +173,16 @@ enum {
  */
 struct binder_fd_array_object {
 	struct binder_object_header	hdr;
-	__u32				pad;
+	__u32				flags;
 	binder_size_t			num_fds;
 	binder_size_t			parent;
 	binder_size_t			parent_offset;
 };
 
+enum {
+	BINDER_FDA_FLAG_XFER_CHARGE = BINDER_FD_FLAG_XFER_CHARGE,
+};
+
 /*
  * On 64-bit platforms where user code may run in 32-bits the driver must
  * translate the buffer (and local binder) addresses appropriately.
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 6/6] selftests: Add binder cgroup gpu memory transfer tests
  2022-05-10 23:56 ` T.J. Mercier
                   ` (6 preceding siblings ...)
  (?)
@ 2022-05-10 23:56 ` T.J. Mercier
  2022-05-21 10:15   ` Muhammad Usama Anjum
  -1 siblings, 1 reply; 67+ messages in thread
From: T.J. Mercier @ 2022-05-10 23:56 UTC (permalink / raw)
  To: tjmercier, Shuah Khan
  Cc: daniel, tj, hridya, christian.koenig, jstultz, tkjos, cmllamas,
	surenb, kaleshsingh, Kenny.Ho, mkoutny, skhan, kernel-team,
	linux-kernel, linux-kselftest

These tests verify that the cgroup GPU memory charge is transferred
correctly when a dmabuf is passed between processes in two different
cgroups and the sender specifies BINDER_FD_FLAG_XFER_CHARGE or
BINDER_FDA_FLAG_XFER_CHARGE in the binder transaction data
containing the dmabuf file descriptor.

Signed-off-by: T.J. Mercier <tjmercier@google.com>

---
v6 changes
Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.

v5 changes
Tests for both binder_fd_array_object and binder_fd_object.

Return error code instead of struct binder{fs}_ctx.

Use ifdef __ANDROID__ to choose platform-dependent temp path instead of a
runtime fallback.

Ensure binderfs_mntpt ends with a trailing '/' character instead of
prepending it where used.

v4 changes
Skip test if not run as root per Shuah Khan.

Add better logging for abnormal child termination per Shuah Khan.
---
 .../selftests/drivers/android/binder/Makefile |   8 +
 .../drivers/android/binder/binder_util.c      | 250 +++++++++
 .../drivers/android/binder/binder_util.h      |  32 ++
 .../selftests/drivers/android/binder/config   |   4 +
 .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
 5 files changed, 820 insertions(+)
 create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
 create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
 create mode 100644 tools/testing/selftests/drivers/android/binder/config
 create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c

diff --git a/tools/testing/selftests/drivers/android/binder/Makefile b/tools/testing/selftests/drivers/android/binder/Makefile
new file mode 100644
index 000000000000..726439d10675
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -Wall
+
+TEST_GEN_PROGS = test_dmabuf_cgroup_transfer
+
+include ../../../lib.mk
+
+$(OUTPUT)/test_dmabuf_cgroup_transfer: ../../../cgroup/cgroup_util.c binder_util.c
diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.c b/tools/testing/selftests/drivers/android/binder/binder_util.c
new file mode 100644
index 000000000000..cdd97cb0bb60
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/binder_util.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "binder_util.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/mount.h>
+
+#include <linux/limits.h>
+#include <linux/android/binder.h>
+#include <linux/android/binderfs.h>
+
+static const size_t BINDER_MMAP_SIZE = 64 * 1024;
+
+static void binderfs_unmount(const char *mountpoint)
+{
+	if (umount2(mountpoint, MNT_DETACH))
+		fprintf(stderr, "Failed to unmount binderfs at %s: %s\n",
+			mountpoint, strerror(errno));
+	else
+		fprintf(stderr, "Binderfs unmounted: %s\n", mountpoint);
+
+	if (rmdir(mountpoint))
+		fprintf(stderr, "Failed to remove binderfs mount %s: %s\n",
+			mountpoint, strerror(errno));
+	else
+		fprintf(stderr, "Binderfs mountpoint destroyed: %s\n", mountpoint);
+}
+
+int create_binderfs(struct binderfs_ctx *ctx, const char *name)
+{
+	int fd, ret, saved_errno;
+	struct binderfs_device device = { 0 };
+
+	/*
+	 * P_tmpdir is set to "/tmp/" on Android platforms where Binder is most commonly used, but
+	 * this path does not actually exist on Android. For Android we'll try using
+	 * "/data/local/tmp" and P_tmpdir for non-Android platforms.
+	 *
+	 * This mount point should have a trailing '/' character, but mkdtemp requires that the last
+	 * six characters (before the first null terminator) must be "XXXXXX". Manually append an
+	 * additional null character in the string literal to allocate a character array of the
+	 * correct final size, which we will replace with a '/' after successful completion of the
+	 * mkdtemp call.
+	 */
+#ifdef __ANDROID__
+	char binderfs_mntpt[] = "/data/local/tmp/binderfs_XXXXXX\0";
+#else
+	/* P_tmpdir may or may not contain a trailing '/' separator. We always append one here. */
+	char binderfs_mntpt[] = P_tmpdir "/binderfs_XXXXXX\0";
+#endif
+	static const char BINDER_CONTROL_NAME[] = "binder-control";
+	char device_path[strlen(binderfs_mntpt) + 1 + strlen(BINDER_CONTROL_NAME) + 1];
+
+	if (mkdtemp(binderfs_mntpt) == NULL) {
+		fprintf(stderr, "Failed to create binderfs mountpoint at %s: %s.\n",
+			binderfs_mntpt, strerror(errno));
+		return -1;
+	}
+	binderfs_mntpt[strlen(binderfs_mntpt)] = '/';
+	fprintf(stderr, "Binderfs mountpoint created at %s\n", binderfs_mntpt);
+
+	if (mount(NULL, binderfs_mntpt, "binder", 0, 0)) {
+		perror("Could not mount binderfs");
+		rmdir(binderfs_mntpt);
+		return -1;
+	}
+	fprintf(stderr, "Binderfs mounted at %s\n", binderfs_mntpt);
+
+	strncpy(device.name, name, sizeof(device.name));
+	snprintf(device_path, sizeof(device_path), "%s%s", binderfs_mntpt, BINDER_CONTROL_NAME);
+	fd = open(device_path, O_RDONLY | O_CLOEXEC);
+	if (!fd) {
+		fprintf(stderr, "Failed to open %s device", BINDER_CONTROL_NAME);
+		binderfs_unmount(binderfs_mntpt);
+		return -1;
+	}
+
+	ret = ioctl(fd, BINDER_CTL_ADD, &device);
+	saved_errno = errno;
+	close(fd);
+	errno = saved_errno;
+	if (ret) {
+		perror("Failed to allocate new binder device");
+		binderfs_unmount(binderfs_mntpt);
+		return -1;
+	}
+
+	fprintf(stderr, "Allocated new binder device with major %d, minor %d, and name %s at %s\n",
+		device.major, device.minor, device.name, binderfs_mntpt);
+
+	ctx->name = strdup(name);
+	ctx->mountpoint = strdup(binderfs_mntpt);
+
+	return 0;
+}
+
+void destroy_binderfs(struct binderfs_ctx *ctx)
+{
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), "%s%s", ctx->mountpoint, ctx->name);
+
+	if (unlink(path))
+		fprintf(stderr, "Failed to unlink binder device %s: %s\n", path, strerror(errno));
+	else
+		fprintf(stderr, "Destroyed binder %s at %s\n", ctx->name, ctx->mountpoint);
+
+	binderfs_unmount(ctx->mountpoint);
+
+	free(ctx->name);
+	free(ctx->mountpoint);
+}
+
+int open_binder(const struct binderfs_ctx *bfs_ctx, struct binder_ctx *ctx)
+{
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), "%s%s", bfs_ctx->mountpoint, bfs_ctx->name);
+	ctx->fd = open(path, O_RDWR | O_NONBLOCK | O_CLOEXEC);
+	if (ctx->fd < 0) {
+		fprintf(stderr, "Error opening binder device %s: %s\n", path, strerror(errno));
+		return -1;
+	}
+
+	ctx->memory = mmap(NULL, BINDER_MMAP_SIZE, PROT_READ, MAP_SHARED, ctx->fd, 0);
+	if (ctx->memory == NULL) {
+		perror("Error mapping binder memory");
+		close(ctx->fd);
+		ctx->fd = -1;
+		return -1;
+	}
+
+	return 0;
+}
+
+void close_binder(struct binder_ctx *ctx)
+{
+	if (munmap(ctx->memory, BINDER_MMAP_SIZE))
+		perror("Failed to unmap binder memory");
+	ctx->memory = NULL;
+
+	if (close(ctx->fd))
+		perror("Failed to close binder");
+	ctx->fd = -1;
+}
+
+int become_binder_context_manager(int binder_fd)
+{
+	return ioctl(binder_fd, BINDER_SET_CONTEXT_MGR, 0);
+}
+
+int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writesize,
+			 void *readbuf, binder_size_t readsize)
+{
+	int err;
+	struct binder_write_read bwr = {
+		.write_buffer = (binder_uintptr_t)writebuf,
+		.write_size = writesize,
+		.read_buffer = (binder_uintptr_t)readbuf,
+		.read_size = readsize
+	};
+
+	do {
+		if (ioctl(binder_fd, BINDER_WRITE_READ, &bwr) >= 0)
+			err = 0;
+		else
+			err = -errno;
+	} while (err == -EINTR);
+
+	if (err < 0) {
+		perror("BINDER_WRITE_READ");
+		return -1;
+	}
+
+	if (bwr.write_consumed < writesize) {
+		fprintf(stderr, "Binder did not consume full write buffer %llu %llu\n",
+			bwr.write_consumed, writesize);
+		return -1;
+	}
+
+	return bwr.read_consumed;
+}
+
+static const char *reply_string(int cmd)
+{
+	switch (cmd) {
+	case BR_ERROR:
+		return "BR_ERROR";
+	case BR_OK:
+		return "BR_OK";
+	case BR_TRANSACTION_SEC_CTX:
+		return "BR_TRANSACTION_SEC_CTX";
+	case BR_TRANSACTION:
+		return "BR_TRANSACTION";
+	case BR_REPLY:
+		return "BR_REPLY";
+	case BR_ACQUIRE_RESULT:
+		return "BR_ACQUIRE_RESULT";
+	case BR_DEAD_REPLY:
+		return "BR_DEAD_REPLY";
+	case BR_TRANSACTION_COMPLETE:
+		return "BR_TRANSACTION_COMPLETE";
+	case BR_INCREFS:
+		return "BR_INCREFS";
+	case BR_ACQUIRE:
+		return "BR_ACQUIRE";
+	case BR_RELEASE:
+		return "BR_RELEASE";
+	case BR_DECREFS:
+		return "BR_DECREFS";
+	case BR_ATTEMPT_ACQUIRE:
+		return "BR_ATTEMPT_ACQUIRE";
+	case BR_NOOP:
+		return "BR_NOOP";
+	case BR_SPAWN_LOOPER:
+		return "BR_SPAWN_LOOPER";
+	case BR_FINISHED:
+		return "BR_FINISHED";
+	case BR_DEAD_BINDER:
+		return "BR_DEAD_BINDER";
+	case BR_CLEAR_DEATH_NOTIFICATION_DONE:
+		return "BR_CLEAR_DEATH_NOTIFICATION_DONE";
+	case BR_FAILED_REPLY:
+		return "BR_FAILED_REPLY";
+	case BR_FROZEN_REPLY:
+		return "BR_FROZEN_REPLY";
+	case BR_ONEWAY_SPAM_SUSPECT:
+		return "BR_ONEWAY_SPAM_SUSPECT";
+	default:
+		return "Unknown";
+	};
+}
+
+int expect_binder_reply(int32_t actual, int32_t expected)
+{
+	if (actual != expected) {
+		fprintf(stderr, "Expected %s but received %s\n",
+			reply_string(expected), reply_string(actual));
+		return -1;
+	}
+	return 0;
+}
+
diff --git a/tools/testing/selftests/drivers/android/binder/binder_util.h b/tools/testing/selftests/drivers/android/binder/binder_util.h
new file mode 100644
index 000000000000..adc2b20e8d0a
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/binder_util.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef SELFTEST_BINDER_UTIL_H
+#define SELFTEST_BINDER_UTIL_H
+
+#include <stdint.h>
+
+#include <linux/android/binder.h>
+
+struct binderfs_ctx {
+	char *name;
+	char *mountpoint;
+};
+
+struct binder_ctx {
+	int fd;
+	void *memory;
+};
+
+int create_binderfs(struct binderfs_ctx *ctx, const char *name);
+void destroy_binderfs(struct binderfs_ctx *ctx);
+
+int open_binder(const struct binderfs_ctx *bfs_ctx, struct binder_ctx *ctx);
+void close_binder(struct binder_ctx *ctx);
+
+int become_binder_context_manager(int binder_fd);
+
+int do_binder_write_read(int binder_fd, void *writebuf, binder_size_t writesize,
+			 void *readbuf, binder_size_t readsize);
+
+int expect_binder_reply(int32_t actual, int32_t expected);
+#endif
diff --git a/tools/testing/selftests/drivers/android/binder/config b/tools/testing/selftests/drivers/android/binder/config
new file mode 100644
index 000000000000..fcc5f8f693b3
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/config
@@ -0,0 +1,4 @@
+CONFIG_CGROUP_GPU=y
+CONFIG_ANDROID=y
+CONFIG_ANDROID_BINDERFS=y
+CONFIG_ANDROID_BINDER_IPC=y
diff --git a/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c b/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
new file mode 100644
index 000000000000..4d468c1dc4e3
--- /dev/null
+++ b/tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
@@ -0,0 +1,526 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * These tests verify that the cgroup GPU memory charge is transferred correctly when a dmabuf is
+ * passed between processes in two different cgroups and the sender specifies
+ * BINDER_FD_FLAG_XFER_CHARGE or BINDER_FDA_FLAG_XFER_CHARGE in the binder transaction data
+ * containing the dmabuf file descriptor.
+ *
+ * The parent test process becomes the binder context manager, then forks a child who initiates a
+ * transaction with the context manager by specifying a target of 0. The context manager reply
+ * contains a dmabuf file descriptor (or an array of one file descriptor) which was allocated by the
+ * parent, but should be charged to the child cgroup after the binder transaction.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/epoll.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+
+#include "binder_util.h"
+#include "../../../cgroup/cgroup_util.h"
+#include "../../../kselftest.h"
+#include "../../../kselftest_harness.h"
+
+#include <linux/limits.h>
+#include <linux/dma-heap.h>
+#include <linux/android/binder.h>
+
+#define UNUSED(x) ((void)(x))
+
+static const unsigned int BINDER_CODE = 8675309; /* Any number will work here */
+
+struct cgroup_ctx {
+	char *root;
+	char *source;
+	char *dest;
+};
+
+void destroy_cgroups(struct __test_metadata *_metadata, struct cgroup_ctx *ctx)
+{
+	if (ctx->source != NULL) {
+		TH_LOG("Destroying cgroup: %s", ctx->source);
+		rmdir(ctx->source);
+		free(ctx->source);
+	}
+
+	if (ctx->dest != NULL) {
+		TH_LOG("Destroying cgroup: %s", ctx->dest);
+		rmdir(ctx->dest);
+		free(ctx->dest);
+	}
+
+	free(ctx->root);
+	ctx->root = ctx->source = ctx->dest = NULL;
+}
+
+struct cgroup_ctx create_cgroups(struct __test_metadata *_metadata)
+{
+	struct cgroup_ctx ctx = {0};
+	char root[PATH_MAX], *tmp;
+	static const char template[] = "/gpucg_XXXXXX";
+
+	if (cg_find_unified_root(root, sizeof(root))) {
+		TH_LOG("Could not find cgroups root");
+		return ctx;
+	}
+
+	if (cg_read_strstr(root, "cgroup.controllers", "gpu")) {
+		TH_LOG("Could not find GPU controller");
+		return ctx;
+	}
+
+	if (cg_write(root, "cgroup.subtree_control", "+gpu")) {
+		TH_LOG("Could not enable GPU controller");
+		return ctx;
+	}
+
+	ctx.root = strdup(root);
+
+	snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
+	tmp = mkdtemp(root);
+	if (tmp == NULL) {
+		TH_LOG("%s - Could not create source cgroup", strerror(errno));
+		destroy_cgroups(_metadata, &ctx);
+		return ctx;
+	}
+	ctx.source = strdup(tmp);
+
+	snprintf(root, sizeof(root), "%s/%s", ctx.root, template);
+	tmp = mkdtemp(root);
+	if (tmp == NULL) {
+		TH_LOG("%s - Could not create destination cgroup", strerror(errno));
+		destroy_cgroups(_metadata, &ctx);
+		return ctx;
+	}
+	ctx.dest = strdup(tmp);
+
+	TH_LOG("Created cgroups: %s %s", ctx.source, ctx.dest);
+
+	return ctx;
+}
+
+int dmabuf_heap_alloc(int fd, size_t len, int *dmabuf_fd)
+{
+	struct dma_heap_allocation_data data = {
+		.len = len,
+		.fd = 0,
+		.fd_flags = O_RDONLY | O_CLOEXEC,
+		.heap_flags = 0,
+	};
+	int ret;
+
+	if (!dmabuf_fd)
+		return -EINVAL;
+
+	ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data);
+	if (ret < 0)
+		return ret;
+	*dmabuf_fd = (int)data.fd;
+	return ret;
+}
+
+/* The system heap is known to export dmabufs with support for cgroup tracking */
+int alloc_dmabuf_from_system_heap(struct __test_metadata *_metadata, size_t bytes)
+{
+	int heap_fd = -1, dmabuf_fd = -1;
+	static const char * const heap_path = "/dev/dma_heap/system";
+
+	heap_fd = open(heap_path, O_RDONLY);
+	if (heap_fd < 0) {
+		TH_LOG("%s - open %s failed!\n", strerror(errno), heap_path);
+		return -1;
+	}
+
+	if (dmabuf_heap_alloc(heap_fd, bytes, &dmabuf_fd))
+		TH_LOG("dmabuf allocation failed! - %s", strerror(errno));
+	close(heap_fd);
+
+	return dmabuf_fd;
+}
+
+int binder_request_dmabuf(int binder_fd)
+{
+	int ret;
+
+	/*
+	 * We just send an empty binder_buffer_object to initiate a transaction
+	 * with the context manager, who should respond with a single dmabuf
+	 * inside a binder_fd_array_object or a binder_fd_object.
+	 */
+
+	struct binder_buffer_object bbo = {
+		.hdr.type = BINDER_TYPE_PTR,
+		.flags = 0,
+		.buffer = 0,
+		.length = 0,
+		.parent = 0, /* No parent */
+		.parent_offset = 0 /* No parent */
+	};
+
+	binder_size_t offsets[] = {0};
+
+	struct {
+		int32_t cmd;
+		struct binder_transaction_data btd;
+	} __attribute__((packed)) bc = {
+		.cmd = BC_TRANSACTION,
+		.btd = {
+			.target = { 0 },
+			.cookie = 0,
+			.code = BINDER_CODE,
+			.flags = TF_ACCEPT_FDS, /* We expect a FD/FDA in the reply */
+			.data_size = sizeof(bbo),
+			.offsets_size = sizeof(offsets),
+			.data.ptr = {
+				(binder_uintptr_t)&bbo,
+				(binder_uintptr_t)offsets
+			}
+		},
+	};
+
+	struct {
+		int32_t reply_noop;
+	} __attribute__((packed)) br;
+
+	ret = do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
+	if (ret >= sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
+		return -1;
+	} else if (ret < sizeof(br)) {
+		fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
+		return -1;
+	}
+	return 0;
+}
+
+int send_dmabuf_reply_fda(int binder_fd, struct binder_transaction_data *tr, int dmabuf_fd)
+{
+	int ret;
+	/*
+	 * The trailing 0 is to achieve the necessary alignment for the binder
+	 * buffer_size.
+	 */
+	int fdarray[] = { dmabuf_fd, 0 };
+
+	struct binder_buffer_object bbo = {
+		.hdr.type = BINDER_TYPE_PTR,
+		.flags = 0,
+		.buffer = (binder_uintptr_t)fdarray,
+		.length = sizeof(fdarray),
+		.parent = 0, /* No parent */
+		.parent_offset = 0 /* No parent */
+	};
+
+	struct binder_fd_array_object bfdao = {
+		.hdr.type = BINDER_TYPE_FDA,
+		.flags = BINDER_FDA_FLAG_XFER_CHARGE,
+		.num_fds = 1,
+		.parent = 0, /* The binder_buffer_object */
+		.parent_offset = 0 /* FDs follow immediately */
+	};
+
+	uint64_t sz = sizeof(fdarray);
+	uint8_t data[sizeof(sz) + sizeof(bbo) + sizeof(bfdao)];
+	binder_size_t offsets[] = {sizeof(sz), sizeof(sz)+sizeof(bbo)};
+
+	memcpy(data,                            &sz, sizeof(sz));
+	memcpy(data + sizeof(sz),               &bbo, sizeof(bbo));
+	memcpy(data + sizeof(sz) + sizeof(bbo), &bfdao, sizeof(bfdao));
+
+	struct {
+		int32_t cmd;
+		struct binder_transaction_data_sg btd;
+	} __attribute__((packed)) bc = {
+		.cmd = BC_REPLY_SG,
+		.btd.transaction_data = {
+			.target = { tr->target.handle },
+			.cookie = tr->cookie,
+			.code = BINDER_CODE,
+			.flags = 0,
+			.data_size = sizeof(data),
+			.offsets_size = sizeof(offsets),
+			.data.ptr = {
+				(binder_uintptr_t)data,
+				(binder_uintptr_t)offsets
+			}
+		},
+		.btd.buffers_size = sizeof(fdarray)
+	};
+
+	struct {
+		int32_t reply_noop;
+	} __attribute__((packed)) br;
+
+	ret = do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
+	if (ret >= sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
+		return -1;
+	} else if (ret < sizeof(br)) {
+		fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
+		return -1;
+	}
+	return 0;
+}
+
+int send_dmabuf_reply_fd(int binder_fd, struct binder_transaction_data *tr, int dmabuf_fd)
+{
+	int ret;
+
+	struct binder_fd_object bfdo = {
+		.hdr.type = BINDER_TYPE_FD,
+		.flags = BINDER_FD_FLAG_XFER_CHARGE,
+		.fd = dmabuf_fd
+	};
+
+	binder_size_t offset = 0;
+
+	struct {
+		int32_t cmd;
+		struct binder_transaction_data btd;
+	} __attribute__((packed)) bc = {
+		.cmd = BC_REPLY,
+		.btd = {
+			.target = { tr->target.handle },
+			.cookie = tr->cookie,
+			.code = BINDER_CODE,
+			.flags = 0,
+			.data_size = sizeof(bfdo),
+			.offsets_size = sizeof(offset),
+			.data.ptr = {
+				(binder_uintptr_t)&bfdo,
+				(binder_uintptr_t)&offset
+			}
+		}
+	};
+
+	struct {
+		int32_t reply_noop;
+	} __attribute__((packed)) br;
+
+	ret = do_binder_write_read(binder_fd, &bc, sizeof(bc), &br, sizeof(br));
+	if (ret >= sizeof(br) && expect_binder_reply(br.reply_noop, BR_NOOP)) {
+		return -1;
+	} else if (ret < sizeof(br)) {
+		fprintf(stderr, "Not enough bytes in binder reply %d\n", ret);
+		return -1;
+	}
+	return 0;
+}
+
+struct binder_transaction_data *binder_wait_for_transaction(int binder_fd,
+							    uint32_t *readbuf,
+							    size_t readsize)
+{
+	static const int MAX_EVENTS = 1, EPOLL_WAIT_TIME_MS = 3 * 1000;
+	struct binder_reply {
+		int32_t reply0;
+		int32_t reply1;
+		struct binder_transaction_data btd;
+	} *br;
+	struct binder_transaction_data *ret = NULL;
+	struct epoll_event events[MAX_EVENTS];
+	int epoll_fd, num_events, readcount;
+	uint32_t bc[] = { BC_ENTER_LOOPER };
+
+	do_binder_write_read(binder_fd, &bc, sizeof(bc), NULL, 0);
+
+	epoll_fd = epoll_create1(EPOLL_CLOEXEC);
+	if (epoll_fd == -1) {
+		perror("epoll_create");
+		return NULL;
+	}
+
+	events[0].events = EPOLLIN;
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, binder_fd, &events[0])) {
+		perror("epoll_ctl add");
+		goto err_close;
+	}
+
+	num_events = epoll_wait(epoll_fd, events, MAX_EVENTS, EPOLL_WAIT_TIME_MS);
+	if (num_events < 0) {
+		perror("epoll_wait");
+		goto err_ctl;
+	} else if (num_events == 0) {
+		fprintf(stderr, "No events\n");
+		goto err_ctl;
+	}
+
+	readcount = do_binder_write_read(binder_fd, NULL, 0, readbuf, readsize);
+	fprintf(stderr, "Read %d bytes from binder\n", readcount);
+
+	if (readcount < (int)sizeof(struct binder_reply)) {
+		fprintf(stderr, "read_consumed not large enough\n");
+		goto err_ctl;
+	}
+
+	br = (struct binder_reply *)readbuf;
+	if (expect_binder_reply(br->reply0, BR_NOOP))
+		goto err_ctl;
+
+	if (br->reply1 == BR_TRANSACTION) {
+		if (br->btd.code == BINDER_CODE)
+			ret = &br->btd;
+		else
+			fprintf(stderr, "Received transaction with unexpected code: %u\n",
+				br->btd.code);
+	} else {
+		expect_binder_reply(br->reply1, BR_TRANSACTION_COMPLETE);
+	}
+
+err_ctl:
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_DEL, binder_fd, NULL))
+		perror("epoll_ctl del");
+err_close:
+	close(epoll_fd);
+	return ret;
+}
+
+static int child_request_dmabuf_transfer(const char *cgroup, void *arg)
+{
+	UNUSED(cgroup);
+	int ret = -1;
+	uint32_t readbuf[32];
+	struct binderfs_ctx bfs_ctx = *(struct binderfs_ctx *)arg;
+	struct binder_ctx b_ctx;
+
+	fprintf(stderr, "Child PID: %d\n", getpid());
+
+	if (open_binder(&bfs_ctx, &b_ctx)) {
+		fprintf(stderr, "Child unable to open binder\n");
+		return -1;
+	}
+
+	if (binder_request_dmabuf(b_ctx.fd))
+		goto err;
+
+	/* The child must stay alive until the binder reply is received */
+	if (binder_wait_for_transaction(b_ctx.fd, readbuf, sizeof(readbuf)) == NULL)
+		ret = 0;
+
+	/*
+	 * We don't close the received dmabuf here so that the parent can
+	 * inspect the cgroup gpu memory charges to verify the charge transfer
+	 * completed successfully.
+	 */
+err:
+	close_binder(&b_ctx);
+	fprintf(stderr, "Child done\n");
+	return ret;
+}
+
+static const char * const GPUMEM_FILENAME = "gpu.memory.current";
+static const size_t ONE_MiB = 1024 * 1024;
+
+FIXTURE(fix) {
+	int dmabuf_fd;
+	struct binderfs_ctx bfs_ctx;
+	struct binder_ctx b_ctx;
+	struct cgroup_ctx cg_ctx;
+	struct binder_transaction_data *tr;
+	pid_t child_pid;
+};
+
+FIXTURE_SETUP(fix)
+{
+	long memsize;
+	uint32_t readbuf[32];
+	struct flat_binder_object *fbo;
+	struct binder_buffer_object *bbo;
+
+	if (geteuid() != 0)
+		ksft_exit_skip("Need to be root to mount binderfs\n");
+
+	if (create_binderfs(&self->bfs_ctx, "testbinder"))
+		ksft_exit_skip("The Android binderfs filesystem is not available\n");
+
+	self->cg_ctx = create_cgroups(_metadata);
+	if (self->cg_ctx.root == NULL) {
+		destroy_binderfs(&self->bfs_ctx);
+		ksft_exit_skip("cgroup v2 isn't mounted\n");
+	}
+
+	ASSERT_EQ(cg_enter_current(self->cg_ctx.source), 0) {
+		TH_LOG("Could not move parent to cgroup: %s", self->cg_ctx.source);
+	}
+
+	self->dmabuf_fd = alloc_dmabuf_from_system_heap(_metadata, ONE_MiB);
+	ASSERT_GE(self->dmabuf_fd, 0);
+	TH_LOG("Allocated dmabuf");
+
+	memsize = cg_read_key_long(self->cg_ctx.source, GPUMEM_FILENAME, "system-heap");
+	ASSERT_EQ(memsize, ONE_MiB) {
+		TH_LOG("GPU memory used after allocation: %ld but it should be %lu",
+		       memsize, (unsigned long)ONE_MiB);
+	}
+
+	ASSERT_EQ(open_binder(&self->bfs_ctx, &self->b_ctx), 0) {
+		TH_LOG("Parent unable to open binder");
+	}
+	TH_LOG("Opened binder at %s/%s", self->bfs_ctx.mountpoint, self->bfs_ctx.name);
+
+	ASSERT_EQ(become_binder_context_manager(self->b_ctx.fd), 0) {
+		TH_LOG("Cannot become context manager: %s", strerror(errno));
+	}
+
+	self->child_pid = cg_run_nowait(
+		self->cg_ctx.dest, child_request_dmabuf_transfer, &self->bfs_ctx);
+	ASSERT_GT(self->child_pid, 0) {
+		TH_LOG("Error forking: %s", strerror(errno));
+	}
+
+	self->tr = binder_wait_for_transaction(self->b_ctx.fd, readbuf, sizeof(readbuf));
+	ASSERT_NE(self->tr, NULL) {
+		TH_LOG("Error receiving transaction request from child");
+	}
+	fbo = (struct flat_binder_object *)self->tr->data.ptr.buffer;
+	ASSERT_EQ(fbo->hdr.type, BINDER_TYPE_PTR) {
+		TH_LOG("Did not receive a buffer object from child");
+	}
+	bbo = (struct binder_buffer_object *)fbo;
+	ASSERT_EQ(bbo->length, 0) {
+		TH_LOG("Did not receive an empty buffer object from child");
+	}
+
+	TH_LOG("Received transaction from child");
+}
+
+FIXTURE_TEARDOWN(fix)
+{
+	close_binder(&self->b_ctx);
+	close(self->dmabuf_fd);
+	destroy_cgroups(_metadata, &self->cg_ctx);
+	destroy_binderfs(&self->bfs_ctx);
+}
+
+
+void verify_transfer_success(struct _test_data_fix *self, struct __test_metadata *_metadata)
+{
+	ASSERT_EQ(cg_read_key_long(self->cg_ctx.dest, GPUMEM_FILENAME, "system-heap"), ONE_MiB) {
+		TH_LOG("Destination cgroup does not have system-heap charge!");
+	}
+	ASSERT_EQ(cg_read_key_long(self->cg_ctx.source, GPUMEM_FILENAME, "system-heap"), 0) {
+		TH_LOG("Source cgroup still has system-heap charge!");
+	}
+	TH_LOG("Charge transfer succeeded!");
+}
+
+TEST_F(fix, individual_fd)
+{
+	send_dmabuf_reply_fd(self->b_ctx.fd, self->tr, self->dmabuf_fd);
+	verify_transfer_success(self, _metadata);
+}
+
+TEST_F(fix, fd_array)
+{
+	send_dmabuf_reply_fda(self->b_ctx.fd, self->tr, self->dmabuf_fd);
+	verify_transfer_success(self, _metadata);
+}
+
+TEST_HARNESS_MAIN
-- 
2.36.0.512.ge40c2bad7a-goog


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-10 23:56 ` T.J. Mercier
  (?)
@ 2022-05-11 13:21   ` Nicolas Dufresne
  -1 siblings, 0 replies; 67+ messages in thread
From: Nicolas Dufresne @ 2022-05-11 13:21 UTC (permalink / raw)
  To: T.J. Mercier, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan
  Cc: linux-kselftest, linux-doc, Kenny.Ho, cgroups, skhan, cmllamas,
	dri-devel, linux-kernel, linaro-mm-sig, mkoutny, kaleshsingh,
	jstultz, kernel-team, linux-media

Hi,

Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> This patch series revisits the proposal for a GPU cgroup controller to
> track and limit memory allocations by various device/allocator
> subsystems. The patch series also contains a simple prototype to
> illustrate how Android intends to implement DMA-BUF allocator
> attribution using the GPU cgroup controller. The prototype does not
> include resource limit enforcements.

I'm sorry, since I'm not in-depth technically involve. But from reading the
topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
this an attempt to really track the DMABuf allocated by userland, or just
something for GPU ? What about V4L2 devices ? Any way this can be clarified,
specially what would other subsystem needs to have cgroup DMABuf allocation
controller support ?

> 
> Changelog:
> v7:
> Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> This means gpucg_register_bucket now returns an internally allocated
> struct gpucg_bucket.
> 
> Move all public function documentation to the cgroup_gpu.h header.
> 
> Remove comment in documentation about duplicate name rejection which
> is not relevant to cgroups users per Michal Koutný.
> 
> v6:
> Move documentation into cgroup-v2.rst per Tejun Heo.
> 
> Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> 
> Return error on transfer failure per Carlos Llamas.
> 
> v5:
> Rebase on top of v5.18-rc3
> 
> Drop the global GPU cgroup "total" (sum of all device totals) portion
> of the design since there is no currently known use for this per
> Tejun Heo.
> 
> Fix commit message which still contained the old name for
> dma_buf_transfer_charge per Michal Koutný.
> 
> Remove all GPU cgroup code except what's necessary to support charge transfer
> from dma_buf. Previously charging was done in export, but for non-Android
> graphics use-cases this is not ideal since there may be a delay between
> allocation and export, during which time there is no accounting.
> 
> Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> result of above.
> 
> Put the charge and uncharge code in the same file (system_heap_allocate,
> system_heap_dma_buf_release) instead of splitting them between the heap and
> the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> 
> Modify the dma_buf_transfer_charge API to accept a task_struct instead
> of a gpucg. This avoids requiring the caller to manage the refcount
> of the gpucg upon failure and confusing ownership transfer logic.
> 
> Support all strings for gpucg_register_bucket instead of just string
> literals.
> 
> Enforce globally unique gpucg_bucket names.
> 
> Constrain gpucg_bucket name lengths to 64 bytes.
> 
> Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> 
> Drop patch 7 from the series, which changed the types of
> binder_transaction_data's sender_pid and sender_euid fields. This was
> done in another commit here:
> https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> 
> Rename:
>   gpucg_try_charge -> gpucg_charge
>   find_cg_rpool_locked -> cg_rpool_find_locked
>   init_cg_rpool -> cg_rpool_init
>   get_cg_rpool_locked -> cg_rpool_get_locked
>   "gpu cgroup controller" -> "GPU controller"
>   gpucg_device -> gpucg_bucket
>   usage -> size
> 
> Tests:
>   Support both binder_fd_array_object and binder_fd_object. This is
>   necessary because new versions of Android will use binder_fd_object
>   instead of binder_fd_array_object, and we need to support both.
> 
>   Tests for both binder_fd_array_object and binder_fd_object.
> 
>   For binder_utils return error codes instead of
>   struct binder{fs}_ctx.
> 
>   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
>   of a runtime fallback.
> 
>   Ensure binderfs_mntpt ends with a trailing '/' character instead of
>   prepending it where used.
> 
> v4:
> Skip test if not run as root per Shuah Khan
> 
> Add better test logging for abnormal child termination per Shuah Khan
> 
> Adjust ordering of charge/uncharge during transfer to avoid potentially
> hitting cgroup limit per Michal Koutný
> 
> Adjust gpucg_try_charge critical section for charge transfer functionality
> 
> Fix uninitialized return code error for dmabuf_try_charge error case
> 
> v3:
> Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> 
> Use more common dual author commit message format per John Stultz
> 
> Remove android from binder changes title per Todd Kjos
> 
> Add a kselftest for this new behavior per Greg Kroah-Hartman
> 
> Include details on behavior for all combinations of kernel/userspace
> versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> 
> Fix pid and uid types in binder UAPI header
> 
> v2:
> See the previous revision of this change submitted by Hridya Valsaraju
> at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> 
> Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> heap to a single dma-buf function for all heaps per Daniel Vetter and
> Christian König. Pointers to struct gpucg and struct gpucg_device
> tracking the current associations were added to the dma_buf struct to
> achieve this.
> 
> Fix incorrect Kconfig help section indentation per Randy Dunlap.
> 
> History of the GPU cgroup controller
> ====================================
> The GPU/DRM cgroup controller came into being when a consensus[1]
> was reached that the resources it tracked were unsuitable to be integrated
> into memcg. Originally, the proposed controller was specific to the DRM
> subsystem and was intended to track GEM buffers and GPU-specific
> resources[2]. In order to help establish a unified memory accounting model
> for all GPU and all related subsystems, Daniel Vetter put forth a
> suggestion to move it out of the DRM subsystem so that it can be used by
> other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> does the same.
> 
> [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> 
> Hridya Valsaraju (3):
>   gpu: rfc: Proposal for a GPU cgroup controller
>   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
>     memory
>   binder: Add flags to relinquish ownership of fds
> 
> T.J. Mercier (3):
>   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
>   dmabuf: Add gpu cgroup charge transfer function
>   selftests: Add binder cgroup gpu memory transfer tests
> 
>  Documentation/admin-guide/cgroup-v2.rst       |  23 +
>  drivers/android/binder.c                      |  31 +-
>  drivers/dma-buf/dma-buf.c                     |  80 ++-
>  drivers/dma-buf/dma-heap.c                    |  38 ++
>  drivers/dma-buf/heaps/system_heap.c           |  28 +-
>  include/linux/cgroup_gpu.h                    | 146 +++++
>  include/linux/cgroup_subsys.h                 |   4 +
>  include/linux/dma-buf.h                       |  49 +-
>  include/linux/dma-heap.h                      |  15 +
>  include/uapi/linux/android/binder.h           |  23 +-
>  init/Kconfig                                  |   7 +
>  kernel/cgroup/Makefile                        |   1 +
>  kernel/cgroup/gpu.c                           | 390 +++++++++++++
>  .../selftests/drivers/android/binder/Makefile |   8 +
>  .../drivers/android/binder/binder_util.c      | 250 +++++++++
>  .../drivers/android/binder/binder_util.h      |  32 ++
>  .../selftests/drivers/android/binder/config   |   4 +
>  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
>  18 files changed, 1632 insertions(+), 23 deletions(-)
>  create mode 100644 include/linux/cgroup_gpu.h
>  create mode 100644 kernel/cgroup/gpu.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
>  create mode 100644 tools/testing/selftests/drivers/android/binder/config
>  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-11 13:21   ` Nicolas Dufresne
  0 siblings, 0 replies; 67+ messages in thread
From: Nicolas Dufresne @ 2022-05-11 13:21 UTC (permalink / raw)
  To: T.J. Mercier, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan
  Cc: daniel, jstultz, cmllamas, kaleshsingh, Kenny.Ho, mkoutny, skhan,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-media,
	dri-devel, linaro-mm-sig, linux-kselftest

Hi,

Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> This patch series revisits the proposal for a GPU cgroup controller to
> track and limit memory allocations by various device/allocator
> subsystems. The patch series also contains a simple prototype to
> illustrate how Android intends to implement DMA-BUF allocator
> attribution using the GPU cgroup controller. The prototype does not
> include resource limit enforcements.

I'm sorry, since I'm not in-depth technically involve. But from reading the
topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
this an attempt to really track the DMABuf allocated by userland, or just
something for GPU ? What about V4L2 devices ? Any way this can be clarified,
specially what would other subsystem needs to have cgroup DMABuf allocation
controller support ?

> 
> Changelog:
> v7:
> Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> This means gpucg_register_bucket now returns an internally allocated
> struct gpucg_bucket.
> 
> Move all public function documentation to the cgroup_gpu.h header.
> 
> Remove comment in documentation about duplicate name rejection which
> is not relevant to cgroups users per Michal Koutný.
> 
> v6:
> Move documentation into cgroup-v2.rst per Tejun Heo.
> 
> Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> 
> Return error on transfer failure per Carlos Llamas.
> 
> v5:
> Rebase on top of v5.18-rc3
> 
> Drop the global GPU cgroup "total" (sum of all device totals) portion
> of the design since there is no currently known use for this per
> Tejun Heo.
> 
> Fix commit message which still contained the old name for
> dma_buf_transfer_charge per Michal Koutný.
> 
> Remove all GPU cgroup code except what's necessary to support charge transfer
> from dma_buf. Previously charging was done in export, but for non-Android
> graphics use-cases this is not ideal since there may be a delay between
> allocation and export, during which time there is no accounting.
> 
> Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> result of above.
> 
> Put the charge and uncharge code in the same file (system_heap_allocate,
> system_heap_dma_buf_release) instead of splitting them between the heap and
> the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> 
> Modify the dma_buf_transfer_charge API to accept a task_struct instead
> of a gpucg. This avoids requiring the caller to manage the refcount
> of the gpucg upon failure and confusing ownership transfer logic.
> 
> Support all strings for gpucg_register_bucket instead of just string
> literals.
> 
> Enforce globally unique gpucg_bucket names.
> 
> Constrain gpucg_bucket name lengths to 64 bytes.
> 
> Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> 
> Drop patch 7 from the series, which changed the types of
> binder_transaction_data's sender_pid and sender_euid fields. This was
> done in another commit here:
> https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> 
> Rename:
>   gpucg_try_charge -> gpucg_charge
>   find_cg_rpool_locked -> cg_rpool_find_locked
>   init_cg_rpool -> cg_rpool_init
>   get_cg_rpool_locked -> cg_rpool_get_locked
>   "gpu cgroup controller" -> "GPU controller"
>   gpucg_device -> gpucg_bucket
>   usage -> size
> 
> Tests:
>   Support both binder_fd_array_object and binder_fd_object. This is
>   necessary because new versions of Android will use binder_fd_object
>   instead of binder_fd_array_object, and we need to support both.
> 
>   Tests for both binder_fd_array_object and binder_fd_object.
> 
>   For binder_utils return error codes instead of
>   struct binder{fs}_ctx.
> 
>   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
>   of a runtime fallback.
> 
>   Ensure binderfs_mntpt ends with a trailing '/' character instead of
>   prepending it where used.
> 
> v4:
> Skip test if not run as root per Shuah Khan
> 
> Add better test logging for abnormal child termination per Shuah Khan
> 
> Adjust ordering of charge/uncharge during transfer to avoid potentially
> hitting cgroup limit per Michal Koutný
> 
> Adjust gpucg_try_charge critical section for charge transfer functionality
> 
> Fix uninitialized return code error for dmabuf_try_charge error case
> 
> v3:
> Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> 
> Use more common dual author commit message format per John Stultz
> 
> Remove android from binder changes title per Todd Kjos
> 
> Add a kselftest for this new behavior per Greg Kroah-Hartman
> 
> Include details on behavior for all combinations of kernel/userspace
> versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> 
> Fix pid and uid types in binder UAPI header
> 
> v2:
> See the previous revision of this change submitted by Hridya Valsaraju
> at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> 
> Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> heap to a single dma-buf function for all heaps per Daniel Vetter and
> Christian König. Pointers to struct gpucg and struct gpucg_device
> tracking the current associations were added to the dma_buf struct to
> achieve this.
> 
> Fix incorrect Kconfig help section indentation per Randy Dunlap.
> 
> History of the GPU cgroup controller
> ====================================
> The GPU/DRM cgroup controller came into being when a consensus[1]
> was reached that the resources it tracked were unsuitable to be integrated
> into memcg. Originally, the proposed controller was specific to the DRM
> subsystem and was intended to track GEM buffers and GPU-specific
> resources[2]. In order to help establish a unified memory accounting model
> for all GPU and all related subsystems, Daniel Vetter put forth a
> suggestion to move it out of the DRM subsystem so that it can be used by
> other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> does the same.
> 
> [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> 
> Hridya Valsaraju (3):
>   gpu: rfc: Proposal for a GPU cgroup controller
>   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
>     memory
>   binder: Add flags to relinquish ownership of fds
> 
> T.J. Mercier (3):
>   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
>   dmabuf: Add gpu cgroup charge transfer function
>   selftests: Add binder cgroup gpu memory transfer tests
> 
>  Documentation/admin-guide/cgroup-v2.rst       |  23 +
>  drivers/android/binder.c                      |  31 +-
>  drivers/dma-buf/dma-buf.c                     |  80 ++-
>  drivers/dma-buf/dma-heap.c                    |  38 ++
>  drivers/dma-buf/heaps/system_heap.c           |  28 +-
>  include/linux/cgroup_gpu.h                    | 146 +++++
>  include/linux/cgroup_subsys.h                 |   4 +
>  include/linux/dma-buf.h                       |  49 +-
>  include/linux/dma-heap.h                      |  15 +
>  include/uapi/linux/android/binder.h           |  23 +-
>  init/Kconfig                                  |   7 +
>  kernel/cgroup/Makefile                        |   1 +
>  kernel/cgroup/gpu.c                           | 390 +++++++++++++
>  .../selftests/drivers/android/binder/Makefile |   8 +
>  .../drivers/android/binder/binder_util.c      | 250 +++++++++
>  .../drivers/android/binder/binder_util.h      |  32 ++
>  .../selftests/drivers/android/binder/config   |   4 +
>  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
>  18 files changed, 1632 insertions(+), 23 deletions(-)
>  create mode 100644 include/linux/cgroup_gpu.h
>  create mode 100644 kernel/cgroup/gpu.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
>  create mode 100644 tools/testing/selftests/drivers/android/binder/config
>  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-11 13:21   ` Nicolas Dufresne
  0 siblings, 0 replies; 67+ messages in thread
From: Nicolas Dufresne @ 2022-05-11 13:21 UTC (permalink / raw)
  To: T.J. Mercier, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey
  Cc: daniel, jstultz, cmllamas, kaleshsingh, Kenny.Ho, mkoutny, skhan,
	kernel-team, cgroups, linux-doc, linux-kernel, linux-media,
	dri-devel, linaro-mm-sig, linux-kselftest

Hi,

Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> This patch series revisits the proposal for a GPU cgroup controller to
> track and limit memory allocations by various device/allocator
> subsystems. The patch series also contains a simple prototype to
> illustrate how Android intends to implement DMA-BUF allocator
> attribution using the GPU cgroup controller. The prototype does not
> include resource limit enforcements.

I'm sorry, since I'm not in-depth technically involve. But from reading the
topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
this an attempt to really track the DMABuf allocated by userland, or just
something for GPU ? What about V4L2 devices ? Any way this can be clarified,
specially what would other subsystem needs to have cgroup DMABuf allocation
controller support ?

> 
> Changelog:
> v7:
> Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> This means gpucg_register_bucket now returns an internally allocated
> struct gpucg_bucket.
> 
> Move all public function documentation to the cgroup_gpu.h header.
> 
> Remove comment in documentation about duplicate name rejection which
> is not relevant to cgroups users per Michal Koutný.
> 
> v6:
> Move documentation into cgroup-v2.rst per Tejun Heo.
> 
> Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> 
> Return error on transfer failure per Carlos Llamas.
> 
> v5:
> Rebase on top of v5.18-rc3
> 
> Drop the global GPU cgroup "total" (sum of all device totals) portion
> of the design since there is no currently known use for this per
> Tejun Heo.
> 
> Fix commit message which still contained the old name for
> dma_buf_transfer_charge per Michal Koutný.
> 
> Remove all GPU cgroup code except what's necessary to support charge transfer
> from dma_buf. Previously charging was done in export, but for non-Android
> graphics use-cases this is not ideal since there may be a delay between
> allocation and export, during which time there is no accounting.
> 
> Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> result of above.
> 
> Put the charge and uncharge code in the same file (system_heap_allocate,
> system_heap_dma_buf_release) instead of splitting them between the heap and
> the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> 
> Modify the dma_buf_transfer_charge API to accept a task_struct instead
> of a gpucg. This avoids requiring the caller to manage the refcount
> of the gpucg upon failure and confusing ownership transfer logic.
> 
> Support all strings for gpucg_register_bucket instead of just string
> literals.
> 
> Enforce globally unique gpucg_bucket names.
> 
> Constrain gpucg_bucket name lengths to 64 bytes.
> 
> Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> 
> Drop patch 7 from the series, which changed the types of
> binder_transaction_data's sender_pid and sender_euid fields. This was
> done in another commit here:
> https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> 
> Rename:
>   gpucg_try_charge -> gpucg_charge
>   find_cg_rpool_locked -> cg_rpool_find_locked
>   init_cg_rpool -> cg_rpool_init
>   get_cg_rpool_locked -> cg_rpool_get_locked
>   "gpu cgroup controller" -> "GPU controller"
>   gpucg_device -> gpucg_bucket
>   usage -> size
> 
> Tests:
>   Support both binder_fd_array_object and binder_fd_object. This is
>   necessary because new versions of Android will use binder_fd_object
>   instead of binder_fd_array_object, and we need to support both.
> 
>   Tests for both binder_fd_array_object and binder_fd_object.
> 
>   For binder_utils return error codes instead of
>   struct binder{fs}_ctx.
> 
>   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
>   of a runtime fallback.
> 
>   Ensure binderfs_mntpt ends with a trailing '/' character instead of
>   prepending it where used.
> 
> v4:
> Skip test if not run as root per Shuah Khan
> 
> Add better test logging for abnormal child termination per Shuah Khan
> 
> Adjust ordering of charge/uncharge during transfer to avoid potentially
> hitting cgroup limit per Michal Koutný
> 
> Adjust gpucg_try_charge critical section for charge transfer functionality
> 
> Fix uninitialized return code error for dmabuf_try_charge error case
> 
> v3:
> Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> 
> Use more common dual author commit message format per John Stultz
> 
> Remove android from binder changes title per Todd Kjos
> 
> Add a kselftest for this new behavior per Greg Kroah-Hartman
> 
> Include details on behavior for all combinations of kernel/userspace
> versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> 
> Fix pid and uid types in binder UAPI header
> 
> v2:
> See the previous revision of this change submitted by Hridya Valsaraju
> at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> 
> Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> heap to a single dma-buf function for all heaps per Daniel Vetter and
> Christian König. Pointers to struct gpucg and struct gpucg_device
> tracking the current associations were added to the dma_buf struct to
> achieve this.
> 
> Fix incorrect Kconfig help section indentation per Randy Dunlap.
> 
> History of the GPU cgroup controller
> ====================================
> The GPU/DRM cgroup controller came into being when a consensus[1]
> was reached that the resources it tracked were unsuitable to be integrated
> into memcg. Originally, the proposed controller was specific to the DRM
> subsystem and was intended to track GEM buffers and GPU-specific
> resources[2]. In order to help establish a unified memory accounting model
> for all GPU and all related subsystems, Daniel Vetter put forth a
> suggestion to move it out of the DRM subsystem so that it can be used by
> other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> does the same.
> 
> [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> 
> Hridya Valsaraju (3):
>   gpu: rfc: Proposal for a GPU cgroup controller
>   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
>     memory
>   binder: Add flags to relinquish ownership of fds
> 
> T.J. Mercier (3):
>   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
>   dmabuf: Add gpu cgroup charge transfer function
>   selftests: Add binder cgroup gpu memory transfer tests
> 
>  Documentation/admin-guide/cgroup-v2.rst       |  23 +
>  drivers/android/binder.c                      |  31 +-
>  drivers/dma-buf/dma-buf.c                     |  80 ++-
>  drivers/dma-buf/dma-heap.c                    |  38 ++
>  drivers/dma-buf/heaps/system_heap.c           |  28 +-
>  include/linux/cgroup_gpu.h                    | 146 +++++
>  include/linux/cgroup_subsys.h                 |   4 +
>  include/linux/dma-buf.h                       |  49 +-
>  include/linux/dma-heap.h                      |  15 +
>  include/uapi/linux/android/binder.h           |  23 +-
>  init/Kconfig                                  |   7 +
>  kernel/cgroup/Makefile                        |   1 +
>  kernel/cgroup/gpu.c                           | 390 +++++++++++++
>  .../selftests/drivers/android/binder/Makefile |   8 +
>  .../drivers/android/binder/binder_util.c      | 250 +++++++++
>  .../drivers/android/binder/binder_util.h      |  32 ++
>  .../selftests/drivers/android/binder/config   |   4 +
>  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
>  18 files changed, 1632 insertions(+), 23 deletions(-)
>  create mode 100644 include/linux/cgroup_gpu.h
>  create mode 100644 kernel/cgroup/gpu.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
>  create mode 100644 tools/testing/selftests/drivers/android/binder/config
>  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-11 13:21   ` Nicolas Dufresne
  (?)
@ 2022-05-11 20:31     ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-11 20:31 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Hi,
>
> Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > This patch series revisits the proposal for a GPU cgroup controller to
> > track and limit memory allocations by various device/allocator
> > subsystems. The patch series also contains a simple prototype to
> > illustrate how Android intends to implement DMA-BUF allocator
> > attribution using the GPU cgroup controller. The prototype does not
> > include resource limit enforcements.
>
> I'm sorry, since I'm not in-depth technically involve. But from reading the
> topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> this an attempt to really track the DMABuf allocated by userland, or just
> something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> specially what would other subsystem needs to have cgroup DMABuf allocation
> controller support ?
>
Hi Nicolas,

The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
somewhat of an Androidism. However this change aims to be usable for
tracking all GPU related allocations. It's just that this initial
series only adds support for tracking dmabufs allocated from dmabuf
heaps.

In Android most graphics buffers are dma buffers allocated from a
dmabuf heap, so that is why these dmabuf heap allocations are being
tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
also want to track their buffers, but would probably want to do so
under a bucket name of something like "v4l2". Same goes for GEM
dmabufs. The naming scheme for this is still yet to be decided. It
would be cool to be able to attribute memory at the driver level, or
even different types of memory at the driver level, but I imagine
there is a point of diminishing returns for fine-grained
naming/bucketing.

So far, I haven't tried to create a strict definition of what is and
is not "GPU memory" for the purpose of this accounting, so I don't
think we should be restricted to tracking just dmabufs. I don't see
why this couldn't be anything a driver wants to consider as GPU memory
as long as it is named/bucketed appropriately, such as both on-package
graphics card memory use and CPU memory dedicated for graphics use
like for host/device transfers.

Is that helpful?

Best,
T.J.

> >
> > Changelog:
> > v7:
> > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > This means gpucg_register_bucket now returns an internally allocated
> > struct gpucg_bucket.
> >
> > Move all public function documentation to the cgroup_gpu.h header.
> >
> > Remove comment in documentation about duplicate name rejection which
> > is not relevant to cgroups users per Michal Koutný.
> >
> > v6:
> > Move documentation into cgroup-v2.rst per Tejun Heo.
> >
> > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> >
> > Return error on transfer failure per Carlos Llamas.
> >
> > v5:
> > Rebase on top of v5.18-rc3
> >
> > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > of the design since there is no currently known use for this per
> > Tejun Heo.
> >
> > Fix commit message which still contained the old name for
> > dma_buf_transfer_charge per Michal Koutný.
> >
> > Remove all GPU cgroup code except what's necessary to support charge transfer
> > from dma_buf. Previously charging was done in export, but for non-Android
> > graphics use-cases this is not ideal since there may be a delay between
> > allocation and export, during which time there is no accounting.
> >
> > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > result of above.
> >
> > Put the charge and uncharge code in the same file (system_heap_allocate,
> > system_heap_dma_buf_release) instead of splitting them between the heap and
> > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> >
> > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > of a gpucg. This avoids requiring the caller to manage the refcount
> > of the gpucg upon failure and confusing ownership transfer logic.
> >
> > Support all strings for gpucg_register_bucket instead of just string
> > literals.
> >
> > Enforce globally unique gpucg_bucket names.
> >
> > Constrain gpucg_bucket name lengths to 64 bytes.
> >
> > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> >
> > Drop patch 7 from the series, which changed the types of
> > binder_transaction_data's sender_pid and sender_euid fields. This was
> > done in another commit here:
> > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> >
> > Rename:
> >   gpucg_try_charge -> gpucg_charge
> >   find_cg_rpool_locked -> cg_rpool_find_locked
> >   init_cg_rpool -> cg_rpool_init
> >   get_cg_rpool_locked -> cg_rpool_get_locked
> >   "gpu cgroup controller" -> "GPU controller"
> >   gpucg_device -> gpucg_bucket
> >   usage -> size
> >
> > Tests:
> >   Support both binder_fd_array_object and binder_fd_object. This is
> >   necessary because new versions of Android will use binder_fd_object
> >   instead of binder_fd_array_object, and we need to support both.
> >
> >   Tests for both binder_fd_array_object and binder_fd_object.
> >
> >   For binder_utils return error codes instead of
> >   struct binder{fs}_ctx.
> >
> >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> >   of a runtime fallback.
> >
> >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> >   prepending it where used.
> >
> > v4:
> > Skip test if not run as root per Shuah Khan
> >
> > Add better test logging for abnormal child termination per Shuah Khan
> >
> > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > hitting cgroup limit per Michal Koutný
> >
> > Adjust gpucg_try_charge critical section for charge transfer functionality
> >
> > Fix uninitialized return code error for dmabuf_try_charge error case
> >
> > v3:
> > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> >
> > Use more common dual author commit message format per John Stultz
> >
> > Remove android from binder changes title per Todd Kjos
> >
> > Add a kselftest for this new behavior per Greg Kroah-Hartman
> >
> > Include details on behavior for all combinations of kernel/userspace
> > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> >
> > Fix pid and uid types in binder UAPI header
> >
> > v2:
> > See the previous revision of this change submitted by Hridya Valsaraju
> > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> >
> > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > Christian König. Pointers to struct gpucg and struct gpucg_device
> > tracking the current associations were added to the dma_buf struct to
> > achieve this.
> >
> > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> >
> > History of the GPU cgroup controller
> > ====================================
> > The GPU/DRM cgroup controller came into being when a consensus[1]
> > was reached that the resources it tracked were unsuitable to be integrated
> > into memcg. Originally, the proposed controller was specific to the DRM
> > subsystem and was intended to track GEM buffers and GPU-specific
> > resources[2]. In order to help establish a unified memory accounting model
> > for all GPU and all related subsystems, Daniel Vetter put forth a
> > suggestion to move it out of the DRM subsystem so that it can be used by
> > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > does the same.
> >
> > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> >
> > Hridya Valsaraju (3):
> >   gpu: rfc: Proposal for a GPU cgroup controller
> >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> >     memory
> >   binder: Add flags to relinquish ownership of fds
> >
> > T.J. Mercier (3):
> >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> >   dmabuf: Add gpu cgroup charge transfer function
> >   selftests: Add binder cgroup gpu memory transfer tests
> >
> >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> >  drivers/android/binder.c                      |  31 +-
> >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> >  drivers/dma-buf/dma-heap.c                    |  38 ++
> >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> >  include/linux/cgroup_gpu.h                    | 146 +++++
> >  include/linux/cgroup_subsys.h                 |   4 +
> >  include/linux/dma-buf.h                       |  49 +-
> >  include/linux/dma-heap.h                      |  15 +
> >  include/uapi/linux/android/binder.h           |  23 +-
> >  init/Kconfig                                  |   7 +
> >  kernel/cgroup/Makefile                        |   1 +
> >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> >  .../selftests/drivers/android/binder/Makefile |   8 +
> >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> >  .../drivers/android/binder/binder_util.h      |  32 ++
> >  .../selftests/drivers/android/binder/config   |   4 +
> >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> >  18 files changed, 1632 insertions(+), 23 deletions(-)
> >  create mode 100644 include/linux/cgroup_gpu.h
> >  create mode 100644 kernel/cgroup/gpu.c
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-11 20:31     ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-11 20:31 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Laura Abbott, kernel-team, linux-media,
	Todd Kjos, linaro-mm-sig, Hridya Valsaraju, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Tejun Heo

On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Hi,
>
> Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > This patch series revisits the proposal for a GPU cgroup controller to
> > track and limit memory allocations by various device/allocator
> > subsystems. The patch series also contains a simple prototype to
> > illustrate how Android intends to implement DMA-BUF allocator
> > attribution using the GPU cgroup controller. The prototype does not
> > include resource limit enforcements.
>
> I'm sorry, since I'm not in-depth technically involve. But from reading the
> topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> this an attempt to really track the DMABuf allocated by userland, or just
> something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> specially what would other subsystem needs to have cgroup DMABuf allocation
> controller support ?
>
Hi Nicolas,

The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
somewhat of an Androidism. However this change aims to be usable for
tracking all GPU related allocations. It's just that this initial
series only adds support for tracking dmabufs allocated from dmabuf
heaps.

In Android most graphics buffers are dma buffers allocated from a
dmabuf heap, so that is why these dmabuf heap allocations are being
tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
also want to track their buffers, but would probably want to do so
under a bucket name of something like "v4l2". Same goes for GEM
dmabufs. The naming scheme for this is still yet to be decided. It
would be cool to be able to attribute memory at the driver level, or
even different types of memory at the driver level, but I imagine
there is a point of diminishing returns for fine-grained
naming/bucketing.

So far, I haven't tried to create a strict definition of what is and
is not "GPU memory" for the purpose of this accounting, so I don't
think we should be restricted to tracking just dmabufs. I don't see
why this couldn't be anything a driver wants to consider as GPU memory
as long as it is named/bucketed appropriately, such as both on-package
graphics card memory use and CPU memory dedicated for graphics use
like for host/device transfers.

Is that helpful?

Best,
T.J.

> >
> > Changelog:
> > v7:
> > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > This means gpucg_register_bucket now returns an internally allocated
> > struct gpucg_bucket.
> >
> > Move all public function documentation to the cgroup_gpu.h header.
> >
> > Remove comment in documentation about duplicate name rejection which
> > is not relevant to cgroups users per Michal Koutný.
> >
> > v6:
> > Move documentation into cgroup-v2.rst per Tejun Heo.
> >
> > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> >
> > Return error on transfer failure per Carlos Llamas.
> >
> > v5:
> > Rebase on top of v5.18-rc3
> >
> > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > of the design since there is no currently known use for this per
> > Tejun Heo.
> >
> > Fix commit message which still contained the old name for
> > dma_buf_transfer_charge per Michal Koutný.
> >
> > Remove all GPU cgroup code except what's necessary to support charge transfer
> > from dma_buf. Previously charging was done in export, but for non-Android
> > graphics use-cases this is not ideal since there may be a delay between
> > allocation and export, during which time there is no accounting.
> >
> > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > result of above.
> >
> > Put the charge and uncharge code in the same file (system_heap_allocate,
> > system_heap_dma_buf_release) instead of splitting them between the heap and
> > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> >
> > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > of a gpucg. This avoids requiring the caller to manage the refcount
> > of the gpucg upon failure and confusing ownership transfer logic.
> >
> > Support all strings for gpucg_register_bucket instead of just string
> > literals.
> >
> > Enforce globally unique gpucg_bucket names.
> >
> > Constrain gpucg_bucket name lengths to 64 bytes.
> >
> > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> >
> > Drop patch 7 from the series, which changed the types of
> > binder_transaction_data's sender_pid and sender_euid fields. This was
> > done in another commit here:
> > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> >
> > Rename:
> >   gpucg_try_charge -> gpucg_charge
> >   find_cg_rpool_locked -> cg_rpool_find_locked
> >   init_cg_rpool -> cg_rpool_init
> >   get_cg_rpool_locked -> cg_rpool_get_locked
> >   "gpu cgroup controller" -> "GPU controller"
> >   gpucg_device -> gpucg_bucket
> >   usage -> size
> >
> > Tests:
> >   Support both binder_fd_array_object and binder_fd_object. This is
> >   necessary because new versions of Android will use binder_fd_object
> >   instead of binder_fd_array_object, and we need to support both.
> >
> >   Tests for both binder_fd_array_object and binder_fd_object.
> >
> >   For binder_utils return error codes instead of
> >   struct binder{fs}_ctx.
> >
> >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> >   of a runtime fallback.
> >
> >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> >   prepending it where used.
> >
> > v4:
> > Skip test if not run as root per Shuah Khan
> >
> > Add better test logging for abnormal child termination per Shuah Khan
> >
> > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > hitting cgroup limit per Michal Koutný
> >
> > Adjust gpucg_try_charge critical section for charge transfer functionality
> >
> > Fix uninitialized return code error for dmabuf_try_charge error case
> >
> > v3:
> > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> >
> > Use more common dual author commit message format per John Stultz
> >
> > Remove android from binder changes title per Todd Kjos
> >
> > Add a kselftest for this new behavior per Greg Kroah-Hartman
> >
> > Include details on behavior for all combinations of kernel/userspace
> > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> >
> > Fix pid and uid types in binder UAPI header
> >
> > v2:
> > See the previous revision of this change submitted by Hridya Valsaraju
> > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> >
> > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > Christian König. Pointers to struct gpucg and struct gpucg_device
> > tracking the current associations were added to the dma_buf struct to
> > achieve this.
> >
> > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> >
> > History of the GPU cgroup controller
> > ====================================
> > The GPU/DRM cgroup controller came into being when a consensus[1]
> > was reached that the resources it tracked were unsuitable to be integrated
> > into memcg. Originally, the proposed controller was specific to the DRM
> > subsystem and was intended to track GEM buffers and GPU-specific
> > resources[2]. In order to help establish a unified memory accounting model
> > for all GPU and all related subsystems, Daniel Vetter put forth a
> > suggestion to move it out of the DRM subsystem so that it can be used by
> > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > does the same.
> >
> > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> >
> > Hridya Valsaraju (3):
> >   gpu: rfc: Proposal for a GPU cgroup controller
> >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> >     memory
> >   binder: Add flags to relinquish ownership of fds
> >
> > T.J. Mercier (3):
> >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> >   dmabuf: Add gpu cgroup charge transfer function
> >   selftests: Add binder cgroup gpu memory transfer tests
> >
> >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> >  drivers/android/binder.c                      |  31 +-
> >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> >  drivers/dma-buf/dma-heap.c                    |  38 ++
> >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> >  include/linux/cgroup_gpu.h                    | 146 +++++
> >  include/linux/cgroup_subsys.h                 |   4 +
> >  include/linux/dma-buf.h                       |  49 +-
> >  include/linux/dma-heap.h                      |  15 +
> >  include/uapi/linux/android/binder.h           |  23 +-
> >  init/Kconfig                                  |   7 +
> >  kernel/cgroup/Makefile                        |   1 +
> >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> >  .../selftests/drivers/android/binder/Makefile |   8 +
> >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> >  .../drivers/android/binder/binder_util.h      |  32 ++
> >  .../selftests/drivers/android/binder/config   |   4 +
> >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> >  18 files changed, 1632 insertions(+), 23 deletions(-)
> >  create mode 100644 include/linux/cgroup_gpu.h
> >  create mode 100644 kernel/cgroup/gpu.c
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-11 20:31     ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-11 20:31 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz

On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Hi,
>
> Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > This patch series revisits the proposal for a GPU cgroup controller to
> > track and limit memory allocations by various device/allocator
> > subsystems. The patch series also contains a simple prototype to
> > illustrate how Android intends to implement DMA-BUF allocator
> > attribution using the GPU cgroup controller. The prototype does not
> > include resource limit enforcements.
>
> I'm sorry, since I'm not in-depth technically involve. But from reading the
> topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> this an attempt to really track the DMABuf allocated by userland, or just
> something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> specially what would other subsystem needs to have cgroup DMABuf allocation
> controller support ?
>
Hi Nicolas,

The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
somewhat of an Androidism. However this change aims to be usable for
tracking all GPU related allocations. It's just that this initial
series only adds support for tracking dmabufs allocated from dmabuf
heaps.

In Android most graphics buffers are dma buffers allocated from a
dmabuf heap, so that is why these dmabuf heap allocations are being
tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
also want to track their buffers, but would probably want to do so
under a bucket name of something like "v4l2". Same goes for GEM
dmabufs. The naming scheme for this is still yet to be decided. It
would be cool to be able to attribute memory at the driver level, or
even different types of memory at the driver level, but I imagine
there is a point of diminishing returns for fine-grained
naming/bucketing.

So far, I haven't tried to create a strict definition of what is and
is not "GPU memory" for the purpose of this accounting, so I don't
think we should be restricted to tracking just dmabufs. I don't see
why this couldn't be anything a driver wants to consider as GPU memory
as long as it is named/bucketed appropriately, such as both on-package
graphics card memory use and CPU memory dedicated for graphics use
like for host/device transfers.

Is that helpful?

Best,
T.J.

> >
> > Changelog:
> > v7:
> > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > This means gpucg_register_bucket now returns an internally allocated
> > struct gpucg_bucket.
> >
> > Move all public function documentation to the cgroup_gpu.h header.
> >
> > Remove comment in documentation about duplicate name rejection which
> > is not relevant to cgroups users per Michal Koutný.
> >
> > v6:
> > Move documentation into cgroup-v2.rst per Tejun Heo.
> >
> > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> >
> > Return error on transfer failure per Carlos Llamas.
> >
> > v5:
> > Rebase on top of v5.18-rc3
> >
> > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > of the design since there is no currently known use for this per
> > Tejun Heo.
> >
> > Fix commit message which still contained the old name for
> > dma_buf_transfer_charge per Michal Koutný.
> >
> > Remove all GPU cgroup code except what's necessary to support charge transfer
> > from dma_buf. Previously charging was done in export, but for non-Android
> > graphics use-cases this is not ideal since there may be a delay between
> > allocation and export, during which time there is no accounting.
> >
> > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > result of above.
> >
> > Put the charge and uncharge code in the same file (system_heap_allocate,
> > system_heap_dma_buf_release) instead of splitting them between the heap and
> > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> >
> > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > of a gpucg. This avoids requiring the caller to manage the refcount
> > of the gpucg upon failure and confusing ownership transfer logic.
> >
> > Support all strings for gpucg_register_bucket instead of just string
> > literals.
> >
> > Enforce globally unique gpucg_bucket names.
> >
> > Constrain gpucg_bucket name lengths to 64 bytes.
> >
> > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> >
> > Drop patch 7 from the series, which changed the types of
> > binder_transaction_data's sender_pid and sender_euid fields. This was
> > done in another commit here:
> > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> >
> > Rename:
> >   gpucg_try_charge -> gpucg_charge
> >   find_cg_rpool_locked -> cg_rpool_find_locked
> >   init_cg_rpool -> cg_rpool_init
> >   get_cg_rpool_locked -> cg_rpool_get_locked
> >   "gpu cgroup controller" -> "GPU controller"
> >   gpucg_device -> gpucg_bucket
> >   usage -> size
> >
> > Tests:
> >   Support both binder_fd_array_object and binder_fd_object. This is
> >   necessary because new versions of Android will use binder_fd_object
> >   instead of binder_fd_array_object, and we need to support both.
> >
> >   Tests for both binder_fd_array_object and binder_fd_object.
> >
> >   For binder_utils return error codes instead of
> >   struct binder{fs}_ctx.
> >
> >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> >   of a runtime fallback.
> >
> >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> >   prepending it where used.
> >
> > v4:
> > Skip test if not run as root per Shuah Khan
> >
> > Add better test logging for abnormal child termination per Shuah Khan
> >
> > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > hitting cgroup limit per Michal Koutný
> >
> > Adjust gpucg_try_charge critical section for charge transfer functionality
> >
> > Fix uninitialized return code error for dmabuf_try_charge error case
> >
> > v3:
> > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> >
> > Use more common dual author commit message format per John Stultz
> >
> > Remove android from binder changes title per Todd Kjos
> >
> > Add a kselftest for this new behavior per Greg Kroah-Hartman
> >
> > Include details on behavior for all combinations of kernel/userspace
> > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> >
> > Fix pid and uid types in binder UAPI header
> >
> > v2:
> > See the previous revision of this change submitted by Hridya Valsaraju
> > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> >
> > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > Christian König. Pointers to struct gpucg and struct gpucg_device
> > tracking the current associations were added to the dma_buf struct to
> > achieve this.
> >
> > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> >
> > History of the GPU cgroup controller
> > ====================================
> > The GPU/DRM cgroup controller came into being when a consensus[1]
> > was reached that the resources it tracked were unsuitable to be integrated
> > into memcg. Originally, the proposed controller was specific to the DRM
> > subsystem and was intended to track GEM buffers and GPU-specific
> > resources[2]. In order to help establish a unified memory accounting model
> > for all GPU and all related subsystems, Daniel Vetter put forth a
> > suggestion to move it out of the DRM subsystem so that it can be used by
> > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > does the same.
> >
> > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> >
> > Hridya Valsaraju (3):
> >   gpu: rfc: Proposal for a GPU cgroup controller
> >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> >     memory
> >   binder: Add flags to relinquish ownership of fds
> >
> > T.J. Mercier (3):
> >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> >   dmabuf: Add gpu cgroup charge transfer function
> >   selftests: Add binder cgroup gpu memory transfer tests
> >
> >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> >  drivers/android/binder.c                      |  31 +-
> >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> >  drivers/dma-buf/dma-heap.c                    |  38 ++
> >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> >  include/linux/cgroup_gpu.h                    | 146 +++++
> >  include/linux/cgroup_subsys.h                 |   4 +
> >  include/linux/dma-buf.h                       |  49 +-
> >  include/linux/dma-heap.h                      |  15 +
> >  include/uapi/linux/android/binder.h           |  23 +-
> >  init/Kconfig                                  |   7 +
> >  kernel/cgroup/Makefile                        |   1 +
> >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> >  .../selftests/drivers/android/binder/Makefile |   8 +
> >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> >  .../drivers/android/binder/binder_util.h      |  32 ++
> >  .../selftests/drivers/android/binder/config   |   4 +
> >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> >  18 files changed, 1632 insertions(+), 23 deletions(-)
> >  create mode 100644 include/linux/cgroup_gpu.h
> >  create mode 100644 kernel/cgroup/gpu.c
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-11 20:31     ` T.J. Mercier
  (?)
@ 2022-05-12 13:09       ` Nicolas Dufresne
  -1 siblings, 0 replies; 67+ messages in thread
From: Nicolas Dufresne @ 2022-05-12 13:09 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

Le mercredi 11 mai 2022 à 13:31 -0700, T.J. Mercier a écrit :
> On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > 
> > Hi,
> > 
> > Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > > This patch series revisits the proposal for a GPU cgroup controller to
> > > track and limit memory allocations by various device/allocator
> > > subsystems. The patch series also contains a simple prototype to
> > > illustrate how Android intends to implement DMA-BUF allocator
> > > attribution using the GPU cgroup controller. The prototype does not
> > > include resource limit enforcements.
> > 
> > I'm sorry, since I'm not in-depth technically involve. But from reading the
> > topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> > this an attempt to really track the DMABuf allocated by userland, or just
> > something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> > specially what would other subsystem needs to have cgroup DMABuf allocation
> > controller support ?
> > 
> Hi Nicolas,
> 
> The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
> somewhat of an Androidism. However this change aims to be usable for
> tracking all GPU related allocations. It's just that this initial
> series only adds support for tracking dmabufs allocated from dmabuf
> heaps.
> 
> In Android most graphics buffers are dma buffers allocated from a
> dmabuf heap, so that is why these dmabuf heap allocations are being
> tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
> also want to track their buffers, but would probably want to do so
> under a bucket name of something like "v4l2". Same goes for GEM
> dmabufs. The naming scheme for this is still yet to be decided. It
> would be cool to be able to attribute memory at the driver level, or
> even different types of memory at the driver level, but I imagine
> there is a point of diminishing returns for fine-grained
> naming/bucketing.
> 
> So far, I haven't tried to create a strict definition of what is and
> is not "GPU memory" for the purpose of this accounting, so I don't
> think we should be restricted to tracking just dmabufs. I don't see
> why this couldn't be anything a driver wants to consider as GPU memory
> as long as it is named/bucketed appropriately, such as both on-package
> graphics card memory use and CPU memory dedicated for graphics use
> like for host/device transfers.
> 
> Is that helpful?

I'm actually happy I've asked this question, wasn't silly after all. I think the
problem here is a naming issue. What you really are monitor is "video memory",
which consist of a memory segment allocated to store data used to render images
(its not always images of course, GPU an VPU have specialized buffers for their
purpose).

Whether this should be split between what is used specifically by the GPU
drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
drivers is something that should be discussed. But in the current approach, you
really meant Video memory as a superset of the above. Personally, I think
generically (to de-Andronized your work), en-globing all video memory is
sufficient. What I fail to understand is how you will manage to distinguished
DMABuf Heap allocation (which are used outside of Android btw), from Video
allocation or other type of usage. I'm sure non-video usage will exist in the
future (think of machine learning, compute, other high bandwidth streaming
thingy ...)

> 
> Best,
> T.J.
> 
> > > 
> > > Changelog:
> > > v7:
> > > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > > This means gpucg_register_bucket now returns an internally allocated
> > > struct gpucg_bucket.
> > > 
> > > Move all public function documentation to the cgroup_gpu.h header.
> > > 
> > > Remove comment in documentation about duplicate name rejection which
> > > is not relevant to cgroups users per Michal Koutný.
> > > 
> > > v6:
> > > Move documentation into cgroup-v2.rst per Tejun Heo.
> > > 
> > > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> > > 
> > > Return error on transfer failure per Carlos Llamas.
> > > 
> > > v5:
> > > Rebase on top of v5.18-rc3
> > > 
> > > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > > of the design since there is no currently known use for this per
> > > Tejun Heo.
> > > 
> > > Fix commit message which still contained the old name for
> > > dma_buf_transfer_charge per Michal Koutný.
> > > 
> > > Remove all GPU cgroup code except what's necessary to support charge transfer
> > > from dma_buf. Previously charging was done in export, but for non-Android
> > > graphics use-cases this is not ideal since there may be a delay between
> > > allocation and export, during which time there is no accounting.
> > > 
> > > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > > result of above.
> > > 
> > > Put the charge and uncharge code in the same file (system_heap_allocate,
> > > system_heap_dma_buf_release) instead of splitting them between the heap and
> > > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> > > 
> > > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > > of a gpucg. This avoids requiring the caller to manage the refcount
> > > of the gpucg upon failure and confusing ownership transfer logic.
> > > 
> > > Support all strings for gpucg_register_bucket instead of just string
> > > literals.
> > > 
> > > Enforce globally unique gpucg_bucket names.
> > > 
> > > Constrain gpucg_bucket name lengths to 64 bytes.
> > > 
> > > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> > > 
> > > Drop patch 7 from the series, which changed the types of
> > > binder_transaction_data's sender_pid and sender_euid fields. This was
> > > done in another commit here:
> > > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> > > 
> > > Rename:
> > >   gpucg_try_charge -> gpucg_charge
> > >   find_cg_rpool_locked -> cg_rpool_find_locked
> > >   init_cg_rpool -> cg_rpool_init
> > >   get_cg_rpool_locked -> cg_rpool_get_locked
> > >   "gpu cgroup controller" -> "GPU controller"
> > >   gpucg_device -> gpucg_bucket
> > >   usage -> size
> > > 
> > > Tests:
> > >   Support both binder_fd_array_object and binder_fd_object. This is
> > >   necessary because new versions of Android will use binder_fd_object
> > >   instead of binder_fd_array_object, and we need to support both.
> > > 
> > >   Tests for both binder_fd_array_object and binder_fd_object.
> > > 
> > >   For binder_utils return error codes instead of
> > >   struct binder{fs}_ctx.
> > > 
> > >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> > >   of a runtime fallback.
> > > 
> > >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> > >   prepending it where used.
> > > 
> > > v4:
> > > Skip test if not run as root per Shuah Khan
> > > 
> > > Add better test logging for abnormal child termination per Shuah Khan
> > > 
> > > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > > hitting cgroup limit per Michal Koutný
> > > 
> > > Adjust gpucg_try_charge critical section for charge transfer functionality
> > > 
> > > Fix uninitialized return code error for dmabuf_try_charge error case
> > > 
> > > v3:
> > > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> > > 
> > > Use more common dual author commit message format per John Stultz
> > > 
> > > Remove android from binder changes title per Todd Kjos
> > > 
> > > Add a kselftest for this new behavior per Greg Kroah-Hartman
> > > 
> > > Include details on behavior for all combinations of kernel/userspace
> > > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> > > 
> > > Fix pid and uid types in binder UAPI header
> > > 
> > > v2:
> > > See the previous revision of this change submitted by Hridya Valsaraju
> > > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> > > 
> > > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > > Christian König. Pointers to struct gpucg and struct gpucg_device
> > > tracking the current associations were added to the dma_buf struct to
> > > achieve this.
> > > 
> > > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > > 
> > > History of the GPU cgroup controller
> > > ====================================
> > > The GPU/DRM cgroup controller came into being when a consensus[1]
> > > was reached that the resources it tracked were unsuitable to be integrated
> > > into memcg. Originally, the proposed controller was specific to the DRM
> > > subsystem and was intended to track GEM buffers and GPU-specific
> > > resources[2]. In order to help establish a unified memory accounting model
> > > for all GPU and all related subsystems, Daniel Vetter put forth a
> > > suggestion to move it out of the DRM subsystem so that it can be used by
> > > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > > does the same.
> > > 
> > > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> > > 
> > > Hridya Valsaraju (3):
> > >   gpu: rfc: Proposal for a GPU cgroup controller
> > >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> > >     memory
> > >   binder: Add flags to relinquish ownership of fds
> > > 
> > > T.J. Mercier (3):
> > >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> > >   dmabuf: Add gpu cgroup charge transfer function
> > >   selftests: Add binder cgroup gpu memory transfer tests
> > > 
> > >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> > >  drivers/android/binder.c                      |  31 +-
> > >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> > >  drivers/dma-buf/dma-heap.c                    |  38 ++
> > >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> > >  include/linux/cgroup_gpu.h                    | 146 +++++
> > >  include/linux/cgroup_subsys.h                 |   4 +
> > >  include/linux/dma-buf.h                       |  49 +-
> > >  include/linux/dma-heap.h                      |  15 +
> > >  include/uapi/linux/android/binder.h           |  23 +-
> > >  init/Kconfig                                  |   7 +
> > >  kernel/cgroup/Makefile                        |   1 +
> > >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> > >  .../selftests/drivers/android/binder/Makefile |   8 +
> > >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> > >  .../drivers/android/binder/binder_util.h      |  32 ++
> > >  .../selftests/drivers/android/binder/config   |   4 +
> > >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> > >  18 files changed, 1632 insertions(+), 23 deletions(-)
> > >  create mode 100644 include/linux/cgroup_gpu.h
> > >  create mode 100644 kernel/cgroup/gpu.c
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > > 
> > 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-12 13:09       ` Nicolas Dufresne
  0 siblings, 0 replies; 67+ messages in thread
From: Nicolas Dufresne @ 2022-05-12 13:09 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Laura Abbott, kernel-team, linux-media,
	Todd Kjos, linaro-mm-sig, Hridya Valsaraju, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Tejun Heo

Le mercredi 11 mai 2022 à 13:31 -0700, T.J. Mercier a écrit :
> On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > 
> > Hi,
> > 
> > Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > > This patch series revisits the proposal for a GPU cgroup controller to
> > > track and limit memory allocations by various device/allocator
> > > subsystems. The patch series also contains a simple prototype to
> > > illustrate how Android intends to implement DMA-BUF allocator
> > > attribution using the GPU cgroup controller. The prototype does not
> > > include resource limit enforcements.
> > 
> > I'm sorry, since I'm not in-depth technically involve. But from reading the
> > topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> > this an attempt to really track the DMABuf allocated by userland, or just
> > something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> > specially what would other subsystem needs to have cgroup DMABuf allocation
> > controller support ?
> > 
> Hi Nicolas,
> 
> The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
> somewhat of an Androidism. However this change aims to be usable for
> tracking all GPU related allocations. It's just that this initial
> series only adds support for tracking dmabufs allocated from dmabuf
> heaps.
> 
> In Android most graphics buffers are dma buffers allocated from a
> dmabuf heap, so that is why these dmabuf heap allocations are being
> tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
> also want to track their buffers, but would probably want to do so
> under a bucket name of something like "v4l2". Same goes for GEM
> dmabufs. The naming scheme for this is still yet to be decided. It
> would be cool to be able to attribute memory at the driver level, or
> even different types of memory at the driver level, but I imagine
> there is a point of diminishing returns for fine-grained
> naming/bucketing.
> 
> So far, I haven't tried to create a strict definition of what is and
> is not "GPU memory" for the purpose of this accounting, so I don't
> think we should be restricted to tracking just dmabufs. I don't see
> why this couldn't be anything a driver wants to consider as GPU memory
> as long as it is named/bucketed appropriately, such as both on-package
> graphics card memory use and CPU memory dedicated for graphics use
> like for host/device transfers.
> 
> Is that helpful?

I'm actually happy I've asked this question, wasn't silly after all. I think the
problem here is a naming issue. What you really are monitor is "video memory",
which consist of a memory segment allocated to store data used to render images
(its not always images of course, GPU an VPU have specialized buffers for their
purpose).

Whether this should be split between what is used specifically by the GPU
drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
drivers is something that should be discussed. But in the current approach, you
really meant Video memory as a superset of the above. Personally, I think
generically (to de-Andronized your work), en-globing all video memory is
sufficient. What I fail to understand is how you will manage to distinguished
DMABuf Heap allocation (which are used outside of Android btw), from Video
allocation or other type of usage. I'm sure non-video usage will exist in the
future (think of machine learning, compute, other high bandwidth streaming
thingy ...)

> 
> Best,
> T.J.
> 
> > > 
> > > Changelog:
> > > v7:
> > > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > > This means gpucg_register_bucket now returns an internally allocated
> > > struct gpucg_bucket.
> > > 
> > > Move all public function documentation to the cgroup_gpu.h header.
> > > 
> > > Remove comment in documentation about duplicate name rejection which
> > > is not relevant to cgroups users per Michal Koutný.
> > > 
> > > v6:
> > > Move documentation into cgroup-v2.rst per Tejun Heo.
> > > 
> > > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> > > 
> > > Return error on transfer failure per Carlos Llamas.
> > > 
> > > v5:
> > > Rebase on top of v5.18-rc3
> > > 
> > > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > > of the design since there is no currently known use for this per
> > > Tejun Heo.
> > > 
> > > Fix commit message which still contained the old name for
> > > dma_buf_transfer_charge per Michal Koutný.
> > > 
> > > Remove all GPU cgroup code except what's necessary to support charge transfer
> > > from dma_buf. Previously charging was done in export, but for non-Android
> > > graphics use-cases this is not ideal since there may be a delay between
> > > allocation and export, during which time there is no accounting.
> > > 
> > > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > > result of above.
> > > 
> > > Put the charge and uncharge code in the same file (system_heap_allocate,
> > > system_heap_dma_buf_release) instead of splitting them between the heap and
> > > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> > > 
> > > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > > of a gpucg. This avoids requiring the caller to manage the refcount
> > > of the gpucg upon failure and confusing ownership transfer logic.
> > > 
> > > Support all strings for gpucg_register_bucket instead of just string
> > > literals.
> > > 
> > > Enforce globally unique gpucg_bucket names.
> > > 
> > > Constrain gpucg_bucket name lengths to 64 bytes.
> > > 
> > > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> > > 
> > > Drop patch 7 from the series, which changed the types of
> > > binder_transaction_data's sender_pid and sender_euid fields. This was
> > > done in another commit here:
> > > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> > > 
> > > Rename:
> > >   gpucg_try_charge -> gpucg_charge
> > >   find_cg_rpool_locked -> cg_rpool_find_locked
> > >   init_cg_rpool -> cg_rpool_init
> > >   get_cg_rpool_locked -> cg_rpool_get_locked
> > >   "gpu cgroup controller" -> "GPU controller"
> > >   gpucg_device -> gpucg_bucket
> > >   usage -> size
> > > 
> > > Tests:
> > >   Support both binder_fd_array_object and binder_fd_object. This is
> > >   necessary because new versions of Android will use binder_fd_object
> > >   instead of binder_fd_array_object, and we need to support both.
> > > 
> > >   Tests for both binder_fd_array_object and binder_fd_object.
> > > 
> > >   For binder_utils return error codes instead of
> > >   struct binder{fs}_ctx.
> > > 
> > >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> > >   of a runtime fallback.
> > > 
> > >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> > >   prepending it where used.
> > > 
> > > v4:
> > > Skip test if not run as root per Shuah Khan
> > > 
> > > Add better test logging for abnormal child termination per Shuah Khan
> > > 
> > > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > > hitting cgroup limit per Michal Koutný
> > > 
> > > Adjust gpucg_try_charge critical section for charge transfer functionality
> > > 
> > > Fix uninitialized return code error for dmabuf_try_charge error case
> > > 
> > > v3:
> > > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> > > 
> > > Use more common dual author commit message format per John Stultz
> > > 
> > > Remove android from binder changes title per Todd Kjos
> > > 
> > > Add a kselftest for this new behavior per Greg Kroah-Hartman
> > > 
> > > Include details on behavior for all combinations of kernel/userspace
> > > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> > > 
> > > Fix pid and uid types in binder UAPI header
> > > 
> > > v2:
> > > See the previous revision of this change submitted by Hridya Valsaraju
> > > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> > > 
> > > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > > Christian König. Pointers to struct gpucg and struct gpucg_device
> > > tracking the current associations were added to the dma_buf struct to
> > > achieve this.
> > > 
> > > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > > 
> > > History of the GPU cgroup controller
> > > ====================================
> > > The GPU/DRM cgroup controller came into being when a consensus[1]
> > > was reached that the resources it tracked were unsuitable to be integrated
> > > into memcg. Originally, the proposed controller was specific to the DRM
> > > subsystem and was intended to track GEM buffers and GPU-specific
> > > resources[2]. In order to help establish a unified memory accounting model
> > > for all GPU and all related subsystems, Daniel Vetter put forth a
> > > suggestion to move it out of the DRM subsystem so that it can be used by
> > > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > > does the same.
> > > 
> > > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> > > 
> > > Hridya Valsaraju (3):
> > >   gpu: rfc: Proposal for a GPU cgroup controller
> > >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> > >     memory
> > >   binder: Add flags to relinquish ownership of fds
> > > 
> > > T.J. Mercier (3):
> > >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> > >   dmabuf: Add gpu cgroup charge transfer function
> > >   selftests: Add binder cgroup gpu memory transfer tests
> > > 
> > >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> > >  drivers/android/binder.c                      |  31 +-
> > >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> > >  drivers/dma-buf/dma-heap.c                    |  38 ++
> > >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> > >  include/linux/cgroup_gpu.h                    | 146 +++++
> > >  include/linux/cgroup_subsys.h                 |   4 +
> > >  include/linux/dma-buf.h                       |  49 +-
> > >  include/linux/dma-heap.h                      |  15 +
> > >  include/uapi/linux/android/binder.h           |  23 +-
> > >  init/Kconfig                                  |   7 +
> > >  kernel/cgroup/Makefile                        |   1 +
> > >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> > >  .../selftests/drivers/android/binder/Makefile |   8 +
> > >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> > >  .../drivers/android/binder/binder_util.h      |  32 ++
> > >  .../selftests/drivers/android/binder/config   |   4 +
> > >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> > >  18 files changed, 1632 insertions(+), 23 deletions(-)
> > >  create mode 100644 include/linux/cgroup_gpu.h
> > >  create mode 100644 kernel/cgroup/gpu.c
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > > 
> > 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-12 13:09       ` Nicolas Dufresne
  0 siblings, 0 replies; 67+ messages in thread
From: Nicolas Dufresne @ 2022-05-12 13:09 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz

Le mercredi 11 mai 2022 à 13:31 -0700, T.J. Mercier a écrit :
> On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas-dDhyB4GVkw9AFePFGvp55w@public.gmane.org> wrote:
> > 
> > Hi,
> > 
> > Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > > This patch series revisits the proposal for a GPU cgroup controller to
> > > track and limit memory allocations by various device/allocator
> > > subsystems. The patch series also contains a simple prototype to
> > > illustrate how Android intends to implement DMA-BUF allocator
> > > attribution using the GPU cgroup controller. The prototype does not
> > > include resource limit enforcements.
> > 
> > I'm sorry, since I'm not in-depth technically involve. But from reading the
> > topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> > this an attempt to really track the DMABuf allocated by userland, or just
> > something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> > specially what would other subsystem needs to have cgroup DMABuf allocation
> > controller support ?
> > 
> Hi Nicolas,
> 
> The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
> somewhat of an Androidism. However this change aims to be usable for
> tracking all GPU related allocations. It's just that this initial
> series only adds support for tracking dmabufs allocated from dmabuf
> heaps.
> 
> In Android most graphics buffers are dma buffers allocated from a
> dmabuf heap, so that is why these dmabuf heap allocations are being
> tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
> also want to track their buffers, but would probably want to do so
> under a bucket name of something like "v4l2". Same goes for GEM
> dmabufs. The naming scheme for this is still yet to be decided. It
> would be cool to be able to attribute memory at the driver level, or
> even different types of memory at the driver level, but I imagine
> there is a point of diminishing returns for fine-grained
> naming/bucketing.
> 
> So far, I haven't tried to create a strict definition of what is and
> is not "GPU memory" for the purpose of this accounting, so I don't
> think we should be restricted to tracking just dmabufs. I don't see
> why this couldn't be anything a driver wants to consider as GPU memory
> as long as it is named/bucketed appropriately, such as both on-package
> graphics card memory use and CPU memory dedicated for graphics use
> like for host/device transfers.
> 
> Is that helpful?

I'm actually happy I've asked this question, wasn't silly after all. I think the
problem here is a naming issue. What you really are monitor is "video memory",
which consist of a memory segment allocated to store data used to render images
(its not always images of course, GPU an VPU have specialized buffers for their
purpose).

Whether this should be split between what is used specifically by the GPU
drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
drivers is something that should be discussed. But in the current approach, you
really meant Video memory as a superset of the above. Personally, I think
generically (to de-Andronized your work), en-globing all video memory is
sufficient. What I fail to understand is how you will manage to distinguished
DMABuf Heap allocation (which are used outside of Android btw), from Video
allocation or other type of usage. I'm sure non-video usage will exist in the
future (think of machine learning, compute, other high bandwidth streaming
thingy ...)

> 
> Best,
> T.J.
> 
> > > 
> > > Changelog:
> > > v7:
> > > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > > This means gpucg_register_bucket now returns an internally allocated
> > > struct gpucg_bucket.
> > > 
> > > Move all public function documentation to the cgroup_gpu.h header.
> > > 
> > > Remove comment in documentation about duplicate name rejection which
> > > is not relevant to cgroups users per Michal Koutný.
> > > 
> > > v6:
> > > Move documentation into cgroup-v2.rst per Tejun Heo.
> > > 
> > > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> > > 
> > > Return error on transfer failure per Carlos Llamas.
> > > 
> > > v5:
> > > Rebase on top of v5.18-rc3
> > > 
> > > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > > of the design since there is no currently known use for this per
> > > Tejun Heo.
> > > 
> > > Fix commit message which still contained the old name for
> > > dma_buf_transfer_charge per Michal Koutný.
> > > 
> > > Remove all GPU cgroup code except what's necessary to support charge transfer
> > > from dma_buf. Previously charging was done in export, but for non-Android
> > > graphics use-cases this is not ideal since there may be a delay between
> > > allocation and export, during which time there is no accounting.
> > > 
> > > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > > result of above.
> > > 
> > > Put the charge and uncharge code in the same file (system_heap_allocate,
> > > system_heap_dma_buf_release) instead of splitting them between the heap and
> > > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> > > 
> > > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > > of a gpucg. This avoids requiring the caller to manage the refcount
> > > of the gpucg upon failure and confusing ownership transfer logic.
> > > 
> > > Support all strings for gpucg_register_bucket instead of just string
> > > literals.
> > > 
> > > Enforce globally unique gpucg_bucket names.
> > > 
> > > Constrain gpucg_bucket name lengths to 64 bytes.
> > > 
> > > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> > > 
> > > Drop patch 7 from the series, which changed the types of
> > > binder_transaction_data's sender_pid and sender_euid fields. This was
> > > done in another commit here:
> > > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> > > 
> > > Rename:
> > >   gpucg_try_charge -> gpucg_charge
> > >   find_cg_rpool_locked -> cg_rpool_find_locked
> > >   init_cg_rpool -> cg_rpool_init
> > >   get_cg_rpool_locked -> cg_rpool_get_locked
> > >   "gpu cgroup controller" -> "GPU controller"
> > >   gpucg_device -> gpucg_bucket
> > >   usage -> size
> > > 
> > > Tests:
> > >   Support both binder_fd_array_object and binder_fd_object. This is
> > >   necessary because new versions of Android will use binder_fd_object
> > >   instead of binder_fd_array_object, and we need to support both.
> > > 
> > >   Tests for both binder_fd_array_object and binder_fd_object.
> > > 
> > >   For binder_utils return error codes instead of
> > >   struct binder{fs}_ctx.
> > > 
> > >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> > >   of a runtime fallback.
> > > 
> > >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> > >   prepending it where used.
> > > 
> > > v4:
> > > Skip test if not run as root per Shuah Khan
> > > 
> > > Add better test logging for abnormal child termination per Shuah Khan
> > > 
> > > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > > hitting cgroup limit per Michal Koutný
> > > 
> > > Adjust gpucg_try_charge critical section for charge transfer functionality
> > > 
> > > Fix uninitialized return code error for dmabuf_try_charge error case
> > > 
> > > v3:
> > > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> > > 
> > > Use more common dual author commit message format per John Stultz
> > > 
> > > Remove android from binder changes title per Todd Kjos
> > > 
> > > Add a kselftest for this new behavior per Greg Kroah-Hartman
> > > 
> > > Include details on behavior for all combinations of kernel/userspace
> > > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> > > 
> > > Fix pid and uid types in binder UAPI header
> > > 
> > > v2:
> > > See the previous revision of this change submitted by Hridya Valsaraju
> > > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> > > 
> > > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > > Christian König. Pointers to struct gpucg and struct gpucg_device
> > > tracking the current associations were added to the dma_buf struct to
> > > achieve this.
> > > 
> > > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > > 
> > > History of the GPU cgroup controller
> > > ====================================
> > > The GPU/DRM cgroup controller came into being when a consensus[1]
> > > was reached that the resources it tracked were unsuitable to be integrated
> > > into memcg. Originally, the proposed controller was specific to the DRM
> > > subsystem and was intended to track GEM buffers and GPU-specific
> > > resources[2]. In order to help establish a unified memory accounting model
> > > for all GPU and all related subsystems, Daniel Vetter put forth a
> > > suggestion to move it out of the DRM subsystem so that it can be used by
> > > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > > does the same.
> > > 
> > > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org/#22624705
> > > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org/
> > > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei-dv86pmgwkMD5giXZK7PFsA@public.gmane.orglocal/
> > > 
> > > Hridya Valsaraju (3):
> > >   gpu: rfc: Proposal for a GPU cgroup controller
> > >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> > >     memory
> > >   binder: Add flags to relinquish ownership of fds
> > > 
> > > T.J. Mercier (3):
> > >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> > >   dmabuf: Add gpu cgroup charge transfer function
> > >   selftests: Add binder cgroup gpu memory transfer tests
> > > 
> > >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> > >  drivers/android/binder.c                      |  31 +-
> > >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> > >  drivers/dma-buf/dma-heap.c                    |  38 ++
> > >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> > >  include/linux/cgroup_gpu.h                    | 146 +++++
> > >  include/linux/cgroup_subsys.h                 |   4 +
> > >  include/linux/dma-buf.h                       |  49 +-
> > >  include/linux/dma-heap.h                      |  15 +
> > >  include/uapi/linux/android/binder.h           |  23 +-
> > >  init/Kconfig                                  |   7 +
> > >  kernel/cgroup/Makefile                        |   1 +
> > >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> > >  .../selftests/drivers/android/binder/Makefile |   8 +
> > >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> > >  .../drivers/android/binder/binder_util.h      |  32 ++
> > >  .../selftests/drivers/android/binder/config   |   4 +
> > >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> > >  18 files changed, 1632 insertions(+), 23 deletions(-)
> > >  create mode 100644 include/linux/cgroup_gpu.h
> > >  create mode 100644 kernel/cgroup/gpu.c
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> > >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > > 
> > 


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-12 13:09       ` Nicolas Dufresne
  (?)
@ 2022-05-13  3:43         ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-13  3:43 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Thu, May 12, 2022 at 6:10 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le mercredi 11 mai 2022 à 13:31 -0700, T.J. Mercier a écrit :
> > On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > >
> > > Hi,
> > >
> > > Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > > > This patch series revisits the proposal for a GPU cgroup controller to
> > > > track and limit memory allocations by various device/allocator
> > > > subsystems. The patch series also contains a simple prototype to
> > > > illustrate how Android intends to implement DMA-BUF allocator
> > > > attribution using the GPU cgroup controller. The prototype does not
> > > > include resource limit enforcements.
> > >
> > > I'm sorry, since I'm not in-depth technically involve. But from reading the
> > > topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> > > this an attempt to really track the DMABuf allocated by userland, or just
> > > something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> > > specially what would other subsystem needs to have cgroup DMABuf allocation
> > > controller support ?
> > >
> > Hi Nicolas,
> >
> > The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
> > somewhat of an Androidism. However this change aims to be usable for
> > tracking all GPU related allocations. It's just that this initial
> > series only adds support for tracking dmabufs allocated from dmabuf
> > heaps.
> >
> > In Android most graphics buffers are dma buffers allocated from a
> > dmabuf heap, so that is why these dmabuf heap allocations are being
> > tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
> > also want to track their buffers, but would probably want to do so
> > under a bucket name of something like "v4l2". Same goes for GEM
> > dmabufs. The naming scheme for this is still yet to be decided. It
> > would be cool to be able to attribute memory at the driver level, or
> > even different types of memory at the driver level, but I imagine
> > there is a point of diminishing returns for fine-grained
> > naming/bucketing.
> >
> > So far, I haven't tried to create a strict definition of what is and
> > is not "GPU memory" for the purpose of this accounting, so I don't
> > think we should be restricted to tracking just dmabufs. I don't see
> > why this couldn't be anything a driver wants to consider as GPU memory
> > as long as it is named/bucketed appropriately, such as both on-package
> > graphics card memory use and CPU memory dedicated for graphics use
> > like for host/device transfers.
> >
> > Is that helpful?
>
> I'm actually happy I've asked this question, wasn't silly after all. I think the
> problem here is a naming issue. What you really are monitor is "video memory",
> which consist of a memory segment allocated to store data used to render images
> (its not always images of course, GPU an VPU have specialized buffers for their
> purpose).
>
> Whether this should be split between what is used specifically by the GPU
> drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> drivers is something that should be discussed. But in the current approach, you
> really meant Video memory as a superset of the above. Personally, I think
> generically (to de-Andronized your work), en-globing all video memory is
> sufficient. What I fail to understand is how you will manage to distinguished
> DMABuf Heap allocation (which are used outside of Android btw), from Video
> allocation or other type of usage. I'm sure non-video usage will exist in the
> future (think of machine learning, compute, other high bandwidth streaming
> thingy ...)
>
Ok thank you for pointing out the naming issue. The naming is a
consequence of the initial use case, but I guess it's too specific.
What I want out of this change is that android can track dmabufs that
come out of heaps, and drm can track gpu memory. But other drivers
could track different resources under different names. Imagine this
were called a buffer cgroup controller instead of a GPU cgroup
controller. Then the use component ("video memory") isn't tied up with
the name of the controller, but it's up to the name of the bucket the
resource is tracked under. I think this meets the needs of the two use
cases I'm aware of now, while leaving the door open to other future
needs. Really the controller is just enabling abstract named buckets
for tracking and eventually limiting a type of resource.

P.S. I will be unavailable starting tomorrow, but I'll be back on Monday.




> >
> > Best,
> > T.J.
> >
> > > >
> > > > Changelog:
> > > > v7:
> > > > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > > > This means gpucg_register_bucket now returns an internally allocated
> > > > struct gpucg_bucket.
> > > >
> > > > Move all public function documentation to the cgroup_gpu.h header.
> > > >
> > > > Remove comment in documentation about duplicate name rejection which
> > > > is not relevant to cgroups users per Michal Koutný.
> > > >
> > > > v6:
> > > > Move documentation into cgroup-v2.rst per Tejun Heo.
> > > >
> > > > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > > > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> > > >
> > > > Return error on transfer failure per Carlos Llamas.
> > > >
> > > > v5:
> > > > Rebase on top of v5.18-rc3
> > > >
> > > > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > > > of the design since there is no currently known use for this per
> > > > Tejun Heo.
> > > >
> > > > Fix commit message which still contained the old name for
> > > > dma_buf_transfer_charge per Michal Koutný.
> > > >
> > > > Remove all GPU cgroup code except what's necessary to support charge transfer
> > > > from dma_buf. Previously charging was done in export, but for non-Android
> > > > graphics use-cases this is not ideal since there may be a delay between
> > > > allocation and export, during which time there is no accounting.
> > > >
> > > > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > > > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > > > result of above.
> > > >
> > > > Put the charge and uncharge code in the same file (system_heap_allocate,
> > > > system_heap_dma_buf_release) instead of splitting them between the heap and
> > > > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> > > >
> > > > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > > > of a gpucg. This avoids requiring the caller to manage the refcount
> > > > of the gpucg upon failure and confusing ownership transfer logic.
> > > >
> > > > Support all strings for gpucg_register_bucket instead of just string
> > > > literals.
> > > >
> > > > Enforce globally unique gpucg_bucket names.
> > > >
> > > > Constrain gpucg_bucket name lengths to 64 bytes.
> > > >
> > > > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> > > >
> > > > Drop patch 7 from the series, which changed the types of
> > > > binder_transaction_data's sender_pid and sender_euid fields. This was
> > > > done in another commit here:
> > > > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> > > >
> > > > Rename:
> > > >   gpucg_try_charge -> gpucg_charge
> > > >   find_cg_rpool_locked -> cg_rpool_find_locked
> > > >   init_cg_rpool -> cg_rpool_init
> > > >   get_cg_rpool_locked -> cg_rpool_get_locked
> > > >   "gpu cgroup controller" -> "GPU controller"
> > > >   gpucg_device -> gpucg_bucket
> > > >   usage -> size
> > > >
> > > > Tests:
> > > >   Support both binder_fd_array_object and binder_fd_object. This is
> > > >   necessary because new versions of Android will use binder_fd_object
> > > >   instead of binder_fd_array_object, and we need to support both.
> > > >
> > > >   Tests for both binder_fd_array_object and binder_fd_object.
> > > >
> > > >   For binder_utils return error codes instead of
> > > >   struct binder{fs}_ctx.
> > > >
> > > >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> > > >   of a runtime fallback.
> > > >
> > > >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> > > >   prepending it where used.
> > > >
> > > > v4:
> > > > Skip test if not run as root per Shuah Khan
> > > >
> > > > Add better test logging for abnormal child termination per Shuah Khan
> > > >
> > > > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > > > hitting cgroup limit per Michal Koutný
> > > >
> > > > Adjust gpucg_try_charge critical section for charge transfer functionality
> > > >
> > > > Fix uninitialized return code error for dmabuf_try_charge error case
> > > >
> > > > v3:
> > > > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> > > >
> > > > Use more common dual author commit message format per John Stultz
> > > >
> > > > Remove android from binder changes title per Todd Kjos
> > > >
> > > > Add a kselftest for this new behavior per Greg Kroah-Hartman
> > > >
> > > > Include details on behavior for all combinations of kernel/userspace
> > > > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> > > >
> > > > Fix pid and uid types in binder UAPI header
> > > >
> > > > v2:
> > > > See the previous revision of this change submitted by Hridya Valsaraju
> > > > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> > > >
> > > > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > > > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > > > Christian König. Pointers to struct gpucg and struct gpucg_device
> > > > tracking the current associations were added to the dma_buf struct to
> > > > achieve this.
> > > >
> > > > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > > >
> > > > History of the GPU cgroup controller
> > > > ====================================
> > > > The GPU/DRM cgroup controller came into being when a consensus[1]
> > > > was reached that the resources it tracked were unsuitable to be integrated
> > > > into memcg. Originally, the proposed controller was specific to the DRM
> > > > subsystem and was intended to track GEM buffers and GPU-specific
> > > > resources[2]. In order to help establish a unified memory accounting model
> > > > for all GPU and all related subsystems, Daniel Vetter put forth a
> > > > suggestion to move it out of the DRM subsystem so that it can be used by
> > > > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > > > does the same.
> > > >
> > > > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > > > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > > > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> > > >
> > > > Hridya Valsaraju (3):
> > > >   gpu: rfc: Proposal for a GPU cgroup controller
> > > >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> > > >     memory
> > > >   binder: Add flags to relinquish ownership of fds
> > > >
> > > > T.J. Mercier (3):
> > > >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> > > >   dmabuf: Add gpu cgroup charge transfer function
> > > >   selftests: Add binder cgroup gpu memory transfer tests
> > > >
> > > >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> > > >  drivers/android/binder.c                      |  31 +-
> > > >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> > > >  drivers/dma-buf/dma-heap.c                    |  38 ++
> > > >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> > > >  include/linux/cgroup_gpu.h                    | 146 +++++
> > > >  include/linux/cgroup_subsys.h                 |   4 +
> > > >  include/linux/dma-buf.h                       |  49 +-
> > > >  include/linux/dma-heap.h                      |  15 +
> > > >  include/uapi/linux/android/binder.h           |  23 +-
> > > >  init/Kconfig                                  |   7 +
> > > >  kernel/cgroup/Makefile                        |   1 +
> > > >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> > > >  .../selftests/drivers/android/binder/Makefile |   8 +
> > > >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> > > >  .../drivers/android/binder/binder_util.h      |  32 ++
> > > >  .../selftests/drivers/android/binder/config   |   4 +
> > > >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> > > >  18 files changed, 1632 insertions(+), 23 deletions(-)
> > > >  create mode 100644 include/linux/cgroup_gpu.h
> > > >  create mode 100644 kernel/cgroup/gpu.c
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > > >
> > >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-13  3:43         ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-13  3:43 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Laura Abbott, kernel-team, linux-media,
	Todd Kjos, linaro-mm-sig, Hridya Valsaraju, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Tejun Heo

On Thu, May 12, 2022 at 6:10 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le mercredi 11 mai 2022 à 13:31 -0700, T.J. Mercier a écrit :
> > On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > >
> > > Hi,
> > >
> > > Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > > > This patch series revisits the proposal for a GPU cgroup controller to
> > > > track and limit memory allocations by various device/allocator
> > > > subsystems. The patch series also contains a simple prototype to
> > > > illustrate how Android intends to implement DMA-BUF allocator
> > > > attribution using the GPU cgroup controller. The prototype does not
> > > > include resource limit enforcements.
> > >
> > > I'm sorry, since I'm not in-depth technically involve. But from reading the
> > > topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> > > this an attempt to really track the DMABuf allocated by userland, or just
> > > something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> > > specially what would other subsystem needs to have cgroup DMABuf allocation
> > > controller support ?
> > >
> > Hi Nicolas,
> >
> > The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
> > somewhat of an Androidism. However this change aims to be usable for
> > tracking all GPU related allocations. It's just that this initial
> > series only adds support for tracking dmabufs allocated from dmabuf
> > heaps.
> >
> > In Android most graphics buffers are dma buffers allocated from a
> > dmabuf heap, so that is why these dmabuf heap allocations are being
> > tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
> > also want to track their buffers, but would probably want to do so
> > under a bucket name of something like "v4l2". Same goes for GEM
> > dmabufs. The naming scheme for this is still yet to be decided. It
> > would be cool to be able to attribute memory at the driver level, or
> > even different types of memory at the driver level, but I imagine
> > there is a point of diminishing returns for fine-grained
> > naming/bucketing.
> >
> > So far, I haven't tried to create a strict definition of what is and
> > is not "GPU memory" for the purpose of this accounting, so I don't
> > think we should be restricted to tracking just dmabufs. I don't see
> > why this couldn't be anything a driver wants to consider as GPU memory
> > as long as it is named/bucketed appropriately, such as both on-package
> > graphics card memory use and CPU memory dedicated for graphics use
> > like for host/device transfers.
> >
> > Is that helpful?
>
> I'm actually happy I've asked this question, wasn't silly after all. I think the
> problem here is a naming issue. What you really are monitor is "video memory",
> which consist of a memory segment allocated to store data used to render images
> (its not always images of course, GPU an VPU have specialized buffers for their
> purpose).
>
> Whether this should be split between what is used specifically by the GPU
> drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> drivers is something that should be discussed. But in the current approach, you
> really meant Video memory as a superset of the above. Personally, I think
> generically (to de-Andronized your work), en-globing all video memory is
> sufficient. What I fail to understand is how you will manage to distinguished
> DMABuf Heap allocation (which are used outside of Android btw), from Video
> allocation or other type of usage. I'm sure non-video usage will exist in the
> future (think of machine learning, compute, other high bandwidth streaming
> thingy ...)
>
Ok thank you for pointing out the naming issue. The naming is a
consequence of the initial use case, but I guess it's too specific.
What I want out of this change is that android can track dmabufs that
come out of heaps, and drm can track gpu memory. But other drivers
could track different resources under different names. Imagine this
were called a buffer cgroup controller instead of a GPU cgroup
controller. Then the use component ("video memory") isn't tied up with
the name of the controller, but it's up to the name of the bucket the
resource is tracked under. I think this meets the needs of the two use
cases I'm aware of now, while leaving the door open to other future
needs. Really the controller is just enabling abstract named buckets
for tracking and eventually limiting a type of resource.

P.S. I will be unavailable starting tomorrow, but I'll be back on Monday.




> >
> > Best,
> > T.J.
> >
> > > >
> > > > Changelog:
> > > > v7:
> > > > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > > > This means gpucg_register_bucket now returns an internally allocated
> > > > struct gpucg_bucket.
> > > >
> > > > Move all public function documentation to the cgroup_gpu.h header.
> > > >
> > > > Remove comment in documentation about duplicate name rejection which
> > > > is not relevant to cgroups users per Michal Koutný.
> > > >
> > > > v6:
> > > > Move documentation into cgroup-v2.rst per Tejun Heo.
> > > >
> > > > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > > > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> > > >
> > > > Return error on transfer failure per Carlos Llamas.
> > > >
> > > > v5:
> > > > Rebase on top of v5.18-rc3
> > > >
> > > > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > > > of the design since there is no currently known use for this per
> > > > Tejun Heo.
> > > >
> > > > Fix commit message which still contained the old name for
> > > > dma_buf_transfer_charge per Michal Koutný.
> > > >
> > > > Remove all GPU cgroup code except what's necessary to support charge transfer
> > > > from dma_buf. Previously charging was done in export, but for non-Android
> > > > graphics use-cases this is not ideal since there may be a delay between
> > > > allocation and export, during which time there is no accounting.
> > > >
> > > > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > > > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > > > result of above.
> > > >
> > > > Put the charge and uncharge code in the same file (system_heap_allocate,
> > > > system_heap_dma_buf_release) instead of splitting them between the heap and
> > > > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> > > >
> > > > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > > > of a gpucg. This avoids requiring the caller to manage the refcount
> > > > of the gpucg upon failure and confusing ownership transfer logic.
> > > >
> > > > Support all strings for gpucg_register_bucket instead of just string
> > > > literals.
> > > >
> > > > Enforce globally unique gpucg_bucket names.
> > > >
> > > > Constrain gpucg_bucket name lengths to 64 bytes.
> > > >
> > > > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> > > >
> > > > Drop patch 7 from the series, which changed the types of
> > > > binder_transaction_data's sender_pid and sender_euid fields. This was
> > > > done in another commit here:
> > > > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> > > >
> > > > Rename:
> > > >   gpucg_try_charge -> gpucg_charge
> > > >   find_cg_rpool_locked -> cg_rpool_find_locked
> > > >   init_cg_rpool -> cg_rpool_init
> > > >   get_cg_rpool_locked -> cg_rpool_get_locked
> > > >   "gpu cgroup controller" -> "GPU controller"
> > > >   gpucg_device -> gpucg_bucket
> > > >   usage -> size
> > > >
> > > > Tests:
> > > >   Support both binder_fd_array_object and binder_fd_object. This is
> > > >   necessary because new versions of Android will use binder_fd_object
> > > >   instead of binder_fd_array_object, and we need to support both.
> > > >
> > > >   Tests for both binder_fd_array_object and binder_fd_object.
> > > >
> > > >   For binder_utils return error codes instead of
> > > >   struct binder{fs}_ctx.
> > > >
> > > >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> > > >   of a runtime fallback.
> > > >
> > > >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> > > >   prepending it where used.
> > > >
> > > > v4:
> > > > Skip test if not run as root per Shuah Khan
> > > >
> > > > Add better test logging for abnormal child termination per Shuah Khan
> > > >
> > > > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > > > hitting cgroup limit per Michal Koutný
> > > >
> > > > Adjust gpucg_try_charge critical section for charge transfer functionality
> > > >
> > > > Fix uninitialized return code error for dmabuf_try_charge error case
> > > >
> > > > v3:
> > > > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> > > >
> > > > Use more common dual author commit message format per John Stultz
> > > >
> > > > Remove android from binder changes title per Todd Kjos
> > > >
> > > > Add a kselftest for this new behavior per Greg Kroah-Hartman
> > > >
> > > > Include details on behavior for all combinations of kernel/userspace
> > > > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> > > >
> > > > Fix pid and uid types in binder UAPI header
> > > >
> > > > v2:
> > > > See the previous revision of this change submitted by Hridya Valsaraju
> > > > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> > > >
> > > > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > > > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > > > Christian König. Pointers to struct gpucg and struct gpucg_device
> > > > tracking the current associations were added to the dma_buf struct to
> > > > achieve this.
> > > >
> > > > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > > >
> > > > History of the GPU cgroup controller
> > > > ====================================
> > > > The GPU/DRM cgroup controller came into being when a consensus[1]
> > > > was reached that the resources it tracked were unsuitable to be integrated
> > > > into memcg. Originally, the proposed controller was specific to the DRM
> > > > subsystem and was intended to track GEM buffers and GPU-specific
> > > > resources[2]. In order to help establish a unified memory accounting model
> > > > for all GPU and all related subsystems, Daniel Vetter put forth a
> > > > suggestion to move it out of the DRM subsystem so that it can be used by
> > > > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > > > does the same.
> > > >
> > > > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > > > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > > > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> > > >
> > > > Hridya Valsaraju (3):
> > > >   gpu: rfc: Proposal for a GPU cgroup controller
> > > >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> > > >     memory
> > > >   binder: Add flags to relinquish ownership of fds
> > > >
> > > > T.J. Mercier (3):
> > > >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> > > >   dmabuf: Add gpu cgroup charge transfer function
> > > >   selftests: Add binder cgroup gpu memory transfer tests
> > > >
> > > >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> > > >  drivers/android/binder.c                      |  31 +-
> > > >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> > > >  drivers/dma-buf/dma-heap.c                    |  38 ++
> > > >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> > > >  include/linux/cgroup_gpu.h                    | 146 +++++
> > > >  include/linux/cgroup_subsys.h                 |   4 +
> > > >  include/linux/dma-buf.h                       |  49 +-
> > > >  include/linux/dma-heap.h                      |  15 +
> > > >  include/uapi/linux/android/binder.h           |  23 +-
> > > >  init/Kconfig                                  |   7 +
> > > >  kernel/cgroup/Makefile                        |   1 +
> > > >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> > > >  .../selftests/drivers/android/binder/Makefile |   8 +
> > > >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> > > >  .../drivers/android/binder/binder_util.h      |  32 ++
> > > >  .../selftests/drivers/android/binder/config   |   4 +
> > > >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> > > >  18 files changed, 1632 insertions(+), 23 deletions(-)
> > > >  create mode 100644 include/linux/cgroup_gpu.h
> > > >  create mode 100644 kernel/cgroup/gpu.c
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > > >
> > >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-13  3:43         ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-13  3:43 UTC (permalink / raw)
  To: Nicolas Dufresne
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz

On Thu, May 12, 2022 at 6:10 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le mercredi 11 mai 2022 à 13:31 -0700, T.J. Mercier a écrit :
> > On Wed, May 11, 2022 at 6:21 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > >
> > > Hi,
> > >
> > > Le mardi 10 mai 2022 à 23:56 +0000, T.J. Mercier a écrit :
> > > > This patch series revisits the proposal for a GPU cgroup controller to
> > > > track and limit memory allocations by various device/allocator
> > > > subsystems. The patch series also contains a simple prototype to
> > > > illustrate how Android intends to implement DMA-BUF allocator
> > > > attribution using the GPU cgroup controller. The prototype does not
> > > > include resource limit enforcements.
> > >
> > > I'm sorry, since I'm not in-depth technically involve. But from reading the
> > > topic I don't understand the bound this creates between DMABuf Heaps and GPU. Is
> > > this an attempt to really track the DMABuf allocated by userland, or just
> > > something for GPU ? What about V4L2 devices ? Any way this can be clarified,
> > > specially what would other subsystem needs to have cgroup DMABuf allocation
> > > controller support ?
> > >
> > Hi Nicolas,
> >
> > The link between dmabufs, dmabuf heaps, and "GPU memory" is maybe
> > somewhat of an Androidism. However this change aims to be usable for
> > tracking all GPU related allocations. It's just that this initial
> > series only adds support for tracking dmabufs allocated from dmabuf
> > heaps.
> >
> > In Android most graphics buffers are dma buffers allocated from a
> > dmabuf heap, so that is why these dmabuf heap allocations are being
> > tracked under the GPU cgroup. Other dmabuf exporters like V4L2 might
> > also want to track their buffers, but would probably want to do so
> > under a bucket name of something like "v4l2". Same goes for GEM
> > dmabufs. The naming scheme for this is still yet to be decided. It
> > would be cool to be able to attribute memory at the driver level, or
> > even different types of memory at the driver level, but I imagine
> > there is a point of diminishing returns for fine-grained
> > naming/bucketing.
> >
> > So far, I haven't tried to create a strict definition of what is and
> > is not "GPU memory" for the purpose of this accounting, so I don't
> > think we should be restricted to tracking just dmabufs. I don't see
> > why this couldn't be anything a driver wants to consider as GPU memory
> > as long as it is named/bucketed appropriately, such as both on-package
> > graphics card memory use and CPU memory dedicated for graphics use
> > like for host/device transfers.
> >
> > Is that helpful?
>
> I'm actually happy I've asked this question, wasn't silly after all. I think the
> problem here is a naming issue. What you really are monitor is "video memory",
> which consist of a memory segment allocated to store data used to render images
> (its not always images of course, GPU an VPU have specialized buffers for their
> purpose).
>
> Whether this should be split between what is used specifically by the GPU
> drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> drivers is something that should be discussed. But in the current approach, you
> really meant Video memory as a superset of the above. Personally, I think
> generically (to de-Andronized your work), en-globing all video memory is
> sufficient. What I fail to understand is how you will manage to distinguished
> DMABuf Heap allocation (which are used outside of Android btw), from Video
> allocation or other type of usage. I'm sure non-video usage will exist in the
> future (think of machine learning, compute, other high bandwidth streaming
> thingy ...)
>
Ok thank you for pointing out the naming issue. The naming is a
consequence of the initial use case, but I guess it's too specific.
What I want out of this change is that android can track dmabufs that
come out of heaps, and drm can track gpu memory. But other drivers
could track different resources under different names. Imagine this
were called a buffer cgroup controller instead of a GPU cgroup
controller. Then the use component ("video memory") isn't tied up with
the name of the controller, but it's up to the name of the bucket the
resource is tracked under. I think this meets the needs of the two use
cases I'm aware of now, while leaving the door open to other future
needs. Really the controller is just enabling abstract named buckets
for tracking and eventually limiting a type of resource.

P.S. I will be unavailable starting tomorrow, but I'll be back on Monday.




> >
> > Best,
> > T.J.
> >
> > > >
> > > > Changelog:
> > > > v7:
> > > > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > > > This means gpucg_register_bucket now returns an internally allocated
> > > > struct gpucg_bucket.
> > > >
> > > > Move all public function documentation to the cgroup_gpu.h header.
> > > >
> > > > Remove comment in documentation about duplicate name rejection which
> > > > is not relevant to cgroups users per Michal Koutný.
> > > >
> > > > v6:
> > > > Move documentation into cgroup-v2.rst per Tejun Heo.
> > > >
> > > > Rename BINDER_FD{A}_FLAG_SENDER_NO_NEED ->
> > > > BINDER_FD{A}_FLAG_XFER_CHARGE per Carlos Llamas.
> > > >
> > > > Return error on transfer failure per Carlos Llamas.
> > > >
> > > > v5:
> > > > Rebase on top of v5.18-rc3
> > > >
> > > > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > > > of the design since there is no currently known use for this per
> > > > Tejun Heo.
> > > >
> > > > Fix commit message which still contained the old name for
> > > > dma_buf_transfer_charge per Michal Koutný.
> > > >
> > > > Remove all GPU cgroup code except what's necessary to support charge transfer
> > > > from dma_buf. Previously charging was done in export, but for non-Android
> > > > graphics use-cases this is not ideal since there may be a delay between
> > > > allocation and export, during which time there is no accounting.
> > > >
> > > > Merge dmabuf: Use the GPU cgroup charge/uncharge APIs patch into
> > > > dmabuf: heaps: export system_heap buffers with GPU cgroup charging as a
> > > > result of above.
> > > >
> > > > Put the charge and uncharge code in the same file (system_heap_allocate,
> > > > system_heap_dma_buf_release) instead of splitting them between the heap and
> > > > the dma_buf_release. This avoids asymmetric management of the gpucg charges.
> > > >
> > > > Modify the dma_buf_transfer_charge API to accept a task_struct instead
> > > > of a gpucg. This avoids requiring the caller to manage the refcount
> > > > of the gpucg upon failure and confusing ownership transfer logic.
> > > >
> > > > Support all strings for gpucg_register_bucket instead of just string
> > > > literals.
> > > >
> > > > Enforce globally unique gpucg_bucket names.
> > > >
> > > > Constrain gpucg_bucket name lengths to 64 bytes.
> > > >
> > > > Append "-heap" to gpucg_bucket names from dmabuf-heaps.
> > > >
> > > > Drop patch 7 from the series, which changed the types of
> > > > binder_transaction_data's sender_pid and sender_euid fields. This was
> > > > done in another commit here:
> > > > https://lore.kernel.org/all/20220210021129.3386083-4-masahiroy@kernel.org/
> > > >
> > > > Rename:
> > > >   gpucg_try_charge -> gpucg_charge
> > > >   find_cg_rpool_locked -> cg_rpool_find_locked
> > > >   init_cg_rpool -> cg_rpool_init
> > > >   get_cg_rpool_locked -> cg_rpool_get_locked
> > > >   "gpu cgroup controller" -> "GPU controller"
> > > >   gpucg_device -> gpucg_bucket
> > > >   usage -> size
> > > >
> > > > Tests:
> > > >   Support both binder_fd_array_object and binder_fd_object. This is
> > > >   necessary because new versions of Android will use binder_fd_object
> > > >   instead of binder_fd_array_object, and we need to support both.
> > > >
> > > >   Tests for both binder_fd_array_object and binder_fd_object.
> > > >
> > > >   For binder_utils return error codes instead of
> > > >   struct binder{fs}_ctx.
> > > >
> > > >   Use ifdef __ANDROID__ to choose platform-dependent temp path instead
> > > >   of a runtime fallback.
> > > >
> > > >   Ensure binderfs_mntpt ends with a trailing '/' character instead of
> > > >   prepending it where used.
> > > >
> > > > v4:
> > > > Skip test if not run as root per Shuah Khan
> > > >
> > > > Add better test logging for abnormal child termination per Shuah Khan
> > > >
> > > > Adjust ordering of charge/uncharge during transfer to avoid potentially
> > > > hitting cgroup limit per Michal Koutný
> > > >
> > > > Adjust gpucg_try_charge critical section for charge transfer functionality
> > > >
> > > > Fix uninitialized return code error for dmabuf_try_charge error case
> > > >
> > > > v3:
> > > > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz
> > > >
> > > > Use more common dual author commit message format per John Stultz
> > > >
> > > > Remove android from binder changes title per Todd Kjos
> > > >
> > > > Add a kselftest for this new behavior per Greg Kroah-Hartman
> > > >
> > > > Include details on behavior for all combinations of kernel/userspace
> > > > versions in changelog (thanks Suren Baghdasaryan) per Greg Kroah-Hartman.
> > > >
> > > > Fix pid and uid types in binder UAPI header
> > > >
> > > > v2:
> > > > See the previous revision of this change submitted by Hridya Valsaraju
> > > > at: https://lore.kernel.org/all/20220115010622.3185921-1-hridya@google.com/
> > > >
> > > > Move dma-buf cgroup charge transfer from a dma_buf_op defined by every
> > > > heap to a single dma-buf function for all heaps per Daniel Vetter and
> > > > Christian König. Pointers to struct gpucg and struct gpucg_device
> > > > tracking the current associations were added to the dma_buf struct to
> > > > achieve this.
> > > >
> > > > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > > >
> > > > History of the GPU cgroup controller
> > > > ====================================
> > > > The GPU/DRM cgroup controller came into being when a consensus[1]
> > > > was reached that the resources it tracked were unsuitable to be integrated
> > > > into memcg. Originally, the proposed controller was specific to the DRM
> > > > subsystem and was intended to track GEM buffers and GPU-specific
> > > > resources[2]. In order to help establish a unified memory accounting model
> > > > for all GPU and all related subsystems, Daniel Vetter put forth a
> > > > suggestion to move it out of the DRM subsystem so that it can be used by
> > > > other DMA-BUF exporters as well[3]. This RFC proposes an interface that
> > > > does the same.
> > > >
> > > > [1]: https://patchwork.kernel.org/project/dri-devel/cover/20190501140438.9506-1-brian.welty@intel.com/#22624705
> > > > [2]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> > > > [3]: https://lore.kernel.org/amd-gfx/YCVOl8%2F87bqRSQei@phenom.ffwll.local/
> > > >
> > > > Hridya Valsaraju (3):
> > > >   gpu: rfc: Proposal for a GPU cgroup controller
> > > >   cgroup: gpu: Add a cgroup controller for allocator attribution of GPU
> > > >     memory
> > > >   binder: Add flags to relinquish ownership of fds
> > > >
> > > > T.J. Mercier (3):
> > > >   dmabuf: heaps: export system_heap buffers with GPU cgroup charging
> > > >   dmabuf: Add gpu cgroup charge transfer function
> > > >   selftests: Add binder cgroup gpu memory transfer tests
> > > >
> > > >  Documentation/admin-guide/cgroup-v2.rst       |  23 +
> > > >  drivers/android/binder.c                      |  31 +-
> > > >  drivers/dma-buf/dma-buf.c                     |  80 ++-
> > > >  drivers/dma-buf/dma-heap.c                    |  38 ++
> > > >  drivers/dma-buf/heaps/system_heap.c           |  28 +-
> > > >  include/linux/cgroup_gpu.h                    | 146 +++++
> > > >  include/linux/cgroup_subsys.h                 |   4 +
> > > >  include/linux/dma-buf.h                       |  49 +-
> > > >  include/linux/dma-heap.h                      |  15 +
> > > >  include/uapi/linux/android/binder.h           |  23 +-
> > > >  init/Kconfig                                  |   7 +
> > > >  kernel/cgroup/Makefile                        |   1 +
> > > >  kernel/cgroup/gpu.c                           | 390 +++++++++++++
> > > >  .../selftests/drivers/android/binder/Makefile |   8 +
> > > >  .../drivers/android/binder/binder_util.c      | 250 +++++++++
> > > >  .../drivers/android/binder/binder_util.h      |  32 ++
> > > >  .../selftests/drivers/android/binder/config   |   4 +
> > > >  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
> > > >  18 files changed, 1632 insertions(+), 23 deletions(-)
> > > >  create mode 100644 include/linux/cgroup_gpu.h
> > > >  create mode 100644 kernel/cgroup/gpu.c
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/config
> > > >  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> > > >
> > >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-13  3:43         ` T.J. Mercier
  (?)
@ 2022-05-13 16:13           ` Tejun Heo
  -1 siblings, 0 replies; 67+ messages in thread
From: Tejun Heo @ 2022-05-13 16:13 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

Hello,

On Thu, May 12, 2022 at 08:43:52PM -0700, T.J. Mercier wrote:
> > I'm actually happy I've asked this question, wasn't silly after all. I think the
> > problem here is a naming issue. What you really are monitor is "video memory",
> > which consist of a memory segment allocated to store data used to render images
> > (its not always images of course, GPU an VPU have specialized buffers for their
> > purpose).
> >
> > Whether this should be split between what is used specifically by the GPU
> > drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> > drivers is something that should be discussed. But in the current approach, you
> > really meant Video memory as a superset of the above. Personally, I think
> > generically (to de-Andronized your work), en-globing all video memory is
> > sufficient. What I fail to understand is how you will manage to distinguished
> > DMABuf Heap allocation (which are used outside of Android btw), from Video
> > allocation or other type of usage. I'm sure non-video usage will exist in the
> > future (think of machine learning, compute, other high bandwidth streaming
> > thingy ...)
> >
> Ok thank you for pointing out the naming issue. The naming is a
> consequence of the initial use case, but I guess it's too specific.
> What I want out of this change is that android can track dmabufs that
> come out of heaps, and drm can track gpu memory. But other drivers
> could track different resources under different names. Imagine this
> were called a buffer cgroup controller instead of a GPU cgroup
> controller. Then the use component ("video memory") isn't tied up with
> the name of the controller, but it's up to the name of the bucket the
> resource is tracked under. I think this meets the needs of the two use
> cases I'm aware of now, while leaving the door open to other future
> needs. Really the controller is just enabling abstract named buckets
> for tracking and eventually limiting a type of resource.

So, there hasn't been whole lot of discussion w/ other GPU folks and what
comes up still seems to indicate that we're still long way away from having
a meaningful gpu controller. For your use case, would it make sense to just
add dmabuf as a key to the misc controller? I'm not sure it makes sense to
push "gpu controller" forward if there's no conceptual consensus around what
resources are.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-13 16:13           ` Tejun Heo
  0 siblings, 0 replies; 67+ messages in thread
From: Tejun Heo @ 2022-05-13 16:13 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Nicolas Dufresne, Laura Abbott, kernel-team,
	linux-media, Todd Kjos, linaro-mm-sig, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Hridya Valsaraju

Hello,

On Thu, May 12, 2022 at 08:43:52PM -0700, T.J. Mercier wrote:
> > I'm actually happy I've asked this question, wasn't silly after all. I think the
> > problem here is a naming issue. What you really are monitor is "video memory",
> > which consist of a memory segment allocated to store data used to render images
> > (its not always images of course, GPU an VPU have specialized buffers for their
> > purpose).
> >
> > Whether this should be split between what is used specifically by the GPU
> > drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> > drivers is something that should be discussed. But in the current approach, you
> > really meant Video memory as a superset of the above. Personally, I think
> > generically (to de-Andronized your work), en-globing all video memory is
> > sufficient. What I fail to understand is how you will manage to distinguished
> > DMABuf Heap allocation (which are used outside of Android btw), from Video
> > allocation or other type of usage. I'm sure non-video usage will exist in the
> > future (think of machine learning, compute, other high bandwidth streaming
> > thingy ...)
> >
> Ok thank you for pointing out the naming issue. The naming is a
> consequence of the initial use case, but I guess it's too specific.
> What I want out of this change is that android can track dmabufs that
> come out of heaps, and drm can track gpu memory. But other drivers
> could track different resources under different names. Imagine this
> were called a buffer cgroup controller instead of a GPU cgroup
> controller. Then the use component ("video memory") isn't tied up with
> the name of the controller, but it's up to the name of the bucket the
> resource is tracked under. I think this meets the needs of the two use
> cases I'm aware of now, while leaving the door open to other future
> needs. Really the controller is just enabling abstract named buckets
> for tracking and eventually limiting a type of resource.

So, there hasn't been whole lot of discussion w/ other GPU folks and what
comes up still seems to indicate that we're still long way away from having
a meaningful gpu controller. For your use case, would it make sense to just
add dmabuf as a key to the misc controller? I'm not sure it makes sense to
push "gpu controller" forward if there's no conceptual consensus around what
resources are.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-13 16:13           ` Tejun Heo
  0 siblings, 0 replies; 67+ messages in thread
From: Tejun Heo @ 2022-05-13 16:13 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey

Hello,

On Thu, May 12, 2022 at 08:43:52PM -0700, T.J. Mercier wrote:
> > I'm actually happy I've asked this question, wasn't silly after all. I think the
> > problem here is a naming issue. What you really are monitor is "video memory",
> > which consist of a memory segment allocated to store data used to render images
> > (its not always images of course, GPU an VPU have specialized buffers for their
> > purpose).
> >
> > Whether this should be split between what is used specifically by the GPU
> > drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> > drivers is something that should be discussed. But in the current approach, you
> > really meant Video memory as a superset of the above. Personally, I think
> > generically (to de-Andronized your work), en-globing all video memory is
> > sufficient. What I fail to understand is how you will manage to distinguished
> > DMABuf Heap allocation (which are used outside of Android btw), from Video
> > allocation or other type of usage. I'm sure non-video usage will exist in the
> > future (think of machine learning, compute, other high bandwidth streaming
> > thingy ...)
> >
> Ok thank you for pointing out the naming issue. The naming is a
> consequence of the initial use case, but I guess it's too specific.
> What I want out of this change is that android can track dmabufs that
> come out of heaps, and drm can track gpu memory. But other drivers
> could track different resources under different names. Imagine this
> were called a buffer cgroup controller instead of a GPU cgroup
> controller. Then the use component ("video memory") isn't tied up with
> the name of the controller, but it's up to the name of the bucket the
> resource is tracked under. I think this meets the needs of the two use
> cases I'm aware of now, while leaving the door open to other future
> needs. Really the controller is just enabling abstract named buckets
> for tracking and eventually limiting a type of resource.

So, there hasn't been whole lot of discussion w/ other GPU folks and what
comes up still seems to indicate that we're still long way away from having
a meaningful gpu controller. For your use case, would it make sense to just
add dmabuf as a key to the misc controller? I'm not sure it makes sense to
push "gpu controller" forward if there's no conceptual consensus around what
resources are.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-13 16:13           ` Tejun Heo
  (?)
@ 2022-05-17 23:30             ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-17 23:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Fri, May 13, 2022 at 9:13 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, May 12, 2022 at 08:43:52PM -0700, T.J. Mercier wrote:
> > > I'm actually happy I've asked this question, wasn't silly after all. I think the
> > > problem here is a naming issue. What you really are monitor is "video memory",
> > > which consist of a memory segment allocated to store data used to render images
> > > (its not always images of course, GPU an VPU have specialized buffers for their
> > > purpose).
> > >
> > > Whether this should be split between what is used specifically by the GPU
> > > drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> > > drivers is something that should be discussed. But in the current approach, you
> > > really meant Video memory as a superset of the above. Personally, I think
> > > generically (to de-Andronized your work), en-globing all video memory is
> > > sufficient. What I fail to understand is how you will manage to distinguished
> > > DMABuf Heap allocation (which are used outside of Android btw), from Video
> > > allocation or other type of usage. I'm sure non-video usage will exist in the
> > > future (think of machine learning, compute, other high bandwidth streaming
> > > thingy ...)
> > >
> > Ok thank you for pointing out the naming issue. The naming is a
> > consequence of the initial use case, but I guess it's too specific.
> > What I want out of this change is that android can track dmabufs that
> > come out of heaps, and drm can track gpu memory. But other drivers
> > could track different resources under different names. Imagine this
> > were called a buffer cgroup controller instead of a GPU cgroup
> > controller. Then the use component ("video memory") isn't tied up with
> > the name of the controller, but it's up to the name of the bucket the
> > resource is tracked under. I think this meets the needs of the two use
> > cases I'm aware of now, while leaving the door open to other future
> > needs. Really the controller is just enabling abstract named buckets
> > for tracking and eventually limiting a type of resource.
>
> So, there hasn't been whole lot of discussion w/ other GPU folks and what
> comes up still seems to indicate that we're still long way away from having
> a meaningful gpu controller.
>
Yes, and I would still be happy to collaborate.

> For your use case, would it make sense to just
> add dmabuf as a key to the misc controller?
>
Thanks for your suggestion. This almost works. "dmabuf" as a key could
work, but I'd actually like to account for each heap. Since heaps can
be dynamically added, I can't accommodate every potential heap name by
hardcoding registrations in the misc controller.

> I'm not sure it makes sense to
> push "gpu controller" forward if there's no conceptual consensus around what
> resources are.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-17 23:30             ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-17 23:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Nicolas Dufresne, Laura Abbott, kernel-team,
	linux-media, Todd Kjos, linaro-mm-sig, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Hridya Valsaraju

On Fri, May 13, 2022 at 9:13 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, May 12, 2022 at 08:43:52PM -0700, T.J. Mercier wrote:
> > > I'm actually happy I've asked this question, wasn't silly after all. I think the
> > > problem here is a naming issue. What you really are monitor is "video memory",
> > > which consist of a memory segment allocated to store data used to render images
> > > (its not always images of course, GPU an VPU have specialized buffers for their
> > > purpose).
> > >
> > > Whether this should be split between what is used specifically by the GPU
> > > drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> > > drivers is something that should be discussed. But in the current approach, you
> > > really meant Video memory as a superset of the above. Personally, I think
> > > generically (to de-Andronized your work), en-globing all video memory is
> > > sufficient. What I fail to understand is how you will manage to distinguished
> > > DMABuf Heap allocation (which are used outside of Android btw), from Video
> > > allocation or other type of usage. I'm sure non-video usage will exist in the
> > > future (think of machine learning, compute, other high bandwidth streaming
> > > thingy ...)
> > >
> > Ok thank you for pointing out the naming issue. The naming is a
> > consequence of the initial use case, but I guess it's too specific.
> > What I want out of this change is that android can track dmabufs that
> > come out of heaps, and drm can track gpu memory. But other drivers
> > could track different resources under different names. Imagine this
> > were called a buffer cgroup controller instead of a GPU cgroup
> > controller. Then the use component ("video memory") isn't tied up with
> > the name of the controller, but it's up to the name of the bucket the
> > resource is tracked under. I think this meets the needs of the two use
> > cases I'm aware of now, while leaving the door open to other future
> > needs. Really the controller is just enabling abstract named buckets
> > for tracking and eventually limiting a type of resource.
>
> So, there hasn't been whole lot of discussion w/ other GPU folks and what
> comes up still seems to indicate that we're still long way away from having
> a meaningful gpu controller.
>
Yes, and I would still be happy to collaborate.

> For your use case, would it make sense to just
> add dmabuf as a key to the misc controller?
>
Thanks for your suggestion. This almost works. "dmabuf" as a key could
work, but I'd actually like to account for each heap. Since heaps can
be dynamically added, I can't accommodate every potential heap name by
hardcoding registrations in the misc controller.

> I'm not sure it makes sense to
> push "gpu controller" forward if there's no conceptual consensus around what
> resources are.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-17 23:30             ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-17 23:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey

On Fri, May 13, 2022 at 9:13 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, May 12, 2022 at 08:43:52PM -0700, T.J. Mercier wrote:
> > > I'm actually happy I've asked this question, wasn't silly after all. I think the
> > > problem here is a naming issue. What you really are monitor is "video memory",
> > > which consist of a memory segment allocated to store data used to render images
> > > (its not always images of course, GPU an VPU have specialized buffers for their
> > > purpose).
> > >
> > > Whether this should be split between what is used specifically by the GPU
> > > drivers, the display drivers, the VPU (CODEC and pre/post-processor) or camera
> > > drivers is something that should be discussed. But in the current approach, you
> > > really meant Video memory as a superset of the above. Personally, I think
> > > generically (to de-Andronized your work), en-globing all video memory is
> > > sufficient. What I fail to understand is how you will manage to distinguished
> > > DMABuf Heap allocation (which are used outside of Android btw), from Video
> > > allocation or other type of usage. I'm sure non-video usage will exist in the
> > > future (think of machine learning, compute, other high bandwidth streaming
> > > thingy ...)
> > >
> > Ok thank you for pointing out the naming issue. The naming is a
> > consequence of the initial use case, but I guess it's too specific.
> > What I want out of this change is that android can track dmabufs that
> > come out of heaps, and drm can track gpu memory. But other drivers
> > could track different resources under different names. Imagine this
> > were called a buffer cgroup controller instead of a GPU cgroup
> > controller. Then the use component ("video memory") isn't tied up with
> > the name of the controller, but it's up to the name of the bucket the
> > resource is tracked under. I think this meets the needs of the two use
> > cases I'm aware of now, while leaving the door open to other future
> > needs. Really the controller is just enabling abstract named buckets
> > for tracking and eventually limiting a type of resource.
>
> So, there hasn't been whole lot of discussion w/ other GPU folks and what
> comes up still seems to indicate that we're still long way away from having
> a meaningful gpu controller.
>
Yes, and I would still be happy to collaborate.

> For your use case, would it make sense to just
> add dmabuf as a key to the misc controller?
>
Thanks for your suggestion. This almost works. "dmabuf" as a key could
work, but I'd actually like to account for each heap. Since heaps can
be dynamically added, I can't accommodate every potential heap name by
hardcoding registrations in the misc controller.

> I'm not sure it makes sense to
> push "gpu controller" forward if there's no conceptual consensus around what
> resources are.
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
  2022-05-10 23:56 ` T.J. Mercier
  (?)
@ 2022-05-19  9:30   ` eballetbo
  -1 siblings, 0 replies; 67+ messages in thread
From: eballetbo @ 2022-05-19  9:30 UTC (permalink / raw)
  To: lizefan.x, corbet, joel, arve, tjmercier, maco,
	benjamin.gaignard, tj, brauner, sumit.semwal, tkjos, surenb,
	hannes, Brian.Starkey, christian.koenig, gregkh, lmark,
	john.stultz, hridya, shuah, labbott
  Cc: Enric Balletbo i Serra, cgroups, kernel-team, linux-media,
	dri-devel, linaro-mm-sig, cmllamas, daniel, Kenny.Ho,
	linux-kselftest, kaleshsingh, mkoutny, jstultz, linux-doc,
	linux-kernel, skhan

From: Enric Balletbo i Serra <eballetbo@kernel.org>

On Tue, 10 May 2022 23:56:45 +0000, T.J. Mercier wrote:
> From: Hridya Valsaraju <hridya@google.com>
>

Hi T.J. Mercier,

Many thanks for this effort. It caught my attention because we might have a use
case where this feature can be useful for us. Hence I'd like to jump and be part
of the discussion, I'd really appreciate if you can cc'me for next versions.

While reading the full patchset I was a bit confused about the status of this
proposal. In fact, the rfc in the subject combined with the number of iterations
(already seven) confused me. So I'm wondering if this is a RFC or a 'real'
proposal already that you want to land.

If this is still a RFC I'd remove the 'rfc: Proposal' and use the more canonical
way that is put RFC in the []. I.e [PATCH RFC v7] cgroup: Add a GPU cgroup
controller.

If it is not, I'd just remove the RFC and make the subject in the cgroup
subsystem instead of the gpu. I.E [PATCH v7] cgroup: Add a GPU cgroup

I don't want to nitpick but IMO that helps new people to join to the history of
the patchset.

> This patch adds a proposal for a new GPU cgroup controller for
> accounting/limiting GPU and GPU-related memory allocations.

As far as I can see the only thing that is adding here is the accounting, so I'd
remove any reference to limiting and just explain what the patch really
introduces, not the future, otherwise is confusing an you expect more than the
patch really does.

It is important maintain the commit message sync with what the patch really
does.

> The proposed controller is based on the DRM cgroup controller[1] and
> follows the design of the RDMA cgroup controller.
> 
> The new cgroup controller would:
> * Allow setting per-device limits on the total size of buffers
>   allocated by device within a cgroup.
> * Expose a per-device/allocator breakdown of the buffers charged to a
>   cgroup.
> 
> The prototype in the following patches is only for memory accounting
> using the GPU cgroup controller and does not implement limit setting.
> 
> [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> 

I think this is material for the cover more than the commit message. When I read
this I was expecting all this in this patch.

> Signed-off-by: Hridya Valsaraju <hridya@google.com>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
> v7 changes
> Remove comment about duplicate name rejection which is not relevant to
> cgroups users per Michal Koutný.
> 
> v6 changes
> Move documentation into cgroup-v2.rst per Tejun Heo.
> 
> v5 changes
> Drop the global GPU cgroup "total" (sum of all device totals) portion
> of the design since there is no currently known use for this per
> Tejun Heo.
> 
> Update for renamed functions/variables.
> 
> v3 changes
> Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.
> 
> Use more common dual author commit message format per John Stultz.
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69d7a6983f78..2e1d26e327c7 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
>  a process to a different cgroup does not move the charge to the destination
>  cgroup where the process has moved.
>  
> +
> +GPU
> +---
> +
> +The GPU controller accounts for device and system memory allocated by the GPU
> +and related subsystems for graphics use. Resource limits are not currently
> +supported.
> +
> +GPU Interface Files
> +~~~~~~~~~~~~~~~~~~~~
> +
> +  gpu.memory.current
> +	A read-only file containing memory allocations in flat-keyed format. The key
> +	is a string representing the device name. The value is the size of the memory
> +	charged to the device in bytes. The device names are globally unique.::
> +
> +	  $ cat /sys/kernel/fs/cgroup1/gpu.memory.current

I think this is outdated, you are using cgroup v2, right?

> +	  dev1 4194304
> +	  dev2 104857600
> +

When I applied the full series I was expecting see the memory allocated by the
gpu devices or users of the gpu in this file but, after some experiments, what I
saw is the memory allocated via any process that uses the dma-buf heap API (not
necessary gpu users). For example, if you create a small program that allocates
some memory via the dma-buf heap API and then you cat the gpu.memory.current
file, you see that the memory accounted is not related to the gpu.

This is really confusing, looks to me that the patches evolved to account memory
that is not really related to the GPU but allocated vi the dma-buf heap API. IMO
the name of the file should be according to what really does to avoid
confusions.

So, is this patchset meant to be GPU specific? If the answer is yes that's good
but that's not what I experienced. I'm missing something?

If the answer is that evolved to track dma-buf heap allocations I think all the
patches need some rework to adapt the wording as right now, the gpu wording
seems confusing to me.

> +	The device name string is set by a device driver when it registers with the
> +	GPU cgroup controller to participate in resource accounting.
> +
>  Others
>  ------
>
>
Thanks,
 Enric
 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
@ 2022-05-19  9:30   ` eballetbo
  0 siblings, 0 replies; 67+ messages in thread
From: eballetbo @ 2022-05-19  9:30 UTC (permalink / raw)
  To: lizefan.x, corbet, joel, arve, tjmercier, maco,
	benjamin.gaignard, tj, brauner, sumit.semwal, tkjos, surenb,
	hannes, Brian.Starkey, christian.koenig, gregkh, lmark,
	john.stultz, hridya, shuah, labbott
  Cc: linux-kselftest, linux-doc, Kenny.Ho, skhan, cmllamas, dri-devel,
	linux-kernel, linaro-mm-sig, jstultz, mkoutny, kaleshsingh,
	cgroups, Enric Balletbo i Serra, kernel-team, linux-media

From: Enric Balletbo i Serra <eballetbo@kernel.org>

On Tue, 10 May 2022 23:56:45 +0000, T.J. Mercier wrote:
> From: Hridya Valsaraju <hridya@google.com>
>

Hi T.J. Mercier,

Many thanks for this effort. It caught my attention because we might have a use
case where this feature can be useful for us. Hence I'd like to jump and be part
of the discussion, I'd really appreciate if you can cc'me for next versions.

While reading the full patchset I was a bit confused about the status of this
proposal. In fact, the rfc in the subject combined with the number of iterations
(already seven) confused me. So I'm wondering if this is a RFC or a 'real'
proposal already that you want to land.

If this is still a RFC I'd remove the 'rfc: Proposal' and use the more canonical
way that is put RFC in the []. I.e [PATCH RFC v7] cgroup: Add a GPU cgroup
controller.

If it is not, I'd just remove the RFC and make the subject in the cgroup
subsystem instead of the gpu. I.E [PATCH v7] cgroup: Add a GPU cgroup

I don't want to nitpick but IMO that helps new people to join to the history of
the patchset.

> This patch adds a proposal for a new GPU cgroup controller for
> accounting/limiting GPU and GPU-related memory allocations.

As far as I can see the only thing that is adding here is the accounting, so I'd
remove any reference to limiting and just explain what the patch really
introduces, not the future, otherwise is confusing an you expect more than the
patch really does.

It is important maintain the commit message sync with what the patch really
does.

> The proposed controller is based on the DRM cgroup controller[1] and
> follows the design of the RDMA cgroup controller.
> 
> The new cgroup controller would:
> * Allow setting per-device limits on the total size of buffers
>   allocated by device within a cgroup.
> * Expose a per-device/allocator breakdown of the buffers charged to a
>   cgroup.
> 
> The prototype in the following patches is only for memory accounting
> using the GPU cgroup controller and does not implement limit setting.
> 
> [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> 

I think this is material for the cover more than the commit message. When I read
this I was expecting all this in this patch.

> Signed-off-by: Hridya Valsaraju <hridya@google.com>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
> v7 changes
> Remove comment about duplicate name rejection which is not relevant to
> cgroups users per Michal Koutný.
> 
> v6 changes
> Move documentation into cgroup-v2.rst per Tejun Heo.
> 
> v5 changes
> Drop the global GPU cgroup "total" (sum of all device totals) portion
> of the design since there is no currently known use for this per
> Tejun Heo.
> 
> Update for renamed functions/variables.
> 
> v3 changes
> Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.
> 
> Use more common dual author commit message format per John Stultz.
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69d7a6983f78..2e1d26e327c7 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
>  a process to a different cgroup does not move the charge to the destination
>  cgroup where the process has moved.
>  
> +
> +GPU
> +---
> +
> +The GPU controller accounts for device and system memory allocated by the GPU
> +and related subsystems for graphics use. Resource limits are not currently
> +supported.
> +
> +GPU Interface Files
> +~~~~~~~~~~~~~~~~~~~~
> +
> +  gpu.memory.current
> +	A read-only file containing memory allocations in flat-keyed format. The key
> +	is a string representing the device name. The value is the size of the memory
> +	charged to the device in bytes. The device names are globally unique.::
> +
> +	  $ cat /sys/kernel/fs/cgroup1/gpu.memory.current

I think this is outdated, you are using cgroup v2, right?

> +	  dev1 4194304
> +	  dev2 104857600
> +

When I applied the full series I was expecting see the memory allocated by the
gpu devices or users of the gpu in this file but, after some experiments, what I
saw is the memory allocated via any process that uses the dma-buf heap API (not
necessary gpu users). For example, if you create a small program that allocates
some memory via the dma-buf heap API and then you cat the gpu.memory.current
file, you see that the memory accounted is not related to the gpu.

This is really confusing, looks to me that the patches evolved to account memory
that is not really related to the GPU but allocated vi the dma-buf heap API. IMO
the name of the file should be according to what really does to avoid
confusions.

So, is this patchset meant to be GPU specific? If the answer is yes that's good
but that's not what I experienced. I'm missing something?

If the answer is that evolved to track dma-buf heap allocations I think all the
patches need some rework to adapt the wording as right now, the gpu wording
seems confusing to me.

> +	The device name string is set by a device driver when it registers with the
> +	GPU cgroup controller to participate in resource accounting.
> +
>  Others
>  ------
>
>
Thanks,
 Enric
 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
@ 2022-05-19  9:30   ` eballetbo
  0 siblings, 0 replies; 67+ messages in thread
From: eballetbo @ 2022-05-19  9:30 UTC (permalink / raw)
  To: lizefan.x, corbet, joel, arve, tjmercier, maco,
	benjamin.gaignard, tj, brauner, sumit.semwal, tkjos, surenb,
	hannes, Brian.Starkey, christian.koenig, gregkh, lmark,
	john.stultz, hridya, shuah, labbott
  Cc: Enric Balletbo i Serra, cgroups, kernel-team, linux-media,
	dri-devel, linaro-mm-sig, cmllamas, daniel, Kenny.Ho,
	linux-kselftest, kaleshsingh, mkoutny, jstultz, linux-doc,
	linux-kernel, skhan

From: Enric Balletbo i Serra <eballetbo@kernel.org>

On Tue, 10 May 2022 23:56:45 +0000, T.J. Mercier wrote:
> From: Hridya Valsaraju <hridya@google.com>
>

Hi T.J. Mercier,

Many thanks for this effort. It caught my attention because we might have a use
case where this feature can be useful for us. Hence I'd like to jump and be part
of the discussion, I'd really appreciate if you can cc'me for next versions.

While reading the full patchset I was a bit confused about the status of this
proposal. In fact, the rfc in the subject combined with the number of iterations
(already seven) confused me. So I'm wondering if this is a RFC or a 'real'
proposal already that you want to land.

If this is still a RFC I'd remove the 'rfc: Proposal' and use the more canonical
way that is put RFC in the []. I.e [PATCH RFC v7] cgroup: Add a GPU cgroup
controller.

If it is not, I'd just remove the RFC and make the subject in the cgroup
subsystem instead of the gpu. I.E [PATCH v7] cgroup: Add a GPU cgroup

I don't want to nitpick but IMO that helps new people to join to the history of
the patchset.

> This patch adds a proposal for a new GPU cgroup controller for
> accounting/limiting GPU and GPU-related memory allocations.

As far as I can see the only thing that is adding here is the accounting, so I'd
remove any reference to limiting and just explain what the patch really
introduces, not the future, otherwise is confusing an you expect more than the
patch really does.

It is important maintain the commit message sync with what the patch really
does.

> The proposed controller is based on the DRM cgroup controller[1] and
> follows the design of the RDMA cgroup controller.
> 
> The new cgroup controller would:
> * Allow setting per-device limits on the total size of buffers
>   allocated by device within a cgroup.
> * Expose a per-device/allocator breakdown of the buffers charged to a
>   cgroup.
> 
> The prototype in the following patches is only for memory accounting
> using the GPU cgroup controller and does not implement limit setting.
> 
> [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> 

I think this is material for the cover more than the commit message. When I read
this I was expecting all this in this patch.

> Signed-off-by: Hridya Valsaraju <hridya@google.com>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
> v7 changes
> Remove comment about duplicate name rejection which is not relevant to
> cgroups users per Michal Koutný.
> 
> v6 changes
> Move documentation into cgroup-v2.rst per Tejun Heo.
> 
> v5 changes
> Drop the global GPU cgroup "total" (sum of all device totals) portion
> of the design since there is no currently known use for this per
> Tejun Heo.
> 
> Update for renamed functions/variables.
> 
> v3 changes
> Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.
> 
> Use more common dual author commit message format per John Stultz.
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 69d7a6983f78..2e1d26e327c7 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
>  a process to a different cgroup does not move the charge to the destination
>  cgroup where the process has moved.
>  
> +
> +GPU
> +---
> +
> +The GPU controller accounts for device and system memory allocated by the GPU
> +and related subsystems for graphics use. Resource limits are not currently
> +supported.
> +
> +GPU Interface Files
> +~~~~~~~~~~~~~~~~~~~~
> +
> +  gpu.memory.current
> +	A read-only file containing memory allocations in flat-keyed format. The key
> +	is a string representing the device name. The value is the size of the memory
> +	charged to the device in bytes. The device names are globally unique.::
> +
> +	  $ cat /sys/kernel/fs/cgroup1/gpu.memory.current

I think this is outdated, you are using cgroup v2, right?

> +	  dev1 4194304
> +	  dev2 104857600
> +

When I applied the full series I was expecting see the memory allocated by the
gpu devices or users of the gpu in this file but, after some experiments, what I
saw is the memory allocated via any process that uses the dma-buf heap API (not
necessary gpu users). For example, if you create a small program that allocates
some memory via the dma-buf heap API and then you cat the gpu.memory.current
file, you see that the memory accounted is not related to the gpu.

This is really confusing, looks to me that the patches evolved to account memory
that is not really related to the GPU but allocated vi the dma-buf heap API. IMO
the name of the file should be according to what really does to avoid
confusions.

So, is this patchset meant to be GPU specific? If the answer is yes that's good
but that's not what I experienced. I'm missing something?

If the answer is that evolved to track dma-buf heap allocations I think all the
patches need some rework to adapt the wording as right now, the gpu wording
seems confusing to me.

> +	The device name string is set by a device driver when it registers with the
> +	GPU cgroup controller to participate in resource accounting.
> +
>  Others
>  ------
>
>
Thanks,
 Enric
 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
  2022-05-10 23:56 ` T.J. Mercier
  (?)
@ 2022-05-19 10:52   ` eballetbo
  -1 siblings, 0 replies; 67+ messages in thread
From: eballetbo @ 2022-05-19 10:52 UTC (permalink / raw)
  To: lizefan.x, corbet, joel, arve, tjmercier, maco,
	benjamin.gaignard, tj, brauner, sumit.semwal, tkjos, surenb,
	hannes, Brian.Starkey, christian.koenig, gregkh, lmark,
	john.stultz, hridya, shuah, labbott
  Cc: Enric Balletbo i Serra, cgroups, kernel-team, linux-media,
	dri-devel, linaro-mm-sig, cmllamas, daniel, Kenny.Ho,
	linux-kselftest, kaleshsingh, mkoutny, jstultz, linux-doc,
	linux-kernel, skhan

From: Enric Balletbo i Serra <eballetbo@kernel.org>

On Tue, 10 May 2022 23:56:46 +0000, T.J. Mercier wrote
> From: Hridya Valsaraju <hridya@google.com>
> 
> The cgroup controller provides accounting for GPU and GPU-related
> memory allocations. The memory being accounted can be device memory or
> memory allocated from pools dedicated to serve GPU-related tasks.
> 
> This patch adds APIs to:
> -allow a device to register for memory accounting using the GPU cgroup
> controller.
> -charge and uncharge allocated memory to a cgroup.
> 
> When the cgroup controller is enabled, it would expose information about
> the memory allocated by each device(registered for GPU cgroup memory
> accounting) for each cgroup.
> 
> The API/UAPI can be extended to set per-device/total allocation limits
> in the future.
> 
> The cgroup controller has been named following the discussion in [1].
> 
> [1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/
> 
> Signed-off-by: Hridya Valsaraju <hridya@google.com>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
> v7 changes
> Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> This means gpucg_register_bucket now returns an internally allocated
> struct gpucg_bucket.
> 
> Move all public function documentation to the cgroup_gpu.h header.
> 
> v5 changes
> Support all strings for gpucg_register_device instead of just string
> literals.
> 
> Enforce globally unique gpucg_bucket names.
> 
> Constrain gpucg_bucket name lengths to 64 bytes.
> 
> Obtain just a single css refcount instead of nr_pages for each
> charge.
> 
> Rename:
> gpucg_try_charge -> gpucg_charge
> find_cg_rpool_locked -> cg_rpool_find_locked
> init_cg_rpool -> cg_rpool_init
> get_cg_rpool_locked -> cg_rpool_get_locked
> "gpu cgroup controller" -> "GPU controller"
> gpucg_device -> gpucg_bucket
> usage -> size
> 
> v4 changes
> Adjust gpucg_try_charge critical section for future charge transfer
> functionality.
> 
> v3 changes
> Use more common dual author commit message format per John Stultz.
> 
> v2 changes
> Fix incorrect Kconfig help section indentation per Randy Dunlap.
> ---
>  include/linux/cgroup_gpu.h    | 122 ++++++++++++
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig                  |   7 +
>  kernel/cgroup/Makefile        |   1 +
>  kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
>  5 files changed, 473 insertions(+)
>  create mode 100644 include/linux/cgroup_gpu.h
>  create mode 100644 kernel/cgroup/gpu.c
> 
> diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
> new file mode 100644
> index 000000000000..cb228a16aa1f
> --- /dev/null
> +++ b/include/linux/cgroup_gpu.h
> @@ -0,0 +1,122 @@
> +/* SPDX-License-Identifier: MIT
> + * Copyright 2019 Advanced Micro Devices, Inc.
> + * Copyright (C) 2022 Google LLC.
> + */
> +#ifndef _CGROUP_GPU_H
> +#define _CGROUP_GPU_H
> +
> +#include <linux/cgroup.h>
> +
> +#define GPUCG_BUCKET_NAME_MAX_LEN 64
> +
> +struct gpucg;
> +struct gpucg_bucket;
> +
> +#ifdef CONFIG_CGROUP_GPU
> +
> +/**
> + * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
> + * @css: the target cgroup_subsys_state
> + *
> + * Returns: gpu cgroup that contains the @css
> + */
> +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
> +
> +/**
> + * gpucg_get - get the gpucg reference that a task belongs to
> + * @task: the target task
> + *
> + * This increases the reference count of the css that the @task belongs to.
> + *
> + * Returns: reference to the gpu cgroup the task belongs to.
> + */
> +struct gpucg *gpucg_get(struct task_struct *task);
> +
> +/**
> + * gpucg_put - put a gpucg reference
> + * @gpucg: the target gpucg
> + *
> + * Put a reference obtained via gpucg_get
> + */
> +void gpucg_put(struct gpucg *gpucg);
> +
> +/**
> + * gpucg_parent - find the parent of a gpu cgroup
> + * @cg: the target gpucg
> + *
> + * This does not increase the reference count of the parent cgroup
> + *
> + * Returns: parent gpu cgroup of @cg
> + */
> +struct gpucg *gpucg_parent(struct gpucg *cg);
> +
> +/**
> + * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
> + * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
> + * rounded up to be a multiple of the page size.
> + *
> + * @gpucg: The gpu cgroup to charge the memory to.
> + * @bucket: The bucket to charge the memory to.
> + * @size: The size of memory to charge in bytes.
> + *        This size will be rounded up to the nearest page size.
> + *
> + * Return: returns 0 if the charging is successful and otherwise returns an error code.
> + */
> +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> +
> +/**
> + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
> + * The caller must hold a reference to @gpucg obtained through gpucg_get().
> + *
> + * @gpucg: The gpu cgroup to uncharge the memory from.
> + * @bucket: The bucket to uncharge the memory from.
> + * @size: The size of memory to uncharge in bytes.
> + *        This size will be rounded up to the nearest page size.
> + */
> +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> +
> +/**
> + * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
> + *
> + * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
> + *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
> + *
> + * @bucket must remain valid. @name will be copied.
> + *
> + * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
> + * cannot be unregistered, this can never be freed.
> + */
> +struct gpucg_bucket *gpucg_register_bucket(const char *name);
> +#else /* CONFIG_CGROUP_GPU */
> +
> +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> +{
> +	return NULL;
> +}
> +
> +static inline struct gpucg *gpucg_get(struct task_struct *task)
> +{
> +	return NULL;
> +}
> +
> +static inline void gpucg_put(struct gpucg *gpucg) {}
> +
> +static inline struct gpucg *gpucg_parent(struct gpucg *cg)
> +{
> +	return NULL;
> +}
> +
> +static inline int gpucg_charge(struct gpucg *gpucg,
> +			       struct gpucg_bucket *bucket,
> +			       u64 size)
> +{
> +	return 0;
> +}
> +
> +static inline void gpucg_uncharge(struct gpucg *gpucg,
> +				  struct gpucg_bucket *bucket,
> +				  u64 size) {}
> +
> +static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}

I think this needs to return NULL, otherwise you'll get a compiler error when
CONFIG_CGROUP_GPU is not set. 

I found other build errors when CONFIG_CGROUP_GPU is not set, please fix them in
the next versioon.

Thanks,
  Enric

> +#endif /* CONFIG_CGROUP_GPU */
> +#endif /* _CGROUP_GPU_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 445235487230..46a2a7b93c41 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -65,6 +65,10 @@ SUBSYS(rdma)
>  SUBSYS(misc)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_GPU)
> +SUBSYS(gpu)
> +#endif
> +
>  /*
>   * The following subsystems are not supported on the default hierarchy.
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index ddcbefe535e9..2e00a190e170 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -984,6 +984,13 @@ config BLK_CGROUP
>  
>  	See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
>  
> +config CGROUP_GPU
> +	bool "GPU controller (EXPERIMENTAL)"
> +	select PAGE_COUNTER
> +	help
> +	  Provides accounting and limit setting for memory allocations by the GPU and
> +	  GPU-related subsystems.
> +
>  config CGROUP_WRITEBACK
>  	bool
>  	depends on MEMCG && BLK_CGROUP
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index 12f8457ad1f9..be95a5a532fc 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_CGROUP_MISC) += misc.o
>  obj-$(CONFIG_CGROUP_DEBUG) += debug.o
> +obj-$(CONFIG_CGROUP_GPU) += gpu.o
> diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
> new file mode 100644
> index 000000000000..ad16ea15d427
> --- /dev/null
> +++ b/kernel/cgroup/gpu.c
> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: MIT
> +// Copyright 2019 Advanced Micro Devices, Inc.
> +// Copyright (C) 2022 Google LLC.
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_gpu.h>
> +#include <linux/err.h>
> +#include <linux/gfp.h>
> +#include <linux/list.h>
> +#include <linux/mm.h>
> +#include <linux/page_counter.h>
> +#include <linux/seq_file.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +
> +static struct gpucg *root_gpucg __read_mostly;
> +
> +/*
> + * Protects list of resource pools maintained on per cgroup basis and list
> + * of buckets registered for memory accounting using the GPU cgroup controller.
> + */
> +static DEFINE_MUTEX(gpucg_mutex);
> +static LIST_HEAD(gpucg_buckets);
> +
> +/* The GPU cgroup controller data structure */
> +struct gpucg {
> +	struct cgroup_subsys_state css;
> +
> +	/* list of all resource pools that belong to this cgroup */
> +	struct list_head rpools;
> +};
> +
> +/* A named entity representing bucket of tracked memory. */
> +struct gpucg_bucket {
> +	/* list of various resource pools in various cgroups that the bucket is part of */
> +	struct list_head rpools;
> +
> +	/* list of all buckets registered for GPU cgroup accounting */
> +	struct list_head bucket_node;
> +
> +	/* string to be used as identifier for accounting and limit setting */
> +	const char *name;
> +};
> +
> +struct gpucg_resource_pool {
> +	/* The bucket whose resource usage is tracked by this resource pool */
> +	struct gpucg_bucket *bucket;
> +
> +	/* list of all resource pools for the cgroup */
> +	struct list_head cg_node;
> +
> +	/* list maintained by the gpucg_bucket to keep track of its resource pools */
> +	struct list_head bucket_node;
> +
> +	/* tracks memory usage of the resource pool */
> +	struct page_counter total;
> +};
> +
> +static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
> +{
> +	lockdep_assert_held(&gpucg_mutex);
> +
> +	list_del(&rpool->cg_node);
> +	list_del(&rpool->bucket_node);
> +	kfree(rpool);
> +}
> +
> +static void gpucg_css_free(struct cgroup_subsys_state *css)
> +{
> +	struct gpucg_resource_pool *rpool, *tmp;
> +	struct gpucg *gpucg = css_to_gpucg(css);
> +
> +	// delete all resource pools
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
> +		free_cg_rpool_locked(rpool);
> +	mutex_unlock(&gpucg_mutex);
> +
> +	kfree(gpucg);
> +}
> +
> +static struct cgroup_subsys_state *
> +gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> +	struct gpucg *gpucg, *parent;
> +
> +	gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
> +	if (!gpucg)
> +		return ERR_PTR(-ENOMEM);
> +
> +	parent = css_to_gpucg(parent_css);
> +	if (!parent)
> +		root_gpucg = gpucg;
> +
> +	INIT_LIST_HEAD(&gpucg->rpools);
> +
> +	return &gpucg->css;
> +}
> +
> +static struct gpucg_resource_pool *cg_rpool_find_locked(
> +	struct gpucg *cg,
> +	struct gpucg_bucket *bucket)
> +{
> +	struct gpucg_resource_pool *rpool;
> +
> +	lockdep_assert_held(&gpucg_mutex);
> +
> +	list_for_each_entry(rpool, &cg->rpools, cg_node)
> +		if (rpool->bucket == bucket)
> +			return rpool;
> +
> +	return NULL;
> +}
> +
> +static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
> +						 struct gpucg_bucket *bucket)
> +{
> +	struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
> +							GFP_KERNEL);
> +	if (!rpool)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rpool->bucket = bucket;
> +
> +	page_counter_init(&rpool->total, NULL);
> +	INIT_LIST_HEAD(&rpool->cg_node);
> +	INIT_LIST_HEAD(&rpool->bucket_node);
> +	list_add_tail(&rpool->cg_node, &cg->rpools);
> +	list_add_tail(&rpool->bucket_node, &bucket->rpools);
> +
> +	return rpool;
> +}
> +
> +/**
> + * get_cg_rpool_locked - find the resource pool for the specified bucket and
> + * specified cgroup. If the resource pool does not exist for the cg, it is
> + * created in a hierarchical manner in the cgroup and its ancestor cgroups who
> + * do not already have a resource pool entry for the bucket.
> + *
> + * @cg: The cgroup to find the resource pool for.
> + * @bucket: The bucket associated with the returned resource pool.
> + *
> + * Return: return resource pool entry corresponding to the specified bucket in
> + * the specified cgroup (hierarchically creating them if not existing already).
> + *
> + */
> +static struct gpucg_resource_pool *
> +cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
> +{
> +	struct gpucg *parent_cg, *p, *stop_cg;
> +	struct gpucg_resource_pool *rpool, *tmp_rpool;
> +	struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
> +
> +	rpool = cg_rpool_find_locked(cg, bucket);
> +	if (rpool)
> +		return rpool;
> +
> +	stop_cg = cg;
> +	do {
> +		rpool = cg_rpool_init(stop_cg, bucket);
> +		if (IS_ERR(rpool))
> +			goto err;
> +
> +		if (!leaf_rpool)
> +			leaf_rpool = rpool;
> +
> +		stop_cg = gpucg_parent(stop_cg);
> +		if (!stop_cg)
> +			break;
> +
> +		rpool = cg_rpool_find_locked(stop_cg, bucket);
> +	} while (!rpool);
> +
> +	/*
> +	 * Re-initialize page counters of all rpools created in this invocation
> +	 * to enable hierarchical charging.
> +	 * stop_cg is the first ancestor cg who already had a resource pool for
> +	 * the bucket. It can also be NULL if no ancestors had a pre-existing
> +	 * resource pool for the bucket before this invocation.
> +	 */
> +	rpool = leaf_rpool;
> +	for (p = cg; p != stop_cg; p = parent_cg) {
> +		parent_cg = gpucg_parent(p);
> +		if (!parent_cg)
> +			break;
> +		parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
> +		page_counter_init(&rpool->total, &parent_rpool->total);
> +
> +		rpool = parent_rpool;
> +	}
> +
> +	return leaf_rpool;
> +err:
> +	for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
> +		tmp_rpool = cg_rpool_find_locked(p, bucket);
> +		free_cg_rpool_locked(tmp_rpool);
> +	}
> +	return rpool;
> +}
> +
> +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct gpucg, css) : NULL;
> +}
> +
> +struct gpucg *gpucg_get(struct task_struct *task)
> +{
> +	if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
> +		return NULL;
> +	return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
> +}
> +
> +void gpucg_put(struct gpucg *gpucg)
> +{
> +	if (gpucg)
> +		css_put(&gpucg->css);
> +}
> +
> +struct gpucg *gpucg_parent(struct gpucg *cg)
> +{
> +	return css_to_gpucg(cg->css.parent);
> +}
> +
> +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> +{
> +	struct page_counter *counter;
> +	u64 nr_pages;
> +	struct gpucg_resource_pool *rp;
> +	int ret = 0;
> +
> +	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +
> +	mutex_lock(&gpucg_mutex);
> +	rp = cg_rpool_get_locked(gpucg, bucket);
> +	/*
> +	 * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
> +	 * progress to avoid potentially exceeding a limit.
> +	 */
> +	if (IS_ERR(rp)) {
> +		mutex_unlock(&gpucg_mutex);
> +		return PTR_ERR(rp);
> +	}
> +
> +	if (page_counter_try_charge(&rp->total, nr_pages, &counter))
> +		css_get(&gpucg->css);
> +	else
> +		ret = -ENOMEM;
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return ret;
> +}
> +
> +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> +{
> +	u64 nr_pages;
> +	struct gpucg_resource_pool *rp;
> +
> +	mutex_lock(&gpucg_mutex);
> +	rp = cg_rpool_find_locked(gpucg, bucket);
> +	/*
> +	 * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
> +	 * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
> +	 * no potential to exceed a limit while uncharging and transferring.
> +	 */
> +	mutex_unlock(&gpucg_mutex);
> +
> +	if (unlikely(!rp)) {
> +		pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
> +		return;
> +	}
> +
> +	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +	page_counter_uncharge(&rp->total, nr_pages);
> +	css_put(&gpucg->css);
> +}
> +
> +struct gpucg_bucket *gpucg_register_bucket(const char *name)
> +{
> +	struct gpucg_bucket *bucket, *b;
> +
> +	if (!name)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
> +		return ERR_PTR(-ENAMETOOLONG);
> +
> +	bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
> +	if (!bucket)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&bucket->bucket_node);
> +	INIT_LIST_HEAD(&bucket->rpools);
> +	bucket->name = kstrdup_const(name, GFP_KERNEL);
> +
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry(b, &gpucg_buckets, bucket_node) {
> +		if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
> +			mutex_unlock(&gpucg_mutex);
> +			kfree_const(bucket->name);
> +			kfree(bucket);
> +			return ERR_PTR(-EEXIST);
> +		}
> +	}
> +	list_add_tail(&bucket->bucket_node, &gpucg_buckets);
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return bucket;
> +}
> +
> +static int gpucg_resource_show(struct seq_file *sf, void *v)
> +{
> +	struct gpucg_resource_pool *rpool;
> +	struct gpucg *cg = css_to_gpucg(seq_css(sf));
> +
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry(rpool, &cg->rpools, cg_node) {
> +		seq_printf(sf, "%s %lu\n", rpool->bucket->name,
> +			   page_counter_read(&rpool->total) * PAGE_SIZE);
> +	}
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return 0;
> +}
> +
> +struct cftype files[] = {
> +	{
> +		.name = "memory.current",
> +		.seq_show = gpucg_resource_show,
> +	},
> +	{ }     /* terminate */
> +};
> +
> +struct cgroup_subsys gpu_cgrp_subsys = {
> +	.css_alloc      = gpucg_css_alloc,
> +	.css_free       = gpucg_css_free,
> +	.early_init     = false,
> +	.legacy_cftypes = files,
> +	.dfl_cftypes    = files,
> +};
> 
> -- 
> 2.36.0.512.ge40c2bad7a-goog
> 
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
@ 2022-05-19 10:52   ` eballetbo
  0 siblings, 0 replies; 67+ messages in thread
From: eballetbo @ 2022-05-19 10:52 UTC (permalink / raw)
  To: lizefan.x, corbet, joel, arve, tjmercier, maco,
	benjamin.gaignard, tj, brauner, sumit.semwal, tkjos, surenb,
	hannes, Brian.Starkey, christian.koenig, gregkh, lmark,
	john.stultz, hridya, shuah, labbott
  Cc: linux-kselftest, linux-doc, Kenny.Ho, skhan, cmllamas, dri-devel,
	linux-kernel, linaro-mm-sig, jstultz, mkoutny, kaleshsingh,
	cgroups, Enric Balletbo i Serra, kernel-team, linux-media

From: Enric Balletbo i Serra <eballetbo@kernel.org>

On Tue, 10 May 2022 23:56:46 +0000, T.J. Mercier wrote
> From: Hridya Valsaraju <hridya@google.com>
> 
> The cgroup controller provides accounting for GPU and GPU-related
> memory allocations. The memory being accounted can be device memory or
> memory allocated from pools dedicated to serve GPU-related tasks.
> 
> This patch adds APIs to:
> -allow a device to register for memory accounting using the GPU cgroup
> controller.
> -charge and uncharge allocated memory to a cgroup.
> 
> When the cgroup controller is enabled, it would expose information about
> the memory allocated by each device(registered for GPU cgroup memory
> accounting) for each cgroup.
> 
> The API/UAPI can be extended to set per-device/total allocation limits
> in the future.
> 
> The cgroup controller has been named following the discussion in [1].
> 
> [1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/
> 
> Signed-off-by: Hridya Valsaraju <hridya@google.com>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
> v7 changes
> Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> This means gpucg_register_bucket now returns an internally allocated
> struct gpucg_bucket.
> 
> Move all public function documentation to the cgroup_gpu.h header.
> 
> v5 changes
> Support all strings for gpucg_register_device instead of just string
> literals.
> 
> Enforce globally unique gpucg_bucket names.
> 
> Constrain gpucg_bucket name lengths to 64 bytes.
> 
> Obtain just a single css refcount instead of nr_pages for each
> charge.
> 
> Rename:
> gpucg_try_charge -> gpucg_charge
> find_cg_rpool_locked -> cg_rpool_find_locked
> init_cg_rpool -> cg_rpool_init
> get_cg_rpool_locked -> cg_rpool_get_locked
> "gpu cgroup controller" -> "GPU controller"
> gpucg_device -> gpucg_bucket
> usage -> size
> 
> v4 changes
> Adjust gpucg_try_charge critical section for future charge transfer
> functionality.
> 
> v3 changes
> Use more common dual author commit message format per John Stultz.
> 
> v2 changes
> Fix incorrect Kconfig help section indentation per Randy Dunlap.
> ---
>  include/linux/cgroup_gpu.h    | 122 ++++++++++++
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig                  |   7 +
>  kernel/cgroup/Makefile        |   1 +
>  kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
>  5 files changed, 473 insertions(+)
>  create mode 100644 include/linux/cgroup_gpu.h
>  create mode 100644 kernel/cgroup/gpu.c
> 
> diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
> new file mode 100644
> index 000000000000..cb228a16aa1f
> --- /dev/null
> +++ b/include/linux/cgroup_gpu.h
> @@ -0,0 +1,122 @@
> +/* SPDX-License-Identifier: MIT
> + * Copyright 2019 Advanced Micro Devices, Inc.
> + * Copyright (C) 2022 Google LLC.
> + */
> +#ifndef _CGROUP_GPU_H
> +#define _CGROUP_GPU_H
> +
> +#include <linux/cgroup.h>
> +
> +#define GPUCG_BUCKET_NAME_MAX_LEN 64
> +
> +struct gpucg;
> +struct gpucg_bucket;
> +
> +#ifdef CONFIG_CGROUP_GPU
> +
> +/**
> + * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
> + * @css: the target cgroup_subsys_state
> + *
> + * Returns: gpu cgroup that contains the @css
> + */
> +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
> +
> +/**
> + * gpucg_get - get the gpucg reference that a task belongs to
> + * @task: the target task
> + *
> + * This increases the reference count of the css that the @task belongs to.
> + *
> + * Returns: reference to the gpu cgroup the task belongs to.
> + */
> +struct gpucg *gpucg_get(struct task_struct *task);
> +
> +/**
> + * gpucg_put - put a gpucg reference
> + * @gpucg: the target gpucg
> + *
> + * Put a reference obtained via gpucg_get
> + */
> +void gpucg_put(struct gpucg *gpucg);
> +
> +/**
> + * gpucg_parent - find the parent of a gpu cgroup
> + * @cg: the target gpucg
> + *
> + * This does not increase the reference count of the parent cgroup
> + *
> + * Returns: parent gpu cgroup of @cg
> + */
> +struct gpucg *gpucg_parent(struct gpucg *cg);
> +
> +/**
> + * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
> + * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
> + * rounded up to be a multiple of the page size.
> + *
> + * @gpucg: The gpu cgroup to charge the memory to.
> + * @bucket: The bucket to charge the memory to.
> + * @size: The size of memory to charge in bytes.
> + *        This size will be rounded up to the nearest page size.
> + *
> + * Return: returns 0 if the charging is successful and otherwise returns an error code.
> + */
> +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> +
> +/**
> + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
> + * The caller must hold a reference to @gpucg obtained through gpucg_get().
> + *
> + * @gpucg: The gpu cgroup to uncharge the memory from.
> + * @bucket: The bucket to uncharge the memory from.
> + * @size: The size of memory to uncharge in bytes.
> + *        This size will be rounded up to the nearest page size.
> + */
> +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> +
> +/**
> + * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
> + *
> + * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
> + *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
> + *
> + * @bucket must remain valid. @name will be copied.
> + *
> + * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
> + * cannot be unregistered, this can never be freed.
> + */
> +struct gpucg_bucket *gpucg_register_bucket(const char *name);
> +#else /* CONFIG_CGROUP_GPU */
> +
> +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> +{
> +	return NULL;
> +}
> +
> +static inline struct gpucg *gpucg_get(struct task_struct *task)
> +{
> +	return NULL;
> +}
> +
> +static inline void gpucg_put(struct gpucg *gpucg) {}
> +
> +static inline struct gpucg *gpucg_parent(struct gpucg *cg)
> +{
> +	return NULL;
> +}
> +
> +static inline int gpucg_charge(struct gpucg *gpucg,
> +			       struct gpucg_bucket *bucket,
> +			       u64 size)
> +{
> +	return 0;
> +}
> +
> +static inline void gpucg_uncharge(struct gpucg *gpucg,
> +				  struct gpucg_bucket *bucket,
> +				  u64 size) {}
> +
> +static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}

I think this needs to return NULL, otherwise you'll get a compiler error when
CONFIG_CGROUP_GPU is not set. 

I found other build errors when CONFIG_CGROUP_GPU is not set, please fix them in
the next versioon.

Thanks,
  Enric

> +#endif /* CONFIG_CGROUP_GPU */
> +#endif /* _CGROUP_GPU_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 445235487230..46a2a7b93c41 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -65,6 +65,10 @@ SUBSYS(rdma)
>  SUBSYS(misc)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_GPU)
> +SUBSYS(gpu)
> +#endif
> +
>  /*
>   * The following subsystems are not supported on the default hierarchy.
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index ddcbefe535e9..2e00a190e170 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -984,6 +984,13 @@ config BLK_CGROUP
>  
>  	See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
>  
> +config CGROUP_GPU
> +	bool "GPU controller (EXPERIMENTAL)"
> +	select PAGE_COUNTER
> +	help
> +	  Provides accounting and limit setting for memory allocations by the GPU and
> +	  GPU-related subsystems.
> +
>  config CGROUP_WRITEBACK
>  	bool
>  	depends on MEMCG && BLK_CGROUP
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index 12f8457ad1f9..be95a5a532fc 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_CGROUP_MISC) += misc.o
>  obj-$(CONFIG_CGROUP_DEBUG) += debug.o
> +obj-$(CONFIG_CGROUP_GPU) += gpu.o
> diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
> new file mode 100644
> index 000000000000..ad16ea15d427
> --- /dev/null
> +++ b/kernel/cgroup/gpu.c
> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: MIT
> +// Copyright 2019 Advanced Micro Devices, Inc.
> +// Copyright (C) 2022 Google LLC.
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_gpu.h>
> +#include <linux/err.h>
> +#include <linux/gfp.h>
> +#include <linux/list.h>
> +#include <linux/mm.h>
> +#include <linux/page_counter.h>
> +#include <linux/seq_file.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +
> +static struct gpucg *root_gpucg __read_mostly;
> +
> +/*
> + * Protects list of resource pools maintained on per cgroup basis and list
> + * of buckets registered for memory accounting using the GPU cgroup controller.
> + */
> +static DEFINE_MUTEX(gpucg_mutex);
> +static LIST_HEAD(gpucg_buckets);
> +
> +/* The GPU cgroup controller data structure */
> +struct gpucg {
> +	struct cgroup_subsys_state css;
> +
> +	/* list of all resource pools that belong to this cgroup */
> +	struct list_head rpools;
> +};
> +
> +/* A named entity representing bucket of tracked memory. */
> +struct gpucg_bucket {
> +	/* list of various resource pools in various cgroups that the bucket is part of */
> +	struct list_head rpools;
> +
> +	/* list of all buckets registered for GPU cgroup accounting */
> +	struct list_head bucket_node;
> +
> +	/* string to be used as identifier for accounting and limit setting */
> +	const char *name;
> +};
> +
> +struct gpucg_resource_pool {
> +	/* The bucket whose resource usage is tracked by this resource pool */
> +	struct gpucg_bucket *bucket;
> +
> +	/* list of all resource pools for the cgroup */
> +	struct list_head cg_node;
> +
> +	/* list maintained by the gpucg_bucket to keep track of its resource pools */
> +	struct list_head bucket_node;
> +
> +	/* tracks memory usage of the resource pool */
> +	struct page_counter total;
> +};
> +
> +static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
> +{
> +	lockdep_assert_held(&gpucg_mutex);
> +
> +	list_del(&rpool->cg_node);
> +	list_del(&rpool->bucket_node);
> +	kfree(rpool);
> +}
> +
> +static void gpucg_css_free(struct cgroup_subsys_state *css)
> +{
> +	struct gpucg_resource_pool *rpool, *tmp;
> +	struct gpucg *gpucg = css_to_gpucg(css);
> +
> +	// delete all resource pools
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
> +		free_cg_rpool_locked(rpool);
> +	mutex_unlock(&gpucg_mutex);
> +
> +	kfree(gpucg);
> +}
> +
> +static struct cgroup_subsys_state *
> +gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> +	struct gpucg *gpucg, *parent;
> +
> +	gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
> +	if (!gpucg)
> +		return ERR_PTR(-ENOMEM);
> +
> +	parent = css_to_gpucg(parent_css);
> +	if (!parent)
> +		root_gpucg = gpucg;
> +
> +	INIT_LIST_HEAD(&gpucg->rpools);
> +
> +	return &gpucg->css;
> +}
> +
> +static struct gpucg_resource_pool *cg_rpool_find_locked(
> +	struct gpucg *cg,
> +	struct gpucg_bucket *bucket)
> +{
> +	struct gpucg_resource_pool *rpool;
> +
> +	lockdep_assert_held(&gpucg_mutex);
> +
> +	list_for_each_entry(rpool, &cg->rpools, cg_node)
> +		if (rpool->bucket == bucket)
> +			return rpool;
> +
> +	return NULL;
> +}
> +
> +static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
> +						 struct gpucg_bucket *bucket)
> +{
> +	struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
> +							GFP_KERNEL);
> +	if (!rpool)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rpool->bucket = bucket;
> +
> +	page_counter_init(&rpool->total, NULL);
> +	INIT_LIST_HEAD(&rpool->cg_node);
> +	INIT_LIST_HEAD(&rpool->bucket_node);
> +	list_add_tail(&rpool->cg_node, &cg->rpools);
> +	list_add_tail(&rpool->bucket_node, &bucket->rpools);
> +
> +	return rpool;
> +}
> +
> +/**
> + * get_cg_rpool_locked - find the resource pool for the specified bucket and
> + * specified cgroup. If the resource pool does not exist for the cg, it is
> + * created in a hierarchical manner in the cgroup and its ancestor cgroups who
> + * do not already have a resource pool entry for the bucket.
> + *
> + * @cg: The cgroup to find the resource pool for.
> + * @bucket: The bucket associated with the returned resource pool.
> + *
> + * Return: return resource pool entry corresponding to the specified bucket in
> + * the specified cgroup (hierarchically creating them if not existing already).
> + *
> + */
> +static struct gpucg_resource_pool *
> +cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
> +{
> +	struct gpucg *parent_cg, *p, *stop_cg;
> +	struct gpucg_resource_pool *rpool, *tmp_rpool;
> +	struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
> +
> +	rpool = cg_rpool_find_locked(cg, bucket);
> +	if (rpool)
> +		return rpool;
> +
> +	stop_cg = cg;
> +	do {
> +		rpool = cg_rpool_init(stop_cg, bucket);
> +		if (IS_ERR(rpool))
> +			goto err;
> +
> +		if (!leaf_rpool)
> +			leaf_rpool = rpool;
> +
> +		stop_cg = gpucg_parent(stop_cg);
> +		if (!stop_cg)
> +			break;
> +
> +		rpool = cg_rpool_find_locked(stop_cg, bucket);
> +	} while (!rpool);
> +
> +	/*
> +	 * Re-initialize page counters of all rpools created in this invocation
> +	 * to enable hierarchical charging.
> +	 * stop_cg is the first ancestor cg who already had a resource pool for
> +	 * the bucket. It can also be NULL if no ancestors had a pre-existing
> +	 * resource pool for the bucket before this invocation.
> +	 */
> +	rpool = leaf_rpool;
> +	for (p = cg; p != stop_cg; p = parent_cg) {
> +		parent_cg = gpucg_parent(p);
> +		if (!parent_cg)
> +			break;
> +		parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
> +		page_counter_init(&rpool->total, &parent_rpool->total);
> +
> +		rpool = parent_rpool;
> +	}
> +
> +	return leaf_rpool;
> +err:
> +	for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
> +		tmp_rpool = cg_rpool_find_locked(p, bucket);
> +		free_cg_rpool_locked(tmp_rpool);
> +	}
> +	return rpool;
> +}
> +
> +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct gpucg, css) : NULL;
> +}
> +
> +struct gpucg *gpucg_get(struct task_struct *task)
> +{
> +	if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
> +		return NULL;
> +	return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
> +}
> +
> +void gpucg_put(struct gpucg *gpucg)
> +{
> +	if (gpucg)
> +		css_put(&gpucg->css);
> +}
> +
> +struct gpucg *gpucg_parent(struct gpucg *cg)
> +{
> +	return css_to_gpucg(cg->css.parent);
> +}
> +
> +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> +{
> +	struct page_counter *counter;
> +	u64 nr_pages;
> +	struct gpucg_resource_pool *rp;
> +	int ret = 0;
> +
> +	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +
> +	mutex_lock(&gpucg_mutex);
> +	rp = cg_rpool_get_locked(gpucg, bucket);
> +	/*
> +	 * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
> +	 * progress to avoid potentially exceeding a limit.
> +	 */
> +	if (IS_ERR(rp)) {
> +		mutex_unlock(&gpucg_mutex);
> +		return PTR_ERR(rp);
> +	}
> +
> +	if (page_counter_try_charge(&rp->total, nr_pages, &counter))
> +		css_get(&gpucg->css);
> +	else
> +		ret = -ENOMEM;
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return ret;
> +}
> +
> +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> +{
> +	u64 nr_pages;
> +	struct gpucg_resource_pool *rp;
> +
> +	mutex_lock(&gpucg_mutex);
> +	rp = cg_rpool_find_locked(gpucg, bucket);
> +	/*
> +	 * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
> +	 * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
> +	 * no potential to exceed a limit while uncharging and transferring.
> +	 */
> +	mutex_unlock(&gpucg_mutex);
> +
> +	if (unlikely(!rp)) {
> +		pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
> +		return;
> +	}
> +
> +	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +	page_counter_uncharge(&rp->total, nr_pages);
> +	css_put(&gpucg->css);
> +}
> +
> +struct gpucg_bucket *gpucg_register_bucket(const char *name)
> +{
> +	struct gpucg_bucket *bucket, *b;
> +
> +	if (!name)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
> +		return ERR_PTR(-ENAMETOOLONG);
> +
> +	bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
> +	if (!bucket)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&bucket->bucket_node);
> +	INIT_LIST_HEAD(&bucket->rpools);
> +	bucket->name = kstrdup_const(name, GFP_KERNEL);
> +
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry(b, &gpucg_buckets, bucket_node) {
> +		if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
> +			mutex_unlock(&gpucg_mutex);
> +			kfree_const(bucket->name);
> +			kfree(bucket);
> +			return ERR_PTR(-EEXIST);
> +		}
> +	}
> +	list_add_tail(&bucket->bucket_node, &gpucg_buckets);
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return bucket;
> +}
> +
> +static int gpucg_resource_show(struct seq_file *sf, void *v)
> +{
> +	struct gpucg_resource_pool *rpool;
> +	struct gpucg *cg = css_to_gpucg(seq_css(sf));
> +
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry(rpool, &cg->rpools, cg_node) {
> +		seq_printf(sf, "%s %lu\n", rpool->bucket->name,
> +			   page_counter_read(&rpool->total) * PAGE_SIZE);
> +	}
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return 0;
> +}
> +
> +struct cftype files[] = {
> +	{
> +		.name = "memory.current",
> +		.seq_show = gpucg_resource_show,
> +	},
> +	{ }     /* terminate */
> +};
> +
> +struct cgroup_subsys gpu_cgrp_subsys = {
> +	.css_alloc      = gpucg_css_alloc,
> +	.css_free       = gpucg_css_free,
> +	.early_init     = false,
> +	.legacy_cftypes = files,
> +	.dfl_cftypes    = files,
> +};
> 
> -- 
> 2.36.0.512.ge40c2bad7a-goog
> 
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
@ 2022-05-19 10:52   ` eballetbo
  0 siblings, 0 replies; 67+ messages in thread
From: eballetbo @ 2022-05-19 10:52 UTC (permalink / raw)
  To: lizefan.x, corbet, joel, arve, tjmercier, maco,
	benjamin.gaignard, tj, brauner, sumit.semwal, tkjos, surenb,
	hannes, Brian.Starkey, christian.koenig, gregkh, lmark,
	john.stultz, hridya, shuah, labbott
  Cc: Enric Balletbo i Serra, cgroups, kernel-team, linux-media,
	dri-devel, linaro-mm-sig, cmllamas, daniel, Kenny.Ho,
	linux-kselftest, kaleshsingh, mkoutny, jstultz, linux-doc,
	linux-kernel, skhan

From: Enric Balletbo i Serra <eballetbo@kernel.org>

On Tue, 10 May 2022 23:56:46 +0000, T.J. Mercier wrote
> From: Hridya Valsaraju <hridya@google.com>
> 
> The cgroup controller provides accounting for GPU and GPU-related
> memory allocations. The memory being accounted can be device memory or
> memory allocated from pools dedicated to serve GPU-related tasks.
> 
> This patch adds APIs to:
> -allow a device to register for memory accounting using the GPU cgroup
> controller.
> -charge and uncharge allocated memory to a cgroup.
> 
> When the cgroup controller is enabled, it would expose information about
> the memory allocated by each device(registered for GPU cgroup memory
> accounting) for each cgroup.
> 
> The API/UAPI can be extended to set per-device/total allocation limits
> in the future.
> 
> The cgroup controller has been named following the discussion in [1].
> 
> [1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/
> 
> Signed-off-by: Hridya Valsaraju <hridya@google.com>
> Signed-off-by: T.J. Mercier <tjmercier@google.com>
> ---
> v7 changes
> Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> This means gpucg_register_bucket now returns an internally allocated
> struct gpucg_bucket.
> 
> Move all public function documentation to the cgroup_gpu.h header.
> 
> v5 changes
> Support all strings for gpucg_register_device instead of just string
> literals.
> 
> Enforce globally unique gpucg_bucket names.
> 
> Constrain gpucg_bucket name lengths to 64 bytes.
> 
> Obtain just a single css refcount instead of nr_pages for each
> charge.
> 
> Rename:
> gpucg_try_charge -> gpucg_charge
> find_cg_rpool_locked -> cg_rpool_find_locked
> init_cg_rpool -> cg_rpool_init
> get_cg_rpool_locked -> cg_rpool_get_locked
> "gpu cgroup controller" -> "GPU controller"
> gpucg_device -> gpucg_bucket
> usage -> size
> 
> v4 changes
> Adjust gpucg_try_charge critical section for future charge transfer
> functionality.
> 
> v3 changes
> Use more common dual author commit message format per John Stultz.
> 
> v2 changes
> Fix incorrect Kconfig help section indentation per Randy Dunlap.
> ---
>  include/linux/cgroup_gpu.h    | 122 ++++++++++++
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig                  |   7 +
>  kernel/cgroup/Makefile        |   1 +
>  kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
>  5 files changed, 473 insertions(+)
>  create mode 100644 include/linux/cgroup_gpu.h
>  create mode 100644 kernel/cgroup/gpu.c
> 
> diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
> new file mode 100644
> index 000000000000..cb228a16aa1f
> --- /dev/null
> +++ b/include/linux/cgroup_gpu.h
> @@ -0,0 +1,122 @@
> +/* SPDX-License-Identifier: MIT
> + * Copyright 2019 Advanced Micro Devices, Inc.
> + * Copyright (C) 2022 Google LLC.
> + */
> +#ifndef _CGROUP_GPU_H
> +#define _CGROUP_GPU_H
> +
> +#include <linux/cgroup.h>
> +
> +#define GPUCG_BUCKET_NAME_MAX_LEN 64
> +
> +struct gpucg;
> +struct gpucg_bucket;
> +
> +#ifdef CONFIG_CGROUP_GPU
> +
> +/**
> + * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
> + * @css: the target cgroup_subsys_state
> + *
> + * Returns: gpu cgroup that contains the @css
> + */
> +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
> +
> +/**
> + * gpucg_get - get the gpucg reference that a task belongs to
> + * @task: the target task
> + *
> + * This increases the reference count of the css that the @task belongs to.
> + *
> + * Returns: reference to the gpu cgroup the task belongs to.
> + */
> +struct gpucg *gpucg_get(struct task_struct *task);
> +
> +/**
> + * gpucg_put - put a gpucg reference
> + * @gpucg: the target gpucg
> + *
> + * Put a reference obtained via gpucg_get
> + */
> +void gpucg_put(struct gpucg *gpucg);
> +
> +/**
> + * gpucg_parent - find the parent of a gpu cgroup
> + * @cg: the target gpucg
> + *
> + * This does not increase the reference count of the parent cgroup
> + *
> + * Returns: parent gpu cgroup of @cg
> + */
> +struct gpucg *gpucg_parent(struct gpucg *cg);
> +
> +/**
> + * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
> + * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
> + * rounded up to be a multiple of the page size.
> + *
> + * @gpucg: The gpu cgroup to charge the memory to.
> + * @bucket: The bucket to charge the memory to.
> + * @size: The size of memory to charge in bytes.
> + *        This size will be rounded up to the nearest page size.
> + *
> + * Return: returns 0 if the charging is successful and otherwise returns an error code.
> + */
> +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> +
> +/**
> + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
> + * The caller must hold a reference to @gpucg obtained through gpucg_get().
> + *
> + * @gpucg: The gpu cgroup to uncharge the memory from.
> + * @bucket: The bucket to uncharge the memory from.
> + * @size: The size of memory to uncharge in bytes.
> + *        This size will be rounded up to the nearest page size.
> + */
> +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> +
> +/**
> + * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
> + *
> + * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
> + *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
> + *
> + * @bucket must remain valid. @name will be copied.
> + *
> + * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
> + * cannot be unregistered, this can never be freed.
> + */
> +struct gpucg_bucket *gpucg_register_bucket(const char *name);
> +#else /* CONFIG_CGROUP_GPU */
> +
> +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> +{
> +	return NULL;
> +}
> +
> +static inline struct gpucg *gpucg_get(struct task_struct *task)
> +{
> +	return NULL;
> +}
> +
> +static inline void gpucg_put(struct gpucg *gpucg) {}
> +
> +static inline struct gpucg *gpucg_parent(struct gpucg *cg)
> +{
> +	return NULL;
> +}
> +
> +static inline int gpucg_charge(struct gpucg *gpucg,
> +			       struct gpucg_bucket *bucket,
> +			       u64 size)
> +{
> +	return 0;
> +}
> +
> +static inline void gpucg_uncharge(struct gpucg *gpucg,
> +				  struct gpucg_bucket *bucket,
> +				  u64 size) {}
> +
> +static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}

I think this needs to return NULL, otherwise you'll get a compiler error when
CONFIG_CGROUP_GPU is not set. 

I found other build errors when CONFIG_CGROUP_GPU is not set, please fix them in
the next versioon.

Thanks,
  Enric

> +#endif /* CONFIG_CGROUP_GPU */
> +#endif /* _CGROUP_GPU_H */
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 445235487230..46a2a7b93c41 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -65,6 +65,10 @@ SUBSYS(rdma)
>  SUBSYS(misc)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_GPU)
> +SUBSYS(gpu)
> +#endif
> +
>  /*
>   * The following subsystems are not supported on the default hierarchy.
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index ddcbefe535e9..2e00a190e170 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -984,6 +984,13 @@ config BLK_CGROUP
>  
>  	See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
>  
> +config CGROUP_GPU
> +	bool "GPU controller (EXPERIMENTAL)"
> +	select PAGE_COUNTER
> +	help
> +	  Provides accounting and limit setting for memory allocations by the GPU and
> +	  GPU-related subsystems.
> +
>  config CGROUP_WRITEBACK
>  	bool
>  	depends on MEMCG && BLK_CGROUP
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index 12f8457ad1f9..be95a5a532fc 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_CGROUP_MISC) += misc.o
>  obj-$(CONFIG_CGROUP_DEBUG) += debug.o
> +obj-$(CONFIG_CGROUP_GPU) += gpu.o
> diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
> new file mode 100644
> index 000000000000..ad16ea15d427
> --- /dev/null
> +++ b/kernel/cgroup/gpu.c
> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: MIT
> +// Copyright 2019 Advanced Micro Devices, Inc.
> +// Copyright (C) 2022 Google LLC.
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_gpu.h>
> +#include <linux/err.h>
> +#include <linux/gfp.h>
> +#include <linux/list.h>
> +#include <linux/mm.h>
> +#include <linux/page_counter.h>
> +#include <linux/seq_file.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +
> +static struct gpucg *root_gpucg __read_mostly;
> +
> +/*
> + * Protects list of resource pools maintained on per cgroup basis and list
> + * of buckets registered for memory accounting using the GPU cgroup controller.
> + */
> +static DEFINE_MUTEX(gpucg_mutex);
> +static LIST_HEAD(gpucg_buckets);
> +
> +/* The GPU cgroup controller data structure */
> +struct gpucg {
> +	struct cgroup_subsys_state css;
> +
> +	/* list of all resource pools that belong to this cgroup */
> +	struct list_head rpools;
> +};
> +
> +/* A named entity representing bucket of tracked memory. */
> +struct gpucg_bucket {
> +	/* list of various resource pools in various cgroups that the bucket is part of */
> +	struct list_head rpools;
> +
> +	/* list of all buckets registered for GPU cgroup accounting */
> +	struct list_head bucket_node;
> +
> +	/* string to be used as identifier for accounting and limit setting */
> +	const char *name;
> +};
> +
> +struct gpucg_resource_pool {
> +	/* The bucket whose resource usage is tracked by this resource pool */
> +	struct gpucg_bucket *bucket;
> +
> +	/* list of all resource pools for the cgroup */
> +	struct list_head cg_node;
> +
> +	/* list maintained by the gpucg_bucket to keep track of its resource pools */
> +	struct list_head bucket_node;
> +
> +	/* tracks memory usage of the resource pool */
> +	struct page_counter total;
> +};
> +
> +static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
> +{
> +	lockdep_assert_held(&gpucg_mutex);
> +
> +	list_del(&rpool->cg_node);
> +	list_del(&rpool->bucket_node);
> +	kfree(rpool);
> +}
> +
> +static void gpucg_css_free(struct cgroup_subsys_state *css)
> +{
> +	struct gpucg_resource_pool *rpool, *tmp;
> +	struct gpucg *gpucg = css_to_gpucg(css);
> +
> +	// delete all resource pools
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
> +		free_cg_rpool_locked(rpool);
> +	mutex_unlock(&gpucg_mutex);
> +
> +	kfree(gpucg);
> +}
> +
> +static struct cgroup_subsys_state *
> +gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> +	struct gpucg *gpucg, *parent;
> +
> +	gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
> +	if (!gpucg)
> +		return ERR_PTR(-ENOMEM);
> +
> +	parent = css_to_gpucg(parent_css);
> +	if (!parent)
> +		root_gpucg = gpucg;
> +
> +	INIT_LIST_HEAD(&gpucg->rpools);
> +
> +	return &gpucg->css;
> +}
> +
> +static struct gpucg_resource_pool *cg_rpool_find_locked(
> +	struct gpucg *cg,
> +	struct gpucg_bucket *bucket)
> +{
> +	struct gpucg_resource_pool *rpool;
> +
> +	lockdep_assert_held(&gpucg_mutex);
> +
> +	list_for_each_entry(rpool, &cg->rpools, cg_node)
> +		if (rpool->bucket == bucket)
> +			return rpool;
> +
> +	return NULL;
> +}
> +
> +static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
> +						 struct gpucg_bucket *bucket)
> +{
> +	struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
> +							GFP_KERNEL);
> +	if (!rpool)
> +		return ERR_PTR(-ENOMEM);
> +
> +	rpool->bucket = bucket;
> +
> +	page_counter_init(&rpool->total, NULL);
> +	INIT_LIST_HEAD(&rpool->cg_node);
> +	INIT_LIST_HEAD(&rpool->bucket_node);
> +	list_add_tail(&rpool->cg_node, &cg->rpools);
> +	list_add_tail(&rpool->bucket_node, &bucket->rpools);
> +
> +	return rpool;
> +}
> +
> +/**
> + * get_cg_rpool_locked - find the resource pool for the specified bucket and
> + * specified cgroup. If the resource pool does not exist for the cg, it is
> + * created in a hierarchical manner in the cgroup and its ancestor cgroups who
> + * do not already have a resource pool entry for the bucket.
> + *
> + * @cg: The cgroup to find the resource pool for.
> + * @bucket: The bucket associated with the returned resource pool.
> + *
> + * Return: return resource pool entry corresponding to the specified bucket in
> + * the specified cgroup (hierarchically creating them if not existing already).
> + *
> + */
> +static struct gpucg_resource_pool *
> +cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
> +{
> +	struct gpucg *parent_cg, *p, *stop_cg;
> +	struct gpucg_resource_pool *rpool, *tmp_rpool;
> +	struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
> +
> +	rpool = cg_rpool_find_locked(cg, bucket);
> +	if (rpool)
> +		return rpool;
> +
> +	stop_cg = cg;
> +	do {
> +		rpool = cg_rpool_init(stop_cg, bucket);
> +		if (IS_ERR(rpool))
> +			goto err;
> +
> +		if (!leaf_rpool)
> +			leaf_rpool = rpool;
> +
> +		stop_cg = gpucg_parent(stop_cg);
> +		if (!stop_cg)
> +			break;
> +
> +		rpool = cg_rpool_find_locked(stop_cg, bucket);
> +	} while (!rpool);
> +
> +	/*
> +	 * Re-initialize page counters of all rpools created in this invocation
> +	 * to enable hierarchical charging.
> +	 * stop_cg is the first ancestor cg who already had a resource pool for
> +	 * the bucket. It can also be NULL if no ancestors had a pre-existing
> +	 * resource pool for the bucket before this invocation.
> +	 */
> +	rpool = leaf_rpool;
> +	for (p = cg; p != stop_cg; p = parent_cg) {
> +		parent_cg = gpucg_parent(p);
> +		if (!parent_cg)
> +			break;
> +		parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
> +		page_counter_init(&rpool->total, &parent_rpool->total);
> +
> +		rpool = parent_rpool;
> +	}
> +
> +	return leaf_rpool;
> +err:
> +	for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
> +		tmp_rpool = cg_rpool_find_locked(p, bucket);
> +		free_cg_rpool_locked(tmp_rpool);
> +	}
> +	return rpool;
> +}
> +
> +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> +{
> +	return css ? container_of(css, struct gpucg, css) : NULL;
> +}
> +
> +struct gpucg *gpucg_get(struct task_struct *task)
> +{
> +	if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
> +		return NULL;
> +	return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
> +}
> +
> +void gpucg_put(struct gpucg *gpucg)
> +{
> +	if (gpucg)
> +		css_put(&gpucg->css);
> +}
> +
> +struct gpucg *gpucg_parent(struct gpucg *cg)
> +{
> +	return css_to_gpucg(cg->css.parent);
> +}
> +
> +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> +{
> +	struct page_counter *counter;
> +	u64 nr_pages;
> +	struct gpucg_resource_pool *rp;
> +	int ret = 0;
> +
> +	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +
> +	mutex_lock(&gpucg_mutex);
> +	rp = cg_rpool_get_locked(gpucg, bucket);
> +	/*
> +	 * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
> +	 * progress to avoid potentially exceeding a limit.
> +	 */
> +	if (IS_ERR(rp)) {
> +		mutex_unlock(&gpucg_mutex);
> +		return PTR_ERR(rp);
> +	}
> +
> +	if (page_counter_try_charge(&rp->total, nr_pages, &counter))
> +		css_get(&gpucg->css);
> +	else
> +		ret = -ENOMEM;
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return ret;
> +}
> +
> +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> +{
> +	u64 nr_pages;
> +	struct gpucg_resource_pool *rp;
> +
> +	mutex_lock(&gpucg_mutex);
> +	rp = cg_rpool_find_locked(gpucg, bucket);
> +	/*
> +	 * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
> +	 * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
> +	 * no potential to exceed a limit while uncharging and transferring.
> +	 */
> +	mutex_unlock(&gpucg_mutex);
> +
> +	if (unlikely(!rp)) {
> +		pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
> +		return;
> +	}
> +
> +	nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> +	page_counter_uncharge(&rp->total, nr_pages);
> +	css_put(&gpucg->css);
> +}
> +
> +struct gpucg_bucket *gpucg_register_bucket(const char *name)
> +{
> +	struct gpucg_bucket *bucket, *b;
> +
> +	if (!name)
> +		return ERR_PTR(-EINVAL);
> +
> +	if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
> +		return ERR_PTR(-ENAMETOOLONG);
> +
> +	bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
> +	if (!bucket)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&bucket->bucket_node);
> +	INIT_LIST_HEAD(&bucket->rpools);
> +	bucket->name = kstrdup_const(name, GFP_KERNEL);
> +
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry(b, &gpucg_buckets, bucket_node) {
> +		if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
> +			mutex_unlock(&gpucg_mutex);
> +			kfree_const(bucket->name);
> +			kfree(bucket);
> +			return ERR_PTR(-EEXIST);
> +		}
> +	}
> +	list_add_tail(&bucket->bucket_node, &gpucg_buckets);
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return bucket;
> +}
> +
> +static int gpucg_resource_show(struct seq_file *sf, void *v)
> +{
> +	struct gpucg_resource_pool *rpool;
> +	struct gpucg *cg = css_to_gpucg(seq_css(sf));
> +
> +	mutex_lock(&gpucg_mutex);
> +	list_for_each_entry(rpool, &cg->rpools, cg_node) {
> +		seq_printf(sf, "%s %lu\n", rpool->bucket->name,
> +			   page_counter_read(&rpool->total) * PAGE_SIZE);
> +	}
> +	mutex_unlock(&gpucg_mutex);
> +
> +	return 0;
> +}
> +
> +struct cftype files[] = {
> +	{
> +		.name = "memory.current",
> +		.seq_show = gpucg_resource_show,
> +	},
> +	{ }     /* terminate */
> +};
> +
> +struct cgroup_subsys gpu_cgrp_subsys = {
> +	.css_alloc      = gpucg_css_alloc,
> +	.css_free       = gpucg_css_free,
> +	.early_init     = false,
> +	.legacy_cftypes = files,
> +	.dfl_cftypes    = files,
> +};
> 
> -- 
> 2.36.0.512.ge40c2bad7a-goog
> 
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-17 23:30             ` T.J. Mercier
  (?)
@ 2022-05-20  7:47               ` Tejun Heo
  -1 siblings, 0 replies; 67+ messages in thread
From: Tejun Heo @ 2022-05-20  7:47 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

Hello,

On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> Thanks for your suggestion. This almost works. "dmabuf" as a key could
> work, but I'd actually like to account for each heap. Since heaps can
> be dynamically added, I can't accommodate every potential heap name by
> hardcoding registrations in the misc controller.

On its own, that's a pretty weak reason to be adding a separate gpu
controller especially given that it doesn't really seem to be one with
proper abstractions for gpu resources. We don't want to keep adding random
keys to misc controller but can definitely add limited flexibility. What
kind of keys do you need?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-20  7:47               ` Tejun Heo
  0 siblings, 0 replies; 67+ messages in thread
From: Tejun Heo @ 2022-05-20  7:47 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Nicolas Dufresne, Laura Abbott, kernel-team,
	linux-media, Todd Kjos, linaro-mm-sig, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Hridya Valsaraju

Hello,

On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> Thanks for your suggestion. This almost works. "dmabuf" as a key could
> work, but I'd actually like to account for each heap. Since heaps can
> be dynamically added, I can't accommodate every potential heap name by
> hardcoding registrations in the misc controller.

On its own, that's a pretty weak reason to be adding a separate gpu
controller especially given that it doesn't really seem to be one with
proper abstractions for gpu resources. We don't want to keep adding random
keys to misc controller but can definitely add limited flexibility. What
kind of keys do you need?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-20  7:47               ` Tejun Heo
  0 siblings, 0 replies; 67+ messages in thread
From: Tejun Heo @ 2022-05-20  7:47 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey

Hello,

On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> Thanks for your suggestion. This almost works. "dmabuf" as a key could
> work, but I'd actually like to account for each heap. Since heaps can
> be dynamically added, I can't accommodate every potential heap name by
> hardcoding registrations in the misc controller.

On its own, that's a pretty weak reason to be adding a separate gpu
controller especially given that it doesn't really seem to be one with
proper abstractions for gpu resources. We don't want to keep adding random
keys to misc controller but can definitely add limited flexibility. What
kind of keys do you need?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-20  7:47               ` Tejun Heo
  (?)
@ 2022-05-20 16:25                 ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-20 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > work, but I'd actually like to account for each heap. Since heaps can
> > be dynamically added, I can't accommodate every potential heap name by
> > hardcoding registrations in the misc controller.
>
> On its own, that's a pretty weak reason to be adding a separate gpu
> controller especially given that it doesn't really seem to be one with
> proper abstractions for gpu resources. We don't want to keep adding random
> keys to misc controller but can definitely add limited flexibility. What
> kind of keys do you need?
>
Well the dmabuf-from-heaps component of this is the initial use case.
I was envisioning we'd have additional keys as discussed here:
https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
So we'd end up with a well-defined core set of keys like "system", and
then drivers would be free to use their own keys for their own unique
purposes which could be complementary or orthogonal to the core set.
Yesterday I was talking with someone who is interested in limiting gpu
cores and bus IDs in addition to gpu memory. How to define core keys
is the part where it looks like there's trouble.

For my use case it would be sufficient to have current and maximum
values for an arbitrary number of keys - one per heap. So the only
part missing from the misc controller (for my use case) is the ability
to register a new key at runtime as heaps are added. Instead of
keeping track of resources with enum misc_res_type, requesting a
resource handle/ID from the misc controller at runtime is what I think
would be required instead.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-20 16:25                 ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-20 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Nicolas Dufresne, Laura Abbott, kernel-team,
	linux-media, Todd Kjos, linaro-mm-sig, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Hridya Valsaraju

On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > work, but I'd actually like to account for each heap. Since heaps can
> > be dynamically added, I can't accommodate every potential heap name by
> > hardcoding registrations in the misc controller.
>
> On its own, that's a pretty weak reason to be adding a separate gpu
> controller especially given that it doesn't really seem to be one with
> proper abstractions for gpu resources. We don't want to keep adding random
> keys to misc controller but can definitely add limited flexibility. What
> kind of keys do you need?
>
Well the dmabuf-from-heaps component of this is the initial use case.
I was envisioning we'd have additional keys as discussed here:
https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
So we'd end up with a well-defined core set of keys like "system", and
then drivers would be free to use their own keys for their own unique
purposes which could be complementary or orthogonal to the core set.
Yesterday I was talking with someone who is interested in limiting gpu
cores and bus IDs in addition to gpu memory. How to define core keys
is the part where it looks like there's trouble.

For my use case it would be sufficient to have current and maximum
values for an arbitrary number of keys - one per heap. So the only
part missing from the misc controller (for my use case) is the ability
to register a new key at runtime as heaps are added. Instead of
keeping track of resources with enum misc_res_type, requesting a
resource handle/ID from the misc controller at runtime is what I think
would be required instead.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-05-20 16:25                 ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-20 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey

On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > work, but I'd actually like to account for each heap. Since heaps can
> > be dynamically added, I can't accommodate every potential heap name by
> > hardcoding registrations in the misc controller.
>
> On its own, that's a pretty weak reason to be adding a separate gpu
> controller especially given that it doesn't really seem to be one with
> proper abstractions for gpu resources. We don't want to keep adding random
> keys to misc controller but can definitely add limited flexibility. What
> kind of keys do you need?
>
Well the dmabuf-from-heaps component of this is the initial use case.
I was envisioning we'd have additional keys as discussed here:
https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
So we'd end up with a well-defined core set of keys like "system", and
then drivers would be free to use their own keys for their own unique
purposes which could be complementary or orthogonal to the core set.
Yesterday I was talking with someone who is interested in limiting gpu
cores and bus IDs in addition to gpu memory. How to define core keys
is the part where it looks like there's trouble.

For my use case it would be sufficient to have current and maximum
values for an arbitrary number of keys - one per heap. So the only
part missing from the misc controller (for my use case) is the ability
to register a new key at runtime as heaps are added. Instead of
keeping track of resources with enum misc_res_type, requesting a
resource handle/ID from the misc controller at runtime is what I think
would be required instead.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
  2022-05-19 10:52   ` eballetbo
  (?)
@ 2022-05-20 16:33     ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-20 16:33 UTC (permalink / raw)
  To: eballetbo
  Cc: Zefan Li, Jonathan Corbet, Joel Fernandes,
	Arve Hjønnevåg, Martijn Coenen, Benjamin Gaignard,
	Tejun Heo, Christian Brauner, Sumit Semwal, Todd Kjos,
	Suren Baghdasaryan, Johannes Weiner, Brian Starkey,
	Christian König, Greg Kroah-Hartman, Liam Mark, John Stultz,
	Hridya Valsaraju, Shuah Khan, Laura Abbott, cgroups, kernel-team,
	linux-media, dri-devel, linaro-mm-sig, Carlos Llamas,
	Daniel Vetter, Kenny.Ho, linux-kselftest, Kalesh Singh,
	Michal Koutný,
	John Stultz, linux-doc, linux-kernel, Shuah Khan

On Thu, May 19, 2022 at 3:53 AM <eballetbo@kernel.org> wrote:
>
> From: Enric Balletbo i Serra <eballetbo@kernel.org>
>
> On Tue, 10 May 2022 23:56:46 +0000, T.J. Mercier wrote
> > From: Hridya Valsaraju <hridya@google.com>
> >
> > The cgroup controller provides accounting for GPU and GPU-related
> > memory allocations. The memory being accounted can be device memory or
> > memory allocated from pools dedicated to serve GPU-related tasks.
> >
> > This patch adds APIs to:
> > -allow a device to register for memory accounting using the GPU cgroup
> > controller.
> > -charge and uncharge allocated memory to a cgroup.
> >
> > When the cgroup controller is enabled, it would expose information about
> > the memory allocated by each device(registered for GPU cgroup memory
> > accounting) for each cgroup.
> >
> > The API/UAPI can be extended to set per-device/total allocation limits
> > in the future.
> >
> > The cgroup controller has been named following the discussion in [1].
> >
> > [1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/
> >
> > Signed-off-by: Hridya Valsaraju <hridya@google.com>
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> > v7 changes
> > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > This means gpucg_register_bucket now returns an internally allocated
> > struct gpucg_bucket.
> >
> > Move all public function documentation to the cgroup_gpu.h header.
> >
> > v5 changes
> > Support all strings for gpucg_register_device instead of just string
> > literals.
> >
> > Enforce globally unique gpucg_bucket names.
> >
> > Constrain gpucg_bucket name lengths to 64 bytes.
> >
> > Obtain just a single css refcount instead of nr_pages for each
> > charge.
> >
> > Rename:
> > gpucg_try_charge -> gpucg_charge
> > find_cg_rpool_locked -> cg_rpool_find_locked
> > init_cg_rpool -> cg_rpool_init
> > get_cg_rpool_locked -> cg_rpool_get_locked
> > "gpu cgroup controller" -> "GPU controller"
> > gpucg_device -> gpucg_bucket
> > usage -> size
> >
> > v4 changes
> > Adjust gpucg_try_charge critical section for future charge transfer
> > functionality.
> >
> > v3 changes
> > Use more common dual author commit message format per John Stultz.
> >
> > v2 changes
> > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > ---
> >  include/linux/cgroup_gpu.h    | 122 ++++++++++++
> >  include/linux/cgroup_subsys.h |   4 +
> >  init/Kconfig                  |   7 +
> >  kernel/cgroup/Makefile        |   1 +
> >  kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
> >  5 files changed, 473 insertions(+)
> >  create mode 100644 include/linux/cgroup_gpu.h
> >  create mode 100644 kernel/cgroup/gpu.c
> >
> > diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
> > new file mode 100644
> > index 000000000000..cb228a16aa1f
> > --- /dev/null
> > +++ b/include/linux/cgroup_gpu.h
> > @@ -0,0 +1,122 @@
> > +/* SPDX-License-Identifier: MIT
> > + * Copyright 2019 Advanced Micro Devices, Inc.
> > + * Copyright (C) 2022 Google LLC.
> > + */
> > +#ifndef _CGROUP_GPU_H
> > +#define _CGROUP_GPU_H
> > +
> > +#include <linux/cgroup.h>
> > +
> > +#define GPUCG_BUCKET_NAME_MAX_LEN 64
> > +
> > +struct gpucg;
> > +struct gpucg_bucket;
> > +
> > +#ifdef CONFIG_CGROUP_GPU
> > +
> > +/**
> > + * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
> > + * @css: the target cgroup_subsys_state
> > + *
> > + * Returns: gpu cgroup that contains the @css
> > + */
> > +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
> > +
> > +/**
> > + * gpucg_get - get the gpucg reference that a task belongs to
> > + * @task: the target task
> > + *
> > + * This increases the reference count of the css that the @task belongs to.
> > + *
> > + * Returns: reference to the gpu cgroup the task belongs to.
> > + */
> > +struct gpucg *gpucg_get(struct task_struct *task);
> > +
> > +/**
> > + * gpucg_put - put a gpucg reference
> > + * @gpucg: the target gpucg
> > + *
> > + * Put a reference obtained via gpucg_get
> > + */
> > +void gpucg_put(struct gpucg *gpucg);
> > +
> > +/**
> > + * gpucg_parent - find the parent of a gpu cgroup
> > + * @cg: the target gpucg
> > + *
> > + * This does not increase the reference count of the parent cgroup
> > + *
> > + * Returns: parent gpu cgroup of @cg
> > + */
> > +struct gpucg *gpucg_parent(struct gpucg *cg);
> > +
> > +/**
> > + * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
> > + * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
> > + * rounded up to be a multiple of the page size.
> > + *
> > + * @gpucg: The gpu cgroup to charge the memory to.
> > + * @bucket: The bucket to charge the memory to.
> > + * @size: The size of memory to charge in bytes.
> > + *        This size will be rounded up to the nearest page size.
> > + *
> > + * Return: returns 0 if the charging is successful and otherwise returns an error code.
> > + */
> > +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> > +
> > +/**
> > + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
> > + * The caller must hold a reference to @gpucg obtained through gpucg_get().
> > + *
> > + * @gpucg: The gpu cgroup to uncharge the memory from.
> > + * @bucket: The bucket to uncharge the memory from.
> > + * @size: The size of memory to uncharge in bytes.
> > + *        This size will be rounded up to the nearest page size.
> > + */
> > +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> > +
> > +/**
> > + * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
> > + *
> > + * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
> > + *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
> > + *
> > + * @bucket must remain valid. @name will be copied.
> > + *
> > + * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
> > + * cannot be unregistered, this can never be freed.
> > + */
> > +struct gpucg_bucket *gpucg_register_bucket(const char *name);
> > +#else /* CONFIG_CGROUP_GPU */
> > +
> > +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline struct gpucg *gpucg_get(struct task_struct *task)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline void gpucg_put(struct gpucg *gpucg) {}
> > +
> > +static inline struct gpucg *gpucg_parent(struct gpucg *cg)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline int gpucg_charge(struct gpucg *gpucg,
> > +                            struct gpucg_bucket *bucket,
> > +                            u64 size)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline void gpucg_uncharge(struct gpucg *gpucg,
> > +                               struct gpucg_bucket *bucket,
> > +                               u64 size) {}
> > +
> > +static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
>
> I think this needs to return NULL, otherwise you'll get a compiler error when
> CONFIG_CGROUP_GPU is not set.
>
> I found other build errors when CONFIG_CGROUP_GPU is not set, please fix them in
> the next versioon.
>
> Thanks,
>   Enric
>
Thanks. I have been building each patch with allnoconfig and
allyesconfig before posting, but clearly this was not sufficient. I'll
fix this up.


> > +#endif /* CONFIG_CGROUP_GPU */
> > +#endif /* _CGROUP_GPU_H */
> > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> > index 445235487230..46a2a7b93c41 100644
> > --- a/include/linux/cgroup_subsys.h
> > +++ b/include/linux/cgroup_subsys.h
> > @@ -65,6 +65,10 @@ SUBSYS(rdma)
> >  SUBSYS(misc)
> >  #endif
> >
> > +#if IS_ENABLED(CONFIG_CGROUP_GPU)
> > +SUBSYS(gpu)
> > +#endif
> > +
> >  /*
> >   * The following subsystems are not supported on the default hierarchy.
> >   */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index ddcbefe535e9..2e00a190e170 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -984,6 +984,13 @@ config BLK_CGROUP
> >
> >       See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
> >
> > +config CGROUP_GPU
> > +     bool "GPU controller (EXPERIMENTAL)"
> > +     select PAGE_COUNTER
> > +     help
> > +       Provides accounting and limit setting for memory allocations by the GPU and
> > +       GPU-related subsystems.
> > +
> >  config CGROUP_WRITEBACK
> >       bool
> >       depends on MEMCG && BLK_CGROUP
> > diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> > index 12f8457ad1f9..be95a5a532fc 100644
> > --- a/kernel/cgroup/Makefile
> > +++ b/kernel/cgroup/Makefile
> > @@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
> >  obj-$(CONFIG_CPUSETS) += cpuset.o
> >  obj-$(CONFIG_CGROUP_MISC) += misc.o
> >  obj-$(CONFIG_CGROUP_DEBUG) += debug.o
> > +obj-$(CONFIG_CGROUP_GPU) += gpu.o
> > diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
> > new file mode 100644
> > index 000000000000..ad16ea15d427
> > --- /dev/null
> > +++ b/kernel/cgroup/gpu.c
> > @@ -0,0 +1,339 @@
> > +// SPDX-License-Identifier: MIT
> > +// Copyright 2019 Advanced Micro Devices, Inc.
> > +// Copyright (C) 2022 Google LLC.
> > +
> > +#include <linux/cgroup.h>
> > +#include <linux/cgroup_gpu.h>
> > +#include <linux/err.h>
> > +#include <linux/gfp.h>
> > +#include <linux/list.h>
> > +#include <linux/mm.h>
> > +#include <linux/page_counter.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +
> > +static struct gpucg *root_gpucg __read_mostly;
> > +
> > +/*
> > + * Protects list of resource pools maintained on per cgroup basis and list
> > + * of buckets registered for memory accounting using the GPU cgroup controller.
> > + */
> > +static DEFINE_MUTEX(gpucg_mutex);
> > +static LIST_HEAD(gpucg_buckets);
> > +
> > +/* The GPU cgroup controller data structure */
> > +struct gpucg {
> > +     struct cgroup_subsys_state css;
> > +
> > +     /* list of all resource pools that belong to this cgroup */
> > +     struct list_head rpools;
> > +};
> > +
> > +/* A named entity representing bucket of tracked memory. */
> > +struct gpucg_bucket {
> > +     /* list of various resource pools in various cgroups that the bucket is part of */
> > +     struct list_head rpools;
> > +
> > +     /* list of all buckets registered for GPU cgroup accounting */
> > +     struct list_head bucket_node;
> > +
> > +     /* string to be used as identifier for accounting and limit setting */
> > +     const char *name;
> > +};
> > +
> > +struct gpucg_resource_pool {
> > +     /* The bucket whose resource usage is tracked by this resource pool */
> > +     struct gpucg_bucket *bucket;
> > +
> > +     /* list of all resource pools for the cgroup */
> > +     struct list_head cg_node;
> > +
> > +     /* list maintained by the gpucg_bucket to keep track of its resource pools */
> > +     struct list_head bucket_node;
> > +
> > +     /* tracks memory usage of the resource pool */
> > +     struct page_counter total;
> > +};
> > +
> > +static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
> > +{
> > +     lockdep_assert_held(&gpucg_mutex);
> > +
> > +     list_del(&rpool->cg_node);
> > +     list_del(&rpool->bucket_node);
> > +     kfree(rpool);
> > +}
> > +
> > +static void gpucg_css_free(struct cgroup_subsys_state *css)
> > +{
> > +     struct gpucg_resource_pool *rpool, *tmp;
> > +     struct gpucg *gpucg = css_to_gpucg(css);
> > +
> > +     // delete all resource pools
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
> > +             free_cg_rpool_locked(rpool);
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     kfree(gpucg);
> > +}
> > +
> > +static struct cgroup_subsys_state *
> > +gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
> > +{
> > +     struct gpucg *gpucg, *parent;
> > +
> > +     gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
> > +     if (!gpucg)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     parent = css_to_gpucg(parent_css);
> > +     if (!parent)
> > +             root_gpucg = gpucg;
> > +
> > +     INIT_LIST_HEAD(&gpucg->rpools);
> > +
> > +     return &gpucg->css;
> > +}
> > +
> > +static struct gpucg_resource_pool *cg_rpool_find_locked(
> > +     struct gpucg *cg,
> > +     struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg_resource_pool *rpool;
> > +
> > +     lockdep_assert_held(&gpucg_mutex);
> > +
> > +     list_for_each_entry(rpool, &cg->rpools, cg_node)
> > +             if (rpool->bucket == bucket)
> > +                     return rpool;
> > +
> > +     return NULL;
> > +}
> > +
> > +static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
> > +                                              struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
> > +                                                     GFP_KERNEL);
> > +     if (!rpool)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     rpool->bucket = bucket;
> > +
> > +     page_counter_init(&rpool->total, NULL);
> > +     INIT_LIST_HEAD(&rpool->cg_node);
> > +     INIT_LIST_HEAD(&rpool->bucket_node);
> > +     list_add_tail(&rpool->cg_node, &cg->rpools);
> > +     list_add_tail(&rpool->bucket_node, &bucket->rpools);
> > +
> > +     return rpool;
> > +}
> > +
> > +/**
> > + * get_cg_rpool_locked - find the resource pool for the specified bucket and
> > + * specified cgroup. If the resource pool does not exist for the cg, it is
> > + * created in a hierarchical manner in the cgroup and its ancestor cgroups who
> > + * do not already have a resource pool entry for the bucket.
> > + *
> > + * @cg: The cgroup to find the resource pool for.
> > + * @bucket: The bucket associated with the returned resource pool.
> > + *
> > + * Return: return resource pool entry corresponding to the specified bucket in
> > + * the specified cgroup (hierarchically creating them if not existing already).
> > + *
> > + */
> > +static struct gpucg_resource_pool *
> > +cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg *parent_cg, *p, *stop_cg;
> > +     struct gpucg_resource_pool *rpool, *tmp_rpool;
> > +     struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
> > +
> > +     rpool = cg_rpool_find_locked(cg, bucket);
> > +     if (rpool)
> > +             return rpool;
> > +
> > +     stop_cg = cg;
> > +     do {
> > +             rpool = cg_rpool_init(stop_cg, bucket);
> > +             if (IS_ERR(rpool))
> > +                     goto err;
> > +
> > +             if (!leaf_rpool)
> > +                     leaf_rpool = rpool;
> > +
> > +             stop_cg = gpucg_parent(stop_cg);
> > +             if (!stop_cg)
> > +                     break;
> > +
> > +             rpool = cg_rpool_find_locked(stop_cg, bucket);
> > +     } while (!rpool);
> > +
> > +     /*
> > +      * Re-initialize page counters of all rpools created in this invocation
> > +      * to enable hierarchical charging.
> > +      * stop_cg is the first ancestor cg who already had a resource pool for
> > +      * the bucket. It can also be NULL if no ancestors had a pre-existing
> > +      * resource pool for the bucket before this invocation.
> > +      */
> > +     rpool = leaf_rpool;
> > +     for (p = cg; p != stop_cg; p = parent_cg) {
> > +             parent_cg = gpucg_parent(p);
> > +             if (!parent_cg)
> > +                     break;
> > +             parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
> > +             page_counter_init(&rpool->total, &parent_rpool->total);
> > +
> > +             rpool = parent_rpool;
> > +     }
> > +
> > +     return leaf_rpool;
> > +err:
> > +     for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
> > +             tmp_rpool = cg_rpool_find_locked(p, bucket);
> > +             free_cg_rpool_locked(tmp_rpool);
> > +     }
> > +     return rpool;
> > +}
> > +
> > +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> > +{
> > +     return css ? container_of(css, struct gpucg, css) : NULL;
> > +}
> > +
> > +struct gpucg *gpucg_get(struct task_struct *task)
> > +{
> > +     if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
> > +             return NULL;
> > +     return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
> > +}
> > +
> > +void gpucg_put(struct gpucg *gpucg)
> > +{
> > +     if (gpucg)
> > +             css_put(&gpucg->css);
> > +}
> > +
> > +struct gpucg *gpucg_parent(struct gpucg *cg)
> > +{
> > +     return css_to_gpucg(cg->css.parent);
> > +}
> > +
> > +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> > +{
> > +     struct page_counter *counter;
> > +     u64 nr_pages;
> > +     struct gpucg_resource_pool *rp;
> > +     int ret = 0;
> > +
> > +     nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     rp = cg_rpool_get_locked(gpucg, bucket);
> > +     /*
> > +      * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
> > +      * progress to avoid potentially exceeding a limit.
> > +      */
> > +     if (IS_ERR(rp)) {
> > +             mutex_unlock(&gpucg_mutex);
> > +             return PTR_ERR(rp);
> > +     }
> > +
> > +     if (page_counter_try_charge(&rp->total, nr_pages, &counter))
> > +             css_get(&gpucg->css);
> > +     else
> > +             ret = -ENOMEM;
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return ret;
> > +}
> > +
> > +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> > +{
> > +     u64 nr_pages;
> > +     struct gpucg_resource_pool *rp;
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     rp = cg_rpool_find_locked(gpucg, bucket);
> > +     /*
> > +      * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
> > +      * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
> > +      * no potential to exceed a limit while uncharging and transferring.
> > +      */
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     if (unlikely(!rp)) {
> > +             pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
> > +             return;
> > +     }
> > +
> > +     nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +     page_counter_uncharge(&rp->total, nr_pages);
> > +     css_put(&gpucg->css);
> > +}
> > +
> > +struct gpucg_bucket *gpucg_register_bucket(const char *name)
> > +{
> > +     struct gpucg_bucket *bucket, *b;
> > +
> > +     if (!name)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
> > +             return ERR_PTR(-ENAMETOOLONG);
> > +
> > +     bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
> > +     if (!bucket)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     INIT_LIST_HEAD(&bucket->bucket_node);
> > +     INIT_LIST_HEAD(&bucket->rpools);
> > +     bucket->name = kstrdup_const(name, GFP_KERNEL);
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry(b, &gpucg_buckets, bucket_node) {
> > +             if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
> > +                     mutex_unlock(&gpucg_mutex);
> > +                     kfree_const(bucket->name);
> > +                     kfree(bucket);
> > +                     return ERR_PTR(-EEXIST);
> > +             }
> > +     }
> > +     list_add_tail(&bucket->bucket_node, &gpucg_buckets);
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return bucket;
> > +}
> > +
> > +static int gpucg_resource_show(struct seq_file *sf, void *v)
> > +{
> > +     struct gpucg_resource_pool *rpool;
> > +     struct gpucg *cg = css_to_gpucg(seq_css(sf));
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry(rpool, &cg->rpools, cg_node) {
> > +             seq_printf(sf, "%s %lu\n", rpool->bucket->name,
> > +                        page_counter_read(&rpool->total) * PAGE_SIZE);
> > +     }
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return 0;
> > +}
> > +
> > +struct cftype files[] = {
> > +     {
> > +             .name = "memory.current",
> > +             .seq_show = gpucg_resource_show,
> > +     },
> > +     { }     /* terminate */
> > +};
> > +
> > +struct cgroup_subsys gpu_cgrp_subsys = {
> > +     .css_alloc      = gpucg_css_alloc,
> > +     .css_free       = gpucg_css_free,
> > +     .early_init     = false,
> > +     .legacy_cftypes = files,
> > +     .dfl_cftypes    = files,
> > +};
> >
> > --
> > 2.36.0.512.ge40c2bad7a-goog
> >
> >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
@ 2022-05-20 16:33     ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-20 16:33 UTC (permalink / raw)
  To: eballetbo
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Laura Abbott, kernel-team, linux-media,
	Todd Kjos, linaro-mm-sig, Hridya Valsaraju, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Tejun Heo

On Thu, May 19, 2022 at 3:53 AM <eballetbo@kernel.org> wrote:
>
> From: Enric Balletbo i Serra <eballetbo@kernel.org>
>
> On Tue, 10 May 2022 23:56:46 +0000, T.J. Mercier wrote
> > From: Hridya Valsaraju <hridya@google.com>
> >
> > The cgroup controller provides accounting for GPU and GPU-related
> > memory allocations. The memory being accounted can be device memory or
> > memory allocated from pools dedicated to serve GPU-related tasks.
> >
> > This patch adds APIs to:
> > -allow a device to register for memory accounting using the GPU cgroup
> > controller.
> > -charge and uncharge allocated memory to a cgroup.
> >
> > When the cgroup controller is enabled, it would expose information about
> > the memory allocated by each device(registered for GPU cgroup memory
> > accounting) for each cgroup.
> >
> > The API/UAPI can be extended to set per-device/total allocation limits
> > in the future.
> >
> > The cgroup controller has been named following the discussion in [1].
> >
> > [1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/
> >
> > Signed-off-by: Hridya Valsaraju <hridya@google.com>
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> > v7 changes
> > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > This means gpucg_register_bucket now returns an internally allocated
> > struct gpucg_bucket.
> >
> > Move all public function documentation to the cgroup_gpu.h header.
> >
> > v5 changes
> > Support all strings for gpucg_register_device instead of just string
> > literals.
> >
> > Enforce globally unique gpucg_bucket names.
> >
> > Constrain gpucg_bucket name lengths to 64 bytes.
> >
> > Obtain just a single css refcount instead of nr_pages for each
> > charge.
> >
> > Rename:
> > gpucg_try_charge -> gpucg_charge
> > find_cg_rpool_locked -> cg_rpool_find_locked
> > init_cg_rpool -> cg_rpool_init
> > get_cg_rpool_locked -> cg_rpool_get_locked
> > "gpu cgroup controller" -> "GPU controller"
> > gpucg_device -> gpucg_bucket
> > usage -> size
> >
> > v4 changes
> > Adjust gpucg_try_charge critical section for future charge transfer
> > functionality.
> >
> > v3 changes
> > Use more common dual author commit message format per John Stultz.
> >
> > v2 changes
> > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > ---
> >  include/linux/cgroup_gpu.h    | 122 ++++++++++++
> >  include/linux/cgroup_subsys.h |   4 +
> >  init/Kconfig                  |   7 +
> >  kernel/cgroup/Makefile        |   1 +
> >  kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
> >  5 files changed, 473 insertions(+)
> >  create mode 100644 include/linux/cgroup_gpu.h
> >  create mode 100644 kernel/cgroup/gpu.c
> >
> > diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
> > new file mode 100644
> > index 000000000000..cb228a16aa1f
> > --- /dev/null
> > +++ b/include/linux/cgroup_gpu.h
> > @@ -0,0 +1,122 @@
> > +/* SPDX-License-Identifier: MIT
> > + * Copyright 2019 Advanced Micro Devices, Inc.
> > + * Copyright (C) 2022 Google LLC.
> > + */
> > +#ifndef _CGROUP_GPU_H
> > +#define _CGROUP_GPU_H
> > +
> > +#include <linux/cgroup.h>
> > +
> > +#define GPUCG_BUCKET_NAME_MAX_LEN 64
> > +
> > +struct gpucg;
> > +struct gpucg_bucket;
> > +
> > +#ifdef CONFIG_CGROUP_GPU
> > +
> > +/**
> > + * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
> > + * @css: the target cgroup_subsys_state
> > + *
> > + * Returns: gpu cgroup that contains the @css
> > + */
> > +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
> > +
> > +/**
> > + * gpucg_get - get the gpucg reference that a task belongs to
> > + * @task: the target task
> > + *
> > + * This increases the reference count of the css that the @task belongs to.
> > + *
> > + * Returns: reference to the gpu cgroup the task belongs to.
> > + */
> > +struct gpucg *gpucg_get(struct task_struct *task);
> > +
> > +/**
> > + * gpucg_put - put a gpucg reference
> > + * @gpucg: the target gpucg
> > + *
> > + * Put a reference obtained via gpucg_get
> > + */
> > +void gpucg_put(struct gpucg *gpucg);
> > +
> > +/**
> > + * gpucg_parent - find the parent of a gpu cgroup
> > + * @cg: the target gpucg
> > + *
> > + * This does not increase the reference count of the parent cgroup
> > + *
> > + * Returns: parent gpu cgroup of @cg
> > + */
> > +struct gpucg *gpucg_parent(struct gpucg *cg);
> > +
> > +/**
> > + * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
> > + * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
> > + * rounded up to be a multiple of the page size.
> > + *
> > + * @gpucg: The gpu cgroup to charge the memory to.
> > + * @bucket: The bucket to charge the memory to.
> > + * @size: The size of memory to charge in bytes.
> > + *        This size will be rounded up to the nearest page size.
> > + *
> > + * Return: returns 0 if the charging is successful and otherwise returns an error code.
> > + */
> > +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> > +
> > +/**
> > + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
> > + * The caller must hold a reference to @gpucg obtained through gpucg_get().
> > + *
> > + * @gpucg: The gpu cgroup to uncharge the memory from.
> > + * @bucket: The bucket to uncharge the memory from.
> > + * @size: The size of memory to uncharge in bytes.
> > + *        This size will be rounded up to the nearest page size.
> > + */
> > +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> > +
> > +/**
> > + * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
> > + *
> > + * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
> > + *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
> > + *
> > + * @bucket must remain valid. @name will be copied.
> > + *
> > + * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
> > + * cannot be unregistered, this can never be freed.
> > + */
> > +struct gpucg_bucket *gpucg_register_bucket(const char *name);
> > +#else /* CONFIG_CGROUP_GPU */
> > +
> > +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline struct gpucg *gpucg_get(struct task_struct *task)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline void gpucg_put(struct gpucg *gpucg) {}
> > +
> > +static inline struct gpucg *gpucg_parent(struct gpucg *cg)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline int gpucg_charge(struct gpucg *gpucg,
> > +                            struct gpucg_bucket *bucket,
> > +                            u64 size)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline void gpucg_uncharge(struct gpucg *gpucg,
> > +                               struct gpucg_bucket *bucket,
> > +                               u64 size) {}
> > +
> > +static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
>
> I think this needs to return NULL, otherwise you'll get a compiler error when
> CONFIG_CGROUP_GPU is not set.
>
> I found other build errors when CONFIG_CGROUP_GPU is not set, please fix them in
> the next versioon.
>
> Thanks,
>   Enric
>
Thanks. I have been building each patch with allnoconfig and
allyesconfig before posting, but clearly this was not sufficient. I'll
fix this up.


> > +#endif /* CONFIG_CGROUP_GPU */
> > +#endif /* _CGROUP_GPU_H */
> > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> > index 445235487230..46a2a7b93c41 100644
> > --- a/include/linux/cgroup_subsys.h
> > +++ b/include/linux/cgroup_subsys.h
> > @@ -65,6 +65,10 @@ SUBSYS(rdma)
> >  SUBSYS(misc)
> >  #endif
> >
> > +#if IS_ENABLED(CONFIG_CGROUP_GPU)
> > +SUBSYS(gpu)
> > +#endif
> > +
> >  /*
> >   * The following subsystems are not supported on the default hierarchy.
> >   */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index ddcbefe535e9..2e00a190e170 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -984,6 +984,13 @@ config BLK_CGROUP
> >
> >       See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
> >
> > +config CGROUP_GPU
> > +     bool "GPU controller (EXPERIMENTAL)"
> > +     select PAGE_COUNTER
> > +     help
> > +       Provides accounting and limit setting for memory allocations by the GPU and
> > +       GPU-related subsystems.
> > +
> >  config CGROUP_WRITEBACK
> >       bool
> >       depends on MEMCG && BLK_CGROUP
> > diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> > index 12f8457ad1f9..be95a5a532fc 100644
> > --- a/kernel/cgroup/Makefile
> > +++ b/kernel/cgroup/Makefile
> > @@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
> >  obj-$(CONFIG_CPUSETS) += cpuset.o
> >  obj-$(CONFIG_CGROUP_MISC) += misc.o
> >  obj-$(CONFIG_CGROUP_DEBUG) += debug.o
> > +obj-$(CONFIG_CGROUP_GPU) += gpu.o
> > diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
> > new file mode 100644
> > index 000000000000..ad16ea15d427
> > --- /dev/null
> > +++ b/kernel/cgroup/gpu.c
> > @@ -0,0 +1,339 @@
> > +// SPDX-License-Identifier: MIT
> > +// Copyright 2019 Advanced Micro Devices, Inc.
> > +// Copyright (C) 2022 Google LLC.
> > +
> > +#include <linux/cgroup.h>
> > +#include <linux/cgroup_gpu.h>
> > +#include <linux/err.h>
> > +#include <linux/gfp.h>
> > +#include <linux/list.h>
> > +#include <linux/mm.h>
> > +#include <linux/page_counter.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +
> > +static struct gpucg *root_gpucg __read_mostly;
> > +
> > +/*
> > + * Protects list of resource pools maintained on per cgroup basis and list
> > + * of buckets registered for memory accounting using the GPU cgroup controller.
> > + */
> > +static DEFINE_MUTEX(gpucg_mutex);
> > +static LIST_HEAD(gpucg_buckets);
> > +
> > +/* The GPU cgroup controller data structure */
> > +struct gpucg {
> > +     struct cgroup_subsys_state css;
> > +
> > +     /* list of all resource pools that belong to this cgroup */
> > +     struct list_head rpools;
> > +};
> > +
> > +/* A named entity representing bucket of tracked memory. */
> > +struct gpucg_bucket {
> > +     /* list of various resource pools in various cgroups that the bucket is part of */
> > +     struct list_head rpools;
> > +
> > +     /* list of all buckets registered for GPU cgroup accounting */
> > +     struct list_head bucket_node;
> > +
> > +     /* string to be used as identifier for accounting and limit setting */
> > +     const char *name;
> > +};
> > +
> > +struct gpucg_resource_pool {
> > +     /* The bucket whose resource usage is tracked by this resource pool */
> > +     struct gpucg_bucket *bucket;
> > +
> > +     /* list of all resource pools for the cgroup */
> > +     struct list_head cg_node;
> > +
> > +     /* list maintained by the gpucg_bucket to keep track of its resource pools */
> > +     struct list_head bucket_node;
> > +
> > +     /* tracks memory usage of the resource pool */
> > +     struct page_counter total;
> > +};
> > +
> > +static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
> > +{
> > +     lockdep_assert_held(&gpucg_mutex);
> > +
> > +     list_del(&rpool->cg_node);
> > +     list_del(&rpool->bucket_node);
> > +     kfree(rpool);
> > +}
> > +
> > +static void gpucg_css_free(struct cgroup_subsys_state *css)
> > +{
> > +     struct gpucg_resource_pool *rpool, *tmp;
> > +     struct gpucg *gpucg = css_to_gpucg(css);
> > +
> > +     // delete all resource pools
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
> > +             free_cg_rpool_locked(rpool);
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     kfree(gpucg);
> > +}
> > +
> > +static struct cgroup_subsys_state *
> > +gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
> > +{
> > +     struct gpucg *gpucg, *parent;
> > +
> > +     gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
> > +     if (!gpucg)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     parent = css_to_gpucg(parent_css);
> > +     if (!parent)
> > +             root_gpucg = gpucg;
> > +
> > +     INIT_LIST_HEAD(&gpucg->rpools);
> > +
> > +     return &gpucg->css;
> > +}
> > +
> > +static struct gpucg_resource_pool *cg_rpool_find_locked(
> > +     struct gpucg *cg,
> > +     struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg_resource_pool *rpool;
> > +
> > +     lockdep_assert_held(&gpucg_mutex);
> > +
> > +     list_for_each_entry(rpool, &cg->rpools, cg_node)
> > +             if (rpool->bucket == bucket)
> > +                     return rpool;
> > +
> > +     return NULL;
> > +}
> > +
> > +static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
> > +                                              struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
> > +                                                     GFP_KERNEL);
> > +     if (!rpool)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     rpool->bucket = bucket;
> > +
> > +     page_counter_init(&rpool->total, NULL);
> > +     INIT_LIST_HEAD(&rpool->cg_node);
> > +     INIT_LIST_HEAD(&rpool->bucket_node);
> > +     list_add_tail(&rpool->cg_node, &cg->rpools);
> > +     list_add_tail(&rpool->bucket_node, &bucket->rpools);
> > +
> > +     return rpool;
> > +}
> > +
> > +/**
> > + * get_cg_rpool_locked - find the resource pool for the specified bucket and
> > + * specified cgroup. If the resource pool does not exist for the cg, it is
> > + * created in a hierarchical manner in the cgroup and its ancestor cgroups who
> > + * do not already have a resource pool entry for the bucket.
> > + *
> > + * @cg: The cgroup to find the resource pool for.
> > + * @bucket: The bucket associated with the returned resource pool.
> > + *
> > + * Return: return resource pool entry corresponding to the specified bucket in
> > + * the specified cgroup (hierarchically creating them if not existing already).
> > + *
> > + */
> > +static struct gpucg_resource_pool *
> > +cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg *parent_cg, *p, *stop_cg;
> > +     struct gpucg_resource_pool *rpool, *tmp_rpool;
> > +     struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
> > +
> > +     rpool = cg_rpool_find_locked(cg, bucket);
> > +     if (rpool)
> > +             return rpool;
> > +
> > +     stop_cg = cg;
> > +     do {
> > +             rpool = cg_rpool_init(stop_cg, bucket);
> > +             if (IS_ERR(rpool))
> > +                     goto err;
> > +
> > +             if (!leaf_rpool)
> > +                     leaf_rpool = rpool;
> > +
> > +             stop_cg = gpucg_parent(stop_cg);
> > +             if (!stop_cg)
> > +                     break;
> > +
> > +             rpool = cg_rpool_find_locked(stop_cg, bucket);
> > +     } while (!rpool);
> > +
> > +     /*
> > +      * Re-initialize page counters of all rpools created in this invocation
> > +      * to enable hierarchical charging.
> > +      * stop_cg is the first ancestor cg who already had a resource pool for
> > +      * the bucket. It can also be NULL if no ancestors had a pre-existing
> > +      * resource pool for the bucket before this invocation.
> > +      */
> > +     rpool = leaf_rpool;
> > +     for (p = cg; p != stop_cg; p = parent_cg) {
> > +             parent_cg = gpucg_parent(p);
> > +             if (!parent_cg)
> > +                     break;
> > +             parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
> > +             page_counter_init(&rpool->total, &parent_rpool->total);
> > +
> > +             rpool = parent_rpool;
> > +     }
> > +
> > +     return leaf_rpool;
> > +err:
> > +     for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
> > +             tmp_rpool = cg_rpool_find_locked(p, bucket);
> > +             free_cg_rpool_locked(tmp_rpool);
> > +     }
> > +     return rpool;
> > +}
> > +
> > +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> > +{
> > +     return css ? container_of(css, struct gpucg, css) : NULL;
> > +}
> > +
> > +struct gpucg *gpucg_get(struct task_struct *task)
> > +{
> > +     if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
> > +             return NULL;
> > +     return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
> > +}
> > +
> > +void gpucg_put(struct gpucg *gpucg)
> > +{
> > +     if (gpucg)
> > +             css_put(&gpucg->css);
> > +}
> > +
> > +struct gpucg *gpucg_parent(struct gpucg *cg)
> > +{
> > +     return css_to_gpucg(cg->css.parent);
> > +}
> > +
> > +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> > +{
> > +     struct page_counter *counter;
> > +     u64 nr_pages;
> > +     struct gpucg_resource_pool *rp;
> > +     int ret = 0;
> > +
> > +     nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     rp = cg_rpool_get_locked(gpucg, bucket);
> > +     /*
> > +      * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
> > +      * progress to avoid potentially exceeding a limit.
> > +      */
> > +     if (IS_ERR(rp)) {
> > +             mutex_unlock(&gpucg_mutex);
> > +             return PTR_ERR(rp);
> > +     }
> > +
> > +     if (page_counter_try_charge(&rp->total, nr_pages, &counter))
> > +             css_get(&gpucg->css);
> > +     else
> > +             ret = -ENOMEM;
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return ret;
> > +}
> > +
> > +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> > +{
> > +     u64 nr_pages;
> > +     struct gpucg_resource_pool *rp;
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     rp = cg_rpool_find_locked(gpucg, bucket);
> > +     /*
> > +      * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
> > +      * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
> > +      * no potential to exceed a limit while uncharging and transferring.
> > +      */
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     if (unlikely(!rp)) {
> > +             pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
> > +             return;
> > +     }
> > +
> > +     nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +     page_counter_uncharge(&rp->total, nr_pages);
> > +     css_put(&gpucg->css);
> > +}
> > +
> > +struct gpucg_bucket *gpucg_register_bucket(const char *name)
> > +{
> > +     struct gpucg_bucket *bucket, *b;
> > +
> > +     if (!name)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
> > +             return ERR_PTR(-ENAMETOOLONG);
> > +
> > +     bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
> > +     if (!bucket)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     INIT_LIST_HEAD(&bucket->bucket_node);
> > +     INIT_LIST_HEAD(&bucket->rpools);
> > +     bucket->name = kstrdup_const(name, GFP_KERNEL);
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry(b, &gpucg_buckets, bucket_node) {
> > +             if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
> > +                     mutex_unlock(&gpucg_mutex);
> > +                     kfree_const(bucket->name);
> > +                     kfree(bucket);
> > +                     return ERR_PTR(-EEXIST);
> > +             }
> > +     }
> > +     list_add_tail(&bucket->bucket_node, &gpucg_buckets);
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return bucket;
> > +}
> > +
> > +static int gpucg_resource_show(struct seq_file *sf, void *v)
> > +{
> > +     struct gpucg_resource_pool *rpool;
> > +     struct gpucg *cg = css_to_gpucg(seq_css(sf));
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry(rpool, &cg->rpools, cg_node) {
> > +             seq_printf(sf, "%s %lu\n", rpool->bucket->name,
> > +                        page_counter_read(&rpool->total) * PAGE_SIZE);
> > +     }
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return 0;
> > +}
> > +
> > +struct cftype files[] = {
> > +     {
> > +             .name = "memory.current",
> > +             .seq_show = gpucg_resource_show,
> > +     },
> > +     { }     /* terminate */
> > +};
> > +
> > +struct cgroup_subsys gpu_cgrp_subsys = {
> > +     .css_alloc      = gpucg_css_alloc,
> > +     .css_free       = gpucg_css_free,
> > +     .early_init     = false,
> > +     .legacy_cftypes = files,
> > +     .dfl_cftypes    = files,
> > +};
> >
> > --
> > 2.36.0.512.ge40c2bad7a-goog
> >
> >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory
@ 2022-05-20 16:33     ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-20 16:33 UTC (permalink / raw)
  To: eballetbo
  Cc: Zefan Li, Jonathan Corbet, Joel Fernandes,
	Arve Hjønnevåg, Martijn Coenen, Benjamin Gaignard,
	Tejun Heo, Christian Brauner, Sumit Semwal, Todd Kjos,
	Suren Baghdasaryan, Johannes Weiner, Brian Starkey,
	Christian König, Greg Kroah-Hartman, Liam Mark, John Stultz,
	Hridya Valsaraju, Shuah Khan

On Thu, May 19, 2022 at 3:53 AM <eballetbo@kernel.org> wrote:
>
> From: Enric Balletbo i Serra <eballetbo@kernel.org>
>
> On Tue, 10 May 2022 23:56:46 +0000, T.J. Mercier wrote
> > From: Hridya Valsaraju <hridya@google.com>
> >
> > The cgroup controller provides accounting for GPU and GPU-related
> > memory allocations. The memory being accounted can be device memory or
> > memory allocated from pools dedicated to serve GPU-related tasks.
> >
> > This patch adds APIs to:
> > -allow a device to register for memory accounting using the GPU cgroup
> > controller.
> > -charge and uncharge allocated memory to a cgroup.
> >
> > When the cgroup controller is enabled, it would expose information about
> > the memory allocated by each device(registered for GPU cgroup memory
> > accounting) for each cgroup.
> >
> > The API/UAPI can be extended to set per-device/total allocation limits
> > in the future.
> >
> > The cgroup controller has been named following the discussion in [1].
> >
> > [1]: https://lore.kernel.org/amd-gfx/YCJp%2F%2FkMC7YjVMXv@phenom.ffwll.local/
> >
> > Signed-off-by: Hridya Valsaraju <hridya@google.com>
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> > v7 changes
> > Hide gpucg and gpucg_bucket struct definitions per Michal Koutný.
> > This means gpucg_register_bucket now returns an internally allocated
> > struct gpucg_bucket.
> >
> > Move all public function documentation to the cgroup_gpu.h header.
> >
> > v5 changes
> > Support all strings for gpucg_register_device instead of just string
> > literals.
> >
> > Enforce globally unique gpucg_bucket names.
> >
> > Constrain gpucg_bucket name lengths to 64 bytes.
> >
> > Obtain just a single css refcount instead of nr_pages for each
> > charge.
> >
> > Rename:
> > gpucg_try_charge -> gpucg_charge
> > find_cg_rpool_locked -> cg_rpool_find_locked
> > init_cg_rpool -> cg_rpool_init
> > get_cg_rpool_locked -> cg_rpool_get_locked
> > "gpu cgroup controller" -> "GPU controller"
> > gpucg_device -> gpucg_bucket
> > usage -> size
> >
> > v4 changes
> > Adjust gpucg_try_charge critical section for future charge transfer
> > functionality.
> >
> > v3 changes
> > Use more common dual author commit message format per John Stultz.
> >
> > v2 changes
> > Fix incorrect Kconfig help section indentation per Randy Dunlap.
> > ---
> >  include/linux/cgroup_gpu.h    | 122 ++++++++++++
> >  include/linux/cgroup_subsys.h |   4 +
> >  init/Kconfig                  |   7 +
> >  kernel/cgroup/Makefile        |   1 +
> >  kernel/cgroup/gpu.c           | 339 ++++++++++++++++++++++++++++++++++
> >  5 files changed, 473 insertions(+)
> >  create mode 100644 include/linux/cgroup_gpu.h
> >  create mode 100644 kernel/cgroup/gpu.c
> >
> > diff --git a/include/linux/cgroup_gpu.h b/include/linux/cgroup_gpu.h
> > new file mode 100644
> > index 000000000000..cb228a16aa1f
> > --- /dev/null
> > +++ b/include/linux/cgroup_gpu.h
> > @@ -0,0 +1,122 @@
> > +/* SPDX-License-Identifier: MIT
> > + * Copyright 2019 Advanced Micro Devices, Inc.
> > + * Copyright (C) 2022 Google LLC.
> > + */
> > +#ifndef _CGROUP_GPU_H
> > +#define _CGROUP_GPU_H
> > +
> > +#include <linux/cgroup.h>
> > +
> > +#define GPUCG_BUCKET_NAME_MAX_LEN 64
> > +
> > +struct gpucg;
> > +struct gpucg_bucket;
> > +
> > +#ifdef CONFIG_CGROUP_GPU
> > +
> > +/**
> > + * css_to_gpucg - get the corresponding gpucg ref from a cgroup_subsys_state
> > + * @css: the target cgroup_subsys_state
> > + *
> > + * Returns: gpu cgroup that contains the @css
> > + */
> > +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css);
> > +
> > +/**
> > + * gpucg_get - get the gpucg reference that a task belongs to
> > + * @task: the target task
> > + *
> > + * This increases the reference count of the css that the @task belongs to.
> > + *
> > + * Returns: reference to the gpu cgroup the task belongs to.
> > + */
> > +struct gpucg *gpucg_get(struct task_struct *task);
> > +
> > +/**
> > + * gpucg_put - put a gpucg reference
> > + * @gpucg: the target gpucg
> > + *
> > + * Put a reference obtained via gpucg_get
> > + */
> > +void gpucg_put(struct gpucg *gpucg);
> > +
> > +/**
> > + * gpucg_parent - find the parent of a gpu cgroup
> > + * @cg: the target gpucg
> > + *
> > + * This does not increase the reference count of the parent cgroup
> > + *
> > + * Returns: parent gpu cgroup of @cg
> > + */
> > +struct gpucg *gpucg_parent(struct gpucg *cg);
> > +
> > +/**
> > + * gpucg_charge - charge memory to the specified gpucg and gpucg_bucket.
> > + * Caller must hold a reference to @gpucg obtained through gpucg_get(). The size of the memory is
> > + * rounded up to be a multiple of the page size.
> > + *
> > + * @gpucg: The gpu cgroup to charge the memory to.
> > + * @bucket: The bucket to charge the memory to.
> > + * @size: The size of memory to charge in bytes.
> > + *        This size will be rounded up to the nearest page size.
> > + *
> > + * Return: returns 0 if the charging is successful and otherwise returns an error code.
> > + */
> > +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> > +
> > +/**
> > + * gpucg_uncharge - uncharge memory from the specified gpucg and gpucg_bucket.
> > + * The caller must hold a reference to @gpucg obtained through gpucg_get().
> > + *
> > + * @gpucg: The gpu cgroup to uncharge the memory from.
> > + * @bucket: The bucket to uncharge the memory from.
> > + * @size: The size of memory to uncharge in bytes.
> > + *        This size will be rounded up to the nearest page size.
> > + */
> > +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size);
> > +
> > +/**
> > + * gpucg_register_bucket - Registers a bucket for memory accounting using the GPU cgroup controller.
> > + *
> > + * @name: Pointer to a null-terminated string to denote the name of the bucket. This name should be
> > + *        globally unique, and should not exceed @GPUCG_BUCKET_NAME_MAX_LEN bytes.
> > + *
> > + * @bucket must remain valid. @name will be copied.
> > + *
> > + * Returns a pointer to a newly allocated bucket on success, or an errno code otherwise. As buckets
> > + * cannot be unregistered, this can never be freed.
> > + */
> > +struct gpucg_bucket *gpucg_register_bucket(const char *name);
> > +#else /* CONFIG_CGROUP_GPU */
> > +
> > +static inline struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline struct gpucg *gpucg_get(struct task_struct *task)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline void gpucg_put(struct gpucg *gpucg) {}
> > +
> > +static inline struct gpucg *gpucg_parent(struct gpucg *cg)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline int gpucg_charge(struct gpucg *gpucg,
> > +                            struct gpucg_bucket *bucket,
> > +                            u64 size)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline void gpucg_uncharge(struct gpucg *gpucg,
> > +                               struct gpucg_bucket *bucket,
> > +                               u64 size) {}
> > +
> > +static inline struct gpucg_bucket *gpucg_register_bucket(const char *name) {}
>
> I think this needs to return NULL, otherwise you'll get a compiler error when
> CONFIG_CGROUP_GPU is not set.
>
> I found other build errors when CONFIG_CGROUP_GPU is not set, please fix them in
> the next versioon.
>
> Thanks,
>   Enric
>
Thanks. I have been building each patch with allnoconfig and
allyesconfig before posting, but clearly this was not sufficient. I'll
fix this up.


> > +#endif /* CONFIG_CGROUP_GPU */
> > +#endif /* _CGROUP_GPU_H */
> > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> > index 445235487230..46a2a7b93c41 100644
> > --- a/include/linux/cgroup_subsys.h
> > +++ b/include/linux/cgroup_subsys.h
> > @@ -65,6 +65,10 @@ SUBSYS(rdma)
> >  SUBSYS(misc)
> >  #endif
> >
> > +#if IS_ENABLED(CONFIG_CGROUP_GPU)
> > +SUBSYS(gpu)
> > +#endif
> > +
> >  /*
> >   * The following subsystems are not supported on the default hierarchy.
> >   */
> > diff --git a/init/Kconfig b/init/Kconfig
> > index ddcbefe535e9..2e00a190e170 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -984,6 +984,13 @@ config BLK_CGROUP
> >
> >       See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information.
> >
> > +config CGROUP_GPU
> > +     bool "GPU controller (EXPERIMENTAL)"
> > +     select PAGE_COUNTER
> > +     help
> > +       Provides accounting and limit setting for memory allocations by the GPU and
> > +       GPU-related subsystems.
> > +
> >  config CGROUP_WRITEBACK
> >       bool
> >       depends on MEMCG && BLK_CGROUP
> > diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> > index 12f8457ad1f9..be95a5a532fc 100644
> > --- a/kernel/cgroup/Makefile
> > +++ b/kernel/cgroup/Makefile
> > @@ -7,3 +7,4 @@ obj-$(CONFIG_CGROUP_RDMA) += rdma.o
> >  obj-$(CONFIG_CPUSETS) += cpuset.o
> >  obj-$(CONFIG_CGROUP_MISC) += misc.o
> >  obj-$(CONFIG_CGROUP_DEBUG) += debug.o
> > +obj-$(CONFIG_CGROUP_GPU) += gpu.o
> > diff --git a/kernel/cgroup/gpu.c b/kernel/cgroup/gpu.c
> > new file mode 100644
> > index 000000000000..ad16ea15d427
> > --- /dev/null
> > +++ b/kernel/cgroup/gpu.c
> > @@ -0,0 +1,339 @@
> > +// SPDX-License-Identifier: MIT
> > +// Copyright 2019 Advanced Micro Devices, Inc.
> > +// Copyright (C) 2022 Google LLC.
> > +
> > +#include <linux/cgroup.h>
> > +#include <linux/cgroup_gpu.h>
> > +#include <linux/err.h>
> > +#include <linux/gfp.h>
> > +#include <linux/list.h>
> > +#include <linux/mm.h>
> > +#include <linux/page_counter.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +
> > +static struct gpucg *root_gpucg __read_mostly;
> > +
> > +/*
> > + * Protects list of resource pools maintained on per cgroup basis and list
> > + * of buckets registered for memory accounting using the GPU cgroup controller.
> > + */
> > +static DEFINE_MUTEX(gpucg_mutex);
> > +static LIST_HEAD(gpucg_buckets);
> > +
> > +/* The GPU cgroup controller data structure */
> > +struct gpucg {
> > +     struct cgroup_subsys_state css;
> > +
> > +     /* list of all resource pools that belong to this cgroup */
> > +     struct list_head rpools;
> > +};
> > +
> > +/* A named entity representing bucket of tracked memory. */
> > +struct gpucg_bucket {
> > +     /* list of various resource pools in various cgroups that the bucket is part of */
> > +     struct list_head rpools;
> > +
> > +     /* list of all buckets registered for GPU cgroup accounting */
> > +     struct list_head bucket_node;
> > +
> > +     /* string to be used as identifier for accounting and limit setting */
> > +     const char *name;
> > +};
> > +
> > +struct gpucg_resource_pool {
> > +     /* The bucket whose resource usage is tracked by this resource pool */
> > +     struct gpucg_bucket *bucket;
> > +
> > +     /* list of all resource pools for the cgroup */
> > +     struct list_head cg_node;
> > +
> > +     /* list maintained by the gpucg_bucket to keep track of its resource pools */
> > +     struct list_head bucket_node;
> > +
> > +     /* tracks memory usage of the resource pool */
> > +     struct page_counter total;
> > +};
> > +
> > +static void free_cg_rpool_locked(struct gpucg_resource_pool *rpool)
> > +{
> > +     lockdep_assert_held(&gpucg_mutex);
> > +
> > +     list_del(&rpool->cg_node);
> > +     list_del(&rpool->bucket_node);
> > +     kfree(rpool);
> > +}
> > +
> > +static void gpucg_css_free(struct cgroup_subsys_state *css)
> > +{
> > +     struct gpucg_resource_pool *rpool, *tmp;
> > +     struct gpucg *gpucg = css_to_gpucg(css);
> > +
> > +     // delete all resource pools
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry_safe(rpool, tmp, &gpucg->rpools, cg_node)
> > +             free_cg_rpool_locked(rpool);
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     kfree(gpucg);
> > +}
> > +
> > +static struct cgroup_subsys_state *
> > +gpucg_css_alloc(struct cgroup_subsys_state *parent_css)
> > +{
> > +     struct gpucg *gpucg, *parent;
> > +
> > +     gpucg = kzalloc(sizeof(struct gpucg), GFP_KERNEL);
> > +     if (!gpucg)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     parent = css_to_gpucg(parent_css);
> > +     if (!parent)
> > +             root_gpucg = gpucg;
> > +
> > +     INIT_LIST_HEAD(&gpucg->rpools);
> > +
> > +     return &gpucg->css;
> > +}
> > +
> > +static struct gpucg_resource_pool *cg_rpool_find_locked(
> > +     struct gpucg *cg,
> > +     struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg_resource_pool *rpool;
> > +
> > +     lockdep_assert_held(&gpucg_mutex);
> > +
> > +     list_for_each_entry(rpool, &cg->rpools, cg_node)
> > +             if (rpool->bucket == bucket)
> > +                     return rpool;
> > +
> > +     return NULL;
> > +}
> > +
> > +static struct gpucg_resource_pool *cg_rpool_init(struct gpucg *cg,
> > +                                              struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg_resource_pool *rpool = kzalloc(sizeof(*rpool),
> > +                                                     GFP_KERNEL);
> > +     if (!rpool)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     rpool->bucket = bucket;
> > +
> > +     page_counter_init(&rpool->total, NULL);
> > +     INIT_LIST_HEAD(&rpool->cg_node);
> > +     INIT_LIST_HEAD(&rpool->bucket_node);
> > +     list_add_tail(&rpool->cg_node, &cg->rpools);
> > +     list_add_tail(&rpool->bucket_node, &bucket->rpools);
> > +
> > +     return rpool;
> > +}
> > +
> > +/**
> > + * get_cg_rpool_locked - find the resource pool for the specified bucket and
> > + * specified cgroup. If the resource pool does not exist for the cg, it is
> > + * created in a hierarchical manner in the cgroup and its ancestor cgroups who
> > + * do not already have a resource pool entry for the bucket.
> > + *
> > + * @cg: The cgroup to find the resource pool for.
> > + * @bucket: The bucket associated with the returned resource pool.
> > + *
> > + * Return: return resource pool entry corresponding to the specified bucket in
> > + * the specified cgroup (hierarchically creating them if not existing already).
> > + *
> > + */
> > +static struct gpucg_resource_pool *
> > +cg_rpool_get_locked(struct gpucg *cg, struct gpucg_bucket *bucket)
> > +{
> > +     struct gpucg *parent_cg, *p, *stop_cg;
> > +     struct gpucg_resource_pool *rpool, *tmp_rpool;
> > +     struct gpucg_resource_pool *parent_rpool = NULL, *leaf_rpool = NULL;
> > +
> > +     rpool = cg_rpool_find_locked(cg, bucket);
> > +     if (rpool)
> > +             return rpool;
> > +
> > +     stop_cg = cg;
> > +     do {
> > +             rpool = cg_rpool_init(stop_cg, bucket);
> > +             if (IS_ERR(rpool))
> > +                     goto err;
> > +
> > +             if (!leaf_rpool)
> > +                     leaf_rpool = rpool;
> > +
> > +             stop_cg = gpucg_parent(stop_cg);
> > +             if (!stop_cg)
> > +                     break;
> > +
> > +             rpool = cg_rpool_find_locked(stop_cg, bucket);
> > +     } while (!rpool);
> > +
> > +     /*
> > +      * Re-initialize page counters of all rpools created in this invocation
> > +      * to enable hierarchical charging.
> > +      * stop_cg is the first ancestor cg who already had a resource pool for
> > +      * the bucket. It can also be NULL if no ancestors had a pre-existing
> > +      * resource pool for the bucket before this invocation.
> > +      */
> > +     rpool = leaf_rpool;
> > +     for (p = cg; p != stop_cg; p = parent_cg) {
> > +             parent_cg = gpucg_parent(p);
> > +             if (!parent_cg)
> > +                     break;
> > +             parent_rpool = cg_rpool_find_locked(parent_cg, bucket);
> > +             page_counter_init(&rpool->total, &parent_rpool->total);
> > +
> > +             rpool = parent_rpool;
> > +     }
> > +
> > +     return leaf_rpool;
> > +err:
> > +     for (p = cg; p != stop_cg; p = gpucg_parent(p)) {
> > +             tmp_rpool = cg_rpool_find_locked(p, bucket);
> > +             free_cg_rpool_locked(tmp_rpool);
> > +     }
> > +     return rpool;
> > +}
> > +
> > +struct gpucg *css_to_gpucg(struct cgroup_subsys_state *css)
> > +{
> > +     return css ? container_of(css, struct gpucg, css) : NULL;
> > +}
> > +
> > +struct gpucg *gpucg_get(struct task_struct *task)
> > +{
> > +     if (!cgroup_subsys_enabled(gpu_cgrp_subsys))
> > +             return NULL;
> > +     return css_to_gpucg(task_get_css(task, gpu_cgrp_id));
> > +}
> > +
> > +void gpucg_put(struct gpucg *gpucg)
> > +{
> > +     if (gpucg)
> > +             css_put(&gpucg->css);
> > +}
> > +
> > +struct gpucg *gpucg_parent(struct gpucg *cg)
> > +{
> > +     return css_to_gpucg(cg->css.parent);
> > +}
> > +
> > +int gpucg_charge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> > +{
> > +     struct page_counter *counter;
> > +     u64 nr_pages;
> > +     struct gpucg_resource_pool *rp;
> > +     int ret = 0;
> > +
> > +     nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     rp = cg_rpool_get_locked(gpucg, bucket);
> > +     /*
> > +      * Continue to hold gpucg_mutex because we use it to block charges while transfers are in
> > +      * progress to avoid potentially exceeding a limit.
> > +      */
> > +     if (IS_ERR(rp)) {
> > +             mutex_unlock(&gpucg_mutex);
> > +             return PTR_ERR(rp);
> > +     }
> > +
> > +     if (page_counter_try_charge(&rp->total, nr_pages, &counter))
> > +             css_get(&gpucg->css);
> > +     else
> > +             ret = -ENOMEM;
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return ret;
> > +}
> > +
> > +void gpucg_uncharge(struct gpucg *gpucg, struct gpucg_bucket *bucket, u64 size)
> > +{
> > +     u64 nr_pages;
> > +     struct gpucg_resource_pool *rp;
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     rp = cg_rpool_find_locked(gpucg, bucket);
> > +     /*
> > +      * gpucg_mutex can be unlocked here, rp will stay valid until gpucg is freed and there are
> > +      * active refs on gpucg. Uncharges are fine while transfers are in progress since there is
> > +      * no potential to exceed a limit while uncharging and transferring.
> > +      */
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     if (unlikely(!rp)) {
> > +             pr_err("Resource pool not found, incorrect charge/uncharge ordering?\n");
> > +             return;
> > +     }
> > +
> > +     nr_pages = PAGE_ALIGN(size) >> PAGE_SHIFT;
> > +     page_counter_uncharge(&rp->total, nr_pages);
> > +     css_put(&gpucg->css);
> > +}
> > +
> > +struct gpucg_bucket *gpucg_register_bucket(const char *name)
> > +{
> > +     struct gpucg_bucket *bucket, *b;
> > +
> > +     if (!name)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     if (strlen(name) >= GPUCG_BUCKET_NAME_MAX_LEN)
> > +             return ERR_PTR(-ENAMETOOLONG);
> > +
> > +     bucket = kzalloc(sizeof(struct gpucg_bucket), GFP_KERNEL);
> > +     if (!bucket)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     INIT_LIST_HEAD(&bucket->bucket_node);
> > +     INIT_LIST_HEAD(&bucket->rpools);
> > +     bucket->name = kstrdup_const(name, GFP_KERNEL);
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry(b, &gpucg_buckets, bucket_node) {
> > +             if (strncmp(b->name, bucket->name, GPUCG_BUCKET_NAME_MAX_LEN) == 0) {
> > +                     mutex_unlock(&gpucg_mutex);
> > +                     kfree_const(bucket->name);
> > +                     kfree(bucket);
> > +                     return ERR_PTR(-EEXIST);
> > +             }
> > +     }
> > +     list_add_tail(&bucket->bucket_node, &gpucg_buckets);
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return bucket;
> > +}
> > +
> > +static int gpucg_resource_show(struct seq_file *sf, void *v)
> > +{
> > +     struct gpucg_resource_pool *rpool;
> > +     struct gpucg *cg = css_to_gpucg(seq_css(sf));
> > +
> > +     mutex_lock(&gpucg_mutex);
> > +     list_for_each_entry(rpool, &cg->rpools, cg_node) {
> > +             seq_printf(sf, "%s %lu\n", rpool->bucket->name,
> > +                        page_counter_read(&rpool->total) * PAGE_SIZE);
> > +     }
> > +     mutex_unlock(&gpucg_mutex);
> > +
> > +     return 0;
> > +}
> > +
> > +struct cftype files[] = {
> > +     {
> > +             .name = "memory.current",
> > +             .seq_show = gpucg_resource_show,
> > +     },
> > +     { }     /* terminate */
> > +};
> > +
> > +struct cgroup_subsys gpu_cgrp_subsys = {
> > +     .css_alloc      = gpucg_css_alloc,
> > +     .css_free       = gpucg_css_free,
> > +     .early_init     = false,
> > +     .legacy_cftypes = files,
> > +     .dfl_cftypes    = files,
> > +};
> >
> > --
> > 2.36.0.512.ge40c2bad7a-goog
> >
> >
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
  2022-05-19  9:30   ` eballetbo
  (?)
@ 2022-05-21  2:19     ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-21  2:19 UTC (permalink / raw)
  To: eballetbo
  Cc: Zefan Li, Jonathan Corbet, Joel Fernandes,
	Arve Hjønnevåg, Martijn Coenen, Benjamin Gaignard,
	Tejun Heo, Christian Brauner, Sumit Semwal, Todd Kjos,
	Suren Baghdasaryan, Johannes Weiner, Brian Starkey,
	Christian König, Greg Kroah-Hartman, Liam Mark, John Stultz,
	Hridya Valsaraju, Shuah Khan, Laura Abbott, cgroups, kernel-team,
	linux-media, dri-devel, linaro-mm-sig, Carlos Llamas,
	Daniel Vetter, Kenny.Ho, linux-kselftest, Kalesh Singh,
	Michal Koutný,
	John Stultz, linux-doc, linux-kernel, Shuah Khan

On Thu, May 19, 2022 at 2:31 AM <eballetbo@kernel.org> wrote:
>
> From: Enric Balletbo i Serra <eballetbo@kernel.org>
>
> On Tue, 10 May 2022 23:56:45 +0000, T.J. Mercier wrote:
> > From: Hridya Valsaraju <hridya@google.com>
> >
>
> Hi T.J. Mercier,
>
> Many thanks for this effort. It caught my attention because we might have a use
> case where this feature can be useful for us. Hence I'd like to jump and be part
> of the discussion, I'd really appreciate if you can cc'me for next versions.
>
Hi Enric,

Sure thing, thanks for engaging.

> While reading the full patchset I was a bit confused about the status of this
> proposal. In fact, the rfc in the subject combined with the number of iterations
> (already seven) confused me. So I'm wondering if this is a RFC or a 'real'
> proposal already that you want to land.
>
I'm sorry about this. I'm quite new to kernel development (this is my
first set of patches) and the point at which I should have
transitioned from RFC to PATCH was not clear to me. The status now
could be described as adding initial support for accounting that would
be built upon to expand what is tracked (more than just buffers from
heaps) and to add support for limiting. I see you have also commented
about this below.

> If this is still a RFC I'd remove the 'rfc: Proposal' and use the more canonical
> way that is put RFC in the []. I.e [PATCH RFC v7] cgroup: Add a GPU cgroup
> controller.
>
> If it is not, I'd just remove the RFC and make the subject in the cgroup
> subsystem instead of the gpu. I.E [PATCH v7] cgroup: Add a GPU cgroup
>
> I don't want to nitpick but IMO that helps new people to join to the history of
> the patchset.
>
> > This patch adds a proposal for a new GPU cgroup controller for
> > accounting/limiting GPU and GPU-related memory allocations.
>
> As far as I can see the only thing that is adding here is the accounting, so I'd
> remove any reference to limiting and just explain what the patch really
> introduces, not the future, otherwise is confusing an you expect more than the
> patch really does.
>
> It is important maintain the commit message sync with what the patch really
> does.
>
Acknowledged, thank you.

> > The proposed controller is based on the DRM cgroup controller[1] and
> > follows the design of the RDMA cgroup controller.
> >
> > The new cgroup controller would:
> > * Allow setting per-device limits on the total size of buffers
> >   allocated by device within a cgroup.
> > * Expose a per-device/allocator breakdown of the buffers charged to a
> >   cgroup.
> >
> > The prototype in the following patches is only for memory accounting
> > using the GPU cgroup controller and does not implement limit setting.
> >
> > [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> >
>
> I think this is material for the cover more than the commit message. When I read
> this I was expecting all this in this patch.
>
> > Signed-off-by: Hridya Valsaraju <hridya@google.com>
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> > v7 changes
> > Remove comment about duplicate name rejection which is not relevant to
> > cgroups users per Michal Koutný.
> >
> > v6 changes
> > Move documentation into cgroup-v2.rst per Tejun Heo.
> >
> > v5 changes
> > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > of the design since there is no currently known use for this per
> > Tejun Heo.
> >
> > Update for renamed functions/variables.
> >
> > v3 changes
> > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.
> >
> > Use more common dual author commit message format per John Stultz.
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..2e1d26e327c7 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
> >  a process to a different cgroup does not move the charge to the destination
> >  cgroup where the process has moved.
> >
> > +
> > +GPU
> > +---
> > +
> > +The GPU controller accounts for device and system memory allocated by the GPU
> > +and related subsystems for graphics use. Resource limits are not currently
> > +supported.
> > +
> > +GPU Interface Files
> > +~~~~~~~~~~~~~~~~~~~~
> > +
> > +  gpu.memory.current
> > +     A read-only file containing memory allocations in flat-keyed format. The key
> > +     is a string representing the device name. The value is the size of the memory
> > +     charged to the device in bytes. The device names are globally unique.::
> > +
> > +       $ cat /sys/kernel/fs/cgroup1/gpu.memory.current
>
> I think this is outdated, you are using cgroup v2, right?
>
Oh "cgroup1" was meant to refer to the name of a cgroup, not to cgroup
v1. A different name would be better here.

> > +       dev1 4194304
> > +       dev2 104857600
> > +
>
> When I applied the full series I was expecting see the memory allocated by the
> gpu devices or users of the gpu in this file but, after some experiments, what I
> saw is the memory allocated via any process that uses the dma-buf heap API (not
> necessary gpu users). For example, if you create a small program that allocates
> some memory via the dma-buf heap API and then you cat the gpu.memory.current
> file, you see that the memory accounted is not related to the gpu.
>
> This is really confusing, looks to me that the patches evolved to account memory
> that is not really related to the GPU but allocated vi the dma-buf heap API. IMO
> the name of the file should be according to what really does to avoid
> confusions.
>
> So, is this patchset meant to be GPU specific? If the answer is yes that's good
> but that's not what I experienced. I'm missing something?
>
There are two reasons this exists as a GPU controller. The first is
that most graphics buffers in Android come from these heaps, and this
is primarily what we are interested in accounting. However the idea is
to account other graphics memory types more commonly used on desktop
under different resource names with this controller. The second reason
predates my involvement, but my understanding is that Hridya tried to
upstream heap tracking via tracepoints but was asked to try to use GPU
cgroups instead, which led to her initial version of this series. So
this is a starting point. Any commentary on why this controller would
our would not work for any use cases you have in mind (provided the
appropriate charging/uncharging code is plugged in) would be
appreciated!

By the way, discussion around earlier proposals on this topic
suggested the "G" should be for "general" instead of "graphics", I
think in recognition of the breadth of resources that would eventually
be tracked by it.
https://lore.kernel.org/amd-gfx/YBp4ap+1l2KWbqEJ@phenom.ffwll.local/



> If the answer is that evolved to track dma-buf heap allocations I think all the
> patches need some rework to adapt the wording as right now, the gpu wording
> seems confusing to me.
>
> > +     The device name string is set by a device driver when it registers with the
> > +     GPU cgroup controller to participate in resource accounting.
> > +
> >  Others
> >  ------
> >
> >
> Thanks,
>  Enric
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
@ 2022-05-21  2:19     ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-21  2:19 UTC (permalink / raw)
  To: eballetbo
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Laura Abbott, kernel-team, linux-media,
	Todd Kjos, linaro-mm-sig, Hridya Valsaraju, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Tejun Heo

On Thu, May 19, 2022 at 2:31 AM <eballetbo@kernel.org> wrote:
>
> From: Enric Balletbo i Serra <eballetbo@kernel.org>
>
> On Tue, 10 May 2022 23:56:45 +0000, T.J. Mercier wrote:
> > From: Hridya Valsaraju <hridya@google.com>
> >
>
> Hi T.J. Mercier,
>
> Many thanks for this effort. It caught my attention because we might have a use
> case where this feature can be useful for us. Hence I'd like to jump and be part
> of the discussion, I'd really appreciate if you can cc'me for next versions.
>
Hi Enric,

Sure thing, thanks for engaging.

> While reading the full patchset I was a bit confused about the status of this
> proposal. In fact, the rfc in the subject combined with the number of iterations
> (already seven) confused me. So I'm wondering if this is a RFC or a 'real'
> proposal already that you want to land.
>
I'm sorry about this. I'm quite new to kernel development (this is my
first set of patches) and the point at which I should have
transitioned from RFC to PATCH was not clear to me. The status now
could be described as adding initial support for accounting that would
be built upon to expand what is tracked (more than just buffers from
heaps) and to add support for limiting. I see you have also commented
about this below.

> If this is still a RFC I'd remove the 'rfc: Proposal' and use the more canonical
> way that is put RFC in the []. I.e [PATCH RFC v7] cgroup: Add a GPU cgroup
> controller.
>
> If it is not, I'd just remove the RFC and make the subject in the cgroup
> subsystem instead of the gpu. I.E [PATCH v7] cgroup: Add a GPU cgroup
>
> I don't want to nitpick but IMO that helps new people to join to the history of
> the patchset.
>
> > This patch adds a proposal for a new GPU cgroup controller for
> > accounting/limiting GPU and GPU-related memory allocations.
>
> As far as I can see the only thing that is adding here is the accounting, so I'd
> remove any reference to limiting and just explain what the patch really
> introduces, not the future, otherwise is confusing an you expect more than the
> patch really does.
>
> It is important maintain the commit message sync with what the patch really
> does.
>
Acknowledged, thank you.

> > The proposed controller is based on the DRM cgroup controller[1] and
> > follows the design of the RDMA cgroup controller.
> >
> > The new cgroup controller would:
> > * Allow setting per-device limits on the total size of buffers
> >   allocated by device within a cgroup.
> > * Expose a per-device/allocator breakdown of the buffers charged to a
> >   cgroup.
> >
> > The prototype in the following patches is only for memory accounting
> > using the GPU cgroup controller and does not implement limit setting.
> >
> > [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> >
>
> I think this is material for the cover more than the commit message. When I read
> this I was expecting all this in this patch.
>
> > Signed-off-by: Hridya Valsaraju <hridya@google.com>
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > ---
> > v7 changes
> > Remove comment about duplicate name rejection which is not relevant to
> > cgroups users per Michal Koutný.
> >
> > v6 changes
> > Move documentation into cgroup-v2.rst per Tejun Heo.
> >
> > v5 changes
> > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > of the design since there is no currently known use for this per
> > Tejun Heo.
> >
> > Update for renamed functions/variables.
> >
> > v3 changes
> > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.
> >
> > Use more common dual author commit message format per John Stultz.
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..2e1d26e327c7 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
> >  a process to a different cgroup does not move the charge to the destination
> >  cgroup where the process has moved.
> >
> > +
> > +GPU
> > +---
> > +
> > +The GPU controller accounts for device and system memory allocated by the GPU
> > +and related subsystems for graphics use. Resource limits are not currently
> > +supported.
> > +
> > +GPU Interface Files
> > +~~~~~~~~~~~~~~~~~~~~
> > +
> > +  gpu.memory.current
> > +     A read-only file containing memory allocations in flat-keyed format. The key
> > +     is a string representing the device name. The value is the size of the memory
> > +     charged to the device in bytes. The device names are globally unique.::
> > +
> > +       $ cat /sys/kernel/fs/cgroup1/gpu.memory.current
>
> I think this is outdated, you are using cgroup v2, right?
>
Oh "cgroup1" was meant to refer to the name of a cgroup, not to cgroup
v1. A different name would be better here.

> > +       dev1 4194304
> > +       dev2 104857600
> > +
>
> When I applied the full series I was expecting see the memory allocated by the
> gpu devices or users of the gpu in this file but, after some experiments, what I
> saw is the memory allocated via any process that uses the dma-buf heap API (not
> necessary gpu users). For example, if you create a small program that allocates
> some memory via the dma-buf heap API and then you cat the gpu.memory.current
> file, you see that the memory accounted is not related to the gpu.
>
> This is really confusing, looks to me that the patches evolved to account memory
> that is not really related to the GPU but allocated vi the dma-buf heap API. IMO
> the name of the file should be according to what really does to avoid
> confusions.
>
> So, is this patchset meant to be GPU specific? If the answer is yes that's good
> but that's not what I experienced. I'm missing something?
>
There are two reasons this exists as a GPU controller. The first is
that most graphics buffers in Android come from these heaps, and this
is primarily what we are interested in accounting. However the idea is
to account other graphics memory types more commonly used on desktop
under different resource names with this controller. The second reason
predates my involvement, but my understanding is that Hridya tried to
upstream heap tracking via tracepoints but was asked to try to use GPU
cgroups instead, which led to her initial version of this series. So
this is a starting point. Any commentary on why this controller would
our would not work for any use cases you have in mind (provided the
appropriate charging/uncharging code is plugged in) would be
appreciated!

By the way, discussion around earlier proposals on this topic
suggested the "G" should be for "general" instead of "graphics", I
think in recognition of the breadth of resources that would eventually
be tracked by it.
https://lore.kernel.org/amd-gfx/YBp4ap+1l2KWbqEJ@phenom.ffwll.local/



> If the answer is that evolved to track dma-buf heap allocations I think all the
> patches need some rework to adapt the wording as right now, the gpu wording
> seems confusing to me.
>
> > +     The device name string is set by a device driver when it registers with the
> > +     GPU cgroup controller to participate in resource accounting.
> > +
> >  Others
> >  ------
> >
> >
> Thanks,
>  Enric
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 1/6] gpu: rfc: Proposal for a GPU cgroup controller
@ 2022-05-21  2:19     ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-05-21  2:19 UTC (permalink / raw)
  To: eballetbo-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Zefan Li, Jonathan Corbet, Joel Fernandes,
	Arve Hjønnevåg, Martijn Coenen, Benjamin Gaignard,
	Tejun Heo, Christian Brauner, Sumit Semwal, Todd Kjos,
	Suren Baghdasaryan, Johannes Weiner, Brian Starkey,
	Christian König, Greg Kroah-Hartman, Liam Mark, John Stultz,
	Hridya Valsaraju, Shuah Khan

On Thu, May 19, 2022 at 2:31 AM <eballetbo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>
> From: Enric Balletbo i Serra <eballetbo-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>
> On Tue, 10 May 2022 23:56:45 +0000, T.J. Mercier wrote:
> > From: Hridya Valsaraju <hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> >
>
> Hi T.J. Mercier,
>
> Many thanks for this effort. It caught my attention because we might have a use
> case where this feature can be useful for us. Hence I'd like to jump and be part
> of the discussion, I'd really appreciate if you can cc'me for next versions.
>
Hi Enric,

Sure thing, thanks for engaging.

> While reading the full patchset I was a bit confused about the status of this
> proposal. In fact, the rfc in the subject combined with the number of iterations
> (already seven) confused me. So I'm wondering if this is a RFC or a 'real'
> proposal already that you want to land.
>
I'm sorry about this. I'm quite new to kernel development (this is my
first set of patches) and the point at which I should have
transitioned from RFC to PATCH was not clear to me. The status now
could be described as adding initial support for accounting that would
be built upon to expand what is tracked (more than just buffers from
heaps) and to add support for limiting. I see you have also commented
about this below.

> If this is still a RFC I'd remove the 'rfc: Proposal' and use the more canonical
> way that is put RFC in the []. I.e [PATCH RFC v7] cgroup: Add a GPU cgroup
> controller.
>
> If it is not, I'd just remove the RFC and make the subject in the cgroup
> subsystem instead of the gpu. I.E [PATCH v7] cgroup: Add a GPU cgroup
>
> I don't want to nitpick but IMO that helps new people to join to the history of
> the patchset.
>
> > This patch adds a proposal for a new GPU cgroup controller for
> > accounting/limiting GPU and GPU-related memory allocations.
>
> As far as I can see the only thing that is adding here is the accounting, so I'd
> remove any reference to limiting and just explain what the patch really
> introduces, not the future, otherwise is confusing an you expect more than the
> patch really does.
>
> It is important maintain the commit message sync with what the patch really
> does.
>
Acknowledged, thank you.

> > The proposed controller is based on the DRM cgroup controller[1] and
> > follows the design of the RDMA cgroup controller.
> >
> > The new cgroup controller would:
> > * Allow setting per-device limits on the total size of buffers
> >   allocated by device within a cgroup.
> > * Expose a per-device/allocator breakdown of the buffers charged to a
> >   cgroup.
> >
> > The prototype in the following patches is only for memory accounting
> > using the GPU cgroup controller and does not implement limit setting.
> >
> > [1]: https://lore.kernel.org/amd-gfx/20210126214626.16260-1-brian.welty@intel.com/
> >
>
> I think this is material for the cover more than the commit message. When I read
> this I was expecting all this in this patch.
>
> > Signed-off-by: Hridya Valsaraju <hridya-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > Signed-off-by: T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> > ---
> > v7 changes
> > Remove comment about duplicate name rejection which is not relevant to
> > cgroups users per Michal Koutný.
> >
> > v6 changes
> > Move documentation into cgroup-v2.rst per Tejun Heo.
> >
> > v5 changes
> > Drop the global GPU cgroup "total" (sum of all device totals) portion
> > of the design since there is no currently known use for this per
> > Tejun Heo.
> >
> > Update for renamed functions/variables.
> >
> > v3 changes
> > Remove Upstreaming Plan from gpu-cgroup.rst per John Stultz.
> >
> > Use more common dual author commit message format per John Stultz.
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 69d7a6983f78..2e1d26e327c7 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -2352,6 +2352,29 @@ first, and stays charged to that cgroup until that resource is freed. Migrating
> >  a process to a different cgroup does not move the charge to the destination
> >  cgroup where the process has moved.
> >
> > +
> > +GPU
> > +---
> > +
> > +The GPU controller accounts for device and system memory allocated by the GPU
> > +and related subsystems for graphics use. Resource limits are not currently
> > +supported.
> > +
> > +GPU Interface Files
> > +~~~~~~~~~~~~~~~~~~~~
> > +
> > +  gpu.memory.current
> > +     A read-only file containing memory allocations in flat-keyed format. The key
> > +     is a string representing the device name. The value is the size of the memory
> > +     charged to the device in bytes. The device names are globally unique.::
> > +
> > +       $ cat /sys/kernel/fs/cgroup1/gpu.memory.current
>
> I think this is outdated, you are using cgroup v2, right?
>
Oh "cgroup1" was meant to refer to the name of a cgroup, not to cgroup
v1. A different name would be better here.

> > +       dev1 4194304
> > +       dev2 104857600
> > +
>
> When I applied the full series I was expecting see the memory allocated by the
> gpu devices or users of the gpu in this file but, after some experiments, what I
> saw is the memory allocated via any process that uses the dma-buf heap API (not
> necessary gpu users). For example, if you create a small program that allocates
> some memory via the dma-buf heap API and then you cat the gpu.memory.current
> file, you see that the memory accounted is not related to the gpu.
>
> This is really confusing, looks to me that the patches evolved to account memory
> that is not really related to the GPU but allocated vi the dma-buf heap API. IMO
> the name of the file should be according to what really does to avoid
> confusions.
>
> So, is this patchset meant to be GPU specific? If the answer is yes that's good
> but that's not what I experienced. I'm missing something?
>
There are two reasons this exists as a GPU controller. The first is
that most graphics buffers in Android come from these heaps, and this
is primarily what we are interested in accounting. However the idea is
to account other graphics memory types more commonly used on desktop
under different resource names with this controller. The second reason
predates my involvement, but my understanding is that Hridya tried to
upstream heap tracking via tracepoints but was asked to try to use GPU
cgroups instead, which led to her initial version of this series. So
this is a starting point. Any commentary on why this controller would
our would not work for any use cases you have in mind (provided the
appropriate charging/uncharging code is plugged in) would be
appreciated!

By the way, discussion around earlier proposals on this topic
suggested the "G" should be for "general" instead of "graphics", I
think in recognition of the breadth of resources that would eventually
be tracked by it.
https://lore.kernel.org/amd-gfx/YBp4ap+1l2KWbqEJ-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org/



> If the answer is that evolved to track dma-buf heap allocations I think all the
> patches need some rework to adapt the wording as right now, the gpu wording
> seems confusing to me.
>
> > +     The device name string is set by a device driver when it registers with the
> > +     GPU cgroup controller to participate in resource accounting.
> > +
> >  Others
> >  ------
> >
> >
> Thanks,
>  Enric
>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 6/6] selftests: Add binder cgroup gpu memory transfer tests
  2022-05-10 23:56 ` [PATCH v7 6/6] selftests: Add binder cgroup gpu memory transfer tests T.J. Mercier
@ 2022-05-21 10:15   ` Muhammad Usama Anjum
  0 siblings, 0 replies; 67+ messages in thread
From: Muhammad Usama Anjum @ 2022-05-21 10:15 UTC (permalink / raw)
  To: T.J. Mercier, Shuah Khan
  Cc: usama.anjum, daniel, tj, hridya, christian.koenig, jstultz,
	tkjos, cmllamas, surenb, kaleshsingh, Kenny.Ho, mkoutny, skhan,
	kernel-team, linux-kernel, linux-kselftest

On 5/11/22 4:56 AM, T.J. Mercier wrote:
>  .../selftests/drivers/android/binder/Makefile |   8 +
>  .../drivers/android/binder/binder_util.c      | 250 +++++++++
>  .../drivers/android/binder/binder_util.h      |  32 ++
>  .../selftests/drivers/android/binder/config   |   4 +
>  .../binder/test_dmabuf_cgroup_transfer.c      | 526 ++++++++++++++++++
>  5 files changed, 820 insertions(+)
>  create mode 100644 tools/testing/selftests/drivers/android/binder/Makefile
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.c
>  create mode 100644 tools/testing/selftests/drivers/android/binder/binder_util.h
>  create mode 100644 tools/testing/selftests/drivers/android/binder/config
>  create mode 100644 tools/testing/selftests/drivers/android/binder/test_dmabuf_cgroup_transfer.c
> 
> diff --git a/tools/testing/selftests/drivers/android/binder/Makefile b/tools/testing/selftests/drivers/android/binder/Makefile
> new file mode 100644
> index 000000000000..726439d10675
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/android/binder/Makefile
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: GPL-2.0
> +CFLAGS += -Wall
Please add $(KHDR_INCLUDES) here to include the uapi header files from
the source tree.

> +
> +TEST_GEN_PROGS = test_dmabuf_cgroup_transfer
Please create a .gitignore file and add test_dmabuf_cgroup_transfer to it.

> +
> +include ../../../lib.mk
> +
> +$(OUTPUT)/test_dmabuf_cgroup_transfer: ../../../cgroup/cgroup_util.c binder_util.c

-- 
Muhammad Usama Anjum

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-05-20 16:25                 ` T.J. Mercier
  (?)
@ 2022-06-15 17:31                   ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-06-15 17:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Nicolas Dufresne, Laura Abbott, kernel-team,
	linux-media, Todd Kjos, linaro-mm-sig, Shuah Khan, cgroups,
	Suren Baghdasaryan, Christian Brauner, Greg Kroah-Hartman,
	linux-kernel, Liam Mark, Christian König,
	Arve Hjønnevåg, Michal Koutný,
	Johannes Weiner, Hridya Valsaraju

On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > work, but I'd actually like to account for each heap. Since heaps can
> > > be dynamically added, I can't accommodate every potential heap name by
> > > hardcoding registrations in the misc controller.
> >
> > On its own, that's a pretty weak reason to be adding a separate gpu
> > controller especially given that it doesn't really seem to be one with
> > proper abstractions for gpu resources. We don't want to keep adding random
> > keys to misc controller but can definitely add limited flexibility. What
> > kind of keys do you need?
> >
> Well the dmabuf-from-heaps component of this is the initial use case.
> I was envisioning we'd have additional keys as discussed here:
> https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> So we'd end up with a well-defined core set of keys like "system", and
> then drivers would be free to use their own keys for their own unique
> purposes which could be complementary or orthogonal to the core set.
> Yesterday I was talking with someone who is interested in limiting gpu
> cores and bus IDs in addition to gpu memory. How to define core keys
> is the part where it looks like there's trouble.
>
> For my use case it would be sufficient to have current and maximum
> values for an arbitrary number of keys - one per heap. So the only
> part missing from the misc controller (for my use case) is the ability
> to register a new key at runtime as heaps are added. Instead of
> keeping track of resources with enum misc_res_type, requesting a
> resource handle/ID from the misc controller at runtime is what I think
> would be required instead.
>
Quick update: I'm going to make an attempt to modify the misc
controller to support a limited amount of dynamic resource
registration/tracking in place of the new controller in this series.

Thanks everyone for the feedback.
-T.J.

> > Thanks.
> >
> > --
> > tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-15 17:31                   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-06-15 17:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > work, but I'd actually like to account for each heap. Since heaps can
> > > be dynamically added, I can't accommodate every potential heap name by
> > > hardcoding registrations in the misc controller.
> >
> > On its own, that's a pretty weak reason to be adding a separate gpu
> > controller especially given that it doesn't really seem to be one with
> > proper abstractions for gpu resources. We don't want to keep adding random
> > keys to misc controller but can definitely add limited flexibility. What
> > kind of keys do you need?
> >
> Well the dmabuf-from-heaps component of this is the initial use case.
> I was envisioning we'd have additional keys as discussed here:
> https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> So we'd end up with a well-defined core set of keys like "system", and
> then drivers would be free to use their own keys for their own unique
> purposes which could be complementary or orthogonal to the core set.
> Yesterday I was talking with someone who is interested in limiting gpu
> cores and bus IDs in addition to gpu memory. How to define core keys
> is the part where it looks like there's trouble.
>
> For my use case it would be sufficient to have current and maximum
> values for an arbitrary number of keys - one per heap. So the only
> part missing from the misc controller (for my use case) is the ability
> to register a new key at runtime as heaps are added. Instead of
> keeping track of resources with enum misc_res_type, requesting a
> resource handle/ID from the misc controller at runtime is what I think
> would be required instead.
>
Quick update: I'm going to make an attempt to modify the misc
controller to support a limited amount of dynamic resource
registration/tracking in place of the new controller in this series.

Thanks everyone for the feedback.
-T.J.

> > Thanks.
> >
> > --
> > tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-15 17:31                   ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-06-15 17:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Nicolas Dufresne, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Greg Kroah-Hartman, Arve Hjønnevåg, Todd Kjos,
	Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey

On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > work, but I'd actually like to account for each heap. Since heaps can
> > > be dynamically added, I can't accommodate every potential heap name by
> > > hardcoding registrations in the misc controller.
> >
> > On its own, that's a pretty weak reason to be adding a separate gpu
> > controller especially given that it doesn't really seem to be one with
> > proper abstractions for gpu resources. We don't want to keep adding random
> > keys to misc controller but can definitely add limited flexibility. What
> > kind of keys do you need?
> >
> Well the dmabuf-from-heaps component of this is the initial use case.
> I was envisioning we'd have additional keys as discussed here:
> https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> So we'd end up with a well-defined core set of keys like "system", and
> then drivers would be free to use their own keys for their own unique
> purposes which could be complementary or orthogonal to the core set.
> Yesterday I was talking with someone who is interested in limiting gpu
> cores and bus IDs in addition to gpu memory. How to define core keys
> is the part where it looks like there's trouble.
>
> For my use case it would be sufficient to have current and maximum
> values for an arbitrary number of keys - one per heap. So the only
> part missing from the misc controller (for my use case) is the ability
> to register a new key at runtime as heaps are added. Instead of
> keeping track of resources with enum misc_res_type, requesting a
> resource handle/ID from the misc controller at runtime is what I think
> would be required instead.
>
Quick update: I'm going to make an attempt to modify the misc
controller to support a limited amount of dynamic resource
registration/tracking in place of the new controller in this series.

Thanks everyone for the feedback.
-T.J.

> > Thanks.
> >
> > --
> > tejun

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-06-15 17:31                   ` T.J. Mercier
  (?)
@ 2022-06-24 20:17                     ` Daniel Vetter
  -1 siblings, 0 replies; 67+ messages in thread
From: Daniel Vetter @ 2022-06-24 20:17 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, Nicolas Dufresne, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey, John Stultz, Shuah Khan, Daniel Vetter,
	John Stultz, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > be dynamically added, I can't accommodate every potential heap name by
> > > > hardcoding registrations in the misc controller.
> > >
> > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > controller especially given that it doesn't really seem to be one with
> > > proper abstractions for gpu resources. We don't want to keep adding random
> > > keys to misc controller but can definitely add limited flexibility. What
> > > kind of keys do you need?
> > >
> > Well the dmabuf-from-heaps component of this is the initial use case.
> > I was envisioning we'd have additional keys as discussed here:
> > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > So we'd end up with a well-defined core set of keys like "system", and
> > then drivers would be free to use their own keys for their own unique
> > purposes which could be complementary or orthogonal to the core set.
> > Yesterday I was talking with someone who is interested in limiting gpu
> > cores and bus IDs in addition to gpu memory. How to define core keys
> > is the part where it looks like there's trouble.
> >
> > For my use case it would be sufficient to have current and maximum
> > values for an arbitrary number of keys - one per heap. So the only
> > part missing from the misc controller (for my use case) is the ability
> > to register a new key at runtime as heaps are added. Instead of
> > keeping track of resources with enum misc_res_type, requesting a
> > resource handle/ID from the misc controller at runtime is what I think
> > would be required instead.
> >
> Quick update: I'm going to make an attempt to modify the misc
> controller to support a limited amount of dynamic resource
> registration/tracking in place of the new controller in this series.
> 
> Thanks everyone for the feedback.

Somehow I missed this entire chain here.

I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
account. Atm everyone just adds their one-off solution in a random corner:
- total tracking in misc cgroup controller
- dma-buf sysfs files (except apparently too slow so it'll get deleted
  again)
- random other stuff on open device files os OOM killer can see it

This doesn't look good.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-24 20:17                     ` Daniel Vetter
  0 siblings, 0 replies; 67+ messages in thread
From: Daniel Vetter @ 2022-06-24 20:17 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel,
	John Stultz, Zefan Li, Kalesh Singh, Joel Fernandes, Shuah Khan,
	Sumit Semwal, Kenny.Ho, Benjamin Gaignard, Jonathan Corbet,
	Martijn Coenen, Nicolas Dufresne, Laura Abbott, kernel-team,
	linux-media, Todd Kjos, linaro-mm-sig, Hridya Valsaraju,
	Shuah Khan, cgroups, Suren Baghdasaryan, Christian Brauner,
	Greg Kroah-Hartman, linux-kernel, Liam Mark,
	Christian König, Arve Hjønnevåg,
	Michal Koutný,
	Johannes Weiner, Tejun Heo

On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > be dynamically added, I can't accommodate every potential heap name by
> > > > hardcoding registrations in the misc controller.
> > >
> > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > controller especially given that it doesn't really seem to be one with
> > > proper abstractions for gpu resources. We don't want to keep adding random
> > > keys to misc controller but can definitely add limited flexibility. What
> > > kind of keys do you need?
> > >
> > Well the dmabuf-from-heaps component of this is the initial use case.
> > I was envisioning we'd have additional keys as discussed here:
> > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > So we'd end up with a well-defined core set of keys like "system", and
> > then drivers would be free to use their own keys for their own unique
> > purposes which could be complementary or orthogonal to the core set.
> > Yesterday I was talking with someone who is interested in limiting gpu
> > cores and bus IDs in addition to gpu memory. How to define core keys
> > is the part where it looks like there's trouble.
> >
> > For my use case it would be sufficient to have current and maximum
> > values for an arbitrary number of keys - one per heap. So the only
> > part missing from the misc controller (for my use case) is the ability
> > to register a new key at runtime as heaps are added. Instead of
> > keeping track of resources with enum misc_res_type, requesting a
> > resource handle/ID from the misc controller at runtime is what I think
> > would be required instead.
> >
> Quick update: I'm going to make an attempt to modify the misc
> controller to support a limited amount of dynamic resource
> registration/tracking in place of the new controller in this series.
> 
> Thanks everyone for the feedback.

Somehow I missed this entire chain here.

I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
account. Atm everyone just adds their one-off solution in a random corner:
- total tracking in misc cgroup controller
- dma-buf sysfs files (except apparently too slow so it'll get deleted
  again)
- random other stuff on open device files os OOM killer can see it

This doesn't look good.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-24 20:17                     ` Daniel Vetter
  0 siblings, 0 replies; 67+ messages in thread
From: Daniel Vetter @ 2022-06-24 20:17 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, Nicolas Dufresne, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Greg Kroah-Hartman, Arve Hjønnevåg,
	Todd Kjos, Martijn Coenen, Joel Fernandes, Christian Brauner,
	Hridya Valsaraju, Suren Baghdasaryan, Sumit Semwal,
	Christian König, Benjamin Gaignard, Liam Mark, Laura Abbott,
	Brian Starkey

On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >
> > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > >
> > > Hello,
> > >
> > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > be dynamically added, I can't accommodate every potential heap name by
> > > > hardcoding registrations in the misc controller.
> > >
> > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > controller especially given that it doesn't really seem to be one with
> > > proper abstractions for gpu resources. We don't want to keep adding random
> > > keys to misc controller but can definitely add limited flexibility. What
> > > kind of keys do you need?
> > >
> > Well the dmabuf-from-heaps component of this is the initial use case.
> > I was envisioning we'd have additional keys as discussed here:
> > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > So we'd end up with a well-defined core set of keys like "system", and
> > then drivers would be free to use their own keys for their own unique
> > purposes which could be complementary or orthogonal to the core set.
> > Yesterday I was talking with someone who is interested in limiting gpu
> > cores and bus IDs in addition to gpu memory. How to define core keys
> > is the part where it looks like there's trouble.
> >
> > For my use case it would be sufficient to have current and maximum
> > values for an arbitrary number of keys - one per heap. So the only
> > part missing from the misc controller (for my use case) is the ability
> > to register a new key at runtime as heaps are added. Instead of
> > keeping track of resources with enum misc_res_type, requesting a
> > resource handle/ID from the misc controller at runtime is what I think
> > would be required instead.
> >
> Quick update: I'm going to make an attempt to modify the misc
> controller to support a limited amount of dynamic resource
> registration/tracking in place of the new controller in this series.
> 
> Thanks everyone for the feedback.

Somehow I missed this entire chain here.

I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
account. Atm everyone just adds their one-off solution in a random corner:
- total tracking in misc cgroup controller
- dma-buf sysfs files (except apparently too slow so it'll get deleted
  again)
- random other stuff on open device files os OOM killer can see it

This doesn't look good.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-06-24 20:17                     ` Daniel Vetter
@ 2022-06-24 20:32                       ` John Stultz
  -1 siblings, 0 replies; 67+ messages in thread
From: John Stultz @ 2022-06-24 20:32 UTC (permalink / raw)
  To: T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark, Laura Abbott, Brian Starkey,
	John Stultz, Shuah Khan, John Stultz, Carlos Llamas,
	Kalesh Singh, Kenny.Ho, Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest
  Cc: Daniel Vetter

On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> > >
> > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > hardcoding registrations in the misc controller.
> > > >
> > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > controller especially given that it doesn't really seem to be one with
> > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > keys to misc controller but can definitely add limited flexibility. What
> > > > kind of keys do you need?
> > > >
> > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > I was envisioning we'd have additional keys as discussed here:
> > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > So we'd end up with a well-defined core set of keys like "system", and
> > > then drivers would be free to use their own keys for their own unique
> > > purposes which could be complementary or orthogonal to the core set.
> > > Yesterday I was talking with someone who is interested in limiting gpu
> > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > is the part where it looks like there's trouble.
> > >
> > > For my use case it would be sufficient to have current and maximum
> > > values for an arbitrary number of keys - one per heap. So the only
> > > part missing from the misc controller (for my use case) is the ability
> > > to register a new key at runtime as heaps are added. Instead of
> > > keeping track of resources with enum misc_res_type, requesting a
> > > resource handle/ID from the misc controller at runtime is what I think
> > > would be required instead.
> > >
> > Quick update: I'm going to make an attempt to modify the misc
> > controller to support a limited amount of dynamic resource
> > registration/tracking in place of the new controller in this series.
> >
> > Thanks everyone for the feedback.
>
> Somehow I missed this entire chain here.
>
> I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> account. Atm everyone just adds their one-off solution in a random corner:
> - total tracking in misc cgroup controller
> - dma-buf sysfs files (except apparently too slow so it'll get deleted
>   again)
> - random other stuff on open device files os OOM killer can see it
>
> This doesn't look good.

But I also think one could see it as "gpu memory" is the drm subsystem
doing the same thing (in that it's artificially narrow to gpus). It
seems we need something to account for buffers allocated by drivers,
no matter which subsystem it was in (drm, v4l2, or networking or
whatever).

thanks
-john

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-24 20:32                       ` John Stultz
  0 siblings, 0 replies; 67+ messages in thread
From: John Stultz @ 2022-06-24 20:32 UTC (permalink / raw)
  To: T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark, Laura Abbott, Brian Starkey,
	John Stultz, Shuah Khan, John Stultz, Carlos Llamas,
	Kalesh Singh, Kenny.Ho, Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> > >
> > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > hardcoding registrations in the misc controller.
> > > >
> > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > controller especially given that it doesn't really seem to be one with
> > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > keys to misc controller but can definitely add limited flexibility. What
> > > > kind of keys do you need?
> > > >
> > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > I was envisioning we'd have additional keys as discussed here:
> > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > So we'd end up with a well-defined core set of keys like "system", and
> > > then drivers would be free to use their own keys for their own unique
> > > purposes which could be complementary or orthogonal to the core set.
> > > Yesterday I was talking with someone who is interested in limiting gpu
> > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > is the part where it looks like there's trouble.
> > >
> > > For my use case it would be sufficient to have current and maximum
> > > values for an arbitrary number of keys - one per heap. So the only
> > > part missing from the misc controller (for my use case) is the ability
> > > to register a new key at runtime as heaps are added. Instead of
> > > keeping track of resources with enum misc_res_type, requesting a
> > > resource handle/ID from the misc controller at runtime is what I think
> > > would be required instead.
> > >
> > Quick update: I'm going to make an attempt to modify the misc
> > controller to support a limited amount of dynamic resource
> > registration/tracking in place of the new controller in this series.
> >
> > Thanks everyone for the feedback.
>
> Somehow I missed this entire chain here.
>
> I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> account. Atm everyone just adds their one-off solution in a random corner:
> - total tracking in misc cgroup controller
> - dma-buf sysfs files (except apparently too slow so it'll get deleted
>   again)
> - random other stuff on open device files os OOM killer can see it
>
> This doesn't look good.

But I also think one could see it as "gpu memory" is the drm subsystem
doing the same thing (in that it's artificially narrow to gpus). It
seems we need something to account for buffers allocated by drivers,
no matter which subsystem it was in (drm, v4l2, or networking or
whatever).

thanks
-john

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
       [not found]                     ` <YrYbwu0iIAJJGXVg-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org>
@ 2022-06-24 20:32                       ` John Stultz
  0 siblings, 0 replies; 67+ messages in thread
From: John Stultz @ 2022-06-24 20:32 UTC (permalink / raw)
  To: T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark, Laura Abbott
  Cc: Daniel Vetter

On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel-/w4YWyX8dFk@public.gmane.org> wrote:
>
> On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > >
> > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > hardcoding registrations in the misc controller.
> > > >
> > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > controller especially given that it doesn't really seem to be one with
> > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > keys to misc controller but can definitely add limited flexibility. What
> > > > kind of keys do you need?
> > > >
> > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > I was envisioning we'd have additional keys as discussed here:
> > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > So we'd end up with a well-defined core set of keys like "system", and
> > > then drivers would be free to use their own keys for their own unique
> > > purposes which could be complementary or orthogonal to the core set.
> > > Yesterday I was talking with someone who is interested in limiting gpu
> > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > is the part where it looks like there's trouble.
> > >
> > > For my use case it would be sufficient to have current and maximum
> > > values for an arbitrary number of keys - one per heap. So the only
> > > part missing from the misc controller (for my use case) is the ability
> > > to register a new key at runtime as heaps are added. Instead of
> > > keeping track of resources with enum misc_res_type, requesting a
> > > resource handle/ID from the misc controller at runtime is what I think
> > > would be required instead.
> > >
> > Quick update: I'm going to make an attempt to modify the misc
> > controller to support a limited amount of dynamic resource
> > registration/tracking in place of the new controller in this series.
> >
> > Thanks everyone for the feedback.
>
> Somehow I missed this entire chain here.
>
> I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> account. Atm everyone just adds their one-off solution in a random corner:
> - total tracking in misc cgroup controller
> - dma-buf sysfs files (except apparently too slow so it'll get deleted
>   again)
> - random other stuff on open device files os OOM killer can see it
>
> This doesn't look good.

But I also think one could see it as "gpu memory" is the drm subsystem
doing the same thing (in that it's artificially narrow to gpus). It
seems we need something to account for buffers allocated by drivers,
no matter which subsystem it was in (drm, v4l2, or networking or
whatever).

thanks
-john

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-06-24 20:32                       ` John Stultz
  (?)
@ 2022-06-24 20:36                         ` Daniel Vetter
  -1 siblings, 0 replies; 67+ messages in thread
From: Daniel Vetter @ 2022-06-24 20:36 UTC (permalink / raw)
  To: John Stultz
  Cc: T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark, Laura Abbott, Brian Starkey,
	John Stultz, Shuah Khan, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest,
	Daniel Vetter

On Fri, Jun 24, 2022 at 01:32:45PM -0700, John Stultz wrote:
> On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > > hardcoding registrations in the misc controller.
> > > > >
> > > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > > controller especially given that it doesn't really seem to be one with
> > > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > > keys to misc controller but can definitely add limited flexibility. What
> > > > > kind of keys do you need?
> > > > >
> > > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > > I was envisioning we'd have additional keys as discussed here:
> > > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > > So we'd end up with a well-defined core set of keys like "system", and
> > > > then drivers would be free to use their own keys for their own unique
> > > > purposes which could be complementary or orthogonal to the core set.
> > > > Yesterday I was talking with someone who is interested in limiting gpu
> > > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > > is the part where it looks like there's trouble.
> > > >
> > > > For my use case it would be sufficient to have current and maximum
> > > > values for an arbitrary number of keys - one per heap. So the only
> > > > part missing from the misc controller (for my use case) is the ability
> > > > to register a new key at runtime as heaps are added. Instead of
> > > > keeping track of resources with enum misc_res_type, requesting a
> > > > resource handle/ID from the misc controller at runtime is what I think
> > > > would be required instead.
> > > >
> > > Quick update: I'm going to make an attempt to modify the misc
> > > controller to support a limited amount of dynamic resource
> > > registration/tracking in place of the new controller in this series.
> > >
> > > Thanks everyone for the feedback.
> >
> > Somehow I missed this entire chain here.
> >
> > I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> > account. Atm everyone just adds their one-off solution in a random corner:
> > - total tracking in misc cgroup controller
> > - dma-buf sysfs files (except apparently too slow so it'll get deleted
> >   again)
> > - random other stuff on open device files os OOM killer can see it
> >
> > This doesn't look good.
> 
> But I also think one could see it as "gpu memory" is the drm subsystem
> doing the same thing (in that it's artificially narrow to gpus). It
> seems we need something to account for buffers allocated by drivers,
> no matter which subsystem it was in (drm, v4l2, or networking or
> whatever).

This is what the gpucg was. It wasn't called the dmabuf cg because we want
to account also memory of other types (e.g. drm gem buffer objects which
aren't exported), and I guess people didn't dare call it an xpu.

But this was absolutely for a lot more than just "gpu drivers in drm".
Better names welcome.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-24 20:36                         ` Daniel Vetter
  0 siblings, 0 replies; 67+ messages in thread
From: Daniel Vetter @ 2022-06-24 20:36 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kselftest, linux-doc, Carlos Llamas, dri-devel, Zefan Li,
	Kalesh Singh, Joel Fernandes, Shuah Khan, Sumit Semwal, Kenny.Ho,
	Benjamin Gaignard, Jonathan Corbet, Martijn Coenen,
	Nicolas Dufresne, Laura Abbott, kernel-team, linux-media,
	Todd Kjos, linaro-mm-sig, Hridya Valsaraju, Shuah Khan, cgroups,
	Suren Baghdasaryan, T.J. Mercier, Christian Brauner,
	Greg Kroah-Hartman, linux-kernel, Liam Mark,
	Christian König, Arve Hjønnevåg,
	Michal Koutný,
	Johannes Weiner, Tejun Heo

On Fri, Jun 24, 2022 at 01:32:45PM -0700, John Stultz wrote:
> On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> > > >
> > > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > > hardcoding registrations in the misc controller.
> > > > >
> > > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > > controller especially given that it doesn't really seem to be one with
> > > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > > keys to misc controller but can definitely add limited flexibility. What
> > > > > kind of keys do you need?
> > > > >
> > > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > > I was envisioning we'd have additional keys as discussed here:
> > > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > > So we'd end up with a well-defined core set of keys like "system", and
> > > > then drivers would be free to use their own keys for their own unique
> > > > purposes which could be complementary or orthogonal to the core set.
> > > > Yesterday I was talking with someone who is interested in limiting gpu
> > > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > > is the part where it looks like there's trouble.
> > > >
> > > > For my use case it would be sufficient to have current and maximum
> > > > values for an arbitrary number of keys - one per heap. So the only
> > > > part missing from the misc controller (for my use case) is the ability
> > > > to register a new key at runtime as heaps are added. Instead of
> > > > keeping track of resources with enum misc_res_type, requesting a
> > > > resource handle/ID from the misc controller at runtime is what I think
> > > > would be required instead.
> > > >
> > > Quick update: I'm going to make an attempt to modify the misc
> > > controller to support a limited amount of dynamic resource
> > > registration/tracking in place of the new controller in this series.
> > >
> > > Thanks everyone for the feedback.
> >
> > Somehow I missed this entire chain here.
> >
> > I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> > account. Atm everyone just adds their one-off solution in a random corner:
> > - total tracking in misc cgroup controller
> > - dma-buf sysfs files (except apparently too slow so it'll get deleted
> >   again)
> > - random other stuff on open device files os OOM killer can see it
> >
> > This doesn't look good.
> 
> But I also think one could see it as "gpu memory" is the drm subsystem
> doing the same thing (in that it's artificially narrow to gpus). It
> seems we need something to account for buffers allocated by drivers,
> no matter which subsystem it was in (drm, v4l2, or networking or
> whatever).

This is what the gpucg was. It wasn't called the dmabuf cg because we want
to account also memory of other types (e.g. drm gem buffer objects which
aren't exported), and I guess people didn't dare call it an xpu.

But this was absolutely for a lot more than just "gpu drivers in drm".
Better names welcome.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-24 20:36                         ` Daniel Vetter
  0 siblings, 0 replies; 67+ messages in thread
From: Daniel Vetter @ 2022-06-24 20:36 UTC (permalink / raw)
  To: John Stultz
  Cc: T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark, Laura Abbott

On Fri, Jun 24, 2022 at 01:32:45PM -0700, John Stultz wrote:
> On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel-/w4YWyX8dFk@public.gmane.org> wrote:
> >
> > On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > > >
> > > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > > hardcoding registrations in the misc controller.
> > > > >
> > > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > > controller especially given that it doesn't really seem to be one with
> > > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > > keys to misc controller but can definitely add limited flexibility. What
> > > > > kind of keys do you need?
> > > > >
> > > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > > I was envisioning we'd have additional keys as discussed here:
> > > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > > So we'd end up with a well-defined core set of keys like "system", and
> > > > then drivers would be free to use their own keys for their own unique
> > > > purposes which could be complementary or orthogonal to the core set.
> > > > Yesterday I was talking with someone who is interested in limiting gpu
> > > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > > is the part where it looks like there's trouble.
> > > >
> > > > For my use case it would be sufficient to have current and maximum
> > > > values for an arbitrary number of keys - one per heap. So the only
> > > > part missing from the misc controller (for my use case) is the ability
> > > > to register a new key at runtime as heaps are added. Instead of
> > > > keeping track of resources with enum misc_res_type, requesting a
> > > > resource handle/ID from the misc controller at runtime is what I think
> > > > would be required instead.
> > > >
> > > Quick update: I'm going to make an attempt to modify the misc
> > > controller to support a limited amount of dynamic resource
> > > registration/tracking in place of the new controller in this series.
> > >
> > > Thanks everyone for the feedback.
> >
> > Somehow I missed this entire chain here.
> >
> > I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> > account. Atm everyone just adds their one-off solution in a random corner:
> > - total tracking in misc cgroup controller
> > - dma-buf sysfs files (except apparently too slow so it'll get deleted
> >   again)
> > - random other stuff on open device files os OOM killer can see it
> >
> > This doesn't look good.
> 
> But I also think one could see it as "gpu memory" is the drm subsystem
> doing the same thing (in that it's artificially narrow to gpus). It
> seems we need something to account for buffers allocated by drivers,
> no matter which subsystem it was in (drm, v4l2, or networking or
> whatever).

This is what the gpucg was. It wasn't called the dmabuf cg because we want
to account also memory of other types (e.g. drm gem buffer objects which
aren't exported), and I guess people didn't dare call it an xpu.

But this was absolutely for a lot more than just "gpu drivers in drm".
Better names welcome.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
  2022-06-24 20:36                         ` Daniel Vetter
@ 2022-06-24 21:17                           ` T.J. Mercier
  -1 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-06-24 21:17 UTC (permalink / raw)
  To: John Stultz, T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark, Laura Abbott, Brian Starkey,
	John Stultz, Shuah Khan, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest
  Cc: Daniel Vetter

On Fri, Jun 24, 2022 at 1:36 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, Jun 24, 2022 at 01:32:45PM -0700, John Stultz wrote:
> > On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > > > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> > > > >
> > > > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > > > hardcoding registrations in the misc controller.
> > > > > >
> > > > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > > > controller especially given that it doesn't really seem to be one with
> > > > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > > > keys to misc controller but can definitely add limited flexibility. What
> > > > > > kind of keys do you need?
> > > > > >
> > > > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > > > I was envisioning we'd have additional keys as discussed here:
> > > > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > > > So we'd end up with a well-defined core set of keys like "system", and
> > > > > then drivers would be free to use their own keys for their own unique
> > > > > purposes which could be complementary or orthogonal to the core set.
> > > > > Yesterday I was talking with someone who is interested in limiting gpu
> > > > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > > > is the part where it looks like there's trouble.
> > > > >
> > > > > For my use case it would be sufficient to have current and maximum
> > > > > values for an arbitrary number of keys - one per heap. So the only
> > > > > part missing from the misc controller (for my use case) is the ability
> > > > > to register a new key at runtime as heaps are added. Instead of
> > > > > keeping track of resources with enum misc_res_type, requesting a
> > > > > resource handle/ID from the misc controller at runtime is what I think
> > > > > would be required instead.
> > > > >
> > > > Quick update: I'm going to make an attempt to modify the misc
> > > > controller to support a limited amount of dynamic resource
> > > > registration/tracking in place of the new controller in this series.
> > > >
> > > > Thanks everyone for the feedback.
> > >
> > > Somehow I missed this entire chain here.
> > >
> > > I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> > > account. Atm everyone just adds their one-off solution in a random corner:
> > > - total tracking in misc cgroup controller
> > > - dma-buf sysfs files (except apparently too slow so it'll get deleted
> > >   again)
> > > - random other stuff on open device files os OOM killer can see it
> > >
> > > This doesn't look good.
> >
> > But I also think one could see it as "gpu memory" is the drm subsystem
> > doing the same thing (in that it's artificially narrow to gpus). It
> > seems we need something to account for buffers allocated by drivers,
> > no matter which subsystem it was in (drm, v4l2, or networking or
> > whatever).
>
> This is what the gpucg was. It wasn't called the dmabuf cg because we want
> to account also memory of other types (e.g. drm gem buffer objects which
> aren't exported), and I guess people didn't dare call it an xpu.
>
> But this was absolutely for a lot more than just "gpu drivers in drm".
> Better names welcome.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

From an API perspective the two approaches (misc vs GPU) seem similar
to me. Someone comes up with a name of a resource they want to track,
and it's added as a key in a cgroup interface file as drivers register
and perform accounting on that resource. Considering just the naming,
what do you see as the appeal of a controller named GPU/XPU vs one
named Misc? Folks seem to have assumptions about the type of resources
a "GPU" controller should be tracking, and potentially also how
different resources are grouped under a single resource name. So is
your thought that non-graphics related accounting of the same sort
should be using a differently named controller, even if that
controller could have the same implementation?

My thought is that the resource names should be as specific as
possible to allow fine-grained accounting, and leave any grouping of
resources to userspace. We can do that under any controller. If you'd
like to see a separate controller for graphics related stuff... well
that's what I was aiming for with the GPU cgroup controller. It's just
that dmabufs from heaps are the first use-case wired up.

I haven't put much time into the misc controller effort yet, and I'd
still be happy to see the GPU controller accepted if we can agree
about how it'd be used going forward. Daniel, I think you're in a
great position to comment about this. :) If there's a place where the
implementation is missing the mark, then let's change it. Are the
controller and resource naming the only issues?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
@ 2022-06-24 21:17                           ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-06-24 21:17 UTC (permalink / raw)
  To: John Stultz, T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark, Laura Abbott, Brian Starkey,
	John Stultz, Shuah Khan, Carlos Llamas, Kalesh Singh, Kenny.Ho,
	Michal Koutný,
	Shuah Khan, kernel-team, cgroups, linux-doc, linux-kernel,
	linux-media, dri-devel, linaro-mm-sig, linux-kselftest

On Fri, Jun 24, 2022 at 1:36 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Fri, Jun 24, 2022 at 01:32:45PM -0700, John Stultz wrote:
> > On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > > > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier@google.com> wrote:
> > > > >
> > > > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj@kernel.org> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > > > hardcoding registrations in the misc controller.
> > > > > >
> > > > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > > > controller especially given that it doesn't really seem to be one with
> > > > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > > > keys to misc controller but can definitely add limited flexibility. What
> > > > > > kind of keys do you need?
> > > > > >
> > > > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > > > I was envisioning we'd have additional keys as discussed here:
> > > > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier@google.com/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > > > So we'd end up with a well-defined core set of keys like "system", and
> > > > > then drivers would be free to use their own keys for their own unique
> > > > > purposes which could be complementary or orthogonal to the core set.
> > > > > Yesterday I was talking with someone who is interested in limiting gpu
> > > > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > > > is the part where it looks like there's trouble.
> > > > >
> > > > > For my use case it would be sufficient to have current and maximum
> > > > > values for an arbitrary number of keys - one per heap. So the only
> > > > > part missing from the misc controller (for my use case) is the ability
> > > > > to register a new key at runtime as heaps are added. Instead of
> > > > > keeping track of resources with enum misc_res_type, requesting a
> > > > > resource handle/ID from the misc controller at runtime is what I think
> > > > > would be required instead.
> > > > >
> > > > Quick update: I'm going to make an attempt to modify the misc
> > > > controller to support a limited amount of dynamic resource
> > > > registration/tracking in place of the new controller in this series.
> > > >
> > > > Thanks everyone for the feedback.
> > >
> > > Somehow I missed this entire chain here.
> > >
> > > I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> > > account. Atm everyone just adds their one-off solution in a random corner:
> > > - total tracking in misc cgroup controller
> > > - dma-buf sysfs files (except apparently too slow so it'll get deleted
> > >   again)
> > > - random other stuff on open device files os OOM killer can see it
> > >
> > > This doesn't look good.
> >
> > But I also think one could see it as "gpu memory" is the drm subsystem
> > doing the same thing (in that it's artificially narrow to gpus). It
> > seems we need something to account for buffers allocated by drivers,
> > no matter which subsystem it was in (drm, v4l2, or networking or
> > whatever).
>
> This is what the gpucg was. It wasn't called the dmabuf cg because we want
> to account also memory of other types (e.g. drm gem buffer objects which
> aren't exported), and I guess people didn't dare call it an xpu.
>
> But this was absolutely for a lot more than just "gpu drivers in drm".
> Better names welcome.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

From an API perspective the two approaches (misc vs GPU) seem similar
to me. Someone comes up with a name of a resource they want to track,
and it's added as a key in a cgroup interface file as drivers register
and perform accounting on that resource. Considering just the naming,
what do you see as the appeal of a controller named GPU/XPU vs one
named Misc? Folks seem to have assumptions about the type of resources
a "GPU" controller should be tracking, and potentially also how
different resources are grouped under a single resource name. So is
your thought that non-graphics related accounting of the same sort
should be using a differently named controller, even if that
controller could have the same implementation?

My thought is that the resource names should be as specific as
possible to allow fine-grained accounting, and leave any grouping of
resources to userspace. We can do that under any controller. If you'd
like to see a separate controller for graphics related stuff... well
that's what I was aiming for with the GPU cgroup controller. It's just
that dmabufs from heaps are the first use-case wired up.

I haven't put much time into the misc controller effort yet, and I'd
still be happy to see the GPU controller accepted if we can agree
about how it'd be used going forward. Daniel, I think you're in a
great position to comment about this. :) If there's a place where the
implementation is missing the mark, then let's change it. Are the
controller and resource naming the only issues?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 0/6] Proposal for a GPU cgroup controller
       [not found]                         ` <YrYgWCTtZqfvCt5D-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org>
@ 2022-06-24 21:17                           ` T.J. Mercier
  0 siblings, 0 replies; 67+ messages in thread
From: T.J. Mercier @ 2022-06-24 21:17 UTC (permalink / raw)
  To: John Stultz, T.J. Mercier, Tejun Heo, Nicolas Dufresne, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Greg Kroah-Hartman,
	Arve Hjønnevåg, Todd Kjos, Martijn Coenen,
	Joel Fernandes, Christian Brauner, Hridya Valsaraju,
	Suren Baghdasaryan, Sumit Semwal, Christian König,
	Benjamin Gaignard, Liam Mark
  Cc: Daniel Vetter

On Fri, Jun 24, 2022 at 1:36 PM Daniel Vetter <daniel-/w4YWyX8dFk@public.gmane.org> wrote:
>
> On Fri, Jun 24, 2022 at 01:32:45PM -0700, John Stultz wrote:
> > On Fri, Jun 24, 2022 at 1:17 PM Daniel Vetter <daniel-/w4YWyX8dFk@public.gmane.org> wrote:
> > >
> > > On Wed, Jun 15, 2022 at 10:31:21AM -0700, T.J. Mercier wrote:
> > > > On Fri, May 20, 2022 at 9:25 AM T.J. Mercier <tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > > > >
> > > > > On Fri, May 20, 2022 at 12:47 AM Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > On Tue, May 17, 2022 at 04:30:29PM -0700, T.J. Mercier wrote:
> > > > > > > Thanks for your suggestion. This almost works. "dmabuf" as a key could
> > > > > > > work, but I'd actually like to account for each heap. Since heaps can
> > > > > > > be dynamically added, I can't accommodate every potential heap name by
> > > > > > > hardcoding registrations in the misc controller.
> > > > > >
> > > > > > On its own, that's a pretty weak reason to be adding a separate gpu
> > > > > > controller especially given that it doesn't really seem to be one with
> > > > > > proper abstractions for gpu resources. We don't want to keep adding random
> > > > > > keys to misc controller but can definitely add limited flexibility. What
> > > > > > kind of keys do you need?
> > > > > >
> > > > > Well the dmabuf-from-heaps component of this is the initial use case.
> > > > > I was envisioning we'd have additional keys as discussed here:
> > > > > https://lore.kernel.org/lkml/20220328035951.1817417-1-tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/T/#m82e5fe9d8674bb60160701e52dae4356fea2ddfa
> > > > > So we'd end up with a well-defined core set of keys like "system", and
> > > > > then drivers would be free to use their own keys for their own unique
> > > > > purposes which could be complementary or orthogonal to the core set.
> > > > > Yesterday I was talking with someone who is interested in limiting gpu
> > > > > cores and bus IDs in addition to gpu memory. How to define core keys
> > > > > is the part where it looks like there's trouble.
> > > > >
> > > > > For my use case it would be sufficient to have current and maximum
> > > > > values for an arbitrary number of keys - one per heap. So the only
> > > > > part missing from the misc controller (for my use case) is the ability
> > > > > to register a new key at runtime as heaps are added. Instead of
> > > > > keeping track of resources with enum misc_res_type, requesting a
> > > > > resource handle/ID from the misc controller at runtime is what I think
> > > > > would be required instead.
> > > > >
> > > > Quick update: I'm going to make an attempt to modify the misc
> > > > controller to support a limited amount of dynamic resource
> > > > registration/tracking in place of the new controller in this series.
> > > >
> > > > Thanks everyone for the feedback.
> > >
> > > Somehow I missed this entire chain here.
> > >
> > > I'm not a fan, because I'm kinda hoping we could finally unify gpu memory
> > > account. Atm everyone just adds their one-off solution in a random corner:
> > > - total tracking in misc cgroup controller
> > > - dma-buf sysfs files (except apparently too slow so it'll get deleted
> > >   again)
> > > - random other stuff on open device files os OOM killer can see it
> > >
> > > This doesn't look good.
> >
> > But I also think one could see it as "gpu memory" is the drm subsystem
> > doing the same thing (in that it's artificially narrow to gpus). It
> > seems we need something to account for buffers allocated by drivers,
> > no matter which subsystem it was in (drm, v4l2, or networking or
> > whatever).
>
> This is what the gpucg was. It wasn't called the dmabuf cg because we want
> to account also memory of other types (e.g. drm gem buffer objects which
> aren't exported), and I guess people didn't dare call it an xpu.
>
> But this was absolutely for a lot more than just "gpu drivers in drm".
> Better names welcome.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

From an API perspective the two approaches (misc vs GPU) seem similar
to me. Someone comes up with a name of a resource they want to track,
and it's added as a key in a cgroup interface file as drivers register
and perform accounting on that resource. Considering just the naming,
what do you see as the appeal of a controller named GPU/XPU vs one
named Misc? Folks seem to have assumptions about the type of resources
a "GPU" controller should be tracking, and potentially also how
different resources are grouped under a single resource name. So is
your thought that non-graphics related accounting of the same sort
should be using a differently named controller, even if that
controller could have the same implementation?

My thought is that the resource names should be as specific as
possible to allow fine-grained accounting, and leave any grouping of
resources to userspace. We can do that under any controller. If you'd
like to see a separate controller for graphics related stuff... well
that's what I was aiming for with the GPU cgroup controller. It's just
that dmabufs from heaps are the first use-case wired up.

I haven't put much time into the misc controller effort yet, and I'd
still be happy to see the GPU controller accepted if we can agree
about how it'd be used going forward. Daniel, I think you're in a
great position to comment about this. :) If there's a place where the
implementation is missing the mark, then let's change it. Are the
controller and resource naming the only issues?

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2022-06-26 17:48 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-10 23:56 [PATCH v7 0/6] Proposal for a GPU cgroup controller T.J. Mercier
2022-05-10 23:56 ` T.J. Mercier
2022-05-10 23:56 ` T.J. Mercier
2022-05-10 23:56 ` [PATCH v7 1/6] gpu: rfc: " T.J. Mercier
2022-05-10 23:56   ` T.J. Mercier
2022-05-10 23:56 ` [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory T.J. Mercier
2022-05-10 23:56   ` T.J. Mercier
2022-05-10 23:56 ` [PATCH v7 3/6] dmabuf: heaps: export system_heap buffers with GPU cgroup charging T.J. Mercier
2022-05-10 23:56   ` T.J. Mercier
2022-05-10 23:56 ` [PATCH v7 4/6] dmabuf: Add gpu cgroup charge transfer function T.J. Mercier
2022-05-10 23:56   ` T.J. Mercier
2022-05-10 23:56   ` T.J. Mercier
2022-05-10 23:56 ` [PATCH v7 5/6] binder: Add flags to relinquish ownership of fds T.J. Mercier
2022-05-10 23:56   ` T.J. Mercier
2022-05-10 23:56 ` [PATCH v7 6/6] selftests: Add binder cgroup gpu memory transfer tests T.J. Mercier
2022-05-21 10:15   ` Muhammad Usama Anjum
2022-05-11 13:21 ` [PATCH v7 0/6] Proposal for a GPU cgroup controller Nicolas Dufresne
2022-05-11 13:21   ` Nicolas Dufresne
2022-05-11 13:21   ` Nicolas Dufresne
2022-05-11 20:31   ` T.J. Mercier
2022-05-11 20:31     ` T.J. Mercier
2022-05-11 20:31     ` T.J. Mercier
2022-05-12 13:09     ` Nicolas Dufresne
2022-05-12 13:09       ` Nicolas Dufresne
2022-05-12 13:09       ` Nicolas Dufresne
2022-05-13  3:43       ` T.J. Mercier
2022-05-13  3:43         ` T.J. Mercier
2022-05-13  3:43         ` T.J. Mercier
2022-05-13 16:13         ` Tejun Heo
2022-05-13 16:13           ` Tejun Heo
2022-05-13 16:13           ` Tejun Heo
2022-05-17 23:30           ` T.J. Mercier
2022-05-17 23:30             ` T.J. Mercier
2022-05-17 23:30             ` T.J. Mercier
2022-05-20  7:47             ` Tejun Heo
2022-05-20  7:47               ` Tejun Heo
2022-05-20  7:47               ` Tejun Heo
2022-05-20 16:25               ` T.J. Mercier
2022-05-20 16:25                 ` T.J. Mercier
2022-05-20 16:25                 ` T.J. Mercier
2022-06-15 17:31                 ` T.J. Mercier
2022-06-15 17:31                   ` T.J. Mercier
2022-06-15 17:31                   ` T.J. Mercier
2022-06-24 20:17                   ` Daniel Vetter
2022-06-24 20:17                     ` Daniel Vetter
2022-06-24 20:17                     ` Daniel Vetter
     [not found]                     ` <YrYbwu0iIAJJGXVg-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org>
2022-06-24 20:32                       ` John Stultz
2022-06-24 20:32                     ` John Stultz
2022-06-24 20:32                       ` John Stultz
2022-06-24 20:36                       ` Daniel Vetter
2022-06-24 20:36                         ` Daniel Vetter
2022-06-24 20:36                         ` Daniel Vetter
     [not found]                         ` <YrYgWCTtZqfvCt5D-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org>
2022-06-24 21:17                           ` T.J. Mercier
2022-06-24 21:17                         ` T.J. Mercier
2022-06-24 21:17                           ` T.J. Mercier
2022-05-19  9:30 ` [PATCH v7 1/6] gpu: rfc: " eballetbo
2022-05-19  9:30   ` eballetbo
2022-05-19  9:30   ` eballetbo
2022-05-21  2:19   ` T.J. Mercier
2022-05-21  2:19     ` T.J. Mercier
2022-05-21  2:19     ` T.J. Mercier
2022-05-19 10:52 ` [PATCH v7 2/6] cgroup: gpu: Add a cgroup controller for allocator attribution of GPU memory eballetbo
2022-05-19 10:52   ` eballetbo
2022-05-19 10:52   ` eballetbo
2022-05-20 16:33   ` T.J. Mercier
2022-05-20 16:33     ` T.J. Mercier
2022-05-20 16:33     ` T.J. Mercier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.