From: Alistair Popple <apopple@nvidia.com> To: Yosry Ahmed <yosryahmed@google.com> Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, jgg@nvidia.com, jhubbard@nvidia.com, tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com, mkoutny@suse.com, daniel@ffwll.ch Subject: Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Date: Tue, 31 Jan 2023 11:54:20 +1100 [thread overview] Message-ID: <874js7zf38.fsf@nvidia.com> (raw) In-Reply-To: <CAJD7tkavoSu9WOnw4Nbxz41nq+Rm6Sq5EeOjh3CTyA=AT5=ujg@mail.gmail.com> Yosry Ahmed <yosryahmed@google.com> writes: > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple@nvidia.com> wrote: >> >> Having large amounts of unmovable or unreclaimable memory in a system >> can lead to system instability due to increasing the likelihood of >> encountering out-of-memory conditions. Therefore it is desirable to >> limit the amount of memory users can lock or pin. >> >> From userspace such limits can be enforced by setting >> RLIMIT_MEMLOCK. However there is no standard method that drivers and >> other in-kernel users can use to check and enforce this limit. >> >> This has lead to a large number of inconsistencies in how limits are >> enforced. For example some drivers will use mm->locked_mm while others >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to >> have up to three times RLIMIT_MEMLOCKED pinned. >> >> Having pinned memory limited per-task also makes it easy for users to >> exceed the limit. For example drivers that pin memory with >> pin_user_pages() it tends to remain pinned after fork. To deal with >> this and other issues this series introduces a cgroup for tracking and >> limiting the number of pages pinned or locked by tasks in the group. >> >> However the existing behaviour with regards to the rlimit needs to be >> maintained. Therefore the lesser of the two limits is >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, >> but this bypass is not allowed for the cgroup. >> >> The first part of this series converts existing drivers which >> open-code the use of locked_mm/pinned_mm over to a common interface >> which manages the refcounts of the associated task/mm/user >> structs. This ensures accounting of pages is consistent and makes it >> easier to add charging of the cgroup. >> >> The second part of the series adds the cgroup and converts core mm >> code such as mlock over to charging the cgroup before finally >> introducing some selftests. > > > I didn't go through the entire series, so apologies if this was > mentioned somewhere, but do you mind elaborating on why this is added > as a separate cgroup controller rather than an extension of the memory > cgroup controller? One of my early prototypes actually did add this to the memcg controller. However pinned pages fall under their own limit, and we wanted to always account pages to the cgroup of the task using the driver rather than say folio_memcg(). So adding it to memcg didn't seem to have much benefit as we didn't end up using any of the infrastructure provided by memcg. Hence I thought it was clearer to just add it as it's own controller. - Alistair >> >> >> As I don't have access to systems with all the various devices I >> haven't been able to test all driver changes. Any help there would be >> appreciated. >> >> Alistair Popple (19): >> mm: Introduce vm_account >> drivers/vhost: Convert to use vm_account >> drivers/vdpa: Convert vdpa to use the new vm_structure >> infiniband/umem: Convert to use vm_account >> RMDA/siw: Convert to use vm_account >> RDMA/usnic: convert to use vm_account >> vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm >> vfio/spapr_tce: Convert accounting to pinned_vm >> io_uring: convert to use vm_account >> net: skb: Switch to using vm_account >> xdp: convert to use vm_account >> kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() >> fpga: dfl: afu: convert to use vm_account >> mm: Introduce a cgroup for pinned memory >> mm/util: Extend vm_account to charge pages against the pin cgroup >> mm/util: Refactor account_locked_vm >> mm: Convert mmap and mlock to use account_locked_vm >> mm/mmap: Charge locked memory to pins cgroup >> selftests/vm: Add pins-cgroup selftest for mlock/mmap >> >> MAINTAINERS | 8 +- >> arch/powerpc/kvm/book3s_64_vio.c | 10 +- >> arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- >> drivers/fpga/dfl-afu-dma-region.c | 11 +- >> drivers/fpga/dfl-afu.h | 1 +- >> drivers/infiniband/core/umem.c | 16 +- >> drivers/infiniband/core/umem_odp.c | 6 +- >> drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- >> drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- >> drivers/infiniband/sw/siw/siw.h | 2 +- >> drivers/infiniband/sw/siw/siw_mem.c | 20 +-- >> drivers/infiniband/sw/siw/siw_verbs.c | 15 +- >> drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- >> drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- >> drivers/vfio/vfio_iommu_type1.c | 59 +---- >> drivers/vhost/vdpa.c | 9 +- >> drivers/vhost/vhost.c | 2 +- >> drivers/vhost/vhost.h | 1 +- >> include/linux/cgroup.h | 20 ++- >> include/linux/cgroup_subsys.h | 4 +- >> include/linux/io_uring_types.h | 3 +- >> include/linux/kvm_host.h | 1 +- >> include/linux/mm.h | 5 +- >> include/linux/mm_types.h | 88 ++++++++- >> include/linux/skbuff.h | 6 +- >> include/net/sock.h | 2 +- >> include/net/xdp_sock.h | 2 +- >> include/rdma/ib_umem.h | 1 +- >> io_uring/io_uring.c | 20 +-- >> io_uring/notif.c | 4 +- >> io_uring/notif.h | 10 +- >> io_uring/rsrc.c | 38 +--- >> io_uring/rsrc.h | 9 +- >> mm/Kconfig | 11 +- >> mm/Makefile | 1 +- >> mm/internal.h | 2 +- >> mm/mlock.c | 76 +------ >> mm/mmap.c | 76 +++---- >> mm/mremap.c | 54 +++-- >> mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- >> mm/secretmem.c | 6 +- >> mm/util.c | 196 +++++++++++++++-- >> net/core/skbuff.c | 47 +--- >> net/rds/message.c | 9 +- >> net/xdp/xdp_umem.c | 38 +-- >> tools/testing/selftests/vm/Makefile | 1 +- >> tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- >> virt/kvm/kvm_main.c | 3 +- >> 48 files changed, 1114 insertions(+), 401 deletions(-) >> create mode 100644 mm/pins_cgroup.c >> create mode 100644 tools/testing/selftests/vm/pins-cgroup.c >> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65 >> -- >> git-series 0.9.1 >>
WARNING: multiple messages have this Message-ID (diff)
From: Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> To: Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org, jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org, tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org, surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org, mkoutny-IBi9RG/b67k@public.gmane.org, daniel-/w4YWyX8dFk@public.gmane.org Subject: Re: [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Date: Tue, 31 Jan 2023 11:54:20 +1100 [thread overview] Message-ID: <874js7zf38.fsf@nvidia.com> (raw) In-Reply-To: <CAJD7tkavoSu9WOnw4Nbxz41nq+Rm6Sq5EeOjh3CTyA=AT5=ujg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> writes: > On Mon, Jan 23, 2023 at 9:43 PM Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote: >> >> Having large amounts of unmovable or unreclaimable memory in a system >> can lead to system instability due to increasing the likelihood of >> encountering out-of-memory conditions. Therefore it is desirable to >> limit the amount of memory users can lock or pin. >> >> From userspace such limits can be enforced by setting >> RLIMIT_MEMLOCK. However there is no standard method that drivers and >> other in-kernel users can use to check and enforce this limit. >> >> This has lead to a large number of inconsistencies in how limits are >> enforced. For example some drivers will use mm->locked_mm while others >> will use mm->pinned_mm or user->locked_mm. It is therefore possible to >> have up to three times RLIMIT_MEMLOCKED pinned. >> >> Having pinned memory limited per-task also makes it easy for users to >> exceed the limit. For example drivers that pin memory with >> pin_user_pages() it tends to remain pinned after fork. To deal with >> this and other issues this series introduces a cgroup for tracking and >> limiting the number of pages pinned or locked by tasks in the group. >> >> However the existing behaviour with regards to the rlimit needs to be >> maintained. Therefore the lesser of the two limits is >> enforced. Furthermore having CAP_IPC_LOCK usually bypasses the rlimit, >> but this bypass is not allowed for the cgroup. >> >> The first part of this series converts existing drivers which >> open-code the use of locked_mm/pinned_mm over to a common interface >> which manages the refcounts of the associated task/mm/user >> structs. This ensures accounting of pages is consistent and makes it >> easier to add charging of the cgroup. >> >> The second part of the series adds the cgroup and converts core mm >> code such as mlock over to charging the cgroup before finally >> introducing some selftests. > > > I didn't go through the entire series, so apologies if this was > mentioned somewhere, but do you mind elaborating on why this is added > as a separate cgroup controller rather than an extension of the memory > cgroup controller? One of my early prototypes actually did add this to the memcg controller. However pinned pages fall under their own limit, and we wanted to always account pages to the cgroup of the task using the driver rather than say folio_memcg(). So adding it to memcg didn't seem to have much benefit as we didn't end up using any of the infrastructure provided by memcg. Hence I thought it was clearer to just add it as it's own controller. - Alistair >> >> >> As I don't have access to systems with all the various devices I >> haven't been able to test all driver changes. Any help there would be >> appreciated. >> >> Alistair Popple (19): >> mm: Introduce vm_account >> drivers/vhost: Convert to use vm_account >> drivers/vdpa: Convert vdpa to use the new vm_structure >> infiniband/umem: Convert to use vm_account >> RMDA/siw: Convert to use vm_account >> RDMA/usnic: convert to use vm_account >> vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm >> vfio/spapr_tce: Convert accounting to pinned_vm >> io_uring: convert to use vm_account >> net: skb: Switch to using vm_account >> xdp: convert to use vm_account >> kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() >> fpga: dfl: afu: convert to use vm_account >> mm: Introduce a cgroup for pinned memory >> mm/util: Extend vm_account to charge pages against the pin cgroup >> mm/util: Refactor account_locked_vm >> mm: Convert mmap and mlock to use account_locked_vm >> mm/mmap: Charge locked memory to pins cgroup >> selftests/vm: Add pins-cgroup selftest for mlock/mmap >> >> MAINTAINERS | 8 +- >> arch/powerpc/kvm/book3s_64_vio.c | 10 +- >> arch/powerpc/mm/book3s64/iommu_api.c | 29 +-- >> drivers/fpga/dfl-afu-dma-region.c | 11 +- >> drivers/fpga/dfl-afu.h | 1 +- >> drivers/infiniband/core/umem.c | 16 +- >> drivers/infiniband/core/umem_odp.c | 6 +- >> drivers/infiniband/hw/usnic/usnic_uiom.c | 13 +- >> drivers/infiniband/hw/usnic/usnic_uiom.h | 1 +- >> drivers/infiniband/sw/siw/siw.h | 2 +- >> drivers/infiniband/sw/siw/siw_mem.c | 20 +-- >> drivers/infiniband/sw/siw/siw_verbs.c | 15 +- >> drivers/vdpa/vdpa_user/vduse_dev.c | 20 +-- >> drivers/vfio/vfio_iommu_spapr_tce.c | 15 +- >> drivers/vfio/vfio_iommu_type1.c | 59 +---- >> drivers/vhost/vdpa.c | 9 +- >> drivers/vhost/vhost.c | 2 +- >> drivers/vhost/vhost.h | 1 +- >> include/linux/cgroup.h | 20 ++- >> include/linux/cgroup_subsys.h | 4 +- >> include/linux/io_uring_types.h | 3 +- >> include/linux/kvm_host.h | 1 +- >> include/linux/mm.h | 5 +- >> include/linux/mm_types.h | 88 ++++++++- >> include/linux/skbuff.h | 6 +- >> include/net/sock.h | 2 +- >> include/net/xdp_sock.h | 2 +- >> include/rdma/ib_umem.h | 1 +- >> io_uring/io_uring.c | 20 +-- >> io_uring/notif.c | 4 +- >> io_uring/notif.h | 10 +- >> io_uring/rsrc.c | 38 +--- >> io_uring/rsrc.h | 9 +- >> mm/Kconfig | 11 +- >> mm/Makefile | 1 +- >> mm/internal.h | 2 +- >> mm/mlock.c | 76 +------ >> mm/mmap.c | 76 +++---- >> mm/mremap.c | 54 +++-- >> mm/pins_cgroup.c | 273 ++++++++++++++++++++++++- >> mm/secretmem.c | 6 +- >> mm/util.c | 196 +++++++++++++++-- >> net/core/skbuff.c | 47 +--- >> net/rds/message.c | 9 +- >> net/xdp/xdp_umem.c | 38 +-- >> tools/testing/selftests/vm/Makefile | 1 +- >> tools/testing/selftests/vm/pins-cgroup.c | 271 ++++++++++++++++++++++++- >> virt/kvm/kvm_main.c | 3 +- >> 48 files changed, 1114 insertions(+), 401 deletions(-) >> create mode 100644 mm/pins_cgroup.c >> create mode 100644 tools/testing/selftests/vm/pins-cgroup.c >> >> base-commit: 2241ab53cbb5cdb08a6b2d4688feb13971058f65 >> -- >> git-series 0.9.1 >>
next prev parent reply other threads:[~2023-01-31 1:07 UTC|newest] Thread overview: 108+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-01-24 5:42 [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 01/19] mm: Introduce vm_account Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 6:29 ` Christoph Hellwig 2023-01-24 6:29 ` Christoph Hellwig 2023-01-24 6:29 ` Christoph Hellwig 2023-01-24 14:32 ` Jason Gunthorpe 2023-01-24 14:32 ` Jason Gunthorpe 2023-01-30 11:36 ` Alistair Popple 2023-01-30 11:36 ` Alistair Popple 2023-01-31 14:00 ` David Hildenbrand 2023-01-31 14:00 ` David Hildenbrand 2023-01-31 14:00 ` David Hildenbrand 2023-01-24 5:42 ` [RFC PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:55 ` Michael S. Tsirkin 2023-01-24 5:55 ` Michael S. Tsirkin 2023-01-24 5:55 ` Michael S. Tsirkin 2023-01-30 10:43 ` Alistair Popple 2023-01-30 10:43 ` Alistair Popple 2023-01-24 14:34 ` Jason Gunthorpe 2023-01-24 5:42 ` [RFC PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 14:35 ` Jason Gunthorpe 2023-01-24 14:35 ` Jason Gunthorpe 2023-01-24 5:42 ` [RFC PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 05/19] RMDA/siw: " Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 14:37 ` Jason Gunthorpe 2023-01-24 15:22 ` Bernard Metzler 2023-01-24 15:22 ` Bernard Metzler 2023-01-24 15:56 ` Bernard Metzler 2023-01-24 15:56 ` Bernard Metzler 2023-01-30 11:34 ` Alistair Popple 2023-01-30 11:34 ` Alistair Popple 2023-01-30 13:27 ` Bernard Metzler 2023-01-24 5:42 ` [RFC PATCH 06/19] RDMA/usnic: convert " Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 14:41 ` Jason Gunthorpe 2023-01-24 14:41 ` Jason Gunthorpe 2023-01-30 11:10 ` Alistair Popple 2023-01-30 11:10 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 09/19] io_uring: convert to use vm_account Alistair Popple 2023-01-24 14:44 ` Jason Gunthorpe 2023-01-30 11:12 ` Alistair Popple 2023-01-30 11:12 ` Alistair Popple 2023-01-30 13:21 ` Jason Gunthorpe 2023-01-24 5:42 ` [RFC PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 14:51 ` Jason Gunthorpe 2023-01-24 14:51 ` Jason Gunthorpe 2023-01-30 11:17 ` Alistair Popple 2023-02-06 4:36 ` Alistair Popple 2023-02-06 4:36 ` Alistair Popple 2023-02-06 13:14 ` Jason Gunthorpe 2023-02-06 13:14 ` Jason Gunthorpe 2023-01-24 5:42 ` [RFC PATCH 11/19] xdp: convert to use vm_account Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 8:20 ` kernel test robot 2023-01-24 15:00 ` kernel test robot 2023-01-24 15:41 ` kernel test robot 2023-01-27 21:44 ` Tejun Heo 2023-01-27 21:44 ` Tejun Heo 2023-01-30 13:20 ` Jason Gunthorpe 2023-01-30 13:20 ` Jason Gunthorpe 2023-01-24 5:42 ` [RFC PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 9:52 ` kernel test robot 2023-01-24 5:42 ` [RFC PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 5:42 ` [RFC PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple 2023-01-24 5:42 ` Alistair Popple 2023-01-24 18:26 ` [RFC PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Yosry Ahmed 2023-01-24 18:26 ` Yosry Ahmed 2023-01-31 0:54 ` Alistair Popple [this message] 2023-01-31 0:54 ` Alistair Popple 2023-01-31 5:14 ` Yosry Ahmed 2023-01-31 5:14 ` Yosry Ahmed 2023-01-31 11:22 ` Alistair Popple 2023-01-31 11:22 ` Alistair Popple 2023-01-31 19:49 ` Yosry Ahmed 2023-01-31 19:49 ` Yosry Ahmed 2023-01-24 20:12 ` Jason Gunthorpe 2023-01-24 20:12 ` Jason Gunthorpe 2023-01-31 13:57 ` David Hildenbrand 2023-01-31 14:03 ` Jason Gunthorpe 2023-01-31 14:03 ` Jason Gunthorpe 2023-01-31 14:06 ` David Hildenbrand 2023-01-31 14:10 ` Jason Gunthorpe 2023-01-31 14:10 ` Jason Gunthorpe 2023-01-31 14:15 ` David Hildenbrand 2023-01-31 14:15 ` David Hildenbrand 2023-01-31 14:21 ` Jason Gunthorpe
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=874js7zf38.fsf@nvidia.com \ --to=apopple@nvidia.com \ --cc=cgroups@vger.kernel.org \ --cc=daniel@ffwll.ch \ --cc=hannes@cmpxchg.org \ --cc=jgg@nvidia.com \ --cc=jhubbard@nvidia.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mkoutny@suse.com \ --cc=surenb@google.com \ --cc=tjmercier@google.com \ --cc=yosryahmed@google.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.