All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alistair Popple <apopple@nvidia.com>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
	tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
	mkoutny@suse.com, daniel@ffwll.ch,
	"Daniel P . Berrange" <berrange@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Zefan Li <lizefan.x@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Wed, 22 Feb 2023 21:53:56 -0400	[thread overview]
Message-ID: <Y/bHNO7A8T3QQ5T+@nvidia.com> (raw)
In-Reply-To: <87k009nvnr.fsf@nvidia.com>

On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
> 
> Jason Gunthorpe <jgg@nvidia.com> writes:
> 
> > On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
> >> When a driver unpins a page we scan the pinners list and assign
> >> ownership to the next driver pinning the page by updating memcg_data and
> >> removing the vm_account from the list.
> >
> > I don't see how this works with just the data structure you outlined??
> > Every unique page needs its own list_head in the vm_account, it is
> > doable just incredibly costly.
> 
> The idea was every driver already needs to allocate a pages array to
> pass to pin_user_pages(), and by necessity drivers have to keep a
> reference to the contents of that in one form or another. So
> conceptually the equivalent of:
> 
> struct vm_account {
>        struct list_head possible_pinners;
>        struct mem_cgroup *memcg;
>        struct pages **pages;
>        [...]
> };
> 
> Unpinnig involves finding a new owner by traversing the list of
> page->memcg_data->possible_pinners and iterating over *pages[] to figure
> out if that vm_account actually has this page pinned or not and could
> own it.

Oh, you are focusing on Tejun's DOS scenario. 

The DOS problem is to prevent a pin users in cgroup A from keeping
memory charged to cgroup B that it isn't using any more.

cgroup B doesn't need to be pinning the memory, it could just be
normal VMAs and "isn't using anymore" means it has unmapped all the
VMAs.

Solving that problem means figuring out when every cgroup stops using
the memory - pinning or not. That seems to be very costly.

AFAIK this problem also already exists today as the memcg of a page
doesn't change while it is pinned. So maybe we don't need to address
it.

Arguably the pins are not the problem. If we want to treat the pin
like allocation then we simply charge the non-owning memcg's for the
pin as though it was an allocation. Eg go over every page and if the
owning memcg is not the current memcg then charge the current memcg
for an allocation of the MAP_SHARED memory. Undoing this is trivial
enoug.

This doesn't fix the DOS problem but it does sort of harmonize the pin
accounting with the memcg by multi-accounting every pin of a
MAP_SHARED page.

The other drawback is that this isn't the same thing as the current
rlimit. The rlimit is largely restricting the creation of unmovable
memory.

Though, AFAICT memcg seems to bundle unmovable memory (eg GFP_KERNEL)
along with movable user pages so it would be self-consistent.

I'm unclear if this is OK for libvirt..

> Agree this is costly though. And I don't think all drivers keep the
> array around so "iterating over *pages[]" may need to be a callback.

I think searching lists of pages is not reasonable. Things like VFIO &
KVM use cases effectively pin 90% of all system memory, that is
potentially TB of page lists that might need linear searching!

Jason

WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
To: Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
	Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org,
	tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	mkoutny-IBi9RG/b67k@public.gmane.org,
	daniel-/w4YWyX8dFk@public.gmane.org,
	"Daniel P . Berrange"
	<berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Alex Williamson
	<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Zefan Li <lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Wed, 22 Feb 2023 21:53:56 -0400	[thread overview]
Message-ID: <Y/bHNO7A8T3QQ5T+@nvidia.com> (raw)
In-Reply-To: <87k009nvnr.fsf-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>

On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
> 
> Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
> >> When a driver unpins a page we scan the pinners list and assign
> >> ownership to the next driver pinning the page by updating memcg_data and
> >> removing the vm_account from the list.
> >
> > I don't see how this works with just the data structure you outlined??
> > Every unique page needs its own list_head in the vm_account, it is
> > doable just incredibly costly.
> 
> The idea was every driver already needs to allocate a pages array to
> pass to pin_user_pages(), and by necessity drivers have to keep a
> reference to the contents of that in one form or another. So
> conceptually the equivalent of:
> 
> struct vm_account {
>        struct list_head possible_pinners;
>        struct mem_cgroup *memcg;
>        struct pages **pages;
>        [...]
> };
> 
> Unpinnig involves finding a new owner by traversing the list of
> page->memcg_data->possible_pinners and iterating over *pages[] to figure
> out if that vm_account actually has this page pinned or not and could
> own it.

Oh, you are focusing on Tejun's DOS scenario. 

The DOS problem is to prevent a pin users in cgroup A from keeping
memory charged to cgroup B that it isn't using any more.

cgroup B doesn't need to be pinning the memory, it could just be
normal VMAs and "isn't using anymore" means it has unmapped all the
VMAs.

Solving that problem means figuring out when every cgroup stops using
the memory - pinning or not. That seems to be very costly.

AFAIK this problem also already exists today as the memcg of a page
doesn't change while it is pinned. So maybe we don't need to address
it.

Arguably the pins are not the problem. If we want to treat the pin
like allocation then we simply charge the non-owning memcg's for the
pin as though it was an allocation. Eg go over every page and if the
owning memcg is not the current memcg then charge the current memcg
for an allocation of the MAP_SHARED memory. Undoing this is trivial
enoug.

This doesn't fix the DOS problem but it does sort of harmonize the pin
accounting with the memcg by multi-accounting every pin of a
MAP_SHARED page.

The other drawback is that this isn't the same thing as the current
rlimit. The rlimit is largely restricting the creation of unmovable
memory.

Though, AFAICT memcg seems to bundle unmovable memory (eg GFP_KERNEL)
along with movable user pages so it would be self-consistent.

I'm unclear if this is OK for libvirt..

> Agree this is costly though. And I don't think all drivers keep the
> array around so "iterating over *pages[]" may need to be a callback.

I think searching lists of pages is not reasonable. Things like VFIO &
KVM use cases effectively pin 90% of all system memory, that is
potentially TB of page lists that might need linear searching!

Jason

  parent reply	other threads:[~2023-02-23  1:54 UTC|newest]

Thread overview: 128+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06  7:47 [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
2023-02-06  7:47 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 05/19] RMDA/siw: " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-12 17:32   ` Bernard Metzler
2023-02-06  7:47 ` [PATCH 06/19] RDMA/usnic: convert " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
2023-02-06 15:29   ` Jens Axboe
2023-02-06 15:29     ` Jens Axboe
2023-02-07  1:03     ` Alistair Popple
2023-02-07  1:03       ` Alistair Popple
2023-02-07 14:28       ` Jens Axboe
2023-02-07 14:55         ` Jason Gunthorpe
2023-02-07 14:55           ` Jason Gunthorpe
2023-02-07 17:05           ` Jens Axboe
2023-02-07 17:05             ` Jens Axboe
2023-02-13 11:30             ` Alistair Popple
2023-02-13 11:30               ` Alistair Popple
2023-02-06  7:47 ` [PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 11/19] xdp: convert to use vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:01   ` Yosry Ahmed
2023-02-06 21:01     ` Yosry Ahmed
2023-02-06 21:14   ` Tejun Heo
2023-02-06 21:14     ` Tejun Heo
2023-02-06 22:32     ` Yosry Ahmed
2023-02-06 22:32       ` Yosry Ahmed
2023-02-06 22:36       ` Tejun Heo
2023-02-06 22:39         ` Yosry Ahmed
2023-02-06 22:39           ` Yosry Ahmed
2023-02-06 23:25           ` Tejun Heo
2023-02-06 23:25             ` Tejun Heo
2023-02-06 23:34             ` Yosry Ahmed
2023-02-06 23:34               ` Yosry Ahmed
2023-02-06 23:40             ` Jason Gunthorpe
2023-02-06 23:40               ` Jason Gunthorpe
2023-02-07  0:32               ` Tejun Heo
2023-02-07  0:32                 ` Tejun Heo
2023-02-07 12:19                 ` Jason Gunthorpe
2023-02-07 12:19                   ` Jason Gunthorpe
2023-02-15 19:00                 ` Michal Hocko
2023-02-15 19:00                   ` Michal Hocko
2023-02-15 19:07                   ` Jason Gunthorpe
2023-02-15 19:07                     ` Jason Gunthorpe
2023-02-16  8:04                     ` Michal Hocko
2023-02-16  8:04                       ` Michal Hocko
2023-02-16 12:45                       ` Jason Gunthorpe
2023-02-16 12:45                         ` Jason Gunthorpe
2023-02-21 16:51                         ` Tejun Heo
2023-02-21 16:51                           ` Tejun Heo
2023-02-21 17:25                           ` Jason Gunthorpe
2023-02-21 17:29                             ` Tejun Heo
2023-02-21 17:29                               ` Tejun Heo
2023-02-21 17:51                               ` Jason Gunthorpe
2023-02-21 17:51                                 ` Jason Gunthorpe
2023-02-21 18:07                                 ` Tejun Heo
2023-02-21 18:07                                   ` Tejun Heo
2023-02-21 19:26                                   ` Jason Gunthorpe
2023-02-21 19:26                                     ` Jason Gunthorpe
2023-02-21 19:45                                     ` Tejun Heo
2023-02-21 19:45                                       ` Tejun Heo
2023-02-21 19:49                                       ` Tejun Heo
2023-02-21 19:49                                         ` Tejun Heo
2023-02-21 19:57                                       ` Jason Gunthorpe
2023-02-22 11:38                                         ` Alistair Popple
2023-02-22 11:38                                           ` Alistair Popple
2023-02-22 12:57                                           ` Jason Gunthorpe
2023-02-22 12:57                                             ` Jason Gunthorpe
2023-02-22 22:59                                             ` Alistair Popple
2023-02-22 22:59                                               ` Alistair Popple
2023-02-23  0:05                                               ` Christoph Hellwig
2023-02-23  0:35                                                 ` Alistair Popple
2023-02-23  0:35                                                   ` Alistair Popple
2023-02-23  1:53                                               ` Jason Gunthorpe [this message]
2023-02-23  1:53                                                 ` Jason Gunthorpe
2023-02-23  9:12                                                 ` Daniel P. Berrangé
2023-02-23 17:31                                                   ` Jason Gunthorpe
2023-02-23 17:31                                                     ` Jason Gunthorpe
2023-02-23 17:18                                                 ` T.J. Mercier
2023-02-23 17:28                                                   ` Jason Gunthorpe
2023-02-23 17:28                                                     ` Jason Gunthorpe
2023-02-23 18:03                                                     ` Yosry Ahmed
2023-02-23 18:10                                                       ` Jason Gunthorpe
2023-02-23 18:10                                                         ` Jason Gunthorpe
2023-02-23 18:14                                                         ` Yosry Ahmed
2023-02-23 18:14                                                           ` Yosry Ahmed
2023-02-23 18:15                                                         ` Tejun Heo
2023-02-23 18:17                                                           ` Jason Gunthorpe
2023-02-23 18:17                                                             ` Jason Gunthorpe
2023-02-23 18:22                                                             ` Tejun Heo
2023-02-23 18:22                                                               ` Tejun Heo
2023-02-07  1:00           ` Waiman Long
2023-02-07  1:00             ` Waiman Long
2023-02-07  1:03             ` Tejun Heo
2023-02-07  1:50               ` Alistair Popple
2023-02-07  1:50                 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:12   ` Yosry Ahmed
2023-02-06  7:47 ` [PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-16 11:01 ` [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory David Hildenbrand
2023-02-16 11:01   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y/bHNO7A8T3QQ5T+@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=apopple@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=daniel@ffwll.ch \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=tjmercier@google.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.