All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	Alistair Popple <apopple@nvidia.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
	tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
	mkoutny@suse.com, daniel@ffwll.ch,
	"Daniel P . Berrange" <berrange@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Zefan Li <lizefan.x@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Tue, 21 Feb 2023 09:45:15 -1000	[thread overview]
Message-ID: <Y/UfS8TDIXhUlJ/I@slm.duckdns.org> (raw)
In-Reply-To: <Y/Ua6VcNe/DFh7X4@nvidia.com>

Hello,

On Tue, Feb 21, 2023 at 03:26:33PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 21, 2023 at 08:07:13AM -1000, Tejun Heo wrote:
> > > AFAIK there are few real use cases to establish a pin on MAP_SHARED
> > > mappings outside your cgroup. However, it is possible, the APIs allow
> > > it, and for security sandbox purposes we can't allow a process inside
> > > a cgroup to triger a charge on a different cgroup. That breaks the
> > > sandbox goal.
> > 
> > It seems broken anyway. Please consider the following scenario:
> 
> Yes, this is broken like this already today - memcg doesn't work
> entirely perfectly for MAP_SHARED scenarios, IMHO.

It is far from perfect but the existing behavior isn't that broken. e.g. in
the same scenario, without pinning, even if the larger cgroup keeps using
the same page, the smaller cgroup should be able to evict the pages as they
are not pinned and the cgroup is under heavy reclaim pressure. The larger
cgroup will refault them back in and end up owning those pages.

memcg can't capture the case of the same pages being actively shared by
multiple cgroups concurrently (I think those cases should be handled by
pushing them to the common parent as discussed elswhere but that's a
separate topic) but it can converge when page usage transfers across cgroups
if needed. Disassociating ownership and pinning will break that in an
irreversible way.

> > > > for whatever reason is determining the pinning ownership or should the page
> > > > ownership be attributed the same way too? If they indeed need to differ,
> > > > that probably would need pretty strong justifications.
> > > 
> > > It is inherent to how pin_user_pages() works. It is an API that
> > > establishs pins on existing pages. There is nothing about it that says
> > > who the page's memcg owner is.
> > > 
> > > I don't think we can do anything about this without breaking things.
> > 
> > That's a discrepancy in an internal interface and we don't wanna codify
> > something like that into userspace interface. Semantially, it seems like if
> > pin_user_pages() wanna charge pinning to the cgroup associated with an fd
> > (or whatever), it should also claim the ownership of the pages
> > themselves.
> 
> Multiple cgroup can pin the same page, so it is not as simple as just
> transfering ownership, we need multi-ownership and to really fix the
> memcg limitations with MAP_SHARED without an API impact.
> 
> You are right that pinning is really just a special case of
> allocation, but there is a reason the memcg was left with weak support
> for MAP_SHARED and changing that may be more than just hard but an
> infeasible trade off..
> 
> At least I don't have a good idea how to even approach building a
> reasonable datstructure that can track the number of
> charges per-cgroup per page. :\

As I wrote above, I don't think the problem here is the case of pages being
shared by multiple cgroups concurrently. We can leave that problem for
another thread. However, if we want to support accounting and control of
pinned memory, we really shouldn't introduce a fundmental discrepancy like
the owner and pinner disagreeing with each other. At least conceptually, the
solution is rather straight-forward - whoever pins a page should also claim
the ownership of it.

Thanks.

-- 
tejun

WARNING: multiple messages have this Message-ID (diff)
From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Jason Gunthorpe <jgg-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>,
	Yosry Ahmed <yosryahmed-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	Alistair Popple <apopple-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jhubbard-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org,
	tjmercier-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org,
	surenb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org,
	mkoutny-IBi9RG/b67k@public.gmane.org,
	daniel-/w4YWyX8dFk@public.gmane.org,
	"Daniel P . Berrange"
	<berrange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Alex Williamson
	<alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Zefan Li <lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>,
	Andrew Morton
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Date: Tue, 21 Feb 2023 09:45:15 -1000	[thread overview]
Message-ID: <Y/UfS8TDIXhUlJ/I@slm.duckdns.org> (raw)
In-Reply-To: <Y/Ua6VcNe/DFh7X4-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>

Hello,

On Tue, Feb 21, 2023 at 03:26:33PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 21, 2023 at 08:07:13AM -1000, Tejun Heo wrote:
> > > AFAIK there are few real use cases to establish a pin on MAP_SHARED
> > > mappings outside your cgroup. However, it is possible, the APIs allow
> > > it, and for security sandbox purposes we can't allow a process inside
> > > a cgroup to triger a charge on a different cgroup. That breaks the
> > > sandbox goal.
> > 
> > It seems broken anyway. Please consider the following scenario:
> 
> Yes, this is broken like this already today - memcg doesn't work
> entirely perfectly for MAP_SHARED scenarios, IMHO.

It is far from perfect but the existing behavior isn't that broken. e.g. in
the same scenario, without pinning, even if the larger cgroup keeps using
the same page, the smaller cgroup should be able to evict the pages as they
are not pinned and the cgroup is under heavy reclaim pressure. The larger
cgroup will refault them back in and end up owning those pages.

memcg can't capture the case of the same pages being actively shared by
multiple cgroups concurrently (I think those cases should be handled by
pushing them to the common parent as discussed elswhere but that's a
separate topic) but it can converge when page usage transfers across cgroups
if needed. Disassociating ownership and pinning will break that in an
irreversible way.

> > > > for whatever reason is determining the pinning ownership or should the page
> > > > ownership be attributed the same way too? If they indeed need to differ,
> > > > that probably would need pretty strong justifications.
> > > 
> > > It is inherent to how pin_user_pages() works. It is an API that
> > > establishs pins on existing pages. There is nothing about it that says
> > > who the page's memcg owner is.
> > > 
> > > I don't think we can do anything about this without breaking things.
> > 
> > That's a discrepancy in an internal interface and we don't wanna codify
> > something like that into userspace interface. Semantially, it seems like if
> > pin_user_pages() wanna charge pinning to the cgroup associated with an fd
> > (or whatever), it should also claim the ownership of the pages
> > themselves.
> 
> Multiple cgroup can pin the same page, so it is not as simple as just
> transfering ownership, we need multi-ownership and to really fix the
> memcg limitations with MAP_SHARED without an API impact.
> 
> You are right that pinning is really just a special case of
> allocation, but there is a reason the memcg was left with weak support
> for MAP_SHARED and changing that may be more than just hard but an
> infeasible trade off..
> 
> At least I don't have a good idea how to even approach building a
> reasonable datstructure that can track the number of
> charges per-cgroup per page. :\

As I wrote above, I don't think the problem here is the case of pages being
shared by multiple cgroups concurrently. We can leave that problem for
another thread. However, if we want to support accounting and control of
pinned memory, we really shouldn't introduce a fundmental discrepancy like
the owner and pinner disagreeing with each other. At least conceptually, the
solution is rather straight-forward - whoever pins a page should also claim
the ownership of it.

Thanks.

-- 
tejun

  reply	other threads:[~2023-02-21 19:45 UTC|newest]

Thread overview: 128+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-06  7:47 [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory Alistair Popple
2023-02-06  7:47 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 01/19] mm: Introduce vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 02/19] drivers/vhost: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 03/19] drivers/vdpa: Convert vdpa to use the new vm_structure Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 04/19] infiniband/umem: Convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 05/19] RMDA/siw: " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-12 17:32   ` Bernard Metzler
2023-02-06  7:47 ` [PATCH 06/19] RDMA/usnic: convert " Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 07/19] vfio/type1: Charge pinned pages to pinned_vm instead of locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 08/19] vfio/spapr_tce: Convert accounting to pinned_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 09/19] io_uring: convert to use vm_account Alistair Popple
2023-02-06 15:29   ` Jens Axboe
2023-02-06 15:29     ` Jens Axboe
2023-02-07  1:03     ` Alistair Popple
2023-02-07  1:03       ` Alistair Popple
2023-02-07 14:28       ` Jens Axboe
2023-02-07 14:55         ` Jason Gunthorpe
2023-02-07 14:55           ` Jason Gunthorpe
2023-02-07 17:05           ` Jens Axboe
2023-02-07 17:05             ` Jens Axboe
2023-02-13 11:30             ` Alistair Popple
2023-02-13 11:30               ` Alistair Popple
2023-02-06  7:47 ` [PATCH 10/19] net: skb: Switch to using vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 11/19] xdp: convert to use vm_account Alistair Popple
2023-02-06  7:47 ` [PATCH 12/19] kvm/book3s_64_vio: Convert account_locked_vm() to vm_account_pinned() Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 13/19] fpga: dfl: afu: convert to use vm_account Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 14/19] mm: Introduce a cgroup for pinned memory Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:01   ` Yosry Ahmed
2023-02-06 21:01     ` Yosry Ahmed
2023-02-06 21:14   ` Tejun Heo
2023-02-06 21:14     ` Tejun Heo
2023-02-06 22:32     ` Yosry Ahmed
2023-02-06 22:32       ` Yosry Ahmed
2023-02-06 22:36       ` Tejun Heo
2023-02-06 22:39         ` Yosry Ahmed
2023-02-06 22:39           ` Yosry Ahmed
2023-02-06 23:25           ` Tejun Heo
2023-02-06 23:25             ` Tejun Heo
2023-02-06 23:34             ` Yosry Ahmed
2023-02-06 23:34               ` Yosry Ahmed
2023-02-06 23:40             ` Jason Gunthorpe
2023-02-06 23:40               ` Jason Gunthorpe
2023-02-07  0:32               ` Tejun Heo
2023-02-07  0:32                 ` Tejun Heo
2023-02-07 12:19                 ` Jason Gunthorpe
2023-02-07 12:19                   ` Jason Gunthorpe
2023-02-15 19:00                 ` Michal Hocko
2023-02-15 19:00                   ` Michal Hocko
2023-02-15 19:07                   ` Jason Gunthorpe
2023-02-15 19:07                     ` Jason Gunthorpe
2023-02-16  8:04                     ` Michal Hocko
2023-02-16  8:04                       ` Michal Hocko
2023-02-16 12:45                       ` Jason Gunthorpe
2023-02-16 12:45                         ` Jason Gunthorpe
2023-02-21 16:51                         ` Tejun Heo
2023-02-21 16:51                           ` Tejun Heo
2023-02-21 17:25                           ` Jason Gunthorpe
2023-02-21 17:29                             ` Tejun Heo
2023-02-21 17:29                               ` Tejun Heo
2023-02-21 17:51                               ` Jason Gunthorpe
2023-02-21 17:51                                 ` Jason Gunthorpe
2023-02-21 18:07                                 ` Tejun Heo
2023-02-21 18:07                                   ` Tejun Heo
2023-02-21 19:26                                   ` Jason Gunthorpe
2023-02-21 19:26                                     ` Jason Gunthorpe
2023-02-21 19:45                                     ` Tejun Heo [this message]
2023-02-21 19:45                                       ` Tejun Heo
2023-02-21 19:49                                       ` Tejun Heo
2023-02-21 19:49                                         ` Tejun Heo
2023-02-21 19:57                                       ` Jason Gunthorpe
2023-02-22 11:38                                         ` Alistair Popple
2023-02-22 11:38                                           ` Alistair Popple
2023-02-22 12:57                                           ` Jason Gunthorpe
2023-02-22 12:57                                             ` Jason Gunthorpe
2023-02-22 22:59                                             ` Alistair Popple
2023-02-22 22:59                                               ` Alistair Popple
2023-02-23  0:05                                               ` Christoph Hellwig
2023-02-23  0:35                                                 ` Alistair Popple
2023-02-23  0:35                                                   ` Alistair Popple
2023-02-23  1:53                                               ` Jason Gunthorpe
2023-02-23  1:53                                                 ` Jason Gunthorpe
2023-02-23  9:12                                                 ` Daniel P. Berrangé
2023-02-23 17:31                                                   ` Jason Gunthorpe
2023-02-23 17:31                                                     ` Jason Gunthorpe
2023-02-23 17:18                                                 ` T.J. Mercier
2023-02-23 17:28                                                   ` Jason Gunthorpe
2023-02-23 17:28                                                     ` Jason Gunthorpe
2023-02-23 18:03                                                     ` Yosry Ahmed
2023-02-23 18:10                                                       ` Jason Gunthorpe
2023-02-23 18:10                                                         ` Jason Gunthorpe
2023-02-23 18:14                                                         ` Yosry Ahmed
2023-02-23 18:14                                                           ` Yosry Ahmed
2023-02-23 18:15                                                         ` Tejun Heo
2023-02-23 18:17                                                           ` Jason Gunthorpe
2023-02-23 18:17                                                             ` Jason Gunthorpe
2023-02-23 18:22                                                             ` Tejun Heo
2023-02-23 18:22                                                               ` Tejun Heo
2023-02-07  1:00           ` Waiman Long
2023-02-07  1:00             ` Waiman Long
2023-02-07  1:03             ` Tejun Heo
2023-02-07  1:50               ` Alistair Popple
2023-02-07  1:50                 ` Alistair Popple
2023-02-06  7:47 ` [PATCH 15/19] mm/util: Extend vm_account to charge pages against the pin cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 16/19] mm/util: Refactor account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 17/19] mm: Convert mmap and mlock to use account_locked_vm Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06  7:47 ` [PATCH 18/19] mm/mmap: Charge locked memory to pins cgroup Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-06 21:12   ` Yosry Ahmed
2023-02-06  7:47 ` [PATCH 19/19] selftests/vm: Add pins-cgroup selftest for mlock/mmap Alistair Popple
2023-02-06  7:47   ` Alistair Popple
2023-02-16 11:01 ` [PATCH 00/19] mm: Introduce a cgroup to limit the amount of locked and pinned memory David Hildenbrand
2023-02-16 11:01   ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y/UfS8TDIXhUlJ/I@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=apopple@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=daniel@ffwll.ch \
    --cc=hannes@cmpxchg.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=surenb@google.com \
    --cc=tjmercier@google.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.