archive mirror
 help / color / mirror / Atom feed
From: Mina Almasry <>
To: Mike Kravetz <>
Cc: "Aneesh Kumar" <>,
	shuah <>, "David Rientjes" <>,
	"Shakeel Butt" <>,
	"Greg Thelen" <>,
	"Andrew Morton" <>,,
	"open list" <>,,,, "Michal Koutný" <>
Subject: Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits
Date: Fri, 11 Oct 2019 13:41:36 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry <> wrote:
> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz <> wrote:
> >
> > On 9/19/19 3:24 PM, Mina Almasry wrote:
> > > Patch series implements hugetlb_cgroup reservation usage and limits, which
> > > track hugetlb reservations rather than hugetlb memory faulted in. Details of
> > > the approach is 1/7.
> >
> > Thanks for your continued efforts Mina.
> >
> > One thing that has bothered me with this approach from the beginning is that
> > hugetlb reservations are related to, but somewhat distinct from hugetlb
> > allocations.  The original (existing) huegtlb cgroup implementation does not
> > take reservations into account.  This is an issue you are trying to address
> > by adding a cgroup support for hugetlb reservations.  However, this new
> > reservation cgroup ignores hugetlb allocations at fault time.
> >
> > I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
> > of hugetlb pages.  Both the existing cgroup code and the reservation approach
> > have what I think are some serious flaws.  Consider a system with 100 hugetlb
> > pages available.  A sysadmin, has two groups A and B and wants to limit hugetlb
> > usage to 50 pages each.
> >
> > With the existing implementation, a task in group A could create a mmap of
> > 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
> > nobody in group B can allocate ANY huge pages.  This is true even though
> > no pages have been allocated in A (or B).
> >
> > With the reservation implementation, a task in group A could use MAP_NORESERVE
> > and allocate all 100 pages without taking any reservations.
> >
> > As mentioned in your documentation, it would be possible to use both the
> > existing (allocation) and new reservation cgroups together.  Perhaps if both
> > are setup for the 50/50 split things would work a little better.
> >
> > However, instead of creating a new reservation crgoup how about adding
> > reservation support to the existing allocation cgroup support.  One could
> > even argue that a reservation is an allocation as it sets aside huge pages
> > that can only be used for a specific purpose.  Here is something that
> > may work.
> >
> > Starting with the existing allocation cgroup.
> > - When hugetlb pages are reserved, the cgroup of the task making the
> >   reservations is charged.  Tracking for the charged cgroup is done in the
> >   reservation map in the same way proposed by this patch set.
> > - At page fault time,
> >   - If a reservation already exists for that specific area do not charge the
> >     faulting task.  No tracking in page, just the reservation map.
> >   - If no reservation exists, charge the group of the faulting task.  Tracking
> >     of this information is in the page itself as implemented today.
> > - When the hugetlb object is removed, compare the reservation map with any
> >   allocated pages.  If cgroup tracking information exists in page, uncharge
> >   that group.  Otherwise, unharge the group (if any) in the reservation map.
> >
> Sorry for the late response here. I've been prototyping the
> suggestions from this conversation:
> 1. Supporting cgroup-v2 on the current controller seems trivial.
> Basically just specifying the dfl files seems to do it, and my tests
> on top of cgroup-v2 don't see any problems so far at least. In light
> of this I'm not sure it's best to create a new controller per say.
> Seems like it would duplicate a lot of code with the current
> controller, so I've tentatively just stuck to the plan in my current
> patchset, a new counter on the existing controller.
> 2. I've been working on transitioning the new counter to the behavior
> Mike specified in the email I'm responding to. So far I have a flow
> that works for shared mappings but not private mappings:
> - On reservation, charge the new counter and store the info in the
> resv_map. The counter gets uncharged when the resv_map entry gets
> removed (works fine).
> - On alloc_huge_page(), check if there is a reservation for the page
> being allocated. If not, charge the new counter and store the
> information in resv_map. The counter still gets uncharged when the
> resv_map entry gets removed.
> The above works for all shared mappings and reserved private mappings,
> but I' having trouble supporting private NORESERVE mappings. Charging
> can work the same as for shared mappings: charge the new counter on
> reservation and on allocations that do not have a reservation. But the
> question still comes up: where to store the counter to uncharge this
> page? I thought of a couple of things that don't seem to work:
> 1. I thought of putting the counter in resv_map->reservation_counter,
> so that it gets uncharged on vm_op_close. But, private NORESERVE
> mappings don't even have a resv_map allocated for them.
> 2. I thought of detecting on free_huge_page that the page being freed
> belonged to a private NORESERVE mapping, and uncharging the
> hugetlb_cgroup in the page itself, but free_huge_page only gets a
> struct page* and I can't seem to find a way to detect that that the
> page comes from a private NORESERVE mapping from only the struct
> page*.
> Mike, note your suggestion above to check if the page hugetlb_cgroup
> is null doesn't work if we want to keep the current counter working
> the same: the page will always have a hugetlb_cgroup that points that
> contains the old counter. Any ideas how to apply this new counter
> behavior to a private NORESERVE mappings? Is there maybe a flag I can
> set on the pages at allocation time that I can read on free time to
> know whether to uncharge the hugetlb_cgroup or not?

Reading the code and asking around a bit, it seems the pointer to the
hugetlb_cgroup is in page[2].private. Is it reasonable to use
page[3].private to store the hugetlb_cgroup to uncharge for the new
counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that
would solve my problem. When allocating a private NORESERVE page, set
page[3].private to the hugetlb_cgroup to uncharge, then on
free_huge_page, check page[3].private, if it is non-NULL, uncharge the
new counter on it.

  reply	other threads:[~2019-10-11 20:41 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-19 22:24 [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mina Almasry
2019-09-19 22:24 ` [PATCH v5 1/7] hugetlb_cgroup: Add hugetlb_cgroup reservation counter Mina Almasry
2019-09-19 22:24 ` [PATCH v5 2/7] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations Mina Almasry
2019-09-19 22:24 ` [PATCH v5 3/7] hugetlb_cgroup: add reservation accounting for private mappings Mina Almasry
2019-09-19 22:24 ` [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing Mina Almasry
2019-09-27 21:44   ` Mike Kravetz
2019-09-27 22:33     ` Mina Almasry
2019-09-19 22:24 ` [PATCH v5 5/7] hugetlb_cgroup: add accounting for shared mappings Mina Almasry
2019-09-19 22:24 ` [PATCH v5 6/7] hugetlb_cgroup: Add hugetlb_cgroup reservation tests Mina Almasry
2019-09-19 22:24 ` [PATCH v5 7/7] hugetlb_cgroup: Add hugetlb_cgroup reservation docs Mina Almasry
2019-09-23 17:47 ` [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits Mike Kravetz
2019-09-23 19:18   ` Mina Almasry
2019-09-23 21:27     ` Mike Kravetz
2019-09-24 22:42       ` Mina Almasry
2019-09-26 19:28         ` David Rientjes
2019-09-26 21:23           ` Mike Kravetz
2019-09-27  0:55             ` Mina Almasry
2019-09-27 21:59               ` Mike Kravetz
2019-09-27 22:51                 ` Mina Almasry
2019-09-27 22:56                   ` Mike Kravetz
2019-09-30 15:12               ` Michal Koutný
2019-10-11 19:10   ` Mina Almasry
2019-10-11 20:41     ` Mina Almasry [this message]
2019-10-14 17:33       ` Mike Kravetz
2019-10-14 18:01         ` Mina Almasry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).