From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.4 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F3C0C47404 for ; Fri, 11 Oct 2019 20:41:51 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C52952089F for ; Fri, 11 Oct 2019 20:41:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="O+quGUL9" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C52952089F Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0623B6B0003; Fri, 11 Oct 2019 16:41:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F2D7F8E0005; Fri, 11 Oct 2019 16:41:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF49B8E0001; Fri, 11 Oct 2019 16:41:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0045.hostedemail.com [216.40.44.45]) by kanga.kvack.org (Postfix) with ESMTP id B511D6B0003 for ; Fri, 11 Oct 2019 16:41:49 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id 5AE5052B4 for ; Fri, 11 Oct 2019 20:41:49 +0000 (UTC) X-FDA: 76032675138.15.table31_4a0d04524b35a X-HE-Tag: table31_4a0d04524b35a X-Filterd-Recvd-Size: 9433 Received: from mail-ot1-f66.google.com (mail-ot1-f66.google.com [209.85.210.66]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Fri, 11 Oct 2019 20:41:48 +0000 (UTC) Received: by mail-ot1-f66.google.com with SMTP id s22so9075339otr.6 for ; Fri, 11 Oct 2019 13:41:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QCcI0gZx24qP48bYZd6tERlQ1R1wkv8V/GwDJhxA6+4=; b=O+quGUL9GaTo+gzYgH53b4VZvt/bBFsOo+HoyW0Wa5cymiVBv3ZBICK2Ab1z8ywiOW 6UVDcg8sc6Kd9egwHekRJrJLEVqHBAcElTS2vmtilW/W523Gbm1xXC2JdGqy3WcC42zV qewmH9LJH5ZfFOhFpCYjQElqSODyKlrLrWR8vZlyS+UHQmC3h1AdXsq/YCnEgPTAhJgV qRFKcH01zJTADmivj0w3XEJlAU+KrMMsmZlgI/NUpjIM453sU5iM50jTq8axNxiArMBO iErIVWcGoBL9clvTA9iUAnIkCIQjpUW6CTHdEr0lqSvqUBNMxsgd+1SIXCsPFz7cpG6L zLOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QCcI0gZx24qP48bYZd6tERlQ1R1wkv8V/GwDJhxA6+4=; b=IFRd8AtgrLwOng3DcYLCurowUoRnCFj6qv6551kyEiuRnbQEnOcMERyREZhJO8Rhj3 z04sOa2B1YrUL0c89dz+mmXpIghtCVXECfohtDbLBi/FeB/zlYhD/+vmW9eSsFZQbm8S Ui9LcIpj+njTCIAPzEpYmlvICS6pnG+ZpKiaiWNMcdQ3etyNHtISN+HAS0ibCRtkhY3w DUbFFMl5zDiE3EDgAosbYiW9cKLezjUeIa/Q+2aPTdHrEDHYhkz66B6d23SkCHXWQIMi TGwbmoihjMfLboEsiZW5F8mD8WbT5MUIptKDiM9tK6du1m0ql4kLrFZskhjWAHxAz8CU kK2A== X-Gm-Message-State: APjAAAUOXqYukQyu428yK/DuiRyeKlQyw6+8Tm84vzG9YItGehfuz9tP +pjgVhN/5tPDY4ZQ1CNk6ISyTZDLgOL8DMcTLCgHzg== X-Google-Smtp-Source: APXvYqzNjGsf2tP7ZzFAGiWPzmM+QG3u5csg/sFiRGr1eYLK5VU0i/5K52TZlUDNo2IZdUDXQG8XuW+YL+3Hn6n6F34= X-Received: by 2002:a9d:5c07:: with SMTP id o7mr14443071otk.33.1570826507552; Fri, 11 Oct 2019 13:41:47 -0700 (PDT) MIME-Version: 1.0 References: <20190919222421.27408-1-almasrymina@google.com> <3c73d2b7-f8d0-16bf-b0f0-86673c3e9ce3@oracle.com> In-Reply-To: From: Mina Almasry Date: Fri, 11 Oct 2019 13:41:36 -0700 Message-ID: Subject: Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits To: Mike Kravetz Cc: Aneesh Kumar , shuah , David Rientjes , Shakeel Butt , Greg Thelen , Andrew Morton , khalid.aziz@oracle.com, open list , linux-mm@kvack.org, linux-kselftest@vger.kernel.org, cgroups@vger.kernel.org, =?UTF-8?Q?Michal_Koutn=C3=BD?= Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry wrote: > > On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz wrote: > > > > On 9/19/19 3:24 PM, Mina Almasry wrote: > > > Patch series implements hugetlb_cgroup reservation usage and limits, which > > > track hugetlb reservations rather than hugetlb memory faulted in. Details of > > > the approach is 1/7. > > > > Thanks for your continued efforts Mina. > > > > One thing that has bothered me with this approach from the beginning is that > > hugetlb reservations are related to, but somewhat distinct from hugetlb > > allocations. The original (existing) huegtlb cgroup implementation does not > > take reservations into account. This is an issue you are trying to address > > by adding a cgroup support for hugetlb reservations. However, this new > > reservation cgroup ignores hugetlb allocations at fault time. > > > > I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation > > of hugetlb pages. Both the existing cgroup code and the reservation approach > > have what I think are some serious flaws. Consider a system with 100 hugetlb > > pages available. A sysadmin, has two groups A and B and wants to limit hugetlb > > usage to 50 pages each. > > > > With the existing implementation, a task in group A could create a mmap of > > 100 pages in size and reserve all 100 pages. Since the pages are 'reserved', > > nobody in group B can allocate ANY huge pages. This is true even though > > no pages have been allocated in A (or B). > > > > With the reservation implementation, a task in group A could use MAP_NORESERVE > > and allocate all 100 pages without taking any reservations. > > > > As mentioned in your documentation, it would be possible to use both the > > existing (allocation) and new reservation cgroups together. Perhaps if both > > are setup for the 50/50 split things would work a little better. > > > > However, instead of creating a new reservation crgoup how about adding > > reservation support to the existing allocation cgroup support. One could > > even argue that a reservation is an allocation as it sets aside huge pages > > that can only be used for a specific purpose. Here is something that > > may work. > > > > Starting with the existing allocation cgroup. > > - When hugetlb pages are reserved, the cgroup of the task making the > > reservations is charged. Tracking for the charged cgroup is done in the > > reservation map in the same way proposed by this patch set. > > - At page fault time, > > - If a reservation already exists for that specific area do not charge the > > faulting task. No tracking in page, just the reservation map. > > - If no reservation exists, charge the group of the faulting task. Tracking > > of this information is in the page itself as implemented today. > > - When the hugetlb object is removed, compare the reservation map with any > > allocated pages. If cgroup tracking information exists in page, uncharge > > that group. Otherwise, unharge the group (if any) in the reservation map. > > > > Sorry for the late response here. I've been prototyping the > suggestions from this conversation: > > 1. Supporting cgroup-v2 on the current controller seems trivial. > Basically just specifying the dfl files seems to do it, and my tests > on top of cgroup-v2 don't see any problems so far at least. In light > of this I'm not sure it's best to create a new controller per say. > Seems like it would duplicate a lot of code with the current > controller, so I've tentatively just stuck to the plan in my current > patchset, a new counter on the existing controller. > > 2. I've been working on transitioning the new counter to the behavior > Mike specified in the email I'm responding to. So far I have a flow > that works for shared mappings but not private mappings: > > - On reservation, charge the new counter and store the info in the > resv_map. The counter gets uncharged when the resv_map entry gets > removed (works fine). > - On alloc_huge_page(), check if there is a reservation for the page > being allocated. If not, charge the new counter and store the > information in resv_map. The counter still gets uncharged when the > resv_map entry gets removed. > > The above works for all shared mappings and reserved private mappings, > but I' having trouble supporting private NORESERVE mappings. Charging > can work the same as for shared mappings: charge the new counter on > reservation and on allocations that do not have a reservation. But the > question still comes up: where to store the counter to uncharge this > page? I thought of a couple of things that don't seem to work: > > 1. I thought of putting the counter in resv_map->reservation_counter, > so that it gets uncharged on vm_op_close. But, private NORESERVE > mappings don't even have a resv_map allocated for them. > > 2. I thought of detecting on free_huge_page that the page being freed > belonged to a private NORESERVE mapping, and uncharging the > hugetlb_cgroup in the page itself, but free_huge_page only gets a > struct page* and I can't seem to find a way to detect that that the > page comes from a private NORESERVE mapping from only the struct > page*. > > Mike, note your suggestion above to check if the page hugetlb_cgroup > is null doesn't work if we want to keep the current counter working > the same: the page will always have a hugetlb_cgroup that points that > contains the old counter. Any ideas how to apply this new counter > behavior to a private NORESERVE mappings? Is there maybe a flag I can > set on the pages at allocation time that I can read on free time to > know whether to uncharge the hugetlb_cgroup or not? Reading the code and asking around a bit, it seems the pointer to the hugetlb_cgroup is in page[2].private. Is it reasonable to use page[3].private to store the hugetlb_cgroup to uncharge for the new counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that would solve my problem. When allocating a private NORESERVE page, set page[3].private to the hugetlb_cgroup to uncharge, then on free_huge_page, check page[3].private, if it is non-NULL, uncharge the new counter on it.