From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 358A9C433F5 for ; Wed, 27 Apr 2022 20:05:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 88A466B0074; Wed, 27 Apr 2022 16:05:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 839CB6B0075; Wed, 27 Apr 2022 16:05:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 68C706B0078; Wed, 27 Apr 2022 16:05:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id 54F516B0074 for ; Wed, 27 Apr 2022 16:05:04 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 310F628E25 for ; Wed, 27 Apr 2022 20:05:04 +0000 (UTC) X-FDA: 79403737728.02.F224A5B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id B1F1C4004C for ; Wed, 27 Apr 2022 20:05:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1651089903; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rDtWw++WNcMaUh/hwX3eHKmFH7a7SuJ5t8JmoeW2GOg=; b=ZVoggvwiwdxGfIVOausNsQz70QwcCyO8Xbof9pIgbsaHSB/ywH8eSaot4fMJ+yapScdXfA CZJ1G5OsRTOizO1JFUADtxwuTzFDpPYuCLsd9KB9mZsgUegthzhQE/V+rXfQtnDhDRBGPy +2GHtPqfqhokrm06BQ4C0jKp29VkCs8= Received: from mail-il1-f198.google.com (mail-il1-f198.google.com [209.85.166.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-363-O0lvh8D1M7O5Acme8asrew-1; Wed, 27 Apr 2022 16:05:01 -0400 X-MC-Unique: O0lvh8D1M7O5Acme8asrew-1 Received: by mail-il1-f198.google.com with SMTP id l13-20020a056e021c0d00b002cc38cb4554so564139ilh.10 for ; Wed, 27 Apr 2022 13:05:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=rDtWw++WNcMaUh/hwX3eHKmFH7a7SuJ5t8JmoeW2GOg=; b=Aq1IzPllYOz0Vo3ac93aOCeMVjVWipr0ABhwIk/jY4DCJzOmEJYLBhWsrJNtnwhyZy oO3+Hsm6Ra+ePRQn9mTSxWZMdbSXxUn7vxfNjQnX3p3CClws5vxjdeN3ruQF9ERs2ZrT Pi0tpwWAVR4sIRJ3eVJZ3MTDQixD9eOiSHwc0xIcEElodq8DZyovFJcikskteUMvQ8FN hDwNz2nctm79gAvx++s2qqkDjx1JdvRb1AsWPl5rkRxS1yK9qVJwmj2vD0glTXcEu+H/ 5epWSQwsHFJZaWfyjqkootT7yWbqY9V5fujLuMidgG630YuqGsaNwo1jTNzEsvltT97+ sl2Q== X-Gm-Message-State: AOAM533y/rskP0ub2iSKQzhl/5/AcfxT8KPr3F4Nm5zgAFIVUQo9IrDW GpX/JlLVvrK1dQHUpdMk7Ed7svIGOs8zZC3OrfWPySew8eesHYDkoqXUvvq0+nSmUufjwqKZ2Qi ZptK7h8f+yDg= X-Received: by 2002:a05:6638:617:b0:32a:de4f:7772 with SMTP id g23-20020a056638061700b0032ade4f7772mr9118133jar.155.1651089900270; Wed, 27 Apr 2022 13:05:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyJROIek6KqyBWhuGOeFEgcBtthl/BAjWf9gRgDHMOHzj8vzEbomF4DuLF7ickQHn02Q3rNjg== X-Received: by 2002:a05:6638:617:b0:32a:de4f:7772 with SMTP id g23-20020a056638061700b0032ade4f7772mr9118102jar.155.1651089899874; Wed, 27 Apr 2022 13:04:59 -0700 (PDT) Received: from xz-m1.local (cpec09435e3e0ee-cmc09435e3e0ec.cpe.net.cable.rogers.com. [99.241.198.116]) by smtp.gmail.com with ESMTPSA id m6-20020a923f06000000b002ca74f4fab2sm10089755ila.14.2022.04.27.13.04.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Apr 2022 13:04:59 -0700 (PDT) Date: Wed, 27 Apr 2022 16:04:56 -0400 From: Peter Xu To: Zach O'Keefe Cc: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer , kernel test robot Subject: Re: [PATCH v3 03/12] mm/khugepaged: make hugepage allocation context-specific Message-ID: References: <20220426144412.742113-1-zokeefe@google.com> <20220426144412.742113-4-zokeefe@google.com> MIME-Version: 1.0 In-Reply-To: <20220426144412.742113-4-zokeefe@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Stat-Signature: 7f9qntjq8e5t3ed54op3z4cy6drqyyiu X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B1F1C4004C Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ZVoggvwi; spf=none (imf11.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspam-User: X-HE-Tag: 1651089900-945350 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 26, 2022 at 07:44:03AM -0700, Zach O'Keefe wrote: > Add hugepage allocation context to struct collapse_context, allowing > different collapse contexts to allocate hugepages differently. For > example, khugepaged decides to allocate differently in NUMA and UMA > configurations, and other collapse contexts shouldn't be coupled to this > decision. Likewise for the gfp flags used for said allocation. > > Additionally, move [pre]allocated hugepage pointer into > struct collapse_context. > > Signed-off-by: Zach O'Keefe > Reported-by: kernel test robot (Please remember to remove this one too, and the rest ones.. :) > --- > mm/khugepaged.c | 102 ++++++++++++++++++++++++------------------------ > 1 file changed, 52 insertions(+), 50 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9d42fa330812..c4962191d6e1 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -92,6 +92,11 @@ struct collapse_control { > > /* Last target selected in khugepaged_find_target_node() for this scan */ > int last_target_node; > + > + struct page *hpage; > + gfp_t (*gfp)(void); > + struct page* (*alloc_hpage)(struct collapse_control *cc, gfp_t gfp, > + int node); Nit: s/page* /page */ looks a bit more consistent.. > }; > > /** > @@ -877,21 +882,21 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait) > return true; > } > > -static struct page * > -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node) > +static struct page *khugepaged_alloc_page(struct collapse_control *cc, > + gfp_t gfp, int node) Nit: I'd suggest we put collapse_control* either at the start or at the end, and keep doing it when possible. IIRC you used to put it always at the end, but now it's at the start. Not a big deal but it'll be nice to keep the code self-aligned. > { > - VM_BUG_ON_PAGE(*hpage, *hpage); > + VM_BUG_ON_PAGE(cc->hpage, cc->hpage); > > - *hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER); > - if (unlikely(!*hpage)) { > + cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER); > + if (unlikely(!cc->hpage)) { > count_vm_event(THP_COLLAPSE_ALLOC_FAILED); > - *hpage = ERR_PTR(-ENOMEM); > + cc->hpage = ERR_PTR(-ENOMEM); > return NULL; > } > > - prep_transhuge_page(*hpage); > + prep_transhuge_page(cc->hpage); > count_vm_event(THP_COLLAPSE_ALLOC); > - return *hpage; > + return cc->hpage; > } > #else > static int khugepaged_find_target_node(struct collapse_control *cc) > @@ -953,12 +958,12 @@ static bool khugepaged_prealloc_page(struct page **hpage, bool *wait) > return true; > } > > -static struct page * > -khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node) > +static struct page *khugepaged_alloc_page(struct collapse_control *cc, > + gfp_t gfp, int node) > { > - VM_BUG_ON(!*hpage); > + VM_BUG_ON(!cc->hpage); > > - return *hpage; > + return cc->hpage; > } > #endif > > @@ -1080,10 +1085,9 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, > return true; > } > > -static void collapse_huge_page(struct mm_struct *mm, > - unsigned long address, > - struct page **hpage, > - int node, int referenced, int unmapped) > +static void collapse_huge_page(struct mm_struct *mm, unsigned long address, > + struct collapse_control *cc, int referenced, > + int unmapped) > { > LIST_HEAD(compound_pagelist); > pmd_t *pmd, _pmd; > @@ -1096,11 +1100,12 @@ static void collapse_huge_page(struct mm_struct *mm, > struct mmu_notifier_range range; > gfp_t gfp; > const struct cpumask *cpumask; > + int node; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > /* Only allocate from the target node */ > - gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > + gfp = cc->gfp() | __GFP_THISNODE; > > /* > * Before allocating the hugepage, release the mmap_lock read lock. > @@ -1110,13 +1115,14 @@ static void collapse_huge_page(struct mm_struct *mm, > */ > mmap_read_unlock(mm); > > + node = khugepaged_find_target_node(cc); > /* sched to specified node before huage page memory copy */ > if (task_node(current) != node) { > cpumask = cpumask_of_node(node); > if (!cpumask_empty(cpumask)) > set_cpus_allowed_ptr(current, cpumask); > } > - new_page = khugepaged_alloc_page(hpage, gfp, node); > + new_page = cc->alloc_hpage(cc, gfp, node); AFAICT you removed all references of khugepaged_alloc_page() in this patch, then you'd better drop the function for both NUMA and UMA. Said that, I think it's actually better to keep them, as they do things useful. For example,AFAICT this ->alloc_hpage() change can leak the hpage alloacted for UMA already so that's why I think keeping them makes sense, then iiuc the BUG_ON would trigger with UMA already. I saw that you've moved khugepaged_find_target_node() into this function which looks definitely good, but if we could keep khugepaged_alloc_page() and then IMHO we could even move khugepaged_find_target_node() into that and drop the "node" parameter in khugepaged_alloc_page(). I'd even consider moving cc->gfp() all into it if possible, since the gfp seems to be always with __GFP_THISNODE anyways. What do you think? > if (!new_page) { > result = SCAN_ALLOC_HUGE_PAGE_FAIL; > goto out_nolock; > @@ -1238,15 +1244,15 @@ static void collapse_huge_page(struct mm_struct *mm, > update_mmu_cache_pmd(vma, address, pmd); > spin_unlock(pmd_ptl); > > - *hpage = NULL; > + cc->hpage = NULL; > > khugepaged_pages_collapsed++; > result = SCAN_SUCCEED; > out_up_write: > mmap_write_unlock(mm); > out_nolock: > - if (!IS_ERR_OR_NULL(*hpage)) > - mem_cgroup_uncharge(page_folio(*hpage)); > + if (!IS_ERR_OR_NULL(cc->hpage)) > + mem_cgroup_uncharge(page_folio(cc->hpage)); > trace_mm_collapse_huge_page(mm, isolated, result); > return; > } > @@ -1254,7 +1260,6 @@ static void collapse_huge_page(struct mm_struct *mm, > static int khugepaged_scan_pmd(struct mm_struct *mm, > struct vm_area_struct *vma, > unsigned long address, > - struct page **hpage, > struct collapse_control *cc) > { > pmd_t *pmd; > @@ -1399,10 +1404,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > out_unmap: > pte_unmap_unlock(pte, ptl); > if (ret) { > - node = khugepaged_find_target_node(cc); > /* collapse_huge_page will return with the mmap_lock released */ > - collapse_huge_page(mm, address, hpage, node, > - referenced, unmapped); > + collapse_huge_page(mm, address, cc, referenced, unmapped); > } > out: > trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced, > @@ -1667,8 +1670,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > * @mm: process address space where collapse happens > * @file: file that collapse on > * @start: collapse start address > - * @hpage: new allocated huge page for collapse > - * @node: appointed node the new huge page allocate from > + * @cc: collapse context and scratchpad > * > * Basic scheme is simple, details are more complex: > * - allocate and lock a new huge page; > @@ -1686,8 +1688,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) > * + unlock and free huge page; > */ > static void collapse_file(struct mm_struct *mm, > - struct file *file, pgoff_t start, > - struct page **hpage, int node) > + struct file *file, pgoff_t start, > + struct collapse_control *cc) > { > struct address_space *mapping = file->f_mapping; > gfp_t gfp; > @@ -1697,15 +1699,16 @@ static void collapse_file(struct mm_struct *mm, > XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER); > int nr_none = 0, result = SCAN_SUCCEED; > bool is_shmem = shmem_file(file); > - int nr; > + int nr, node; > > VM_BUG_ON(!IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && !is_shmem); > VM_BUG_ON(start & (HPAGE_PMD_NR - 1)); > > /* Only allocate from the target node */ > - gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > + gfp = cc->gfp() | __GFP_THISNODE; > + node = khugepaged_find_target_node(cc); > > - new_page = khugepaged_alloc_page(hpage, gfp, node); > + new_page = cc->alloc_hpage(cc, gfp, node); > if (!new_page) { > result = SCAN_ALLOC_HUGE_PAGE_FAIL; > goto out; > @@ -1998,7 +2001,7 @@ static void collapse_file(struct mm_struct *mm, > * Remove pte page tables, so we can re-fault the page as huge. > */ > retract_page_tables(mapping, start); > - *hpage = NULL; > + cc->hpage = NULL; > > khugepaged_pages_collapsed++; > } else { > @@ -2045,14 +2048,14 @@ static void collapse_file(struct mm_struct *mm, > unlock_page(new_page); > out: > VM_BUG_ON(!list_empty(&pagelist)); > - if (!IS_ERR_OR_NULL(*hpage)) > - mem_cgroup_uncharge(page_folio(*hpage)); > + if (!IS_ERR_OR_NULL(cc->hpage)) > + mem_cgroup_uncharge(page_folio(cc->hpage)); > /* TODO: tracepoints */ > } > > static void khugepaged_scan_file(struct mm_struct *mm, > - struct file *file, pgoff_t start, struct page **hpage, > - struct collapse_control *cc) > + struct file *file, pgoff_t start, > + struct collapse_control *cc) > { > struct page *page = NULL; > struct address_space *mapping = file->f_mapping; > @@ -2125,8 +2128,7 @@ static void khugepaged_scan_file(struct mm_struct *mm, > result = SCAN_EXCEED_NONE_PTE; > count_vm_event(THP_SCAN_EXCEED_NONE_PTE); > } else { > - node = khugepaged_find_target_node(cc); > - collapse_file(mm, file, start, hpage, node); > + collapse_file(mm, file, start, cc); > } > } > > @@ -2134,8 +2136,8 @@ static void khugepaged_scan_file(struct mm_struct *mm, > } > #else > static void khugepaged_scan_file(struct mm_struct *mm, > - struct file *file, pgoff_t start, struct page **hpage, > - struct collapse_control *cc) > + struct file *file, pgoff_t start, > + struct collapse_control *cc) > { > BUILD_BUG(); > } > @@ -2146,7 +2148,6 @@ static void khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot) > #endif > > static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > - struct page **hpage, > struct collapse_control *cc) > __releases(&khugepaged_mm_lock) > __acquires(&khugepaged_mm_lock) > @@ -2223,12 +2224,11 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > mmap_read_unlock(mm); > ret = 1; > - khugepaged_scan_file(mm, file, pgoff, hpage, cc); > + khugepaged_scan_file(mm, file, pgoff, cc); > fput(file); > } else { > ret = khugepaged_scan_pmd(mm, vma, > - khugepaged_scan.address, > - hpage, cc); > + khugepaged_scan.address, cc); > } > /* move to next address */ > khugepaged_scan.address += HPAGE_PMD_SIZE; > @@ -2286,15 +2286,15 @@ static int khugepaged_wait_event(void) > > static void khugepaged_do_scan(struct collapse_control *cc) > { > - struct page *hpage = NULL; > unsigned int progress = 0, pass_through_head = 0; > unsigned int pages = READ_ONCE(khugepaged_pages_to_scan); > bool wait = true; > > + cc->hpage = NULL; > lru_add_drain_all(); > > while (progress < pages) { > - if (!khugepaged_prealloc_page(&hpage, &wait)) > + if (!khugepaged_prealloc_page(&cc->hpage, &wait)) > break; > > cond_resched(); > @@ -2308,14 +2308,14 @@ static void khugepaged_do_scan(struct collapse_control *cc) > if (khugepaged_has_work() && > pass_through_head < 2) > progress += khugepaged_scan_mm_slot(pages - progress, > - &hpage, cc); > + cc); > else > progress = pages; > spin_unlock(&khugepaged_mm_lock); > } > > - if (!IS_ERR_OR_NULL(hpage)) > - put_page(hpage); > + if (!IS_ERR_OR_NULL(cc->hpage)) > + put_page(cc->hpage); > } > > static bool khugepaged_should_wakeup(void) > @@ -2349,6 +2349,8 @@ static int khugepaged(void *none) > struct mm_slot *mm_slot; > struct collapse_control cc = { > .last_target_node = NUMA_NO_NODE, > + .gfp = &alloc_hugepage_khugepaged_gfpmask, > + .alloc_hpage = &khugepaged_alloc_page, I'm also wondering whether we could avoid the gfp() hook, because logically that can be inlined into the ->alloc_hpage() hook, then we don't need gfp parameter anymore for ->alloc_hpage(), which seems to be a win-win? -- Peter Xu