From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4F57C2B9F4 for ; Mon, 28 Jun 2021 09:28:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 98FA461C4D for ; Mon, 28 Jun 2021 09:28:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232480AbhF1Ja1 (ORCPT ); Mon, 28 Jun 2021 05:30:27 -0400 Received: from szxga08-in.huawei.com ([45.249.212.255]:9282 "EHLO szxga08-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232200AbhF1JaZ (ORCPT ); Mon, 28 Jun 2021 05:30:25 -0400 Received: from dggemv711-chm.china.huawei.com (unknown [172.30.72.56]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4GD2FZ26kZz1BSb5; Mon, 28 Jun 2021 17:22:42 +0800 (CST) Received: from dggema753-chm.china.huawei.com (10.1.198.195) by dggemv711-chm.china.huawei.com (10.1.198.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2176.2; Mon, 28 Jun 2021 17:27:59 +0800 Received: from huawei.com (10.174.179.206) by dggema753-chm.china.huawei.com (10.1.198.195) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2176.2; Mon, 28 Jun 2021 17:27:58 +0800 From: Bin Wang To: CC: , , , , Subject: Re: Re: [PATCH v2] mm: hugetlb: add hwcrp_hugepages to record memory failure on hugetlbfs Date: Mon, 28 Jun 2021 17:27:52 +0800 Message-ID: <20210628092752.2135-1-wangbin224@huawei.com> X-Mailer: git-send-email 2.29.2.windows.3 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.174.179.206] X-ClientProxiedBy: dggems705-chm.china.huawei.com (10.3.19.182) To dggema753-chm.china.huawei.com (10.1.198.195) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > > index 926eeb9bf4eb..ffb6e7b6756b 100644 > > --- a/fs/hugetlbfs/inode.c > > +++ b/fs/hugetlbfs/inode.c > > @@ -986,8 +986,7 @@ static int hugetlbfs_error_remove_page(struct address_space *mapping, > > pgoff_t index = page->index; > > > > remove_huge_page(page); > > - if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1))) > > - hugetlb_fix_reserve_counts(inode); > > As mentioned, huge page reserve counts are not used to record number of > poisioned pages. The calls to hugetlb_unreserve_pages and possibly > hugetlb_fix_reserve_counts are necessary for reserve accounting. They > can not be removed. Thanks for your explanation very much. I didn't get the point of the comments from the first patch. hugetlb_fix_reserve_counts() shouldn't be removed and I will fix this. > > + hugetlb_fix_hwcrp_counts(page); > > This new routine just counts memory errors on 'in use' huge pages. > I do not see a call anywhere to count memory errors on huge pages > not in use. It's my oversight. I should have considered this situation. I tested it with hwcrp_hugepages count, and this is the result: # cat /proc/meminfo |grep -E 'HugePages_|Hard' HardwareCorrupted: 0 kB HugePages_Total: 64 HugePages_Free: 64 HugePages_Rsvd: 0 HugePages_Surp: 0 HugePages_Hwcrp: 0 No count changes, even the HardwareCorrupted. I'm not sure if this is normal. This is what happens in kernel(stable master branch): static int memory_failure_hugetlb(unsigned long pfn, int flags) { ... /* TestSetPageHWPoison return 0 */ if (TestSetPageHWPoison(head)) { ... } num_poisoned_pages_inc(); /* HardwareCorrupted += PAGE_SIZE */ /* get_hwpoison_page() return 0 */ if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) { /* * Check "filter hit" and "race with other subpage." */ lock_page(head); /* PageHWPoison() return 1 */ if (PageHWPoison(head)) { /* (p != head && TestSetPageHWPoison(head)) is hit */ if ((hwpoison_filter(p) && TestClearPageHWPoison(p)) || (p != head && TestSetPageHWPoison(head))) { /* HardwareCorrupted -= PAGE_SIZE */ num_poisoned_pages_dec(); unlock_page(head); return 0; } } ... } ... } It seems like that memory errors on huge pages not in use hit the "race with other subpage". I think I shouldn't add hwcrp_hugepages in this routine. Maybe we need more conditions to distinguish this. > > > > return 0; > > } > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index f7ca1a3870ea..1d5bada80aa5 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -171,6 +171,7 @@ void putback_active_hugepage(struct page *page); > > void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason); > > void free_huge_page(struct page *page); > > void hugetlb_fix_reserve_counts(struct inode *inode); > > +void hugetlb_fix_hwcrp_counts(struct page *page); > > extern struct mutex *hugetlb_fault_mutex_table; > > u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx); > > > > @@ -602,12 +603,14 @@ struct hstate { > > unsigned long free_huge_pages; > > unsigned long resv_huge_pages; > > unsigned long surplus_huge_pages; > > + unsigned long hwcrp_huge_pages; > > unsigned long nr_overcommit_huge_pages; > > struct list_head hugepage_activelist; > > struct list_head hugepage_freelists[MAX_NUMNODES]; > > unsigned int nr_huge_pages_node[MAX_NUMNODES]; > > unsigned int free_huge_pages_node[MAX_NUMNODES]; > > unsigned int surplus_huge_pages_node[MAX_NUMNODES]; > > + unsigned int hwcrp_huge_pages_node[MAX_NUMNODES]; > > I understand your requirement to count the number of memory errors on > hugetlb pages. However, we need to think carefully about we represent > that count. > > Noaya, do you have opinions on where would be the best place to store > this information? The hugetlb memory error code has the comment 'needs > work'. Ideally, we could isolate memory errors to a single base (4K for > x86) page and free the remaining base pages to buddy. We could also > potentially allocate a 'replacement' hugetlb page doing something like > alloc_and_dissolve_huge_page. > > If we get an error on a hugetlb page and can isolate it to a base page > and replace the huge page, is it still a huge page memory error? > > IMO, we should work on isolating memory errors to a base page and > replacing the huge page. Then, the existing count of base pages with > memory errors would be sufficient? > > This is something I would like to work, but I have higher priorities > right now. Yes, splitting the huge pages and isolating a base page is ideal. And we do this with dissolve_free_huge_page() when page_mapping() return NULL. I think there is a reason(but I do not get it) why we don't split huge pags in hugetlbfs_error_remove_page() or after. So I choose to add a new count. -- Bin Wang