From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C1A5C32771 for ; Wed, 15 Jan 2020 21:07:29 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0DF2E2084D for ; Wed, 15 Jan 2020 21:07:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="ifUbEWwY" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0DF2E2084D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=oracle.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 837978E0006; Wed, 15 Jan 2020 16:07:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7C2508E0003; Wed, 15 Jan 2020 16:07:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 688CF8E0006; Wed, 15 Jan 2020 16:07:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0181.hostedemail.com [216.40.44.181]) by kanga.kvack.org (Postfix) with ESMTP id 4FCCE8E0003 for ; Wed, 15 Jan 2020 16:07:28 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with SMTP id C6FFF37E7 for ; Wed, 15 Jan 2020 21:07:27 +0000 (UTC) X-FDA: 76381104534.02.event97_411585aa13d44 X-HE-Tag: event97_411585aa13d44 X-Filterd-Recvd-Size: 8524 Received: from userp2130.oracle.com (userp2130.oracle.com [156.151.31.86]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Wed, 15 Jan 2020 21:07:27 +0000 (UTC) Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id 00FL3At8056915; Wed, 15 Jan 2020 21:07:22 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : cc : references : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2019-08-05; bh=gVbH6KN1Ivi2LjCunL38GzUpCTTx5TgrJ3YTS2KDHtw=; b=ifUbEWwYD2eBUg3CckdQRBeMYTySt7DSjJkcEDxe1Z/xibQKXCQc/Bvbik2EmzL/UGc7 6KmSJRkfRfDjV450U0Zq8VDSUAXMq40po5e2xSerPAn4VozvLGRoThCvslbqXQavGlwv AcZhzvSGPGPFl8sEqlBNF4TLc/3FNiBjt12wBpF5ahEvb+fDvOD6f1gaiQDHqsZol6wI LpbisnKD9JunSCfL2S/Q4zjne+T2eKmrjT0Sa2uIQ95rHlTrkznrvyYNV5BIeGx2O1NK ha/ERuUvXFekMb/7upBd0nKQvTGp/MfoMmIpzabnyTJcSKaDyi/29OHh856aehXK55ct eA== Received: from userp3020.oracle.com (userp3020.oracle.com [156.151.31.79]) by userp2130.oracle.com with ESMTP id 2xf74ser9w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 15 Jan 2020 21:07:22 +0000 Received: from pps.filterd (userp3020.oracle.com [127.0.0.1]) by userp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id 00FL3PIf140793; Wed, 15 Jan 2020 21:07:21 GMT Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userp3020.oracle.com with ESMTP id 2xj1prugjd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 15 Jan 2020 21:07:21 +0000 Received: from abhmp0010.oracle.com (abhmp0010.oracle.com [141.146.116.16]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id 00FL7Jna001055; Wed, 15 Jan 2020 21:07:19 GMT Received: from [192.168.1.206] (/71.63.128.209) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 15 Jan 2020 13:07:19 -0800 Subject: Re: [PATCH 2/2] mm/mempolicy: Skip walking HUGETLB vma if MPOL_MF_STRICT is specified alone From: Mike Kravetz To: Li Xinhai , "linux-mm@kvack.org" Cc: akpm , mhocko , "yang.shi" , n-horiguchi References: <1578993378-10860-1-git-send-email-lixinhai.lxh@gmail.com> <1578993378-10860-2-git-send-email-lixinhai.lxh@gmail.com> <2020011422092314671410@gmail.com> Message-ID: Date: Wed, 15 Jan 2020 13:07:17 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.3.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9501 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-2001150159 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9501 signatures=668685 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1911140001 definitions=main-2001150159 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Naoya, would appreciate any comments you have as this seems to be related to your changes to add huge page migration support. On 1/14/20 5:07 PM, Mike Kravetz wrote: > 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings? > - Is that leftover from the the days when huge page migration was not > supported? > - Is it just because huge page migration is more likely to fail than > base page migration. According to man page, mbind was added to v2.6.7 kernel. git history does not go back that far, so I first looked at v2.6.12. Definitions from include/linux/mempolicy.h ------------------------------------------ /* Policies */ #define MPOL_DEFAULT 0 #define MPOL_PREFERRED 1 #define MPOL_BIND 2 #define MPOL_INTERLEAVE 3 /* Flags for mbind */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ Note the MPOL_MF_STRICT would simply verify current page placement. At this time, hugetlb pages were not part of this verification. >From v2.6.12 routine check_range() ... for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) { if (!vma->vm_next && vma->vm_end < end) return ERR_PTR(-EFAULT); if (prev && prev->vm_end < vma->vm_start) return ERR_PTR(-EFAULT); if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) { err = verify_pages(vma->vm_mm, vma->vm_start, vma->vm_end, nodes); if (err) { first = ERR_PTR(err); break; } } prev = vma; } ... man page history does not go back this far. The first mbind man page has some things that are interesting: 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge page mappings (even though the code does this). 2) Support for huge page policy was added with 2.6.16 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to move pages to the requested node with this flag. Next look at v2.6.16 kernel ========================== This kernel introduces the MPOL_MF_MOVE* flags #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ Once again, the up front check_range() routine will skip hugetlb vma's. Note that check_range() will also isolate pages. So since hugetlb vma's are skipped, no hugetlb pages will be isolated. Since no hugetlb pages are isolated, none will be on the list to be migrate. Therefore, hugetlb pages are effectively ignored for the new MPOL_MF_MOVE* flags. This makes sense as huge page migration support was not added until the v3.12 kernel. Note that at when MPOL_MF_MOVE* flags were added to the mbind man page, the statement "MPOL_MF_STRICT is ignored on huge page mappings right now." was added. This was later changed to "MPOL_MF_STRICT is ignored on huge page mappings." Next look at v3.12 kernel ========================= The v3.12 code looks similiar to today's code. In the verify/isolation phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar to queue_pages_hugetlb(). Neither will cause errors at this point in the process. And, hugetlb pages with a mapcount == 1 will potentially be isolated and added to the list of pages to be migrated. In addition, if the subsequent call to migrate_pages() fails to migrate ANY page, we return -EIO if MPOL_MF_STRICT is set. This is true even if the only page(s) that failed to migrate were hugetlb pages. It should also be noted that no mbind man page updates were made WRT hugetlb pages after migration support was added. Summary ======= It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page mappings." is left over from the original mbind implementation. When the huge page migration support was added, I can not be sure if ignoring MPOL_MF_STRICT for huge pages during the verify/isolation phase was intentional. It seems like it was as the return value from isolate_huge_page() is ignored. What should we do? ================== 1) Nothing more than optimizations by Li Xinhai. Behavior that could be seen as conflicting with man page has existed since v3.12 and I am not aware of any complaints. 2) In addition to optimizations by Li Xinhai, modify code to truly ignore MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do after a failure of migrate_pages(). We could simply traverse the list of pages that were not migrated looking for any non-hugetlb page. 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings." and modify code accordingly. My suggestion would be for 1 or 2. Thoughts? -- Mike Kravetz