From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754158AbcJYEQu (ORCPT ); Tue, 25 Oct 2016 00:16:50 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:46274 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750933AbcJYEQs (ORCPT ); Tue, 25 Oct 2016 00:16:48 -0400 From: "Aneesh Kumar K.V" To: Dave Hansen , Anshuman Khandual , linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: mhocko@suse.com, js1304@gmail.com, vbabka@suse.cz, mgorman@suse.de, minchan@kernel.org, akpm@linux-foundation.org, bsingharora@gmail.com Subject: Re: [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths In-Reply-To: <580E41F0.20601@intel.com> References: <1477283517-2504-1-git-send-email-khandual@linux.vnet.ibm.com> <1477283517-2504-4-git-send-email-khandual@linux.vnet.ibm.com> <580E41F0.20601@intel.com> Date: Tue, 25 Oct 2016 09:45:53 +0530 MIME-Version: 1.0 Content-Type: text/plain X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16102504-0008-0000-0000-000005E420A9 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00005974; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000187; SDB=6.00772362; UDB=6.00370717; IPR=6.00549182; BA=6.00004829; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013097; XFM=3.00000011; UTC=2016-10-25 04:16:01 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16102504-0009-0000-0000-00003C6A6455 Message-Id: <87d1ipawsm.fsf@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-10-25_02:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=7 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1610250071 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Dave Hansen writes: > On 10/23/2016 09:31 PM, Anshuman Khandual wrote: >> This change is part of the isolation requiring coherent device memory nodes >> implementation. >> >> Isolation seeking coherent device memory node requires allocation isolation >> from implicit memory allocations from user space. Towards that effect, the >> memory should not be used for generic HugeTLB page pool allocations. This >> modifies relevant functions to skip all coherent memory nodes present on >> the system during allocation, freeing and auditing for HugeTLB pages. > > This seems really fragile. You had to hit, what, 18 call sites? What > are the odds that this is going to stay working? I guess a better approach is to introduce new node_states entry such that we have one that excludes coherent device memory numa nodes. One possibility is to add N_SYSTEM_MEMORY and N_MEMORY. Current N_MEMORY becomes N_SYSTEM_MEMORY and N_MEMORY includes system and device/any other memory which is coherent. All the isolation can then be achieved based on the nodemask_t used for allocation. So for allocations we want to avoid from coherent device we use N_SYSTEM_MEMORY mask or a derivative of that and where we are ok to allocate from CDM with fallbacks we use N_MEMORY. All nodes zonelist will have zones from the coherent device nodes but we will not end up allocating from coherent device node zone due to the node mask used. This will also make sure we end up allocating from the correct coherent device numa node in the presence of multiple of them based on the distance of the coherent device node from the current executing numa node. > >> @@ -2666,6 +2688,10 @@ static void __init hugetlb_register_all_nodes(void) >> >> for_each_node_state(nid, N_MEMORY) { >> struct node *node = node_devices[nid]; >> + >> + if (isolated_cdm_node(nid)) >> + continue; >> + >> if (node->dev.id == nid) >> hugetlb_register_node(node); >> } > > This looks to be completely kneecapping hugetlbfs on these cdm nodes. > Is that really what you want? > >> @@ -2819,8 +2845,12 @@ static unsigned int cpuset_mems_nr(unsigned int *array) >> int node; >> unsigned int nr = 0; >> >> - for_each_node_mask(node, cpuset_current_mems_allowed) >> + for_each_node_mask(node, cpuset_current_mems_allowed) { >> + if (isolated_cdm_node(node)) >> + continue; >> + >> nr += array[node]; >> + } >> >> return nr; >> } >> @@ -2940,7 +2970,10 @@ void hugetlb_show_meminfo(void) >> if (!hugepages_supported()) >> return; >> >> - for_each_node_state(nid, N_MEMORY) >> + for_each_node_state(nid, N_MEMORY) { >> + if (isolated_cdm_node(nid)) >> + continue; >> + >> for_each_hstate(h) >> pr_info("Node %d hugepages_total=%u hugepages_free=%u hugepages_surp=%u hugepages_size=%lukB\n", >> nid, >> @@ -2948,6 +2981,7 @@ void hugetlb_show_meminfo(void) >> h->free_huge_pages_node[nid], >> h->surplus_huge_pages_node[nid], >> 1UL << (huge_page_order(h) + PAGE_SHIFT - 10)); >> + } >> } > > Your patch description talks about removing *implicit* memory > allocations. But, this removes even the ability to gather *stats* about > huge pages sitting on one of these nodes. That's a lot more drastic > than just changing implicit policies. > > Is that patch description accurate? > > It looks to me like you just went through all the for_each_node*() loops > in hugetlb.c and hacked your node check into them indiscriminately. > This totally removes the ability to *do* hugetlb on this nodes. > > Isn't there some simpler way to do all this, like maybe changing the > root cpuset to disallow allocations to these nodes? -aneesh