From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8803C4743C for ; Tue, 22 Jun 2021 01:14:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9D3D460FD8 for ; Tue, 22 Jun 2021 01:14:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230229AbhFVBQu (ORCPT ); Mon, 21 Jun 2021 21:16:50 -0400 Received: from mga01.intel.com ([192.55.52.88]:45447 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229663AbhFVBQt (ORCPT ); Mon, 21 Jun 2021 21:16:49 -0400 IronPort-SDR: qJrArOUnr0pwX+X4l0O40LNlb5BXRJ4YPG5AHwU65dVT+y/ECPI3bAWEL1Ln9+IoOf1g9XpgA2 0oQdTwWbU7IQ== X-IronPort-AV: E=McAfee;i="6200,9189,10022"; a="228514968" X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="228514968" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:34 -0700 IronPort-SDR: S4SXaIKNke6jbeLRI8Zz8HPKRwgjG7yyFA3hcpwT0N3cYT6i/aP4uSL22ZcCDzJEUFRfjglT42 19EetnP6bLkA== X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="454080998" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.159.119]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:31 -0700 From: "Huang, Ying" To: Zi Yan Cc: Dave Hansen , , , Yang Shi , Michal Hocko , Wei Xu , David Rientjes , Dan Williams , "David Hildenbrand" , osalvador Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order References: <20210618061537.434999-1-ying.huang@intel.com> <20210618061537.434999-3-ying.huang@intel.com> <79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com> <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com> <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> Date: Tue, 22 Jun 2021 09:14:29 +0800 In-Reply-To: <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> (Zi Yan's message of "Mon, 21 Jun 2021 10:50:14 -0400") Message-ID: <87sg1an1je.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Zi Yan writes: > On 19 Jun 2021, at 4:18, Huang, Ying wrote: > >> Zi Yan writes: >> >>> On 18 Jun 2021, at 2:15, Huang Ying wrote: [snip] >>>> +/* >>>> + * When memory fills up on a node, memory contents can be >>>> + * automatically migrated to another node instead of >>>> + * discarded at reclaim. >>>> + * >>>> + * Establish a "migration path" which will start at nodes >>>> + * with CPUs and will follow the priorities used to build the >>>> + * page allocator zonelists. >>>> + * >>>> + * The difference here is that cycles must be avoided. If >>>> + * node0 migrates to node1, then neither node1, nor anything >>>> + * node1 migrates to can migrate to node0. >>>> + * >>>> + * This function can run simultaneously with readers of >>>> + * node_demotion[]. However, it can not run simultaneously >>>> + * with itself. Exclusion is provided by memory hotplug events >>>> + * being single-threaded. >>>> + */ >>>> +static void __set_migration_target_nodes(void) >>>> +{ >>>> + nodemask_t next_pass = NODE_MASK_NONE; >>>> + nodemask_t this_pass = NODE_MASK_NONE; >>>> + nodemask_t used_targets = NODE_MASK_NONE; >>>> + int node; >>>> + >>>> + /* >>>> + * Avoid any oddities like cycles that could occur >>>> + * from changes in the topology. This will leave >>>> + * a momentary gap when migration is disabled. >>>> + */ >>>> + disable_all_migrate_targets(); >>>> + >>>> + /* >>>> + * Ensure that the "disable" is visible across the system. >>>> + * Readers will see either a combination of before+disable >>>> + * state or disable+after. They will never see before and >>>> + * after state together. >>>> + * >>>> + * The before+after state together might have cycles and >>>> + * could cause readers to do things like loop until this >>>> + * function finishes. This ensures they can only see a >>>> + * single "bad" read and would, for instance, only loop >>>> + * once. >>>> + */ >>>> + smp_wmb(); >>>> + >>>> + /* >>>> + * Allocations go close to CPUs, first. Assume that >>>> + * the migration path starts at the nodes with CPUs. >>>> + */ >>>> + next_pass = node_states[N_CPU]; >>> >>> Is there a plan of allowing user to change where the migration >>> path starts? Or maybe one step further providing an interface >>> to allow user to specify the demotion path. Something like >>> /sys/devices/system/node/node*/node_demotion. >> >> I don't think that's necessary at least for now. Do you know any real >> world use case for this? > > In our P9+volta system, GPU memory is exposed as a NUMA node. > For the GPU workloads with data size greater than GPU memory size, > it will be very helpful to allow pages in GPU memory to be migrated/demoted > to CPU memory. With your current assumption, GPU memory -> CPU memory > demotion seems not possible, right? This should also apply to any > system with a device memory exposed as a NUMA node and workloads running > on the device and using CPU memory as a lower tier memory than the device > memory. Thanks a lot for your use case! It appears that the demotion path specified by users is one possible way to satisfy your requirement. And I think it's possible to enable that on top of this patchset. But we still have no specific plan to work on that at least for now. Best Regards, Huang, Ying From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75FE4C4743C for ; Tue, 22 Jun 2021 01:14:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 10C4E60C41 for ; Tue, 22 Jun 2021 01:14:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 10C4E60C41 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 574176B0036; Mon, 21 Jun 2021 21:14:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 524A36B0062; Mon, 21 Jun 2021 21:14:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39EF26B006C; Mon, 21 Jun 2021 21:14:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50]) by kanga.kvack.org (Postfix) with ESMTP id 049746B0036 for ; Mon, 21 Jun 2021 21:14:45 -0400 (EDT) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id A05ACA77A for ; Tue, 22 Jun 2021 01:14:45 +0000 (UTC) X-FDA: 78279590130.27.3D2D7FD Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf20.hostedemail.com (Postfix) with ESMTP id D89512BDE for ; Tue, 22 Jun 2021 01:14:41 +0000 (UTC) IronPort-SDR: TPId+9ztUf7CN1Pb41SjbJE0/sse1GBg+1Y6te/PiD57cyqUb02AR7x06DlFRoVSszC9/LldNF zHnnxQpXSdZA== X-IronPort-AV: E=McAfee;i="6200,9189,10022"; a="270808879" X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="270808879" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:36 -0700 IronPort-SDR: S4SXaIKNke6jbeLRI8Zz8HPKRwgjG7yyFA3hcpwT0N3cYT6i/aP4uSL22ZcCDzJEUFRfjglT42 19EetnP6bLkA== X-IronPort-AV: E=Sophos;i="5.83,290,1616482800"; d="scan'208";a="454080998" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.239.159.119]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Jun 2021 18:14:31 -0700 From: "Huang, Ying" To: Zi Yan Cc: Dave Hansen , , , Yang Shi , Michal Hocko , Wei Xu , David Rientjes , Dan Williams , "David Hildenbrand" , osalvador Subject: Re: [PATCH -V8 02/10] mm/numa: automatically generate node migration order References: <20210618061537.434999-1-ying.huang@intel.com> <20210618061537.434999-3-ying.huang@intel.com> <79397FE3-4B08-4DE5-8468-C5CAE36A3E39@nvidia.com> <87v96anu6o.fsf@yhuang6-desk2.ccr.corp.intel.com> <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> Date: Tue, 22 Jun 2021 09:14:29 +0800 In-Reply-To: <2AA3D792-7F14-4297-8EDD-3B5A7B31AECA@nvidia.com> (Zi Yan's message of "Mon, 21 Jun 2021 10:50:14 -0400") Message-ID: <87sg1an1je.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf20.hostedemail.com: domain of ying.huang@intel.com has no SPF policy when checking 134.134.136.100) smtp.mailfrom=ying.huang@intel.com X-Rspamd-Server: rspam02 X-Stat-Signature: ih6w1asaeqaz9f866gzqmnjogigtipe8 X-Rspamd-Queue-Id: D89512BDE X-HE-Tag: 1624324481-257633 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Zi Yan writes: > On 19 Jun 2021, at 4:18, Huang, Ying wrote: > >> Zi Yan writes: >> >>> On 18 Jun 2021, at 2:15, Huang Ying wrote: [snip] >>>> +/* >>>> + * When memory fills up on a node, memory contents can be >>>> + * automatically migrated to another node instead of >>>> + * discarded at reclaim. >>>> + * >>>> + * Establish a "migration path" which will start at nodes >>>> + * with CPUs and will follow the priorities used to build the >>>> + * page allocator zonelists. >>>> + * >>>> + * The difference here is that cycles must be avoided. If >>>> + * node0 migrates to node1, then neither node1, nor anything >>>> + * node1 migrates to can migrate to node0. >>>> + * >>>> + * This function can run simultaneously with readers of >>>> + * node_demotion[]. However, it can not run simultaneously >>>> + * with itself. Exclusion is provided by memory hotplug events >>>> + * being single-threaded. >>>> + */ >>>> +static void __set_migration_target_nodes(void) >>>> +{ >>>> + nodemask_t next_pass = NODE_MASK_NONE; >>>> + nodemask_t this_pass = NODE_MASK_NONE; >>>> + nodemask_t used_targets = NODE_MASK_NONE; >>>> + int node; >>>> + >>>> + /* >>>> + * Avoid any oddities like cycles that could occur >>>> + * from changes in the topology. This will leave >>>> + * a momentary gap when migration is disabled. >>>> + */ >>>> + disable_all_migrate_targets(); >>>> + >>>> + /* >>>> + * Ensure that the "disable" is visible across the system. >>>> + * Readers will see either a combination of before+disable >>>> + * state or disable+after. They will never see before and >>>> + * after state together. >>>> + * >>>> + * The before+after state together might have cycles and >>>> + * could cause readers to do things like loop until this >>>> + * function finishes. This ensures they can only see a >>>> + * single "bad" read and would, for instance, only loop >>>> + * once. >>>> + */ >>>> + smp_wmb(); >>>> + >>>> + /* >>>> + * Allocations go close to CPUs, first. Assume that >>>> + * the migration path starts at the nodes with CPUs. >>>> + */ >>>> + next_pass = node_states[N_CPU]; >>> >>> Is there a plan of allowing user to change where the migration >>> path starts? Or maybe one step further providing an interface >>> to allow user to specify the demotion path. Something like >>> /sys/devices/system/node/node*/node_demotion. >> >> I don't think that's necessary at least for now. Do you know any real >> world use case for this? > > In our P9+volta system, GPU memory is exposed as a NUMA node. > For the GPU workloads with data size greater than GPU memory size, > it will be very helpful to allow pages in GPU memory to be migrated/demoted > to CPU memory. With your current assumption, GPU memory -> CPU memory > demotion seems not possible, right? This should also apply to any > system with a device memory exposed as a NUMA node and workloads running > on the device and using CPU memory as a lower tier memory than the device > memory. Thanks a lot for your use case! It appears that the demotion path specified by users is one possible way to satisfy your requirement. And I think it's possible to enable that on top of this patchset. But we still have no specific plan to work on that at least for now. Best Regards, Huang, Ying