All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
	Huang Ying <ying.huang@intel.com>,
	Greg Thelen <gthelen@google.com>, Yang Shi <shy828301@gmail.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Tim C Chen <tim.c.chen@intel.com>,
	Brice Goglin <brice.goglin@gmail.com>,
	Michal Hocko <mhocko@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Hesham Almatary <hesham.almatary@huawei.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Alistair Popple <apopple@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Feng Tang <feng.tang@intel.com>,
	Jagdish Gediya <jvgediya@linux.ibm.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	David Rientjes <rientjes@google.com>
Subject: Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
Date: Mon, 6 Jun 2022 21:31:16 +0530	[thread overview]
Message-ID: <3a557f74-cc3a-c0ee-78e8-2cf50bee5f2d@linux.ibm.com> (raw)
In-Reply-To: <20220606155920.00004ce9@Huawei.com>

On 6/6/22 8:29 PM, Jonathan Cameron wrote:
> On Fri, 3 Jun 2022 14:10:47 +0530
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
> 
>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>> On Fri, 27 May 2022 17:55:23 +0530
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>>>    
>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>
>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>
>>>> /sys/devices/system/node/nodeN/memtier
>>>>
>>>> where N = node id
>>>>
>>>> When read, It list the memory tier that the node belongs to.
>>>>
>>>> When written, the kernel moves the node into the specified
>>>> memory tier, the tier assignment of all other nodes are not
>>>> affected.
>>>>
>>>> If the memory tier does not exist, writing to the above file
>>>> create the tier and assign the NUMA node to that tier.
>>> creates
>>>
>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>> for creation is the rank, not the tier number.
>>>
>>> My suggestion is move to an explicit creation file such as
>>> memtier/create_tier_from_rank
>>> to which writing the rank gives results in a new tier
>>> with the next device ID and requested rank.
>>
>> I think the below workflow is much simpler.
>>
>> :/sys/devices/system# cat memtier/memtier1/nodelist
>> 1-3
>> :/sys/devices/system# cat node/node1/memtier
>> 1
>> :/sys/devices/system# ls memtier/memtier*
>> nodelist  power  rank  subsystem  uevent
>> /sys/devices/system# ls memtier/
>> default_rank  max_tier  memtier1  power  uevent
>> :/sys/devices/system# echo 2 > node/node1/memtier
>> :/sys/devices/system#
>>
>> :/sys/devices/system# ls memtier/
>> default_rank  max_tier  memtier1  memtier2  power  uevent
>> :/sys/devices/system# cat memtier/memtier1/nodelist
>> 2-3
>> :/sys/devices/system# cat memtier/memtier2/nodelist
>> 1
>> :/sys/devices/system#
>>
>> ie, to create a tier we just write the tier id/tier index to
>> node/nodeN/memtier file. That will create a new memory tier if needed
>> and add the node to that specific memory tier. Since for now we are
>> having 1:1 mapping between tier index to rank value, we can derive the
>> rank value from the memory tier index.
>>
>> For dynamic memory tier support, we can assign a rank value such that
>> new memory tiers are always created such that it comes last in the
>> demotion order.
> 
> I'm not keen on having to pass through an intermediate state where
> the rank may well be wrong, but I guess it's not that harmful even
> if it feels wrong ;)
> 

Any new memory tier added can be of lowest rank (rank - 0) and hence 
will appear as the highest memory tier in demotion order. User can then
assign the right rank value to the memory tier? Also the actual demotion 
target paths are built during memory block online which in most case 
would happen after we properly verify that the device got assigned to 
the right memory tier with correct rank value?

> Races are potentially a bit of a pain though depending on what we
> expect the usage model to be.
> 
> There are patterns (CXL regions for example) of guaranteeing the
> 'right' device is created by doing something like
> 
> cat create_tier > temp.txt
> #(temp gets 2 for example on first call then
> # next read of this file gets 3 etc)
> 
> cat temp.txt > create_tier
> # will fail if there hasn't been a read of the same value
> 
> Assuming all software keeps to the model, then there are no
> race conditions over creation.  Otherwise we have two new
> devices turn up very close to each other and userspace scripting
> tries to create two new tiers - if it races they may end up in
> the same tier when that wasn't the intent.  Then code to set
> the rank also races and we get two potentially very different
> memories in a tier with a randomly selected rank.
> 
> Fun and games...  And a fine illustration why sysfs based 'device'
> creation is tricky to get right (and lots of cases in the kernel
> don't).
> 

I would expect userspace to be careful and verify the memory tier and 
rank value before we online the memory blocks backed by the device. Even 
if we race, the result would be two device not intended to be part of 
the same memory tier appearing at the same tier. But then we won't be 
building demotion targets yet. So userspace could verify this, move the 
nodes out of the memory tier. Once it is verified, memory blocks can be 
onlined.

Having said that can you outline the usage of 
memtier/create_tier_from_rank ?

-aneesh

  reply	other threads:[~2022-06-06 16:01 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
2022-05-27  2:58 ` Ying Huang
2022-05-27 14:05   ` Hesham Almatary
2022-05-27 16:25     ` Wei Xu
2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-05-27 13:59     ` Jonathan Cameron
2022-06-02  6:07     ` Ying Huang
2022-06-06  2:49       ` Ying Huang
2022-06-06  3:56         ` Aneesh Kumar K V
2022-06-06  5:33           ` Ying Huang
2022-06-06  6:01             ` Aneesh Kumar K V
2022-06-06  6:27               ` Aneesh Kumar K.V
2022-06-06  7:53                 ` Ying Huang
2022-06-06  8:01                   ` Aneesh Kumar K V
2022-06-06  8:52                     ` Ying Huang
2022-06-06  9:02                       ` Aneesh Kumar K V
2022-06-08  1:24                         ` Ying Huang
2022-06-08  7:16     ` Ying Huang
2022-06-08  8:24       ` Aneesh Kumar K V
2022-06-08  8:27         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
2022-05-27 14:15     ` Jonathan Cameron
2022-06-03  8:40       ` Aneesh Kumar K V
2022-06-06 14:59         ` Jonathan Cameron
2022-06-06 16:01           ` Aneesh Kumar K V [this message]
2022-06-06 16:16             ` Jonathan Cameron
2022-06-06 16:39               ` Aneesh Kumar K V
2022-06-06 17:46                 ` Aneesh Kumar K.V
2022-06-07 14:32                   ` Jonathan Cameron
2022-05-28  1:33     ` kernel test robot
2022-06-08  7:18     ` Ying Huang
2022-06-08  8:25       ` Aneesh Kumar K V
2022-06-08  8:29         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-05-27 14:31     ` Jonathan Cameron
2022-05-30  3:35     ` [mm/demotion] 8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c kernel test robot
2022-05-30  3:35       ` kernel test robot
2022-05-27 12:25   ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-01  6:29     ` Bharata B Rao
2022-06-01 13:49       ` Aneesh Kumar K V
2022-06-02  6:36         ` Bharata B Rao
2022-06-03  9:04           ` Aneesh Kumar K V
2022-06-06 10:11             ` Bharata B Rao
2022-06-06 10:16               ` Aneesh Kumar K V
2022-06-06 11:54                 ` Aneesh Kumar K.V
2022-06-06 12:09                   ` Bharata B Rao
2022-06-06 13:00                     ` Aneesh Kumar K V
2022-05-27 12:25   ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
2022-05-27 14:45     ` Jonathan Cameron
2022-05-27 15:45       ` Aneesh Kumar K V
2022-05-30 12:36         ` Jonathan Cameron
2022-06-02  6:41     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-02  6:43     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-05-27 15:03     ` Jonathan Cameron
2022-06-02  7:35     ` Ying Huang
2022-06-03 15:09       ` Aneesh Kumar K V
2022-06-06  0:43         ` Ying Huang
2022-06-06  4:07           ` Aneesh Kumar K V
2022-06-06  5:26             ` Ying Huang
2022-06-06  6:21               ` Aneesh Kumar K.V
2022-06-06  7:42                 ` Ying Huang
2022-06-06  8:02                   ` Aneesh Kumar K V
2022-06-06  8:06                     ` Ying Huang
2022-06-06 17:07               ` Yang Shi
2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
2022-05-27 16:30   ` Wei Xu
2022-05-29  4:31     ` Ying Huang
2022-05-30 12:50       ` Jonathan Cameron
2022-05-31  1:57         ` Ying Huang
2022-06-07 19:25         ` Tim Chen
2022-06-08  4:41           ` Aneesh Kumar K V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3a557f74-cc3a-c0ee-78e8-2cf50bee5f2d@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brice.goglin@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=feng.tang@intel.com \
    --cc=gthelen@google.com \
    --cc=hesham.almatary@huawei.com \
    --cc=jvgediya@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    --cc=shy828301@gmail.com \
    --cc=tim.c.chen@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.