Re: [PATCH v4 1/5] hugetlb: add demote hugetlb page sysfs interfaces

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>,
	Michal Hocko <mhocko@suse.com>, Zi Yan <ziy@nvidia.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	David Rientjes <rientjes@google.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	Nghia Le <nghialm78@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v4 1/5] hugetlb: add demote hugetlb page sysfs interfaces
Date: Fri, 8 Oct 2021 13:24:28 -0700	[thread overview]
Message-ID: <47e53389-638a-1af1-e94f-b3c7e5e7459e@oracle.com> (raw)
In-Reply-To: <YV/4ZFCvoGRn2rtr@localhost.localdomain>

On 10/8/21 12:51 AM, Oscar Salvador wrote:
> On Thu, Oct 07, 2021 at 11:19:14AM -0700, Mike Kravetz wrote:
>> +static ssize_t demote_store(struct kobject *kobj,
>> +	       struct kobj_attribute *attr, const char *buf, size_t len)
>> +{
>> +	unsigned long nr_demote;
>> +	unsigned long nr_available;
>> +	nodemask_t nodes_allowed, *n_mask;
>> +	struct hstate *h;
>> +	int err = 0;
>> +	int nid;
>> +
>> +	err = kstrtoul(buf, 10, &nr_demote);
>> +	if (err)
>> +		return err;
>> +	h = kobj_to_hstate(kobj, &nid);
>> +
>> +	/* Synchronize with other sysfs operations modifying huge pages */
>> +	mutex_lock(&h->resize_lock);
>> +
>> +	if (nid != NUMA_NO_NODE) {
>> +		init_nodemask_of_node(&nodes_allowed, nid);
>> +		n_mask = &nodes_allowed;
>> +	} else {
>> +		n_mask = &node_states[N_MEMORY];
>> +	}
> 
> Why this needs to be protected by the resize_lock? I do not understand
> what are we really protecting here and from what.

In general, the resize_lock prevents unexpected consequences when
multiple users are modifying the number of pages in a pool concurrently
from the proc/sysfs interfaces.  The mutex is acquired here because we
are modifying (decreasing) the pool size.

When the mutex was added, Michal asked about the need.  Theoretically,
all code making pool adjustment should be safe because at a low level
the hugetlb_lock is taken when the hstate is modified.  However, I did
point out that there is a hstate variable 'next_nid_to_alloc' which is
used outside the lock which could result in pages being allocated from
the wrong node.  One could argue that if two (root) sysadmin users are
modifying the pool concurrently, they should not be surprised by such
consequences.  The mutex seemed like a small price to avoid any such
potential issues.  It is taken here to be consistent with this model.

Coincidentally, I was running some stress testing with this series last
night and noticed some unexpected behavior.   As a result, we will also
need to take the resize_mutex of the 'target_hstate'.  This is all in
patch 5 of the series.  I will add details there.
-- 
Mike Kravetz