On 3/6/20 3:12 PM, Mike Kravetz wrote:
> On 3/5/20 10:36 PM, Longpeng (Mike) wrote:
>> 在 2020/3/6 8:09, Mike Kravetz 写道:
>>> On 3/4/20 7:30 PM, Longpeng(Mike) wrote:
>>>> From: Longpeng <longpeng2@huawei.com>
>>> I am thinking we may want to have a more generic solution by allowing
>>> the default_hugepagesz= processing code to verify the passed size and
>>> set up the corresponding hstate.  This would require more cooperation
>>> between architecture specific and independent code.  This could be
>>> accomplished with a simple arch_hugetlb_valid_size() routine provided
>>> by the architectures.  Below is an untested patch to add such support
>>> to the architecture independent code and x86.  Other architectures would
>>> be similar.
>>>
>>> In addition, with architectures providing arch_hugetlb_valid_size() it
>>> should be possible to have a common routine in architecture independent
>>> code to read/process hugepagesz= command line arguments.
>>>
>> I just want to use the minimize changes to address this issue, so I choosed a
>> way which my patch did.
>>
>> To be honest, the approach you suggested above is much better though it need
>> more changes.
>>
>>> Of course, another approach would be to simply require ALL architectures
>>> to set up hstates for ALL supported huge page sizes.
>>>
>> I think this is also needed, then we can request all supported size of hugepages
>> by sysfs(e.g. /sys/kernel/mm/hugepages/*) dynamically. Currently, (x86) we can
>> only request 1G-hugepage through sysfs if we boot with 'default_hugepagesz=1G',
>> even with the first approach.
> I 'think' you can use sysfs for 1G huge pages on x86 today.  Just booted a
> system without any hugepage options on the command line.
>
> # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 0
> # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/^Cugepages
> # echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> 1
> # cat /sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
> 1
>
> x86 and riscv will set up hstates for PUD_SIZE hstates by default if
> CONFIG_CONTIG_ALLOC.  This is because of a somewhat recent feature that
> allowed dynamic allocation of gigantic (page order >= MAX_ORDER) pages.
> Before that feature, it made no sense to set up an hstate for gigantic
> pages if they were not allocated at boot time and could not be dynamically
> added later.
>
> I'll code up a proposal that does the following:
> - Have arch specific code provide a list of supported huge page sizes
> - Arch independent code uses list to create all hstates
> - Move processing of "hugepagesz=" to arch independent code
> - Validate "default_hugepagesz=" when value is read from command line
>
> It make take a few days.  When ready, I will pull in the architecture
> specific people.

Hi Mike,

On platforms that support multiple huge page sizes when 'hugepagesz' is not
specified before 'hugepages=', hugepages are not allocated. (For example
if we are requesting 1GB hugepages)

In terms of reporting meminfo and /sys/kernel/../nr_hugepages reports the
expected results but if we use sysctl vm.nr_hugepages then it reports a non-zero
value as it reads the max_huge_pages from the default hstate instead of
nr_huge_pages.
AFAIK nr_huge_pages is the one that indicates the number of huge pages that are
successfully allocated.

Does vm.nr_hugepages is expected to report the maximum number of hugepages? If
so, will it not make sense to rename the procname?

However, if we expect nr_hugepages to report the number of successfully
allocated hugepages then we should use nr_huge_pages in
hugetlb_sysctl_handler_common().


>
>> BTW, because it's not easy to discuss with you due to the time difference, I
>> have another question about the default hugepages to consult you here. Why the
>> /proc/meminfo only show the info about the default hugepages, but not others?
>> meminfo is more well know than sysfs, some ordinary users know meminfo but don't
>> know use the sysfs to get the hugepages status(e.g. total, free).
> I believe that is simply history.  In the beginning there was only the
> default huge page size and that was added to meminfo.  People then wrote
> scripts to parse huge page information in meminfo.  When support for
> other huge pages was added, it was not added to meminfo as it could break
> user scripts parsing the file.  Adding information for all potential
> huge page sizes may create lots of entries that are unused.  I was not
> around when these decisions were made, but that is my understanding.
> BTW - A recently added meminfo field 'Hugetlb' displays the amount of
> memory consumed by huge pages of ALL sizes.
-- 
Nitesh