All of lore.kernel.org
 help / color / mirror / Atom feed
From: nzimmer <nzimmer@sgi.com>
To: Mel Gorman <mgorman@suse.de>, Waiman Long <waiman.long@hp.com>
Cc: Linux-MM <linux-mm@kvack.org>, Robin Holt <holt@sgi.com>,
	Daniel Rahn <drahn@suse.com>, Davidlohr Bueso <dbueso@suse.com>,
	Dave Hansen <dave.hansen@intel.com>, Tom Vaden <tom.vaden@hp.com>,
	Scott Norton <scott.norton@hp.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation
Date: Wed, 15 Apr 2015 16:37:20 -0500	[thread overview]
Message-ID: <552EDA10.60604@sgi.com> (raw)
In-Reply-To: <20150415154415.GH14842@suse.de>


On 04/15/2015 10:44 AM, Mel Gorman wrote:
> On Wed, Apr 15, 2015 at 10:50:45AM -0400, Waiman Long wrote:
>> On 04/15/2015 09:38 AM, Mel Gorman wrote:
>>> On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
>>>>> <SNIP>
>>>>> Patches are against 4.0-rc7.
>>>>>
>>>>>   Documentation/kernel-parameters.txt |   8 +
>>>>>   arch/ia64/mm/numa.c                 |  19 +-
>>>>>   arch/x86/Kconfig                    |   2 +
>>>>>   include/linux/memblock.h            |  18 ++
>>>>>   include/linux/mm.h                  |   8 +-
>>>>>   include/linux/mmzone.h              |  37 +++-
>>>>>   init/main.c                         |   1 +
>>>>>   mm/Kconfig                          |  29 +++
>>>>>   mm/bootmem.c                        |   6 +-
>>>>>   mm/internal.h                       |  23 ++-
>>>>>   mm/memblock.c                       |  34 ++-
>>>>>   mm/mm_init.c                        |   9 +-
>>>>>   mm/nobootmem.c                      |   7 +-
>>>>>   mm/page_alloc.c                     | 398 +++++++++++++++++++++++++++++++-----
>>>>>   mm/vmscan.c                         |   6 +-
>>>>>   15 files changed, 507 insertions(+), 98 deletions(-)
>>>>>
>>>> I had included your patch with the 4.0 kernel and booted up a
>>>> 16-socket 12-TB machine. I measured the elapsed time from the elilo
>>>> prompt to the availability of ssh login. Without the patch, the
>>>> bootup time was 404s. It was reduced to 298s with the patch. So
>>>> there was about 100s reduction in bootup time (1/4 of the total).
>>>>
>>> Cool, thanks for testing. Would you be able to state if this is really
>>> important or not? Does booting 100s second faster on a 12TB machine really
>>> matter? I can then add that justification to the changelog to avoid a
>>> conversation with Andrew that goes something like
>>>
>>> Andrew: Why are we doing this?
>>> Mel:    Because we can and apparently people might want it.
>>> Andrew: What's the maintenance cost of this?
>>> Mel:    Magic beans
>>>
>>> I prefer talking to Andrew when it's harder to predict what he'll say.
>> Booting 100s faster is certainly something that is nice to have.
>> Right now, more time is spent in the firmware POST portion of the
>> bootup process than in the OS boot.
> I'm not surprised. On two different 1TB machines, I've seen a post time
> of 2 minutes and one of 35. No idea what it's doing for 35 minutes....
> plotting world domination probably.
>
>> So I would say this patch isn't
>> really critical right now as machines with that much memory are
>> relatively rare. However, if we look forward to the near future,
>> some new memory technology like persistent memory is coming and
>> machines with large amount of memory (whether persistent or not)
>> will become more common. This patch will certainly be useful if we
>> look forward into the future.
>>
> Whether persistent memory needs struct pages or not is up in the air and
> I'm not getting stuck in that can of worms. 100 seconds off kernel init
> time is a starting point. I can try pushing it on on that basis but I
> really would like to see SGI and Intel people also chime in on how it
> affects their really large machines.
>
I will get some numbers from this patch set but I haven't had the 
opportunity yet.  I will grab them this weekend for sure if I can't get 
machine time sooner.


>>>> However, there were 2 bootup problems in the dmesg log that needed
>>>> to be addressed.
>>>> 1. There were 2 vmalloc allocation failures:
>>>> [    2.284686] vmalloc: allocation failure, allocated 16578404352 of
>>>> 17179873280 bytes
>>>> [   10.399938] vmalloc: allocation failure, allocated 7970922496 of
>>>> 8589938688 bytes
>>>>
>>>> 2. There were 2 soft lockup warnings:
>>>> [   57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
>>>> [swapper/0:1]
>>>> [   85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
>>>> [swapper/0:1]
>>>>
>>>> Once those problems are fixed, the patch should be in a pretty good
>>>> shape. I have attached the dmesg log for your reference.
>>>>
>>> The obvious conclusion is that initialising 1G per node is not enough for
>>> really large machines. Can you try this on top? It's untested but should
>>> work. The low value was chosen because it happened to work and I wanted
>>> to get test coverage on common hardware but broke is broke.
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index f2c96d02662f..6b3bec304e35 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>>>   	if (pgdat->first_deferred_pfn != ULONG_MAX)
>>>   		return false;
>>>
>>> -	/* Initialise at least 1G per zone */
>>> +	/* Initialise at least 32G per node */
>>>   	(*nr_initialised)++;
>>> -	if (*nr_initialised>  (1UL<<  (30 - PAGE_SHIFT))&&
>>> +	if (*nr_initialised>  (32UL<<  (30 - PAGE_SHIFT))&&
>>>   	(pfn&  (PAGES_PER_SECTION - 1)) == 0) {
>>>   		pgdat->first_deferred_pfn = pfn;
>>>   		return false;
>> I will try this out when I can get hold of the 12-TB machine again.
>>
> Thanks.
>
>> The vmalloc allocation failures were for the following hash tables:
>> - Dentry cache hash table entries
>> - Inode-cache hash table entries
>>
>> Those hash tables scale linearly with the amount of memory available
>> in the system. So instead of hardcoding a certain value, why don't
>> we make it a certain % of the total memory but bottomed out to 1G at
>> the low end?
>>
> Because then it becomes what percentage is the right percentage and what
> happens if it's a percentage of total memory but the NUMA nodes are not
> all the same size?. I want to start simple until there is more data on
> what these really large machines look like and if it ever fails in the
> field, there is the command-line switch until a patch is available.
>


WARNING: multiple messages have this Message-ID (diff)
From: nzimmer <nzimmer@sgi.com>
To: Mel Gorman <mgorman@suse.de>, Waiman Long <waiman.long@hp.com>
Cc: Linux-MM <linux-mm@kvack.org>, Robin Holt <holt@sgi.com>,
	Daniel Rahn <drahn@suse.com>, Davidlohr Bueso <dbueso@suse.com>,
	Dave Hansen <dave.hansen@intel.com>, Tom Vaden <tom.vaden@hp.com>,
	Scott Norton <scott.norton@hp.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/14] Parallel memory initialisation
Date: Wed, 15 Apr 2015 16:37:20 -0500	[thread overview]
Message-ID: <552EDA10.60604@sgi.com> (raw)
In-Reply-To: <20150415154415.GH14842@suse.de>


On 04/15/2015 10:44 AM, Mel Gorman wrote:
> On Wed, Apr 15, 2015 at 10:50:45AM -0400, Waiman Long wrote:
>> On 04/15/2015 09:38 AM, Mel Gorman wrote:
>>> On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
>>>>> <SNIP>
>>>>> Patches are against 4.0-rc7.
>>>>>
>>>>>   Documentation/kernel-parameters.txt |   8 +
>>>>>   arch/ia64/mm/numa.c                 |  19 +-
>>>>>   arch/x86/Kconfig                    |   2 +
>>>>>   include/linux/memblock.h            |  18 ++
>>>>>   include/linux/mm.h                  |   8 +-
>>>>>   include/linux/mmzone.h              |  37 +++-
>>>>>   init/main.c                         |   1 +
>>>>>   mm/Kconfig                          |  29 +++
>>>>>   mm/bootmem.c                        |   6 +-
>>>>>   mm/internal.h                       |  23 ++-
>>>>>   mm/memblock.c                       |  34 ++-
>>>>>   mm/mm_init.c                        |   9 +-
>>>>>   mm/nobootmem.c                      |   7 +-
>>>>>   mm/page_alloc.c                     | 398 +++++++++++++++++++++++++++++++-----
>>>>>   mm/vmscan.c                         |   6 +-
>>>>>   15 files changed, 507 insertions(+), 98 deletions(-)
>>>>>
>>>> I had included your patch with the 4.0 kernel and booted up a
>>>> 16-socket 12-TB machine. I measured the elapsed time from the elilo
>>>> prompt to the availability of ssh login. Without the patch, the
>>>> bootup time was 404s. It was reduced to 298s with the patch. So
>>>> there was about 100s reduction in bootup time (1/4 of the total).
>>>>
>>> Cool, thanks for testing. Would you be able to state if this is really
>>> important or not? Does booting 100s second faster on a 12TB machine really
>>> matter? I can then add that justification to the changelog to avoid a
>>> conversation with Andrew that goes something like
>>>
>>> Andrew: Why are we doing this?
>>> Mel:    Because we can and apparently people might want it.
>>> Andrew: What's the maintenance cost of this?
>>> Mel:    Magic beans
>>>
>>> I prefer talking to Andrew when it's harder to predict what he'll say.
>> Booting 100s faster is certainly something that is nice to have.
>> Right now, more time is spent in the firmware POST portion of the
>> bootup process than in the OS boot.
> I'm not surprised. On two different 1TB machines, I've seen a post time
> of 2 minutes and one of 35. No idea what it's doing for 35 minutes....
> plotting world domination probably.
>
>> So I would say this patch isn't
>> really critical right now as machines with that much memory are
>> relatively rare. However, if we look forward to the near future,
>> some new memory technology like persistent memory is coming and
>> machines with large amount of memory (whether persistent or not)
>> will become more common. This patch will certainly be useful if we
>> look forward into the future.
>>
> Whether persistent memory needs struct pages or not is up in the air and
> I'm not getting stuck in that can of worms. 100 seconds off kernel init
> time is a starting point. I can try pushing it on on that basis but I
> really would like to see SGI and Intel people also chime in on how it
> affects their really large machines.
>
I will get some numbers from this patch set but I haven't had the 
opportunity yet.  I will grab them this weekend for sure if I can't get 
machine time sooner.


>>>> However, there were 2 bootup problems in the dmesg log that needed
>>>> to be addressed.
>>>> 1. There were 2 vmalloc allocation failures:
>>>> [    2.284686] vmalloc: allocation failure, allocated 16578404352 of
>>>> 17179873280 bytes
>>>> [   10.399938] vmalloc: allocation failure, allocated 7970922496 of
>>>> 8589938688 bytes
>>>>
>>>> 2. There were 2 soft lockup warnings:
>>>> [   57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
>>>> [swapper/0:1]
>>>> [   85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
>>>> [swapper/0:1]
>>>>
>>>> Once those problems are fixed, the patch should be in a pretty good
>>>> shape. I have attached the dmesg log for your reference.
>>>>
>>> The obvious conclusion is that initialising 1G per node is not enough for
>>> really large machines. Can you try this on top? It's untested but should
>>> work. The low value was chosen because it happened to work and I wanted
>>> to get test coverage on common hardware but broke is broke.
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index f2c96d02662f..6b3bec304e35 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
>>>   	if (pgdat->first_deferred_pfn != ULONG_MAX)
>>>   		return false;
>>>
>>> -	/* Initialise at least 1G per zone */
>>> +	/* Initialise at least 32G per node */
>>>   	(*nr_initialised)++;
>>> -	if (*nr_initialised>  (1UL<<  (30 - PAGE_SHIFT))&&
>>> +	if (*nr_initialised>  (32UL<<  (30 - PAGE_SHIFT))&&
>>>   	(pfn&  (PAGES_PER_SECTION - 1)) == 0) {
>>>   		pgdat->first_deferred_pfn = pfn;
>>>   		return false;
>> I will try this out when I can get hold of the 12-TB machine again.
>>
> Thanks.
>
>> The vmalloc allocation failures were for the following hash tables:
>> - Dentry cache hash table entries
>> - Inode-cache hash table entries
>>
>> Those hash tables scale linearly with the amount of memory available
>> in the system. So instead of hardcoding a certain value, why don't
>> we make it a certain % of the total memory but bottomed out to 1G at
>> the low end?
>>
> Because then it becomes what percentage is the right percentage and what
> happens if it's a percentage of total memory but the NUMA nodes are not
> all the same size?. I want to start simple until there is more data on
> what these really large machines look like and if it ever fails in the
> field, there is the command-line switch until a patch is available.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-04-15 21:46 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-13 10:16 [RFC PATCH 0/14] Parallel memory initialisation Mel Gorman
2015-04-13 10:16 ` Mel Gorman
2015-04-13 10:16 ` [PATCH 01/14] memblock: Introduce a for_each_reserved_mem_region iterator Mel Gorman
2015-04-13 10:16   ` Mel Gorman
2015-04-13 10:16 ` [PATCH 02/14] mm: meminit: Move page initialization into a separate function Mel Gorman
2015-04-13 10:16   ` Mel Gorman
2015-04-13 10:16 ` [PATCH 03/14] mm: meminit: Only set page reserved in the memblock region Mel Gorman
2015-04-13 10:16   ` Mel Gorman
2015-04-13 10:16 ` [PATCH 04/14] mm: page_alloc: Pass PFN to __free_pages_bootmem Mel Gorman
2015-04-13 10:16   ` Mel Gorman
2015-04-13 10:16 ` [PATCH 05/14] mm: meminit: Make __early_pfn_to_nid SMP-safe and introduce meminit_pfn_in_nid Mel Gorman
2015-04-13 10:16   ` Mel Gorman
2015-04-13 10:16 ` [PATCH 06/14] mm: meminit: Inline some helper functions Mel Gorman
2015-04-13 10:16   ` Mel Gorman
2015-04-13 10:16 ` [PATCH 07/14] mm: meminit: Partially initialise memory if CONFIG_DEFERRED_MEM_INIT is set Mel Gorman
2015-04-13 10:16   ` Mel Gorman
2015-04-13 10:17 ` [PATCH 08/14] mm: meminit: Initialise remaining memory in parallel with kswapd Mel Gorman
2015-04-13 10:17   ` Mel Gorman
2015-04-13 10:17 ` [PATCH 09/14] mm: meminit: Minimise number of pfn->page lookups during initialisation Mel Gorman
2015-04-13 10:17   ` Mel Gorman
2015-04-13 10:17 ` [PATCH 10/14] x86: mm: Enable deferred memory initialisation on x86-64 Mel Gorman
2015-04-13 10:17   ` Mel Gorman
2015-04-13 18:21   ` Paul Bolle
2015-04-13 18:21     ` Paul Bolle
2015-04-13 10:17 ` [PATCH 11/14] mm: meminit: Control parallel memory initialisation from command line and config Mel Gorman
2015-04-13 10:17   ` Mel Gorman
2015-04-13 10:17 ` [PATCH 12/14] mm: meminit: Free pages in large chunks where possible Mel Gorman
2015-04-13 10:17   ` Mel Gorman
2015-04-13 10:17 ` [PATCH 13/14] mm: meminit: Reduce number of times pageblocks are set during initialisation Mel Gorman
2015-04-13 10:17   ` Mel Gorman
2015-04-13 10:17 ` [PATCH 14/14] mm: meminit: Remove mminit_verify_page_links Mel Gorman
2015-04-13 10:17   ` Mel Gorman
2015-04-13 10:29 ` [RFC PATCH 0/14] Parallel memory initialisation Mel Gorman
2015-04-13 10:29   ` Mel Gorman
2015-04-15 13:15 ` Waiman Long
2015-04-15 13:38   ` Mel Gorman
2015-04-15 13:38     ` Mel Gorman
2015-04-15 14:50     ` Waiman Long
2015-04-15 14:50       ` Waiman Long
2015-04-15 15:44       ` Mel Gorman
2015-04-15 15:44         ` Mel Gorman
2015-04-15 21:37         ` nzimmer [this message]
2015-04-15 21:37           ` nzimmer
2015-04-16 18:20     ` Waiman Long
2015-04-15 14:27   ` Peter Zijlstra
2015-04-15 14:27     ` Peter Zijlstra
2015-04-15 14:34     ` Mel Gorman
2015-04-15 14:34       ` Mel Gorman
2015-04-15 14:48       ` Peter Zijlstra
2015-04-15 14:48         ` Peter Zijlstra
2015-04-15 16:18         ` Waiman Long
2015-04-15 16:18           ` Waiman Long
2015-04-15 16:42           ` Norton, Scott J
2015-04-15 16:42             ` Norton, Scott J
2015-04-16  7:25 ` Andrew Morton
2015-04-16  7:25   ` Andrew Morton
2015-04-16  8:46   ` Mel Gorman
2015-04-16  8:46     ` Mel Gorman
2015-04-16 17:26     ` Andrew Morton
2015-04-16 17:26       ` Andrew Morton
2015-04-16 17:37       ` Mel Gorman
2015-04-16 17:37         ` Mel Gorman
2015-04-16  7:51 Daniel J Blueman
2015-04-20  3:15 ` Daniel J Blueman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=552EDA10.60604@sgi.com \
    --to=nzimmer@sgi.com \
    --cc=dave.hansen@intel.com \
    --cc=dbueso@suse.com \
    --cc=drahn@suse.com \
    --cc=holt@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=scott.norton@hp.com \
    --cc=tom.vaden@hp.com \
    --cc=waiman.long@hp.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.