From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f197.google.com (mail-wr0-f197.google.com [209.85.128.197]) by kanga.kvack.org (Postfix) with ESMTP id 21757831F4 for ; Thu, 4 May 2017 14:29:08 -0400 (EDT) Received: by mail-wr0-f197.google.com with SMTP id y43so2686340wrc.11 for ; Thu, 04 May 2017 11:29:08 -0700 (PDT) Received: from aserp1040.oracle.com (aserp1040.oracle.com. [141.146.126.69]) by mx.google.com with ESMTPS id r6si3273769wrb.94.2017.05.04.11.29.05 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 04 May 2017 11:29:06 -0700 (PDT) Subject: Re: [PATCH v3 4/4] mm: Adaptive hash table scaling From: Pasha Tatashin References: <1488432825-92126-1-git-send-email-pasha.tatashin@oracle.com> <1488432825-92126-5-git-send-email-pasha.tatashin@oracle.com> <20170303153247.f16a31c95404c02a8f3e2c5f@linux-foundation.org> <20170426201126.GA32407@dhcp22.suse.cz> <40f72efa-3928-b3c6-acca-0740f1a15ba4@oracle.com> Message-ID: <429c8506-c498-0599-4258-7bac947fe29c@oracle.com> Date: Thu, 4 May 2017 14:28:51 -0400 MIME-Version: 1.0 In-Reply-To: <40f72efa-3928-b3c6-acca-0740f1a15ba4@oracle.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Michal Hocko , Andrew Morton Cc: linux-mm@kvack.org, sparclinux@vger.kernel.org, linux-fsdevel@vger.kernel.org, Al Viro BTW, I am OK with your patch on top of this "Adaptive hash table" patch, but I do not know what high_limit should be from where HASH_ADAPT will kick in. 128M sound reasonable to you? Pasha On 05/04/2017 02:23 PM, Pasha Tatashin wrote: > Hi Michal, > > I do not really want to impose any hard limit, because I do not know > what it should be. > > The owners of the subsystems that use these large hash table should make > a call, and perhaps pass high_limit, if needed into > alloc_large_system_hash(). > > Previous growth rate was unacceptable, because in addition to allocating > large tables (which is acceptable if we take a total system memory > size), we also needed to zero that, and zeroing while we have only one > CPU available was significantly reducing the boot time. > > Now, on 32T the hash table is 1G instead of 32G, so the call is 32 times > faster to finish. While it is not a good idea to waste memory, both 1G > and 32G is insignificant amount of memory compared to the total amount > of such 32T systems (0.09% and 0.003% accordingly). > > Here is boot log on 32T system without this fix: > https://hastebin.com/muruzoveno.go > > [ 769.622359] Dentry cache hash table entries: 2147483648 (order: 21, > 17179869184 bytes) > [ 791.942136] Inode-cache hash table entries: 2147483648 (order: 21, > 17179869184 bytes) > [ 810.810745] Mount-cache hash table entries: 67108864 (order: 16, > 536870912 bytes) > [ 810.922322] Mountpoint-cache hash table entries: 67108864 (order: 16, > 536870912 bytes) > [ 812.125398] ftrace: allocating 20650 entries in 41 pages > > Total time 42.5s > > With this fix (and some other unrelated for this interval fixes): > https://hastebin.com/buxucurawa.go > > [ 12.621164] Dentry cache hash table entries: 134217728 (order: 17, > 1073741824 bytes) > [ 12.869462] Inode-cache hash table entries: 67108864 (order: 16, > 536870912 bytes) > [ 13.101963] Mount-cache hash table entries: 67108864 (order: 16, > 536870912 bytes) > [ 13.331988] Mountpoint-cache hash table entries: 67108864 (order: 16, > 536870912 bytes) > [ 13.364661] ftrace: allocating 20650 entries in 41 pages > > Total time 0.76s. > > So, it scales well for 32T systems, and will scale well for perceivable > future without adding a hard ceiling limit. > > Pasha > > On 04/26/2017 04:11 PM, Michal Hocko wrote: >> On Fri 03-03-17 15:32:47, Andrew Morton wrote: >>> On Thu, 2 Mar 2017 00:33:45 -0500 Pavel Tatashin >>> wrote: >>> >>>> Allow hash tables to scale with memory but at slower pace, when >>>> HASH_ADAPT >>>> is provided every time memory quadruples the sizes of hash tables >>>> will only >>>> double instead of quadrupling as well. This algorithm starts working >>>> only >>>> when memory size reaches a certain point, currently set to 64G. >>>> >>>> This is example of dentry hash table size, before and after four >>>> various >>>> memory configurations: >>>> >>>> MEMORY SCALE HASH_SIZE >>>> old new old new >>>> 8G 13 13 8M 8M >>>> 16G 13 13 16M 16M >>>> 32G 13 13 32M 32M >>>> 64G 13 13 64M 64M >>>> 128G 13 14 128M 64M >>>> 256G 13 14 256M 128M >>>> 512G 13 15 512M 128M >>>> 1024G 13 15 1024M 256M >>>> 2048G 13 16 2048M 256M >>>> 4096G 13 16 4096M 512M >>>> 8192G 13 17 8192M 512M >>>> 16384G 13 17 16384M 1024M >>>> 32768G 13 18 32768M 1024M >>>> 65536G 13 18 65536M 2048M >>> >>> OK, but what are the runtime effects? Presumably some workloads will >>> slow down a bit. How much? How do we know that this is a worthwhile >>> tradeoff? >>> >>> If the effect of this change is "undetectable" then those hash tables >>> are simply too large, and additional tuning is needed, yes? >> >> I am playing with a 3TB and have hit the following >> [ 0.961309] Dentry cache hash table entries: 536870912 (order: 20, >> 4294967296 bytes) >> [ 2.300012] vmalloc: allocation failure, allocated 1383612416 of >> 2147487744 bytes >> [ 2.307473] swapper/0: page allocation failure: order:0, >> mode:0x2080020(GFP_ATOMIC) >> [ 2.315101] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G >> W 4.4.49-hotplug19-default #1 >> [ 2.324017] Hardware name: Huawei 9008/IT91SMUB, BIOS BLXSV607 >> 04/17/2017 >> [ 2.330775] ffffffff8101aba5 ffffffff8130efa0 ffffffff81863f48 >> ffffffff81c03e40 >> [ 2.338201] ffffffff8118c9a2 02080020fff00300 ffffffff81863f48 >> ffffffff81c03de0 >> [ 2.345628] 0000000000000018 ffffffff81c03e50 ffffffff81c03df8 >> ffffffff811d28e6 >> [ 2.353056] Call Trace: >> [ 2.355507] [] dump_trace+0x59/0x310 >> [ 2.360710] [] show_stack_log_lvl+0xea/0x170 >> [ 2.366605] [] show_stack+0x21/0x40 >> [ 2.371723] [] dump_stack+0x5c/0x7c >> [ 2.376842] [] warn_alloc_failed+0xe2/0x150 >> [ 2.382655] [] __vmalloc_node_range+0x240/0x280 >> [ 2.388814] [] __vmalloc+0x47/0x50 >> [ 2.393851] [] alloc_large_system_hash+0x189/0x25d >> [ 2.400264] [] inode_init+0x74/0xa3 >> [ 2.405381] [] vfs_caches_init+0x59/0xe1 >> [ 2.410930] [] start_kernel+0x474/0x4d0 >> [ 2.416392] [] x86_64_start_kernel+0x147/0x156 >> >> Allocating 4G for a hash table is just ridiculous. 512MB which this >> patch should give looks much reasonable, although I would argue it is >> still a _lot_. >> I cannot say I would be really happy about the chosen approach, >> though. Why HASH_ADAPT is not implicit? Which hash table would need >> gigabytes of memory and still benefit from it? Even if there is such an >> example then it should use the explicit high_limit. I do not like this >> opt-in because it is just too easy to miss that and hit the same issue >> again. And in fact only few users of alloc_large_system_hash are using >> the flag. E.g. why {dcache,inode}_init_early do not have the flag? I >> am pretty sure that having a physically contiguous hash table would be >> better over vmalloc from the TLB point of view. >> >> mount_hashtable resp. mountpoint_hashtable are another example. Other >> users just have a reasonable max value. So can we do the following >> on top of your commit? I think that we should rethink the scaling as >> well but I do not have a good answer for the maximum size so let's just >> start with a more reasonable API first. >> --- >> diff --git a/fs/dcache.c b/fs/dcache.c >> index 808ea99062c2..363502faa328 100644 >> --- a/fs/dcache.c >> +++ b/fs/dcache.c >> @@ -3585,7 +3585,7 @@ static void __init dcache_init(void) >> sizeof(struct hlist_bl_head), >> dhash_entries, >> 13, >> - HASH_ZERO | HASH_ADAPT, >> + HASH_ZERO, >> &d_hash_shift, >> &d_hash_mask, >> 0, >> diff --git a/fs/inode.c b/fs/inode.c >> index a9caf53df446..b3c0731ec1fe 100644 >> --- a/fs/inode.c >> +++ b/fs/inode.c >> @@ -1950,7 +1950,7 @@ void __init inode_init(void) >> sizeof(struct hlist_head), >> ihash_entries, >> 14, >> - HASH_ZERO | HASH_ADAPT, >> + HASH_ZERO, >> &i_hash_shift, >> &i_hash_mask, >> 0, >> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h >> index dbaf312b3317..e223d91b6439 100644 >> --- a/include/linux/bootmem.h >> +++ b/include/linux/bootmem.h >> @@ -359,7 +359,6 @@ extern void *alloc_large_system_hash(const char >> *tablename, >> #define HASH_SMALL 0x00000002 /* sub-page allocation allowed, min >> * shift passed via *_hash_shift */ >> #define HASH_ZERO 0x00000004 /* Zero allocated hash table */ >> -#define HASH_ADAPT 0x00000008 /* Adaptive scale for large >> memory */ >> /* Only NUMA needs hash distribution. 64bit NUMA architectures have >> * sufficient vmalloc space. >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index fa752de84eef..3bf60669d200 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -7226,7 +7226,7 @@ void *__init alloc_large_system_hash(const char >> *tablename, >> if (PAGE_SHIFT < 20) >> numentries = round_up(numentries, (1<<20)/PAGE_SIZE); >> - if (flags & HASH_ADAPT) { >> + if (!high_limit) { >> unsigned long adapt; >> for (adapt = ADAPT_SCALE_NPAGES; adapt < numentries; >> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org