From: Daniel Jordan <daniel.m.jordan@oracle.com> To: "Elliott, Robert (Persistent Memory)" <elliott@hpe.com> Cc: "'Daniel Jordan'" <daniel.m.jordan@oracle.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "aarcange@redhat.com" <aarcange@redhat.com>, "aaron.lu@intel.com" <aaron.lu@intel.com>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>, "alex.williamson@redhat.com" <alex.williamson@redhat.com>, "bsd@redhat.com" <bsd@redhat.com>, "darrick.wong@oracle.com" <darrick.wong@oracle.com>, "dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>, "jgg@mellanox.com" <jgg@mellanox.com>, "jwadams@google.com" <jwadams@google.com>, "jiangshanlai@gmail.com" <jiangshanlai@gmail.com>, "mhocko@kernel.org" <mhocko@kernel.org>, "mike.kravetz@oracle.com" <mike.kravetz@oracle.com>, "Pavel.Tatashin@microsoft.com" <Pavel.Tatashin@microsoft.com>, "prasad.singamsetty@oracle.com" <prasad.singamsetty@oracle.com>, "rdunlap@infradead.org" <rdunlap@infradead.org>, "steven.sistare@oracle.com" <steven.sistare@oracle.com>, "tim.c.chen@intel.com" <tim.c.chen@intel.com>, "tj@kernel.org" <tj@kernel.org>, "vbabka@suse.cz" <vbabka@suse.cz>, peterz@infradead.org, dhaval.giani@oracle.com Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node Date: Tue, 27 Nov 2018 12:23:59 -0800 [thread overview] Message-ID: <20181127202359.biav42vbfchprmo5@ca-dmjordan1.us.oracle.com> (raw) In-Reply-To: <AT5PR8401MB1169AA00F542BA2E3204FC24ABD00@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM> On Tue, Nov 27, 2018 at 12:12:28AM +0000, Elliott, Robert (Persistent Memory) wrote: > I ran a short test with: > * HPE ProLiant DL360 Gen9 system > * Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and > 18 hyperthreaded cores (36-53) > * DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds) > * fio workload generator > * cores on one CPU socket talking to a pmem device on the same CPU > * large (1 MiB) random writes (to minimize the threads getting CPU cache > hits from each other) > > Results: > * 31.7 GB/s four threads, four physical cores (0,1,2,3) > * 22.2 GB/s four threads, two physical cores (0,1,36,37) > * 21.4 GB/s two threads, two physical cores (0,1) > * 12.1 GB/s two threads, one physical core (0,36) > * 11.2 GB/s one thread, one physical core (0) > > So, I think it's important that the initialization threads run on > separate physical cores. Thanks for running this. And fair enough, in this test using both siblings gives only a 4-8% speedup over one, so it makes sense to use only cores in the calculation. As for how to actually do this, some arches have smp_num_siblings, but there should be a generic interface to provide that. It's also possible to calculate this from the existing topology_sibling_cpumask, but the first option is better IMHO. Open to suggestions. > For the number of cores to use, one approach is: > memory bandwidth (number of interleaved channels * speed) > divided by > CPU core max sustained write bandwidth > > For example, this 2133 MT/s system is roughly: > 68 GB/s (4 * 17 GB/s nominal) > divided by > 11.2 GB/s (one core's performance) > which is > 6 cores > > ACPI HMAT will report that 68 GB/s number. I'm not sure of > a good way to discover the 11.2 GB/s number. Yes, this would be nice to do if we could know the per-core number, with the caveat that a single number like this would be most useful for the CPU-memory pair it was calculated for, so the kernel could at least calculate it for jobs operating on local memory. Some BogoMIPS-like calibration may work, but I'll wait for ACPI HMAT support in the kernel.
WARNING: multiple messages have this Message-ID (diff)
From: Daniel Jordan <daniel.m.jordan@oracle.com> To: "Elliott, Robert (Persistent Memory)" <elliott@hpe.com> Cc: "'Daniel Jordan'" <daniel.m.jordan@oracle.com>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "aarcange@redhat.com" <aarcange@redhat.com>, "aaron.lu@intel.com" <aaron.lu@intel.com>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>, "alex.williamson@redhat.com" <alex.williamson@redhat.com>, "bsd@redhat.com" <bsd@redhat.com>, "darrick.wong@oracle.com" <darrick.wong@oracle.com>, "dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>, "jgg@mellanox.com" <jgg@mellanox.com>, "jwadams@google.com" <jwadams@google.com>, "jiangshanlai@gmail.com" <jiangshanlai@gmail.com>, "mhocko@kernel.org" <mhocko@kernel.org>, "mike.kravetz@oracle.com" <mike.kravetz@oracle.com Subject: Re: [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node Date: Tue, 27 Nov 2018 12:23:59 -0800 [thread overview] Message-ID: <20181127202359.biav42vbfchprmo5@ca-dmjordan1.us.oracle.com> (raw) In-Reply-To: <AT5PR8401MB1169AA00F542BA2E3204FC24ABD00@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM> On Tue, Nov 27, 2018 at 12:12:28AM +0000, Elliott, Robert (Persistent Memory) wrote: > I ran a short test with: > * HPE ProLiant DL360 Gen9 system > * Intel Xeon E5-2699 CPU with 18 physical cores (0-17) and > 18 hyperthreaded cores (36-53) > * DDR4 NVDIMM-Ns (which run at regular DRAM DIMM speeds) > * fio workload generator > * cores on one CPU socket talking to a pmem device on the same CPU > * large (1 MiB) random writes (to minimize the threads getting CPU cache > hits from each other) > > Results: > * 31.7 GB/s four threads, four physical cores (0,1,2,3) > * 22.2 GB/s four threads, two physical cores (0,1,36,37) > * 21.4 GB/s two threads, two physical cores (0,1) > * 12.1 GB/s two threads, one physical core (0,36) > * 11.2 GB/s one thread, one physical core (0) > > So, I think it's important that the initialization threads run on > separate physical cores. Thanks for running this. And fair enough, in this test using both siblings gives only a 4-8% speedup over one, so it makes sense to use only cores in the calculation. As for how to actually do this, some arches have smp_num_siblings, but there should be a generic interface to provide that. It's also possible to calculate this from the existing topology_sibling_cpumask, but the first option is better IMHO. Open to suggestions. > For the number of cores to use, one approach is: > memory bandwidth (number of interleaved channels * speed) > divided by > CPU core max sustained write bandwidth > > For example, this 2133 MT/s system is roughly: > 68 GB/s (4 * 17 GB/s nominal) > divided by > 11.2 GB/s (one core's performance) > which is > 6 cores > > ACPI HMAT will report that 68 GB/s number. I'm not sure of > a good way to discover the 11.2 GB/s number. Yes, this would be nice to do if we could know the per-core number, with the caveat that a single number like this would be most useful for the CPU-memory pair it was calculated for, so the kernel could at least calculate it for jobs operating on local memory. Some BogoMIPS-like calibration may work, but I'll wait for ACPI HMAT support in the kernel.
next prev parent reply other threads:[~2018-11-27 20:24 UTC|newest] Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-11-05 16:55 [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 01/13] ktask: add documentation Daniel Jordan 2018-11-05 21:19 ` Randy Dunlap 2018-11-06 2:27 ` Daniel Jordan 2018-11-06 8:49 ` Peter Zijlstra 2018-11-06 20:34 ` Daniel Jordan 2018-11-06 20:34 ` Daniel Jordan 2018-11-06 20:34 ` Daniel Jordan 2018-11-06 20:51 ` Jason Gunthorpe 2018-11-06 20:51 ` Jason Gunthorpe 2018-11-06 20:51 ` Jason Gunthorpe 2018-11-07 10:27 ` Peter Zijlstra 2018-11-07 10:27 ` Peter Zijlstra 2018-11-07 10:27 ` Peter Zijlstra 2018-11-07 20:21 ` Daniel Jordan 2018-11-07 20:21 ` Daniel Jordan 2018-11-07 20:21 ` Daniel Jordan 2018-11-07 10:35 ` Peter Zijlstra 2018-11-07 21:20 ` Daniel Jordan 2018-11-08 17:26 ` Jonathan Corbet 2018-11-08 19:15 ` Daniel Jordan 2018-11-08 19:24 ` Jonathan Corbet 2018-11-27 19:50 ` Pavel Machek 2018-11-28 16:56 ` Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 02/13] ktask: multithread CPU-intensive kernel work Daniel Jordan 2018-11-05 20:51 ` Randy Dunlap 2018-11-06 2:24 ` Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 03/13] ktask: add undo support Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 04/13] ktask: run helper threads at MAX_NICE Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 05/13] workqueue, ktask: renice helper threads to prevent starvation Daniel Jordan 2018-11-13 16:34 ` Tejun Heo 2018-11-19 16:45 ` Daniel Jordan 2018-11-20 16:33 ` Tejun Heo 2018-11-20 17:03 ` Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 06/13] vfio: parallelize vfio_pin_map_dma Daniel Jordan 2018-11-05 21:51 ` Alex Williamson 2018-11-06 2:42 ` Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 07/13] mm: change locked_vm's type from unsigned long to atomic_long_t Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 08/13] vfio: remove unnecessary mmap_sem writer acquisition around locked_vm Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 09/13] vfio: relieve mmap_sem reader cacheline bouncing by holding it longer Daniel Jordan 2018-11-05 16:55 ` Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 10/13] mm: enlarge type of offset argument in mem_map_offset and mem_map_next Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 11/13] mm: parallelize deferred struct page initialization within each node Daniel Jordan 2018-11-10 3:48 ` Elliott, Robert (Persistent Memory) 2018-11-10 3:48 ` Elliott, Robert (Persistent Memory) 2018-11-12 16:54 ` Daniel Jordan 2018-11-12 16:54 ` Daniel Jordan 2018-11-12 22:15 ` Elliott, Robert (Persistent Memory) 2018-11-12 22:15 ` Elliott, Robert (Persistent Memory) 2018-11-19 16:01 ` Daniel Jordan 2018-11-19 16:01 ` Daniel Jordan 2018-11-27 0:12 ` Elliott, Robert (Persistent Memory) 2018-11-27 0:12 ` Elliott, Robert (Persistent Memory) 2018-11-27 20:23 ` Daniel Jordan [this message] 2018-11-27 20:23 ` Daniel Jordan 2018-11-19 16:29 ` Daniel Jordan 2018-11-19 16:29 ` Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 12/13] mm: parallelize clear_gigantic_page Daniel Jordan 2018-11-05 16:55 ` [RFC PATCH v4 13/13] hugetlbfs: parallelize hugetlbfs_fallocate with ktask Daniel Jordan 2018-11-05 17:29 ` [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work Michal Hocko 2018-11-06 1:29 ` Daniel Jordan 2018-11-06 9:21 ` Michal Hocko 2018-11-07 20:17 ` Daniel Jordan 2018-11-07 20:17 ` Daniel Jordan 2018-11-05 18:49 ` Zi Yan 2018-11-06 2:20 ` Daniel Jordan 2018-11-06 2:48 ` Zi Yan 2018-11-06 19:00 ` Daniel Jordan 2018-11-30 19:18 ` Tejun Heo 2018-12-01 0:13 ` Daniel Jordan 2018-12-03 16:16 ` Tejun Heo
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20181127202359.biav42vbfchprmo5@ca-dmjordan1.us.oracle.com \ --to=daniel.m.jordan@oracle.com \ --cc=Pavel.Tatashin@microsoft.com \ --cc=aarcange@redhat.com \ --cc=aaron.lu@intel.com \ --cc=akpm@linux-foundation.org \ --cc=alex.williamson@redhat.com \ --cc=bsd@redhat.com \ --cc=darrick.wong@oracle.com \ --cc=dave.hansen@linux.intel.com \ --cc=dhaval.giani@oracle.com \ --cc=elliott@hpe.com \ --cc=jgg@mellanox.com \ --cc=jiangshanlai@gmail.com \ --cc=jwadams@google.com \ --cc=kvm@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mhocko@kernel.org \ --cc=mike.kravetz@oracle.com \ --cc=peterz@infradead.org \ --cc=prasad.singamsetty@oracle.com \ --cc=rdunlap@infradead.org \ --cc=steven.sistare@oracle.com \ --cc=tim.c.chen@intel.com \ --cc=tj@kernel.org \ --cc=vbabka@suse.cz \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.