Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Michael Neuling <mikey@neuling.org>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Ingo Molnar <mingo@kernel.org>,
	Oliver O'Halloran <oohall@gmail.com>,
	Jordan Niethe <jniethe5@gmail.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Valentin Schneider <valentin.schneider@arm.com>
Subject: Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
Date: Fri, 31 Jul 2020 13:06:18 +0530	[thread overview]
Message-ID: <20200731073618.GA28399@in.ibm.com> (raw)
In-Reply-To: <20200729061355.GA14603@linux.vnet.ibm.com>

Hi Srikar, Valentin,

On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:
>

[..snip..]

> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
> 
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
> 
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
> 
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.

I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best
results for POWER9 in terms of cache-benefits during wakeup.  On a
POWER9 Boston machine, running a producer-consumer test case
(https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c)

The test case creates two threads, one Producer and another
Consumer. Both work on a fairly large shared array of size 64M.  In an
interation the Producer performs stores to 1024 random locations and
wakes up the Consumer. In the Consumer's iteration, loads from those
exact 1024 locations.

We measure the number of Consumer iterations per second and the
average time for each Consumer iteration. The smaller the time, the
better it is.

The following results are when I pinned the Producer and Consumer to
different combinations of CPUs to cover Small core , Big-core,
Neighbouring Big-core, Far off core within the same chip, and across
chips. There is a also a case where they are not affined anywhere, and
we let the scheduler wake them up correctly.

We find the best results when the Producer and Consumer are within the
same L2 domain. These numbers are also close to the numbers that we
get when we let the Scheduler wake them up (where LLC is L2).

## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain)

Consumer affined to  CPU 3
Producer affined to  CPU 1
    4698 iterations, avg time: 20034 ns
    4951 iterations, avg time: 20012 ns
    4957 iterations, avg time: 19971 ns
    4968 iterations, avg time: 19985 ns
    4970 iterations, avg time: 19977 ns

## Same Big Core (8 threads: Shares L2, L3, Frequency Domain)

Consumer affined to  CPU 7
Producer affined to  CPU 1
    4580 iterations, avg time: 19403 ns
    4851 iterations, avg time: 19373 ns
    4849 iterations, avg time: 19394 ns
    4856 iterations, avg time: 19394 ns
    4867 iterations, avg time: 19353 ns

## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency Domain)

Producer affined to  CPU 1
Consumer affined to  CPU 11
    4270 iterations, avg time: 24158 ns
    4491 iterations, avg time: 24157 ns
    4500 iterations, avg time: 24148 ns
    4516 iterations, avg time: 24164 ns
    4518 iterations, avg time: 24165 ns

## Any other Big-core from Same Chip (Shares L3)

Producer affined to  CPU 1
Consumer affined to  CPU 87
    4176 iterations, avg time: 27953 ns
    4417 iterations, avg time: 27925 ns
    4415 iterations, avg time: 27934 ns
    4417 iterations, avg time: 27983 ns
    4430 iterations, avg time: 27958 ns

## Different Chips (No cache-sharing)

Consumer affined to  CPU 175
Producer affined to  CPU 1
    3277 iterations, avg time: 50786 ns
    3063 iterations, avg time: 50732 ns
    2831 iterations, avg time: 50737 ns
    2859 iterations, avg time: 50688 ns
    2849 iterations, avg time: 50722 ns

## Without affining them (Let Scheduler wake-them up appropriately)
Consumer affined to  CPU 0-175
Producer affined to  CPU 0-175
    4821 iterations, avg time: 19412 ns
    4863 iterations, avg time: 19435 ns
    4855 iterations, avg time: 19381 ns
    4811 iterations, avg time: 19458 ns
    4892 iterations, avg time: 19429 ns

--
Thanks and Regards
gautham.