linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Waiman Long <llong@redhat.com>
Cc: alex.kogan@oracle.com, Arnd Bergmann <arnd@arndb.de>,
	Borislav Petkov <bp@alien8.de>,
	daniel.m.jordan@oracle.com, dave.dice@oracle.com,
	guohanjun@huawei.com, "H. Peter Anvin" <hpa@zytor.com>,
	jglauber@marvell.com, linux-arch@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	LKML <linux-kernel@vger.kernel.org>,
	linux@armlinux.org.uk, Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	steven.sistare@oracle.com, Thomas Gleixner <tglx@linutronix.de>,
	Will Deacon <will.deacon@arm.com>,
	x86@kernel.org
Subject: Re: [PATCH v15 0/6] Add NUMA-awareness to qspinlock
Date: Fri, 1 Oct 2021 11:57:49 +1300	[thread overview]
Message-ID: <CAGsJ_4wtyLOSwYH0n5vbJ3YFyXcxyVstXxn7q=nr=bPuX5oNaQ@mail.gmail.com> (raw)
In-Reply-To: <a6340beb-3b4a-2518-9340-ea0fc7583dbe@redhat.com>

On Fri, Oct 1, 2021 at 5:58 AM Waiman Long <llong@redhat.com> wrote:
>
> On 9/30/21 5:44 AM, Barry Song wrote:
> >> We have done some performance evaluation with the locktorture module
> >> as well as with several benchmarks from the will-it-scale repo.
> >> The following locktorture results are from an Oracle X5-4 server
> >> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> >> cores each). Each number represents an average (over 25 runs) of the
> >> total number of ops (x10^7) reported at the end of each run. The
> >> standard deviation is also reported in (), and in general is about 3%
> >> from the average. The 'stock' kernel is v5.12.0,
> > I assume x5-4 server has the crossbar topology and its numa diameter is
> > 1hop, and all tests were done on this kind of symmetrical topology. Am
> > I right?
> >
> >      ┌─┐                 ┌─┐
> >      │ ├─────────────────┤ │
> >      └─┤1               1└┬┘
> >        │  1           1   │
> >        │    1       1     │
> >        │      1   1       │
> >        │        1         │
> >        │      1   1       │
> >        │     1      1     │
> >        │   1         1    │
> >       ┌┼┐1             1  ├─┐
> >       │┼┼─────────────────┤ │
> >       └─┘                 └─┘
> >
> >
> > what if the hardware is using the ring topology and other topologies with
> > 2-hops or even 3-hops such as:
> >
> >       ┌─┐                 ┌─┐
> >       │ ├─────────────────┤ │
> >       └─┤                 └┬┘
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >         │                  │
> >        ┌┤                  ├─┐
> >        │┼┬─────────────────┤ │
> >        └─┘                 └─┘
> >
> >
> > or:
> >
> >
> >      ┌───┐       ┌───┐      ┌────┐      ┌─────┐
> >      │   │       │   │      │    │      │     │
> >      │   │       │   │      │    │      │     │
> >      ├───┼───────┼───┼──────┼────┼──────┼─────┤
> >      │   │       │   │      │    │      │     │
> >      └───┘       └───┘      └────┘      └─────┘
> >
> > do we need to consider the distances of numa nodes in the secondary
> > queue? does it still make sense to treat everyone else equal in
> > secondary queue?
>
> The purpose of this patch series is to minimize cacheline transfer from
> one numa node to another. Taking the fine grained detail of the numa
> topology into account will complicate the code without much performance
> benefit from my point of view. Let's keep it simple first. We can always
> improve it later on if one can show real benefit of doing so.

for sure i am not expecting the complex  NUMA topology taken into account for
this moment. I am just curious how things will be different if topology isn't a
crossbar with 1-hop only.

when the master queue is empty, the distance of the numa node spinlock will
jump to will affect the performance. but I am not quite sure how much it will
be. just like a disk, bumping back and forth between far cylinders and sectors
might waste a lot of time.

On the other hand, some numa nodes might be very close while some others
might be very far. for example, if one socket has several DIEs, and the machine
has several sockets, cacheline coherence overhead for NUMA nodes of DIEs within
one socket might be much less than that of NUMA nodes which are in different
sockets. I assume maintaining the master/secondary queues need some
overhead especially while the system has many cores and multiple NUMA nodes,
in this case, making neighbor NUMA nodes share one master queue might win.

Anyway, we need a lot of benchmarking on this before we can really do anything
on it.  For this moment, ignoring the complicated topology should be a
better way
to start.

>
> Cheers,
> Longman

Thanks
barry

  reply	other threads:[~2021-09-30 22:58 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-14 20:07 [PATCH v15 0/6] Add NUMA-awareness to qspinlock Alex Kogan
2021-05-14 20:07 ` [PATCH v15 1/6] locking/qspinlock: Rename mcs lock/unlock macros and make them more generic Alex Kogan
2021-05-14 20:07 ` [PATCH v15 2/6] locking/qspinlock: Refactor the qspinlock slow path Alex Kogan
2021-05-14 20:07 ` [PATCH v15 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock Alex Kogan
2021-09-22 19:25   ` Davidlohr Bueso
2021-09-22 19:52     ` Waiman Long
2023-08-04  1:49     ` Guo Ren
2021-09-30 10:05   ` Barry Song
2023-08-02 23:14   ` Guo Ren
2023-08-03  8:50     ` Peter Zijlstra
2023-08-03 10:28       ` Guo Ren
2023-08-03 11:56         ` Peter Zijlstra
2023-08-04  1:33           ` Guo Ren
2023-08-04  1:38             ` Guo Ren
2023-08-04  8:25             ` Peter Zijlstra
2023-08-04 14:17               ` Guo Ren
2023-08-04 18:23                 ` Peter Zijlstra
2023-08-05  0:19                   ` Guo Ren
2021-05-14 20:07 ` [PATCH v15 4/6] locking/qspinlock: Introduce starvation avoidance into CNA Alex Kogan
2021-05-14 20:07 ` [PATCH v15 5/6] locking/qspinlock: Avoid moving certain threads between waiting queues in CNA Alex Kogan
2021-05-14 20:07 ` [PATCH v15 6/6] locking/qspinlock: Introduce the shuffle reduction optimization into CNA Alex Kogan
2021-09-30  9:44 ` [PATCH v15 0/6] Add NUMA-awareness to qspinlock Barry Song
2021-09-30 16:58   ` Waiman Long
2021-09-30 22:57     ` Barry Song [this message]
2021-09-30 23:51   ` Alex Kogan
2021-12-13 20:37 ` Alex Kogan
2021-12-15 15:13   ` Alex Kogan
2022-04-11 17:09 ` Alex Kogan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGsJ_4wtyLOSwYH0n5vbJ3YFyXcxyVstXxn7q=nr=bPuX5oNaQ@mail.gmail.com' \
    --to=21cnbao@gmail.com \
    --cc=alex.kogan@oracle.com \
    --cc=arnd@arndb.de \
    --cc=bp@alien8.de \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dave.dice@oracle.com \
    --cc=guohanjun@huawei.com \
    --cc=hpa@zytor.com \
    --cc=jglauber@marvell.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=llong@redhat.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=steven.sistare@oracle.com \
    --cc=tglx@linutronix.de \
    --cc=will.deacon@arm.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).