linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Feng Tang <feng.tang@intel.com>, Waiman Long <longman@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
	kernel test robot <oliver.sang@intel.com>,
	John Hubbard <jhubbard@nvidia.com>, Jan Kara <jack@suse.cz>,
	Peter Xu <peterx@redhat.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Christoph Hellwig <hch@lst.de>, Hugh Dickins <hughd@google.com>,
	Jann Horn <jannh@google.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	Michal Hocko <mhocko@suse.com>, Oleg Nesterov <oleg@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, kernel test robot <lkp@intel.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	zhengjun.xing@intel.com
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
Date: Sun, 6 Jun 2021 12:20:46 -0700	[thread overview]
Message-ID: <CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com> (raw)
In-Reply-To: <20210606101623.GA48020@shbuild999.sh.intel.com>

[ Adding Waiman Long to the participants, because this seems to be a
very specific cacheline alignment behavior of rwsems, maybe Waiman has
some comments ]

On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@intel.com> wrote:
>
> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
>   data structure change
>
>   - old kernel
>
>     - first cacheline
>         mmap_lock->count (75%)
>         mm->mapcount (14%)
>
>     - second cacheline
>         mmap_lock->owner (97%)
>
>   - new kernel
>
>     mainly in the cacheline of 'mmap_lock'
>
>     mmap_lock->count (~2%)
>     mmap_lock->owner (95%)

Oooh.

It looks like pretty much all the contention is on mmap_lock, and the
difference is that the old kernel just _happened_ to split the
mmap_lock rwsem at *exactly* the right place.

The rw_semaphore structure looks like this:

        struct rw_semaphore {
                atomic_long_t count;
                atomic_long_t owner;
                struct optimistic_spin_queue osq; /* spinner MCS lock */
                ...

and before the addition of the 'write_protect_seq' field, the mmap_sem
was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines, and
then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind of
layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the first
cacheline (for the optimistic fast-path), and then in the case of
contention, rwsem_down_write_slowpath() will just access the second
cacheline.

Which is probably just optimal for a load that spends a lot of time
contended - new waiters touch that first cacheline, and then they
queue themselves up on the second cacheline. Waiman, does that sound
believable?

Anyway, I'm certainly ok with the patch that just moves
'write_protect_seq' down, it might be worth commenting about how this
is about some very special cache layout of the mmap_sem part of the
structure.

That said, this means that it all is very subtle dependent on a lot of
kernel config options, and I'm not sure how relevant the exact kernel
options are that the test robot has been using. But even if this is
just a "kernel test robot reports", I think it's an interesting case
and worth a comment for when this happens next time...

                Linus

  reply	other threads:[~2021-06-06 19:22 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-25  3:16 [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression kernel test robot
2021-05-25  3:11 ` Linus Torvalds
2021-06-04  7:04   ` Feng Tang
2021-06-04  7:52     ` Feng Tang
2021-06-04 17:57       ` Linus Torvalds
2021-06-06 10:16         ` Feng Tang
2021-06-06 19:20           ` Linus Torvalds [this message]
2021-06-06 22:13             ` Waiman Long
2021-06-07  6:05             ` Feng Tang
2021-06-08  0:03               ` Linus Torvalds
2021-06-04 17:58       ` John Hubbard
2021-06-06  4:47         ` Feng Tang
2021-06-04  8:37   ` [LKP] " Xing Zhengjun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=whoqV=cX5VC80mmR9rr+Z+yQ6fiQZm36Fb-izsanHg23w@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=feng.tang@intel.com \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=lkp@lists.01.org \
    --cc=longman@redhat.com \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=oliver.sang@intel.com \
    --cc=peterx@redhat.com \
    --cc=ying.huang@intel.com \
    --cc=zhengjun.xing@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).