From: Arne Welzel <arne.welzel@corelight.com>
To: dm-devel@redhat.com, dm-crypt@saout.de
Cc: Arne Welzel <arne.welzel@corelight.com>,
DJ Gregor <dj@corelight.com>,
mpatocka@redhat.com, agk@redhat.com, snitzer@redhat.com
Subject: [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
Date: Sun, 8 Aug 2021 15:42:05 +0200 [thread overview]
Message-ID: <20210808134205.1981531-1-arne.welzel@corelight.com> (raw)
On many core systems using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the dmcrypt page allocation
limit for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.
Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive().
This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus() (assuming my understanding of the
percpu_counter logic is proper).
Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512G of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed for dmcrypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.
This behavior was separately/artificially reproduced on an EC2 c5.24xlarge
instance system with 96 CPUs and 192GB RAM as follows, but can be
provokes on systems with less available CPUs.
* Disable swap
* Tune vm settings to promote regular writeback
$ echo 50 > /proc/sys/vm/dirty_expire_centisecs
$ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
$ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
* Create 8 dmcrypt devices based on files on a tmpfs
* Create and mount an ext4 filesystem on each crypt devices
* Run stress-ng --hdd 8 within one of above filesystems
Total %system usage shown via sysstat goes to ~35%, write througput on the
underlying loop device is ~2GB/s. perf profiling an individual kworker
kcryptd thread shows the following in the profile, indicating it hits
heavy spinlock contention in percpu_counter_compare():
99.98% 0.00% kworker/u193:46 [kernel.kallsyms] [k] ret_from_fork
|
---ret_from_fork
kthread
worker_thread
|
--99.92%--process_one_work
|
|--80.52%--kcryptd_crypt
| |
| |--62.58%--mempool_alloc
| | |
| | --62.24%--crypt_page_alloc
| | |
| | --61.51%--__percpu_counter_compare
| | |
| | --61.34%--__percpu_counter_sum
| | |
| | |--58.68%--_raw_spin_lock_irqsave
| | | |
| | | --58.30%--native_queued_spin_lock_slowpath
| | |
| | --0.69%--cpumask_next
| | |
| | --0.51%--_find_next_bit
| |
| |--10.61%--crypt_convert
| | |
| | |--6.05%--xts_crypt
...
After apply this change, %system usage is lowered to ~7% and
write throughput on the loopback interface increases to 2.7GB/s.
The profile shows mempool_alloc() as ~8% rather than ~62% in the
profile and not hitting the percpu_counter() spinlock anymore.
|--8.15%--mempool_alloc
| |
| |--3.93%--crypt_page_alloc
| | |
| | --3.75%--__alloc_pages
| | |
| | --3.62%--get_page_from_freelist
| | |
| | --3.22%--rmqueue_bulk
| | |
| | --2.59%--_raw_spin_lock
| |
| | --2.57%--native_queued_spin_lock_slowpath
| |
| --3.05%--_raw_spin_lock_irqsave
| |
| --2.49%--native_queued_spin_lock_slowpath
Suggested-by: DJ Gregor <dj@corelight.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
---
drivers/md/dm-crypt.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 50f4cbd600d5..2ae481610f12 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2661,7 +2661,12 @@ static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
struct crypt_config *cc = pool_data;
struct page *page;
- if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
+ /*
+ * Note, percpu_counter_read_positive() may over (and under) estimate
+ * the current usage by at most (batch - 1) * num_online_cpus() pages,
+ * but avoids potential spinlock contention of an exact result.
+ */
+ if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) > dm_crypt_pages_per_client) &&
likely(gfp_mask & __GFP_NORETRY))
return NULL;
--
2.20.1
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel
next reply other threads:[~2021-08-09 6:54 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-08-08 13:42 Arne Welzel [this message]
2021-08-10 18:21 ` [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() Mikulas Patocka
2021-08-12 19:47 ` Arne Welzel
2021-08-12 20:37 ` Mikulas Patocka
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210808134205.1981531-1-arne.welzel@corelight.com \
--to=arne.welzel@corelight.com \
--cc=agk@redhat.com \
--cc=dj@corelight.com \
--cc=dm-crypt@saout.de \
--cc=dm-devel@redhat.com \
--cc=mpatocka@redhat.com \
--cc=snitzer@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).