dm-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
@ 2021-08-08 13:42 Arne Welzel
  2021-08-10 18:21 ` Mikulas Patocka
  0 siblings, 1 reply; 4+ messages in thread
From: Arne Welzel @ 2021-08-08 13:42 UTC (permalink / raw)
  To: dm-devel, dm-crypt; +Cc: Arne Welzel, DJ Gregor, mpatocka, agk, snitzer

On many core systems using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the dmcrypt page allocation
limit for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.

Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive().

This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus() (assuming my understanding of the
percpu_counter logic is proper).

Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512G of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed for dmcrypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.

This behavior was separately/artificially reproduced on an EC2 c5.24xlarge
instance system with 96 CPUs and 192GB RAM as follows, but can be
provokes on systems with less available CPUs.

 * Disable swap
 * Tune vm settings to promote regular writeback
     $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
     $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
     $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

 * Create 8 dmcrypt devices based on files on a tmpfs
 * Create and mount an ext4 filesystem on each crypt devices
 * Run stress-ng --hdd 8 within one of above filesystems

Total %system usage shown via sysstat goes to ~35%, write througput on the
underlying loop device is ~2GB/s. perf profiling an individual kworker
kcryptd thread shows the following in the profile, indicating it hits
heavy spinlock contention in percpu_counter_compare():

    99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
            |
            ---ret_from_fork
               kthread
               worker_thread
               |
                --99.92%--process_one_work
                          |
                          |--80.52%--kcryptd_crypt
                          |          |
                          |          |--62.58%--mempool_alloc
                          |          |          |
                          |          |           --62.24%--crypt_page_alloc
                          |          |                     |
                          |          |                      --61.51%--__percpu_counter_compare
                          |          |                                |
                          |          |                                 --61.34%--__percpu_counter_sum
                          |          |                                           |
                          |          |                                           |--58.68%--_raw_spin_lock_irqsave
                          |          |                                           |          |
                          |          |                                           |           --58.30%--native_queued_spin_lock_slowpath
                          |          |                                           |
                          |          |                                            --0.69%--cpumask_next
                          |          |                                                      |
                          |          |                                                       --0.51%--_find_next_bit
                          |          |
                          |          |--10.61%--crypt_convert
                          |          |          |
                          |          |          |--6.05%--xts_crypt
                          ...

After apply this change, %system usage is lowered to ~7% and
write throughput on the loopback interface increases to 2.7GB/s.
The profile shows mempool_alloc() as ~8% rather than ~62% in the
profile and not hitting the percpu_counter() spinlock anymore.

    |--8.15%--mempool_alloc
    |          |
    |          |--3.93%--crypt_page_alloc
    |          |          |
    |          |           --3.75%--__alloc_pages
    |          |                     |
    |          |                      --3.62%--get_page_from_freelist
    |          |                                |
    |          |                                 --3.22%--rmqueue_bulk
    |          |                                           |
    |          |                                            --2.59%--_raw_spin_lock
    |                                                      |
    |          |                                                       --2.57%--native_queued_spin_lock_slowpath
    |          |
    |           --3.05%--_raw_spin_lock_irqsave
    |                     |
    |                      --2.49%--native_queued_spin_lock_slowpath

Suggested-by: DJ Gregor <dj@corelight.com>
Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
---
 drivers/md/dm-crypt.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 50f4cbd600d5..2ae481610f12 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2661,7 +2661,12 @@ static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
 	struct crypt_config *cc = pool_data;
 	struct page *page;
 
-	if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
+	/*
+	 * Note, percpu_counter_read_positive() may over (and under) estimate
+	 * the current usage by at most (batch - 1) * num_online_cpus() pages,
+	 * but avoids potential spinlock contention of an exact result.
+	 */
+	if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) > dm_crypt_pages_per_client) &&
 	    likely(gfp_mask & __GFP_NORETRY))
 		return NULL;
 
-- 
2.20.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
  2021-08-08 13:42 [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() Arne Welzel
@ 2021-08-10 18:21 ` Mikulas Patocka
  2021-08-12 19:47   ` Arne Welzel
  0 siblings, 1 reply; 4+ messages in thread
From: Mikulas Patocka @ 2021-08-10 18:21 UTC (permalink / raw)
  To: Arne Welzel; +Cc: dm-crypt, dm-devel, DJ Gregor, agk, snitzer

Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>


On Sun, 8 Aug 2021, Arne Welzel wrote:

> On many core systems using dm-crypt, heavy spinlock contention in
> percpu_counter_compare() can be observed when the dmcrypt page allocation
> limit for a given device is reached or close to be reached. This is due
> to percpu_counter_compare() taking a spinlock to compute an exact
> result on potentially many CPUs at the same time.
> 
> Switch to non-exact comparison of allocated and allowed pages by using
> the value returned by percpu_counter_read_positive().
> 
> This may over/under estimate the actual number of allocated pages by at
> most (batch-1) * num_online_cpus() (assuming my understanding of the
> percpu_counter logic is proper).
> 
> Currently, batch is bounded by 32. The system on which this issue was
> first observed has 256 CPUs and 512G of RAM. With a 4k page size, this
> change may over/under estimate by 31MB. With ~10G (2%) allowed for dmcrypt
> allocations, this seems an acceptable error. Certainly preferred over
> running into the spinlock contention.
> 
> This behavior was separately/artificially reproduced on an EC2 c5.24xlarge
> instance system with 96 CPUs and 192GB RAM as follows, but can be
> provokes on systems with less available CPUs.
> 
>  * Disable swap
>  * Tune vm settings to promote regular writeback
>      $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
>      $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
>      $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes
> 
>  * Create 8 dmcrypt devices based on files on a tmpfs
>  * Create and mount an ext4 filesystem on each crypt devices
>  * Run stress-ng --hdd 8 within one of above filesystems
> 
> Total %system usage shown via sysstat goes to ~35%, write througput on the
> underlying loop device is ~2GB/s. perf profiling an individual kworker
> kcryptd thread shows the following in the profile, indicating it hits
> heavy spinlock contention in percpu_counter_compare():
> 
>     99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
>             |
>             ---ret_from_fork
>                kthread
>                worker_thread
>                |
>                 --99.92%--process_one_work
>                           |
>                           |--80.52%--kcryptd_crypt
>                           |          |
>                           |          |--62.58%--mempool_alloc
>                           |          |          |
>                           |          |           --62.24%--crypt_page_alloc
>                           |          |                     |
>                           |          |                      --61.51%--__percpu_counter_compare
>                           |          |                                |
>                           |          |                                 --61.34%--__percpu_counter_sum
>                           |          |                                           |
>                           |          |                                           |--58.68%--_raw_spin_lock_irqsave
>                           |          |                                           |          |
>                           |          |                                           |           --58.30%--native_queued_spin_lock_slowpath
>                           |          |                                           |
>                           |          |                                            --0.69%--cpumask_next
>                           |          |                                                      |
>                           |          |                                                       --0.51%--_find_next_bit
>                           |          |
>                           |          |--10.61%--crypt_convert
>                           |          |          |
>                           |          |          |--6.05%--xts_crypt
>                           ...
> 
> After apply this change, %system usage is lowered to ~7% and
> write throughput on the loopback interface increases to 2.7GB/s.
> The profile shows mempool_alloc() as ~8% rather than ~62% in the
> profile and not hitting the percpu_counter() spinlock anymore.
> 
>     |--8.15%--mempool_alloc
>     |          |
>     |          |--3.93%--crypt_page_alloc
>     |          |          |
>     |          |           --3.75%--__alloc_pages
>     |          |                     |
>     |          |                      --3.62%--get_page_from_freelist
>     |          |                                |
>     |          |                                 --3.22%--rmqueue_bulk
>     |          |                                           |
>     |          |                                            --2.59%--_raw_spin_lock
>     |                                                      |
>     |          |                                                       --2.57%--native_queued_spin_lock_slowpath
>     |          |
>     |           --3.05%--_raw_spin_lock_irqsave
>     |                     |
>     |                      --2.49%--native_queued_spin_lock_slowpath
> 
> Suggested-by: DJ Gregor <dj@corelight.com>
> Signed-off-by: Arne Welzel <arne.welzel@corelight.com>
> ---
>  drivers/md/dm-crypt.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 50f4cbd600d5..2ae481610f12 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2661,7 +2661,12 @@ static void *crypt_page_alloc(gfp_t gfp_mask, void *pool_data)
>  	struct crypt_config *cc = pool_data;
>  	struct page *page;
>  
> -	if (unlikely(percpu_counter_compare(&cc->n_allocated_pages, dm_crypt_pages_per_client) >= 0) &&
> +	/*
> +	 * Note, percpu_counter_read_positive() may over (and under) estimate
> +	 * the current usage by at most (batch - 1) * num_online_cpus() pages,
> +	 * but avoids potential spinlock contention of an exact result.
> +	 */
> +	if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) > dm_crypt_pages_per_client) &&
>  	    likely(gfp_mask & __GFP_NORETRY))
>  		return NULL;
>  
> -- 
> 2.20.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
  2021-08-10 18:21 ` Mikulas Patocka
@ 2021-08-12 19:47   ` Arne Welzel
  2021-08-12 20:37     ` Mikulas Patocka
  0 siblings, 1 reply; 4+ messages in thread
From: Arne Welzel @ 2021-08-12 19:47 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: dm-crypt, dm-devel, DJ Gregor, agk, snitzer

Mikulas,

On Tue, 10 Aug 2021, Mikulas Patocka wrote:

> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
>

thank you for the review. After looking at the submitted patch again,
seems more proper to use >= as the condition:

> > + if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) > dm_crypt_pages_per_client) &&
                                                                        ^^
                                                                        >=
Would it be okay if I resend the patch with this changed and add your
Reviewed-by still? Would also fix some wording in the description and
dedent the perf report output somewhat.

Thanks,
   Arne

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc()
  2021-08-12 19:47   ` Arne Welzel
@ 2021-08-12 20:37     ` Mikulas Patocka
  0 siblings, 0 replies; 4+ messages in thread
From: Mikulas Patocka @ 2021-08-12 20:37 UTC (permalink / raw)
  To: Arne Welzel; +Cc: dm-crypt, dm-devel, DJ Gregor, agk, snitzer



On Thu, 12 Aug 2021, Arne Welzel wrote:

> Mikulas,
> 
> On Tue, 10 Aug 2021, Mikulas Patocka wrote:
> 
> > Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
> >
> 
> thank you for the review. After looking at the submitted patch again,
> seems more proper to use >= as the condition:
> 
> > > + if (unlikely(percpu_counter_read_positive(&cc->n_allocated_pages) > dm_crypt_pages_per_client) &&
>                                                                         ^^
>                                                                         >=
> Would it be okay if I resend the patch with this changed and add your
> Reviewed-by still? Would also fix some wording in the description and
> dedent the perf report output somewhat.
> 
> Thanks,
>    Arne

OK - you can resend the patch with my "Reviewed-by".

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-08-13  6:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-08 13:42 [dm-devel] [PATCH] dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() Arne Welzel
2021-08-10 18:21 ` Mikulas Patocka
2021-08-12 19:47   ` Arne Welzel
2021-08-12 20:37     ` Mikulas Patocka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).