All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
@ 2020-05-14  7:04 Huang Ying
  2020-05-15 22:19 ` Andrew Morton
  2020-05-15 23:51 ` Daniel Jordan
  0 siblings, 2 replies; 8+ messages in thread
From: Huang Ying @ 2020-05-14  7:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Michal Hocko, Minchan Kim,
	Tim Chen, Hugh Dickins

In some swap scalability test, it is found that there are heavy lock
contention on swap cache even if we have split one swap cache radix
tree per swap device to one swap cache radix tree every 64 MB trunk in
commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").

The reason is as follow.  After the swap device becomes fragmented so
that there's no free swap cluster, the swap device will be scanned
linearly to find the free swap slots.  swap_info_struct->cluster_next
is the next scanning base that is shared by all CPUs.  So nearby free
swap slots will be allocated for different CPUs.  The probability for
multiple CPUs to operate on the same 64 MB trunk is high.  This causes
the lock contention on the swap cache.

To solve the issue, in this patch, for SSD swap device, a percpu
version next scanning base (cluster_next_cpu) is added.  Every CPU
will use its own next scanning base.  So the probability for multiple
CPUs to operate on the same 64 MB trunk is reduced greatly.  Thus the
lock contention is reduced too.  For HDD, because sequential access is
more important for IO performance, the original shared next scanning
base is used.

To test the patch, we have run 16-process pmbench memory benchmark on
a 2-socket server machine with 48 cores.  One ram disk is configured
as the swap device per socket.  The pmbench working-set size is much
larger than the available memory so that swapping is triggered.  The
memory read/write ratio is 80/20 and the accessing pattern is random.
In the original implementation, the lock contention on the swap cache
is heavy.  The perf profiling data of the lock contention code path is
as following,

_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.93
_raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 		7.03
_raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page: 		3.7
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.9
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:	1.32
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 	1.01
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 	0.87

After applying this patch, it becomes,

_raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page:		3.99
_raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.0
_raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.47
_raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 	1.31
_raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 	0.88
_raw_spin_lock.scan_swap_map_slots.get_swap_pages.get_swap_page: 	0.76
_raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      0.53

The lock contention on the swap cache is almost eliminated.

And the pmbench score increases 15.9%.  The swapin throughput
increases 16.2% from 2.84 GB/s to 3.3 GB/s.  While the swapout
throughput increases 16.1% from 2.87 GB/s to 3.33 GB/s.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
---
 include/linux/swap.h |  1 +
 mm/swapfile.c        | 27 +++++++++++++++++++++++++--
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b42fb47d8cbe..e96820fb7472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,6 +252,7 @@ struct swap_info_struct {
 	unsigned int inuse_pages;	/* number of those currently in use */
 	unsigned int cluster_next;	/* likely index for next allocation */
 	unsigned int cluster_nr;	/* countdown to next cluster search */
+	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 35be7a7271f4..9f1343b066c1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	 */
 
 	si->flags += SWP_SCANNING;
-	scan_base = offset = si->cluster_next;
+	/*
+	 * Use percpu scan base for SSD to reduce lock contention on
+	 * cluster and swap cache.  For HDD, sequential access is more
+	 * important.
+	 */
+	if (si->flags & SWP_SOLIDSTATE)
+		scan_base = this_cpu_read(*si->cluster_next_cpu);
+	else
+		scan_base = si->cluster_next;
+	offset = scan_base;
 
 	/* SSD algorithm */
 	if (si->cluster_info) {
@@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	unlock_cluster(ci);
 
 	swap_range_alloc(si, offset, 1);
-	si->cluster_next = offset + 1;
+	if (si->flags & SWP_SOLIDSTATE)
+		this_cpu_write(*si->cluster_next_cpu, offset + 1);
+	else
+		si->cluster_next = offset + 1;
 	slots[n_ret++] = swp_entry(si->type, offset);
 
 	/* got enough slots or reach max slots? */
@@ -2828,6 +2840,11 @@ static struct swap_info_struct *alloc_swap_info(void)
 	p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
 	if (!p)
 		return ERR_PTR(-ENOMEM);
+	p->cluster_next_cpu = alloc_percpu(unsigned int);
+	if (!p->cluster_next_cpu) {
+		kvfree(p);
+		return ERR_PTR(-ENOMEM);
+	}
 
 	spin_lock(&swap_lock);
 	for (type = 0; type < nr_swapfiles; type++) {
@@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
 
 	p->lowest_bit  = 1;
 	p->cluster_next = 1;
+	for_each_possible_cpu(i)
+		per_cpu(*p->cluster_next_cpu, i) = 1;
 	p->cluster_nr = 0;
 
 	maxpages = max_swapfile_size();
@@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		 * SSD
 		 */
 		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
+		for_each_possible_cpu(cpu) {
+			per_cpu(*p->cluster_next_cpu, cpu) =
+				1 + prandom_u32_max(p->highest_bit);
+		}
 		nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
 		cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
  2020-05-14  7:04 [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache Huang Ying
@ 2020-05-15 22:19 ` Andrew Morton
  2020-05-18  5:52     ` Huang, Ying
  2020-05-15 23:51 ` Daniel Jordan
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2020-05-15 22:19 UTC (permalink / raw)
  To: Huang Ying
  Cc: linux-mm, linux-kernel, Michal Hocko, Minchan Kim, Tim Chen,
	Hugh Dickins

On Thu, 14 May 2020 15:04:24 +0800 Huang Ying <ying.huang@intel.com> wrote:

> In some swap scalability test, it is found that there are heavy lock
> contention on swap cache even if we have split one swap cache radix
> tree per swap device to one swap cache radix tree every 64 MB trunk in
> commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> 
> The reason is as follow.  After the swap device becomes fragmented so
> that there's no free swap cluster, the swap device will be scanned
> linearly to find the free swap slots.  swap_info_struct->cluster_next
> is the next scanning base that is shared by all CPUs.  So nearby free
> swap slots will be allocated for different CPUs.  The probability for
> multiple CPUs to operate on the same 64 MB trunk is high.  This causes
> the lock contention on the swap cache.
> 
> To solve the issue, in this patch, for SSD swap device, a percpu
> version next scanning base (cluster_next_cpu) is added.  Every CPU
> will use its own next scanning base.  So the probability for multiple
> CPUs to operate on the same 64 MB trunk is reduced greatly.  Thus the
> lock contention is reduced too.  For HDD, because sequential access is
> more important for IO performance, the original shared next scanning
> base is used.
> 
> To test the patch, we have run 16-process pmbench memory benchmark on
> a 2-socket server machine with 48 cores.  One ram disk is configured
> as the swap device per socket.  The pmbench working-set size is much
> larger than the available memory so that swapping is triggered.  The
> memory read/write ratio is 80/20 and the accessing pattern is random.
> In the original implementation, the lock contention on the swap cache
> is heavy.  The perf profiling data of the lock contention code path is
> as following,
> 
> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      7.93
> _raw_spin_lock_irqsave.__remove_mapping.shrink_page_list: 		7.03
> _raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page: 		3.7
> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 2.9
> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:	1.32
> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 	1.01
> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 	0.87
> 
> After applying this patch, it becomes,
> 
> _raw_spin_lock_irq.mem_cgroup_commit_charge.do_swap_page:		3.99
> _raw_spin_lock.swapcache_free_entries.free_swap_slot.__swap_entry_free: 3.0
> _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node:      1.47
> _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 	1.31
> _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 	0.88
> _raw_spin_lock.scan_swap_map_slots.get_swap_pages.get_swap_page: 	0.76
> _raw_spin_lock_irq.add_to_swap_cache.add_to_swap.shrink_page_list:      0.53
> 
> The lock contention on the swap cache is almost eliminated.
> 
> And the pmbench score increases 15.9%.  The swapin throughput
> increases 16.2% from 2.84 GB/s to 3.3 GB/s.  While the swapout
> throughput increases 16.1% from 2.87 GB/s to 3.33 GB/s.
> 
> ...
>
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -252,6 +252,7 @@ struct swap_info_struct {
>  	unsigned int inuse_pages;	/* number of those currently in use */
>  	unsigned int cluster_next;	/* likely index for next allocation */
>  	unsigned int cluster_nr;	/* countdown to next cluster search */
> +	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>  	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>  	struct block_device *bdev;	/* swap device or bdev of swap file */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 35be7a7271f4..9f1343b066c1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	 */
>  
>  	si->flags += SWP_SCANNING;
> -	scan_base = offset = si->cluster_next;
> +	/*
> +	 * Use percpu scan base for SSD to reduce lock contention on
> +	 * cluster and swap cache.  For HDD, sequential access is more
> +	 * important.
> +	 */
> +	if (si->flags & SWP_SOLIDSTATE)
> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
> +	else
> +		scan_base = si->cluster_next;
> +	offset = scan_base;

Do we need to make SSD differ from spinning here?  Do bad things happen
if !SWP_SOLIDSTATE devices use the per-cpu cache?

>  	/* SSD algorithm */
>  	if (si->cluster_info) {
> @@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	unlock_cluster(ci);
>  
>  	swap_range_alloc(si, offset, 1);
> -	si->cluster_next = offset + 1;
> +	if (si->flags & SWP_SOLIDSTATE)
> +		this_cpu_write(*si->cluster_next_cpu, offset + 1);
> +	else
> +		si->cluster_next = offset + 1;
>  	slots[n_ret++] = swp_entry(si->type, offset);
>  
>  	/* got enough slots or reach max slots? */
> @@ -2828,6 +2840,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
>  	if (!p)
>  		return ERR_PTR(-ENOMEM);
> +	p->cluster_next_cpu = alloc_percpu(unsigned int);
> +	if (!p->cluster_next_cpu) {
> +		kvfree(p);
> +		return ERR_PTR(-ENOMEM);
> +	}
>  
>  	spin_lock(&swap_lock);
>  	for (type = 0; type < nr_swapfiles; type++) {
> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>  
>  	p->lowest_bit  = 1;
>  	p->cluster_next = 1;
> +	for_each_possible_cpu(i)
> +		per_cpu(*p->cluster_next_cpu, i) = 1;
>  	p->cluster_nr = 0;
>  
>  	maxpages = max_swapfile_size();
> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		 * SSD
>  		 */
>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);

We shouldn't need to do this now?

> +		for_each_possible_cpu(cpu) {
> +			per_cpu(*p->cluster_next_cpu, cpu) =
> +				1 + prandom_u32_max(p->highest_bit);
> +		}

Would there be any benefit in spreading these out evenly?  Intervals of
(p->highest_bit/num_possible_cpus())?  That would reduce collisions,
but not for very long I guess.

Speaking of which, I wonder if there are failure modes in which all the
CPUs end up getting into sync.

And is it the case that if two or more CPUs have the same (or similar)
per_cpu(*p->cluster_next_cpu, cpu), they'll each end up pointlessly
scanning slots which another CPU has just scanned, thus rather
defeating the purpose of having the cluster_next cache?

IOW, should there be some additional collision avoidance scheme to
prevent a CPU from pointing its cluster_ext into a 64MB trunk which
another CPU is already using?

And should it really be a per-cpu thing?  That's rather arbitrary. 
Perhaps we would get better swap locality by making swap_cluster_next a
per-process (per-mm_struct) thing?

>  		nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>  
>  		cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info),


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
  2020-05-14  7:04 [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache Huang Ying
  2020-05-15 22:19 ` Andrew Morton
@ 2020-05-15 23:51 ` Daniel Jordan
  2020-05-18  6:37     ` Huang, Ying
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Jordan @ 2020-05-15 23:51 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Minchan Kim,
	Tim Chen, Hugh Dickins

On Thu, May 14, 2020 at 03:04:24PM +0800, Huang Ying wrote:
> And the pmbench score increases 15.9%.

What metric is that, and how long did you run the benchmark for?

Given that this thing is probabilistic, did you notice much variance from run
to run?

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 35be7a7271f4..9f1343b066c1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	 */
>  
>  	si->flags += SWP_SCANNING;
> -	scan_base = offset = si->cluster_next;
> +	/*
> +	 * Use percpu scan base for SSD to reduce lock contention on
> +	 * cluster and swap cache.  For HDD, sequential access is more
> +	 * important.
> +	 */
> +	if (si->flags & SWP_SOLIDSTATE)
> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
> +	else
> +		scan_base = si->cluster_next;
> +	offset = scan_base;
>  
>  	/* SSD algorithm */
>  	if (si->cluster_info) {

It's just a nit but SWP_SOLIDSTATE and 'if (si->cluster_info)' are two ways to
check the same thing and I'd stick with the one that's already there.

> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>  
>  	p->lowest_bit  = 1;
>  	p->cluster_next = 1;
> +	for_each_possible_cpu(i)
> +		per_cpu(*p->cluster_next_cpu, i) = 1;

These are later overwritten if the device is an SSD which seems to be the only
case where these are used, so why have this?

>  	p->cluster_nr = 0;
>  
>  	maxpages = max_swapfile_size();
> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		 * SSD
>  		 */
>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
> +		for_each_possible_cpu(cpu) {
> +			per_cpu(*p->cluster_next_cpu, cpu) =
> +				1 + prandom_u32_max(p->highest_bit);
> +		}

Is there a reason for adding one?  The history didn't enlighten me about why
cluster_next does it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
  2020-05-15 22:19 ` Andrew Morton
@ 2020-05-18  5:52     ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2020-05-18  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Minchan Kim, Tim Chen,
	Hugh Dickins

Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 14 May 2020 15:04:24 +0800 Huang Ying <ying.huang@intel.com> wrote:
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 */
>>  
>>  	si->flags += SWP_SCANNING;
>> -	scan_base = offset = si->cluster_next;
>> +	/*
>> +	 * Use percpu scan base for SSD to reduce lock contention on
>> +	 * cluster and swap cache.  For HDD, sequential access is more
>> +	 * important.
>> +	 */
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +	else
>> +		scan_base = si->cluster_next;
>> +	offset = scan_base;
>
> Do we need to make SSD differ from spinning here?  Do bad things happen
> if !SWP_SOLIDSTATE devices use the per-cpu cache?

I think the swapout throughput may be affected.  Because HDD seek is
necessary to swapout for multiple CPUs, if per-cpu cluster_next is used.
But I just realized that per-cpu swap slots cache will cause seek too.
If we really care about the performance to use HDD as swap, maybe we
should disable per-cpu swap slots cache for HDD too?

>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>> @@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	unlock_cluster(ci);
>>  
>>  	swap_range_alloc(si, offset, 1);
>> -	si->cluster_next = offset + 1;
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		this_cpu_write(*si->cluster_next_cpu, offset + 1);
>> +	else
>> +		si->cluster_next = offset + 1;
>>  	slots[n_ret++] = swp_entry(si->type, offset);
>>  
>>  	/* got enough slots or reach max slots? */
>> @@ -2828,6 +2840,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  	p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
>>  	if (!p)
>>  		return ERR_PTR(-ENOMEM);
>> +	p->cluster_next_cpu = alloc_percpu(unsigned int);
>> +	if (!p->cluster_next_cpu) {
>> +		kvfree(p);
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>>  
>>  	spin_lock(&swap_lock);
>>  	for (type = 0; type < nr_swapfiles; type++) {
>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>>  
>>  	p->lowest_bit  = 1;
>>  	p->cluster_next = 1;
>> +	for_each_possible_cpu(i)
>> +		per_cpu(*p->cluster_next_cpu, i) = 1;
>>  	p->cluster_nr = 0;
>>  
>>  	maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		 * SSD
>>  		 */
>>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>
> We shouldn't need to do this now?

Yes.  Thanks for pointing this out.  Will delete this in the future
version.

>> +		for_each_possible_cpu(cpu) {
>> +			per_cpu(*p->cluster_next_cpu, cpu) =
>> +				1 + prandom_u32_max(p->highest_bit);
>> +		}
>
> Would there be any benefit in spreading these out evenly?  Intervals of
> (p->highest_bit/num_possible_cpus())?  That would reduce collisions,
> but not for very long I guess.

These may be spread more evenly with
(p->highest_bit/num_possible_cpus()).  I just worry about the possible
situation that num_possible_cpus() >> num_online_cpus().  Where current
method works better?

> Speaking of which, I wonder if there are failure modes in which all the
> CPUs end up getting into sync.
>
> And is it the case that if two or more CPUs have the same (or similar)
> per_cpu(*p->cluster_next_cpu, cpu), they'll each end up pointlessly
> scanning slots which another CPU has just scanned, thus rather
> defeating the purpose of having the cluster_next cache?
>
> IOW, should there be some additional collision avoidance scheme to
> prevent a CPU from pointing its cluster_ext into a 64MB trunk which
> another CPU is already using?

Yes.  That sounds reasonable.  How about something as below,

When per-cpu cluster_next is assigned, if the new value is in a
different 64MB (or larger) trunk of the old value, we will assign a
random value between p->lowest_bit and p->highest_bit to per-cpu
cluster_next.

This can reduce the possibility of collision to be almost 0 if there's
enough free swap slots.  And this is easy to be implemented, especially
considering the following situation,

  (p->highest_bit - p->lowest_bit) / 64MB < num_online_cpus()

> And should it really be a per-cpu thing?  That's rather arbitrary. 
> Perhaps we would get better swap locality by making swap_cluster_next a
> per-process (per-mm_struct) thing?

I think per-cpu is enough.  Because this is a scalability issue, as long
as we work on different 64MB trunks on different CPUs, the scalability
will be good.  I don't find there's any value to use differnt 64MB
trunks on a single CPU.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
@ 2020-05-18  5:52     ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2020-05-18  5:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Michal Hocko, Minchan Kim, Tim Chen,
	Hugh Dickins

Hi, Andrew,

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 14 May 2020 15:04:24 +0800 Huang Ying <ying.huang@intel.com> wrote:
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 */
>>  
>>  	si->flags += SWP_SCANNING;
>> -	scan_base = offset = si->cluster_next;
>> +	/*
>> +	 * Use percpu scan base for SSD to reduce lock contention on
>> +	 * cluster and swap cache.  For HDD, sequential access is more
>> +	 * important.
>> +	 */
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +	else
>> +		scan_base = si->cluster_next;
>> +	offset = scan_base;
>
> Do we need to make SSD differ from spinning here?  Do bad things happen
> if !SWP_SOLIDSTATE devices use the per-cpu cache?

I think the swapout throughput may be affected.  Because HDD seek is
necessary to swapout for multiple CPUs, if per-cpu cluster_next is used.
But I just realized that per-cpu swap slots cache will cause seek too.
If we really care about the performance to use HDD as swap, maybe we
should disable per-cpu swap slots cache for HDD too?

>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>> @@ -835,7 +844,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	unlock_cluster(ci);
>>  
>>  	swap_range_alloc(si, offset, 1);
>> -	si->cluster_next = offset + 1;
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		this_cpu_write(*si->cluster_next_cpu, offset + 1);
>> +	else
>> +		si->cluster_next = offset + 1;
>>  	slots[n_ret++] = swp_entry(si->type, offset);
>>  
>>  	/* got enough slots or reach max slots? */
>> @@ -2828,6 +2840,11 @@ static struct swap_info_struct *alloc_swap_info(void)
>>  	p = kvzalloc(struct_size(p, avail_lists, nr_node_ids), GFP_KERNEL);
>>  	if (!p)
>>  		return ERR_PTR(-ENOMEM);
>> +	p->cluster_next_cpu = alloc_percpu(unsigned int);
>> +	if (!p->cluster_next_cpu) {
>> +		kvfree(p);
>> +		return ERR_PTR(-ENOMEM);
>> +	}
>>  
>>  	spin_lock(&swap_lock);
>>  	for (type = 0; type < nr_swapfiles; type++) {
>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>>  
>>  	p->lowest_bit  = 1;
>>  	p->cluster_next = 1;
>> +	for_each_possible_cpu(i)
>> +		per_cpu(*p->cluster_next_cpu, i) = 1;
>>  	p->cluster_nr = 0;
>>  
>>  	maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		 * SSD
>>  		 */
>>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>
> We shouldn't need to do this now?

Yes.  Thanks for pointing this out.  Will delete this in the future
version.

>> +		for_each_possible_cpu(cpu) {
>> +			per_cpu(*p->cluster_next_cpu, cpu) =
>> +				1 + prandom_u32_max(p->highest_bit);
>> +		}
>
> Would there be any benefit in spreading these out evenly?  Intervals of
> (p->highest_bit/num_possible_cpus())?  That would reduce collisions,
> but not for very long I guess.

These may be spread more evenly with
(p->highest_bit/num_possible_cpus()).  I just worry about the possible
situation that num_possible_cpus() >> num_online_cpus().  Where current
method works better?

> Speaking of which, I wonder if there are failure modes in which all the
> CPUs end up getting into sync.
>
> And is it the case that if two or more CPUs have the same (or similar)
> per_cpu(*p->cluster_next_cpu, cpu), they'll each end up pointlessly
> scanning slots which another CPU has just scanned, thus rather
> defeating the purpose of having the cluster_next cache?
>
> IOW, should there be some additional collision avoidance scheme to
> prevent a CPU from pointing its cluster_ext into a 64MB trunk which
> another CPU is already using?

Yes.  That sounds reasonable.  How about something as below,

When per-cpu cluster_next is assigned, if the new value is in a
different 64MB (or larger) trunk of the old value, we will assign a
random value between p->lowest_bit and p->highest_bit to per-cpu
cluster_next.

This can reduce the possibility of collision to be almost 0 if there's
enough free swap slots.  And this is easy to be implemented, especially
considering the following situation,

  (p->highest_bit - p->lowest_bit) / 64MB < num_online_cpus()

> And should it really be a per-cpu thing?  That's rather arbitrary. 
> Perhaps we would get better swap locality by making swap_cluster_next a
> per-process (per-mm_struct) thing?

I think per-cpu is enough.  Because this is a scalability issue, as long
as we work on different 64MB trunks on different CPUs, the scalability
will be good.  I don't find there's any value to use differnt 64MB
trunks on a single CPU.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
  2020-05-15 23:51 ` Daniel Jordan
@ 2020-05-18  6:37     ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2020-05-18  6:37 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Minchan Kim,
	Tim Chen, Hugh Dickins

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Thu, May 14, 2020 at 03:04:24PM +0800, Huang Ying wrote:
>> And the pmbench score increases 15.9%.
>
> What metric is that, and how long did you run the benchmark for?

I run the benchmark for 1800s.  The metric comes from the following
output of the pmbench,

[1] Benchmark done - took 1800.088 sec for 122910000 page access

That is, the throughput is 122910000 / 1800.088 = 68280.0 (accesses/s).
Then we sum the values from the different processes.

> Given that this thing is probabilistic, did you notice much variance from run
> to run?

The results looks quite stable for me.  The standard deviation of
results run to run is less than 1% for me.

>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 */
>>  
>>  	si->flags += SWP_SCANNING;
>> -	scan_base = offset = si->cluster_next;
>> +	/*
>> +	 * Use percpu scan base for SSD to reduce lock contention on
>> +	 * cluster and swap cache.  For HDD, sequential access is more
>> +	 * important.
>> +	 */
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +	else
>> +		scan_base = si->cluster_next;
>> +	offset = scan_base;
>>  
>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>
> It's just a nit but SWP_SOLIDSTATE and 'if (si->cluster_info)' are two ways to
> check the same thing and I'd stick with the one that's already there.

Yes.  In effect, (si->flags & SWP_SOLIDSTATE) and (si->cluster_info)
always has same value at least for now.  But I don't think they are
exactly same in semantics.  So I would rather to use their exact
semantics.

>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>>  
>>  	p->lowest_bit  = 1;
>>  	p->cluster_next = 1;
>> +	for_each_possible_cpu(i)
>> +		per_cpu(*p->cluster_next_cpu, i) = 1;
>
> These are later overwritten if the device is an SSD which seems to be the only
> case where these are used, so why have this?

Yes.  You are right.  Will remove this in the future versions.

>>  	p->cluster_nr = 0;
>>  
>>  	maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		 * SSD
>>  		 */
>>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>> +		for_each_possible_cpu(cpu) {
>> +			per_cpu(*p->cluster_next_cpu, cpu) =
>> +				1 + prandom_u32_max(p->highest_bit);
>> +		}
>
> Is there a reason for adding one?  The history didn't enlighten me about why
> cluster_next does it.

The first swap slot is the swap partition header, you cand find the
corresponding code in syscall swapon function, below comments "Read the
swap header.".

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
@ 2020-05-18  6:37     ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2020-05-18  6:37 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, linux-mm, linux-kernel, Michal Hocko, Minchan Kim,
	Tim Chen, Hugh Dickins

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Thu, May 14, 2020 at 03:04:24PM +0800, Huang Ying wrote:
>> And the pmbench score increases 15.9%.
>
> What metric is that, and how long did you run the benchmark for?

I run the benchmark for 1800s.  The metric comes from the following
output of the pmbench,

[1] Benchmark done - took 1800.088 sec for 122910000 page access

That is, the throughput is 122910000 / 1800.088 = 68280.0 (accesses/s).
Then we sum the values from the different processes.

> Given that this thing is probabilistic, did you notice much variance from run
> to run?

The results looks quite stable for me.  The standard deviation of
results run to run is less than 1% for me.

>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 35be7a7271f4..9f1343b066c1 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -746,7 +746,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>>  	 */
>>  
>>  	si->flags += SWP_SCANNING;
>> -	scan_base = offset = si->cluster_next;
>> +	/*
>> +	 * Use percpu scan base for SSD to reduce lock contention on
>> +	 * cluster and swap cache.  For HDD, sequential access is more
>> +	 * important.
>> +	 */
>> +	if (si->flags & SWP_SOLIDSTATE)
>> +		scan_base = this_cpu_read(*si->cluster_next_cpu);
>> +	else
>> +		scan_base = si->cluster_next;
>> +	offset = scan_base;
>>  
>>  	/* SSD algorithm */
>>  	if (si->cluster_info) {
>
> It's just a nit but SWP_SOLIDSTATE and 'if (si->cluster_info)' are two ways to
> check the same thing and I'd stick with the one that's already there.

Yes.  In effect, (si->flags & SWP_SOLIDSTATE) and (si->cluster_info)
always has same value at least for now.  But I don't think they are
exactly same in semantics.  So I would rather to use their exact
semantics.

>> @@ -2962,6 +2979,8 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
>>  
>>  	p->lowest_bit  = 1;
>>  	p->cluster_next = 1;
>> +	for_each_possible_cpu(i)
>> +		per_cpu(*p->cluster_next_cpu, i) = 1;
>
> These are later overwritten if the device is an SSD which seems to be the only
> case where these are used, so why have this?

Yes.  You are right.  Will remove this in the future versions.

>>  	p->cluster_nr = 0;
>>  
>>  	maxpages = max_swapfile_size();
>> @@ -3204,6 +3223,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  		 * SSD
>>  		 */
>>  		p->cluster_next = 1 + prandom_u32_max(p->highest_bit);
>> +		for_each_possible_cpu(cpu) {
>> +			per_cpu(*p->cluster_next_cpu, cpu) =
>> +				1 + prandom_u32_max(p->highest_bit);
>> +		}
>
> Is there a reason for adding one?  The history didn't enlighten me about why
> cluster_next does it.

The first swap slot is the swap partition header, you cand find the
corresponding code in syscall swapon function, below comments "Read the
swap header.".

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache
  2020-05-18  6:37     ` Huang, Ying
  (?)
@ 2020-05-19  1:39     ` Daniel Jordan
  -1 siblings, 0 replies; 8+ messages in thread
From: Daniel Jordan @ 2020-05-19  1:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Daniel Jordan, Andrew Morton, linux-mm, linux-kernel,
	Michal Hocko, Minchan Kim, Tim Chen, Hugh Dickins

On Mon, May 18, 2020 at 02:37:15PM +0800, Huang, Ying wrote:
> Daniel Jordan <daniel.m.jordan@oracle.com> writes:
> > On Thu, May 14, 2020 at 03:04:24PM +0800, Huang Ying wrote:
> >> And the pmbench score increases 15.9%.
> >
> > What metric is that, and how long did you run the benchmark for?
> 
> I run the benchmark for 1800s.  The metric comes from the following
> output of the pmbench,
> 
> [1] Benchmark done - took 1800.088 sec for 122910000 page access
> 
> That is, the throughput is 122910000 / 1800.088 = 68280.0 (accesses/s).
> Then we sum the values from the different processes.

Ok.

> > It's just a nit but SWP_SOLIDSTATE and 'if (si->cluster_info)' are two ways to
> > check the same thing and I'd stick with the one that's already there.
> 
> Yes.  In effect, (si->flags & SWP_SOLIDSTATE) and (si->cluster_info)
> always has same value at least for now.  But I don't think they are
> exactly same in semantics.  So I would rather to use their exact
> semantics.

Oh, but I thought the swap clusters were for scaling the locking for fast
devices, so that both checks have the same semantics now, and presumably would
in the future.

It's a minor point, I'm fine either way.

> The first swap slot is the swap partition header, you cand find the
> corresponding code in syscall swapon function, below comments "Read the

Aha, thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-05-19  1:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-14  7:04 [PATCH] swap: Add percpu cluster_next to reduce lock contention on swap cache Huang Ying
2020-05-15 22:19 ` Andrew Morton
2020-05-18  5:52   ` Huang, Ying
2020-05-18  5:52     ` Huang, Ying
2020-05-15 23:51 ` Daniel Jordan
2020-05-18  6:37   ` Huang, Ying
2020-05-18  6:37     ` Huang, Ying
2020-05-19  1:39     ` Daniel Jordan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.