All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] A few cleanup and fixup patches for swap
@ 2022-06-08 14:40 Miaohe Lin
  2022-06-08 14:40 ` [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected Miaohe Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-08 14:40 UTC (permalink / raw)
  To: akpm; +Cc: david, linux-mm, linux-kernel, linmiaohe

Hi everyone,
This series contains a cleaup patch to remove unneeded swap_cache_info
statistics, and two bugfix patches to avoid possible data races of
inuse_pages and so on. More details can be found in the respective
changelogs. Thanks!

---
v2:
  collect Reviewed-by tag per David
  drop patch "mm/swapfile: avoid confusing swap cache statistics"
  add a new patch to remove swap_cache_info statistics per David
  Many thanks David for review and comment.
---
Miaohe Lin (3):
  mm/swapfile: make security_vm_enough_memory_mm() work as expected
  mm/swapfile: fix possible data races of inuse_pages
  mm/swap: remove swap_cache_info statistics

 mm/swap_state.c | 17 -----------------
 mm/swapfile.c   | 14 +++++++++-----
 2 files changed, 9 insertions(+), 22 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-08 14:40 [PATCH v2 0/3] A few cleanup and fixup patches for swap Miaohe Lin
@ 2022-06-08 14:40 ` Miaohe Lin
  2022-06-17  7:33   ` David Hildenbrand
  2022-06-20  7:31   ` Huang, Ying
  2022-06-08 14:40 ` [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages Miaohe Lin
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-08 14:40 UTC (permalink / raw)
  To: akpm; +Cc: david, linux-mm, linux-kernel, linmiaohe

security_vm_enough_memory_mm() checks whether a process has enough memory
to allocate a new virtual mapping. And total_swap_pages is considered as
available memory while swapoff tries to make sure there's enough memory
that can hold the swapped out memory. But total_swap_pages contains the
swap space that is being swapoff. So security_vm_enough_memory_mm() will
success even if there's no memory to hold the swapped out memory because
total_swap_pages always greater than or equal to p->pages.

In order to fix it, p->pages should be retracted from total_swap_pages
first and then check whether there's enough memory for inuse swap pages.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/swapfile.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index ec4c1b276691..d2bead7b8b70 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2398,6 +2398,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	struct filename *pathname;
 	int err, found = 0;
 	unsigned int old_block_size;
+	unsigned int inuse_pages;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -2428,9 +2429,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		spin_unlock(&swap_lock);
 		goto out_dput;
 	}
-	if (!security_vm_enough_memory_mm(current->mm, p->pages))
-		vm_unacct_memory(p->pages);
+
+	total_swap_pages -= p->pages;
+	inuse_pages = READ_ONCE(p->inuse_pages);
+	if (!security_vm_enough_memory_mm(current->mm, inuse_pages))
+		vm_unacct_memory(inuse_pages);
 	else {
+		total_swap_pages += p->pages;
 		err = -ENOMEM;
 		spin_unlock(&swap_lock);
 		goto out_dput;
@@ -2453,7 +2458,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	}
 	plist_del(&p->list, &swap_active_head);
 	atomic_long_sub(p->pages, &nr_swap_pages);
-	total_swap_pages -= p->pages;
 	p->flags &= ~SWP_WRITEOK;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-08 14:40 [PATCH v2 0/3] A few cleanup and fixup patches for swap Miaohe Lin
  2022-06-08 14:40 ` [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected Miaohe Lin
@ 2022-06-08 14:40 ` Miaohe Lin
  2022-06-20  7:54   ` Huang, Ying
  2022-06-08 14:40 ` [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics Miaohe Lin
  2022-06-17  2:37 ` [PATCH v2 0/3] A few cleanup and fixup patches for swap Andrew Morton
  3 siblings, 1 reply; 31+ messages in thread
From: Miaohe Lin @ 2022-06-08 14:40 UTC (permalink / raw)
  To: akpm; +Cc: david, linux-mm, linux-kernel, linmiaohe

si->inuse_pages could still be accessed concurrently now. The plain reads
outside si->lock critical section, i.e. swap_show and si_swapinfo, which
results in data races. But these should be ok because they're just used
for showing swap info.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 mm/swapfile.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d2bead7b8b70..3fa26f6971e9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
 	}
 
 	bytes = si->pages << (PAGE_SHIFT - 10);
-	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
+	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
 
 	file = si->swap_file;
 	len = seq_file_path(swap, file, " \t\n\\");
@@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
 		struct swap_info_struct *si = swap_info[type];
 
 		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
-			nr_to_be_unused += si->inuse_pages;
+			nr_to_be_unused += READ_ONCE(si->inuse_pages);
 	}
 	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
 	val->totalswap = total_swap_pages + nr_to_be_unused;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics
  2022-06-08 14:40 [PATCH v2 0/3] A few cleanup and fixup patches for swap Miaohe Lin
  2022-06-08 14:40 ` [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected Miaohe Lin
  2022-06-08 14:40 ` [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages Miaohe Lin
@ 2022-06-08 14:40 ` Miaohe Lin
  2022-06-08 15:16   ` David Hildenbrand
                     ` (2 more replies)
  2022-06-17  2:37 ` [PATCH v2 0/3] A few cleanup and fixup patches for swap Andrew Morton
  3 siblings, 3 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-08 14:40 UTC (permalink / raw)
  To: akpm; +Cc: david, linux-mm, linux-kernel, linmiaohe

swap_cache_info are not statistics that could be easily used to tune system
performance because they are not easily accessile. Also they can't provide
really useful info when OOM occurs. Remove these statistics can also help
mitigate unneeded global swap_cache_info cacheline contention.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/swap_state.c | 17 -----------------
 1 file changed, 17 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 0a2021fc55ad..41c6a6053d5c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -59,24 +59,11 @@ static bool enable_vma_readahead __read_mostly = true;
 #define GET_SWAP_RA_VAL(vma)					\
 	(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
 
-#define INC_CACHE_INFO(x)	data_race(swap_cache_info.x++)
-#define ADD_CACHE_INFO(x, nr)	data_race(swap_cache_info.x += (nr))
-
-static struct {
-	unsigned long add_total;
-	unsigned long del_total;
-	unsigned long find_success;
-	unsigned long find_total;
-} swap_cache_info;
-
 static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
 
 void show_swap_cache_info(void)
 {
 	printk("%lu pages in swap cache\n", total_swapcache_pages());
-	printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n",
-		swap_cache_info.add_total, swap_cache_info.del_total,
-		swap_cache_info.find_success, swap_cache_info.find_total);
 	printk("Free swap  = %ldkB\n",
 		get_nr_swap_pages() << (PAGE_SHIFT - 10));
 	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
@@ -133,7 +120,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
 		address_space->nrpages += nr;
 		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
 		__mod_lruvec_page_state(page, NR_SWAPCACHE, nr);
-		ADD_CACHE_INFO(add_total, nr);
 unlock:
 		xas_unlock_irq(&xas);
 	} while (xas_nomem(&xas, gfp));
@@ -172,7 +158,6 @@ void __delete_from_swap_cache(struct page *page,
 	address_space->nrpages -= nr;
 	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
 	__mod_lruvec_page_state(page, NR_SWAPCACHE, -nr);
-	ADD_CACHE_INFO(del_total, nr);
 }
 
 /**
@@ -348,12 +333,10 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
 	page = find_get_page(swap_address_space(entry), swp_offset(entry));
 	put_swap_device(si);
 
-	INC_CACHE_INFO(find_total);
 	if (page) {
 		bool vma_ra = swap_use_vma_readahead();
 		bool readahead;
 
-		INC_CACHE_INFO(find_success);
 		/*
 		 * At the moment, we don't support PG_readahead for anon THP
 		 * so let's bail out rather than confusing the readahead stat.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics
  2022-06-08 14:40 ` [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics Miaohe Lin
@ 2022-06-08 15:16   ` David Hildenbrand
  2022-06-20  8:08   ` Huang, Ying
  2022-06-20  9:30   ` Muchun Song
  2 siblings, 0 replies; 31+ messages in thread
From: David Hildenbrand @ 2022-06-08 15:16 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: linux-mm, linux-kernel

On 08.06.22 16:40, Miaohe Lin wrote:
> swap_cache_info are not statistics that could be easily used to tune system
> performance because they are not easily accessile. Also they can't provide
> really useful info when OOM occurs. Remove these statistics can also help
> mitigate unneeded global swap_cache_info cacheline contention.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] A few cleanup and fixup patches for swap
  2022-06-08 14:40 [PATCH v2 0/3] A few cleanup and fixup patches for swap Miaohe Lin
                   ` (2 preceding siblings ...)
  2022-06-08 14:40 ` [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics Miaohe Lin
@ 2022-06-17  2:37 ` Andrew Morton
  2022-06-17  3:00   ` Miaohe Lin
  3 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2022-06-17  2:37 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: david, linux-mm, linux-kernel, Hugh Dickins

On Wed, 8 Jun 2022 22:40:28 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:

> This series contains a cleaup patch to remove unneeded swap_cache_info
> statistics, and two bugfix patches to avoid possible data races of
> inuse_pages and so on. More details can be found in the respective
> changelogs. 

It would be nice to get [1/3] reviewed please.



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 0/3] A few cleanup and fixup patches for swap
  2022-06-17  2:37 ` [PATCH v2 0/3] A few cleanup and fixup patches for swap Andrew Morton
@ 2022-06-17  3:00   ` Miaohe Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-17  3:00 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Huang Ying
  Cc: linux-mm, linux-kernel

On 2022/6/17 10:37, Andrew Morton wrote:
> On Wed, 8 Jun 2022 22:40:28 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
>> This series contains a cleaup patch to remove unneeded swap_cache_info
>> statistics, and two bugfix patches to avoid possible data races of
>> inuse_pages and so on. More details can be found in the respective
>> changelogs. 
> 
> It would be nice to get [1/3] reviewed please.

I'd like too.

Hi David & Hugh & Huang, Ying,
It's very kind of you if you can help review this patch!

Thanks!

BTW: It should be convenient if there are mm/swap reviewers. ;)

> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-08 14:40 ` [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected Miaohe Lin
@ 2022-06-17  7:33   ` David Hildenbrand
  2022-06-18  2:43     ` Miaohe Lin
  2022-06-20  7:31   ` Huang, Ying
  1 sibling, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2022-06-17  7:33 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: linux-mm, linux-kernel

On 08.06.22 16:40, Miaohe Lin wrote:
> security_vm_enough_memory_mm() checks whether a process has enough memory
> to allocate a new virtual mapping. And total_swap_pages is considered as
> available memory while swapoff tries to make sure there's enough memory
> that can hold the swapped out memory. But total_swap_pages contains the
> swap space that is being swapoff. So security_vm_enough_memory_mm() will
> success even if there's no memory to hold the swapped out memory because

s/success/succeed/

> total_swap_pages always greater than or equal to p->pages.
> 
> In order to fix it, p->pages should be retracted from total_swap_pages

s/retracted/subtracted/

> first and then check whether there's enough memory for inuse swap pages.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/swapfile.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ec4c1b276691..d2bead7b8b70 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2398,6 +2398,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	struct filename *pathname;
>  	int err, found = 0;
>  	unsigned int old_block_size;
> +	unsigned int inuse_pages;
>  
>  	if (!capable(CAP_SYS_ADMIN))
>  		return -EPERM;
> @@ -2428,9 +2429,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  		spin_unlock(&swap_lock);
>  		goto out_dput;
>  	}
> -	if (!security_vm_enough_memory_mm(current->mm, p->pages))
> -		vm_unacct_memory(p->pages);
> +
> +	total_swap_pages -= p->pages;
> +	inuse_pages = READ_ONCE(p->inuse_pages);
> +	if (!security_vm_enough_memory_mm(current->mm, inuse_pages))
> +		vm_unacct_memory(inuse_pages);
>  	else {
> +		total_swap_pages += p->pages;

That implies that whenever we fail in security_vm_enough_memory_mm(),
that other concurrent users might see a wrong total_swap_pages.

Assume 4 GiB memory and 8 GiB swap. Let's assume 10 GiB are in use.

Temporarily, we'd have

CommitLimit    4 GiB
Committed_AS  10 GiB

Not sure if relevant, but I wonder if it could be avoided somehow?


Apart from that, LGTM.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-17  7:33   ` David Hildenbrand
@ 2022-06-18  2:43     ` Miaohe Lin
  2022-06-18  7:10       ` David Hildenbrand
  0 siblings, 1 reply; 31+ messages in thread
From: Miaohe Lin @ 2022-06-18  2:43 UTC (permalink / raw)
  To: David Hildenbrand, akpm; +Cc: linux-mm, linux-kernel

On 2022/6/17 15:33, David Hildenbrand wrote:
> On 08.06.22 16:40, Miaohe Lin wrote:
>> security_vm_enough_memory_mm() checks whether a process has enough memory
>> to allocate a new virtual mapping. And total_swap_pages is considered as
>> available memory while swapoff tries to make sure there's enough memory
>> that can hold the swapped out memory. But total_swap_pages contains the
>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>> success even if there's no memory to hold the swapped out memory because
> 
> s/success/succeed/

OK. Thanks.

> 
>> total_swap_pages always greater than or equal to p->pages.
>>
>> In order to fix it, p->pages should be retracted from total_swap_pages
> 
> s/retracted/subtracted/

OK. Thanks.

> 
>> first and then check whether there's enough memory for inuse swap pages.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/swapfile.c | 10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index ec4c1b276691..d2bead7b8b70 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2398,6 +2398,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>  	struct filename *pathname;
>>  	int err, found = 0;
>>  	unsigned int old_block_size;
>> +	unsigned int inuse_pages;
>>  
>>  	if (!capable(CAP_SYS_ADMIN))
>>  		return -EPERM;
>> @@ -2428,9 +2429,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>  		spin_unlock(&swap_lock);
>>  		goto out_dput;
>>  	}
>> -	if (!security_vm_enough_memory_mm(current->mm, p->pages))
>> -		vm_unacct_memory(p->pages);
>> +
>> +	total_swap_pages -= p->pages;
>> +	inuse_pages = READ_ONCE(p->inuse_pages);
>> +	if (!security_vm_enough_memory_mm(current->mm, inuse_pages))
>> +		vm_unacct_memory(inuse_pages);
>>  	else {
>> +		total_swap_pages += p->pages;
> 
> That implies that whenever we fail in security_vm_enough_memory_mm(),
> that other concurrent users might see a wrong total_swap_pages.
> 
> Assume 4 GiB memory and 8 GiB swap. Let's assume 10 GiB are in use.
> 
> Temporarily, we'd have
> 
> CommitLimit    4 GiB
> Committed_AS  10 GiB

IIUC, even if without this change, the other concurrent users if come after vm_acct_memory()
is done in __vm_enough_memory(), they might see

CommitLimit   12 GiB (4 GiB memory + 8GiB total swap)
Committed_AS  18 GiB (10 GiB in use + 8GiB swap space to swapoff)

Or am I miss something?

> 
> Not sure if relevant, but I wonder if it could be avoided somehow?

It seems this race exists already and is benign. The worst case is concurrent users might
fail to allocate the memory. But that window should be really small and swapoff is a rare
ops. Or should I try to fix this race?

> 
> 
> Apart from that, LGTM.

Many thanks for comment! :)

> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-18  2:43     ` Miaohe Lin
@ 2022-06-18  7:10       ` David Hildenbrand
  2022-06-18  7:31         ` Miaohe Lin
  0 siblings, 1 reply; 31+ messages in thread
From: David Hildenbrand @ 2022-06-18  7:10 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: linux-mm, linux-kernel

On 18.06.22 04:43, Miaohe Lin wrote:
> On 2022/6/17 15:33, David Hildenbrand wrote:
>> On 08.06.22 16:40, Miaohe Lin wrote:
>>> security_vm_enough_memory_mm() checks whether a process has enough memory
>>> to allocate a new virtual mapping. And total_swap_pages is considered as
>>> available memory while swapoff tries to make sure there's enough memory
>>> that can hold the swapped out memory. But total_swap_pages contains the
>>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>>> success even if there's no memory to hold the swapped out memory because
>>
>> s/success/succeed/
> 
> OK. Thanks.
> 
>>
>>> total_swap_pages always greater than or equal to p->pages.
>>>
>>> In order to fix it, p->pages should be retracted from total_swap_pages
>>
>> s/retracted/subtracted/
> 
> OK. Thanks.
> 
>>
>>> first and then check whether there's enough memory for inuse swap pages.
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/swapfile.c | 10 +++++++---
>>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index ec4c1b276691..d2bead7b8b70 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -2398,6 +2398,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>>  	struct filename *pathname;
>>>  	int err, found = 0;
>>>  	unsigned int old_block_size;
>>> +	unsigned int inuse_pages;
>>>  
>>>  	if (!capable(CAP_SYS_ADMIN))
>>>  		return -EPERM;
>>> @@ -2428,9 +2429,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>>  		spin_unlock(&swap_lock);
>>>  		goto out_dput;
>>>  	}
>>> -	if (!security_vm_enough_memory_mm(current->mm, p->pages))
>>> -		vm_unacct_memory(p->pages);
>>> +
>>> +	total_swap_pages -= p->pages;
>>> +	inuse_pages = READ_ONCE(p->inuse_pages);
>>> +	if (!security_vm_enough_memory_mm(current->mm, inuse_pages))
>>> +		vm_unacct_memory(inuse_pages);
>>>  	else {
>>> +		total_swap_pages += p->pages;
>>
>> That implies that whenever we fail in security_vm_enough_memory_mm(),
>> that other concurrent users might see a wrong total_swap_pages.
>>
>> Assume 4 GiB memory and 8 GiB swap. Let's assume 10 GiB are in use.
>>
>> Temporarily, we'd have
>>
>> CommitLimit    4 GiB
>> Committed_AS  10 GiB
> 
> IIUC, even if without this change, the other concurrent users if come after vm_acct_memory()
> is done in __vm_enough_memory(), they might see
> 
> CommitLimit   12 GiB (4 GiB memory + 8GiB total swap)
> Committed_AS  18 GiB (10 GiB in use + 8GiB swap space to swapoff)
> 
> Or am I miss something?
> 

I think you are right!

Reviewed-by: David Hildenbrand <david@redhat.com>


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-18  7:10       ` David Hildenbrand
@ 2022-06-18  7:31         ` Miaohe Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-18  7:31 UTC (permalink / raw)
  To: David Hildenbrand, akpm; +Cc: linux-mm, linux-kernel

On 2022/6/18 15:10, David Hildenbrand wrote:
> On 18.06.22 04:43, Miaohe Lin wrote:
>> On 2022/6/17 15:33, David Hildenbrand wrote:
>>> On 08.06.22 16:40, Miaohe Lin wrote:
>>>> security_vm_enough_memory_mm() checks whether a process has enough memory
>>>> to allocate a new virtual mapping. And total_swap_pages is considered as
>>>> available memory while swapoff tries to make sure there's enough memory
>>>> that can hold the swapped out memory. But total_swap_pages contains the
>>>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>>>> success even if there's no memory to hold the swapped out memory because
>>>
>>> s/success/succeed/
>>
>> OK. Thanks.
>>
>>>
>>>> total_swap_pages always greater than or equal to p->pages.
>>>>
>>>> In order to fix it, p->pages should be retracted from total_swap_pages
>>>
>>> s/retracted/subtracted/
>>
>> OK. Thanks.
>>
>>>
>>>> first and then check whether there's enough memory for inuse swap pages.
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> ---
>>>>  mm/swapfile.c | 10 +++++++---
>>>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index ec4c1b276691..d2bead7b8b70 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -2398,6 +2398,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>>>  	struct filename *pathname;
>>>>  	int err, found = 0;
>>>>  	unsigned int old_block_size;
>>>> +	unsigned int inuse_pages;
>>>>  
>>>>  	if (!capable(CAP_SYS_ADMIN))
>>>>  		return -EPERM;
>>>> @@ -2428,9 +2429,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>>>>  		spin_unlock(&swap_lock);
>>>>  		goto out_dput;
>>>>  	}
>>>> -	if (!security_vm_enough_memory_mm(current->mm, p->pages))
>>>> -		vm_unacct_memory(p->pages);
>>>> +
>>>> +	total_swap_pages -= p->pages;
>>>> +	inuse_pages = READ_ONCE(p->inuse_pages);
>>>> +	if (!security_vm_enough_memory_mm(current->mm, inuse_pages))
>>>> +		vm_unacct_memory(inuse_pages);
>>>>  	else {
>>>> +		total_swap_pages += p->pages;
>>>
>>> That implies that whenever we fail in security_vm_enough_memory_mm(),
>>> that other concurrent users might see a wrong total_swap_pages.
>>>
>>> Assume 4 GiB memory and 8 GiB swap. Let's assume 10 GiB are in use.
>>>
>>> Temporarily, we'd have
>>>
>>> CommitLimit    4 GiB
>>> Committed_AS  10 GiB
>>
>> IIUC, even if without this change, the other concurrent users if come after vm_acct_memory()
>> is done in __vm_enough_memory(), they might see
>>
>> CommitLimit   12 GiB (4 GiB memory + 8GiB total swap)
>> Committed_AS  18 GiB (10 GiB in use + 8GiB swap space to swapoff)
>>
>> Or am I miss something?
>>
> 
> I think you are right!
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks a lot!

> 
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-08 14:40 ` [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected Miaohe Lin
  2022-06-17  7:33   ` David Hildenbrand
@ 2022-06-20  7:31   ` Huang, Ying
  2022-06-20 12:12     ` Miaohe Lin
  1 sibling, 1 reply; 31+ messages in thread
From: Huang, Ying @ 2022-06-20  7:31 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, david, linux-mm, linux-kernel

Miaohe Lin <linmiaohe@huawei.com> writes:

> security_vm_enough_memory_mm() checks whether a process has enough memory
> to allocate a new virtual mapping. And total_swap_pages is considered as
> available memory while swapoff tries to make sure there's enough memory
> that can hold the swapped out memory. But total_swap_pages contains the
> swap space that is being swapoff. So security_vm_enough_memory_mm() will
> success even if there's no memory to hold the swapped out memory because
> total_swap_pages always greater than or equal to p->pages.

Per my understanding, swapoff will not allocate virtual mapping by
itself.  But after swapoff, the overcommit limit could be exceeded.
security_vm_enough_memory_mm() is used to check that.  For example, in a
system with 4GB memory and 8GB swap, and 10GB is in use,

CommitLimit:    4+8 = 12GB
Committed_AS:   10GB

security_vm_enough_memory_mm() in swapoff() will fail because
10+8 = 18 > 12.  This is expected because after swapoff, the overcommit
limit will be exceeded.

If 3GB is in use,

CommitLimit:    4+8 = 12GB
Committed_AS:   3GB

security_vm_enough_memory_mm() in swapoff() will succeed because
3+8 = 11 < 12.  This is expected because after swapoff, the overcommit
limit will not be exceeded.

So, what's the real problem of the original implementation?  Can you
show it with an example as above?

Best Regards,
Huang, Ying

> In order to fix it, p->pages should be retracted from total_swap_pages
> first and then check whether there's enough memory for inuse swap pages.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

[snip]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-08 14:40 ` [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages Miaohe Lin
@ 2022-06-20  7:54   ` Huang, Ying
  2022-06-20  9:04     ` Miaohe Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Huang, Ying @ 2022-06-20  7:54 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, david, linux-mm, linux-kernel

Miaohe Lin <linmiaohe@huawei.com> writes:

> si->inuse_pages could still be accessed concurrently now. The plain reads
> outside si->lock critical section, i.e. swap_show and si_swapinfo, which
> results in data races. But these should be ok because they're just used
> for showing swap info.
>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/swapfile.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index d2bead7b8b70..3fa26f6971e9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
>  	}
>  
>  	bytes = si->pages << (PAGE_SHIFT - 10);
> -	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
> +	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
>  
>  	file = si->swap_file;
>  	len = seq_file_path(swap, file, " \t\n\\");
> @@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
>  		struct swap_info_struct *si = swap_info[type];
>  
>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
> -			nr_to_be_unused += si->inuse_pages;
> +			nr_to_be_unused += READ_ONCE(si->inuse_pages);
>  	}
>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
>  	val->totalswap = total_swap_pages + nr_to_be_unused;

READ_ONCE() should be paired with WRITE_ONCE().  So, change the writer
side too?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics
  2022-06-08 14:40 ` [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics Miaohe Lin
  2022-06-08 15:16   ` David Hildenbrand
@ 2022-06-20  8:08   ` Huang, Ying
  2022-06-20  9:05     ` Miaohe Lin
  2022-06-20  9:30   ` Muchun Song
  2 siblings, 1 reply; 31+ messages in thread
From: Huang, Ying @ 2022-06-20  8:08 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, david, linux-mm, linux-kernel

Miaohe Lin <linmiaohe@huawei.com> writes:

> swap_cache_info are not statistics that could be easily used to tune system
> performance because they are not easily accessile. Also they can't provide
> really useful info when OOM occurs. Remove these statistics can also help
> mitigate unneeded global swap_cache_info cacheline contention.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/swap_state.c | 17 -----------------
>  1 file changed, 17 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 0a2021fc55ad..41c6a6053d5c 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -59,24 +59,11 @@ static bool enable_vma_readahead __read_mostly = true;
>  #define GET_SWAP_RA_VAL(vma)					\
>  	(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
>  
> -#define INC_CACHE_INFO(x)	data_race(swap_cache_info.x++)
> -#define ADD_CACHE_INFO(x, nr)	data_race(swap_cache_info.x += (nr))
> -
> -static struct {
> -	unsigned long add_total;
> -	unsigned long del_total;
> -	unsigned long find_success;
> -	unsigned long find_total;
> -} swap_cache_info;
> -
>  static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
>  
>  void show_swap_cache_info(void)
>  {
>  	printk("%lu pages in swap cache\n", total_swapcache_pages());
> -	printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n",
> -		swap_cache_info.add_total, swap_cache_info.del_total,
> -		swap_cache_info.find_success, swap_cache_info.find_total);
>  	printk("Free swap  = %ldkB\n",
>  		get_nr_swap_pages() << (PAGE_SHIFT - 10));
>  	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
> @@ -133,7 +120,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
>  		address_space->nrpages += nr;
>  		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
>  		__mod_lruvec_page_state(page, NR_SWAPCACHE, nr);
> -		ADD_CACHE_INFO(add_total, nr);
>  unlock:
>  		xas_unlock_irq(&xas);
>  	} while (xas_nomem(&xas, gfp));
> @@ -172,7 +158,6 @@ void __delete_from_swap_cache(struct page *page,
>  	address_space->nrpages -= nr;
>  	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
>  	__mod_lruvec_page_state(page, NR_SWAPCACHE, -nr);
> -	ADD_CACHE_INFO(del_total, nr);
>  }
>  
>  /**
> @@ -348,12 +333,10 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
>  	page = find_get_page(swap_address_space(entry), swp_offset(entry));
>  	put_swap_device(si);
>  
> -	INC_CACHE_INFO(find_total);
>  	if (page) {
>  		bool vma_ra = swap_use_vma_readahead();
>  		bool readahead;
>  
> -		INC_CACHE_INFO(find_success);
>  		/*
>  		 * At the moment, we don't support PG_readahead for anon THP
>  		 * so let's bail out rather than confusing the readahead stat.

This looks reasonable.  And if we want to do some statistics for swap
cache in the future, we can use BPF, that is even more convenient.

Acked-by: "Huang, Ying" <ying.huang@intel.com>

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20  7:54   ` Huang, Ying
@ 2022-06-20  9:04     ` Miaohe Lin
  2022-06-20  9:23       ` Muchun Song
  0 siblings, 1 reply; 31+ messages in thread
From: Miaohe Lin @ 2022-06-20  9:04 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/20 15:54, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> si->inuse_pages could still be accessed concurrently now. The plain reads
>> outside si->lock critical section, i.e. swap_show and si_swapinfo, which
>> results in data races. But these should be ok because they're just used
>> for showing swap info.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/swapfile.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index d2bead7b8b70..3fa26f6971e9 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
>>  	}
>>  
>>  	bytes = si->pages << (PAGE_SHIFT - 10);
>> -	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
>> +	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
>>  
>>  	file = si->swap_file;
>>  	len = seq_file_path(swap, file, " \t\n\\");
>> @@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
>>  		struct swap_info_struct *si = swap_info[type];
>>  
>>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
>> -			nr_to_be_unused += si->inuse_pages;
>> +			nr_to_be_unused += READ_ONCE(si->inuse_pages);
>>  	}
>>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
>>  	val->totalswap = total_swap_pages + nr_to_be_unused;
> 
> READ_ONCE() should be paired with WRITE_ONCE().  So, change the writer
> side too?

READ_ONCE() is used to fix the complaint of concurrent accessing to si->inuse_pages from KCSAN here.
The similar commit is 218209487c3d ("mm/swapfile: fix data races in try_to_unuse()"). IMHO, it's fine
to see a not-uptodate value of si->inuse_pages because it's just used for showing swap info. So
WRITE_ONCE() is not obligatory. Or am I miss something?

> 
> Best Regards,
> Huang, Ying

Thanks!

> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics
  2022-06-20  8:08   ` Huang, Ying
@ 2022-06-20  9:05     ` Miaohe Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-20  9:05 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/20 16:08, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> swap_cache_info are not statistics that could be easily used to tune system
>> performance because they are not easily accessile. Also they can't provide
>> really useful info when OOM occurs. Remove these statistics can also help
>> mitigate unneeded global swap_cache_info cacheline contention.
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/swap_state.c | 17 -----------------
>>  1 file changed, 17 deletions(-)
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 0a2021fc55ad..41c6a6053d5c 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -59,24 +59,11 @@ static bool enable_vma_readahead __read_mostly = true;
>>  #define GET_SWAP_RA_VAL(vma)					\
>>  	(atomic_long_read(&(vma)->swap_readahead_info) ? : 4)
>>  
>> -#define INC_CACHE_INFO(x)	data_race(swap_cache_info.x++)
>> -#define ADD_CACHE_INFO(x, nr)	data_race(swap_cache_info.x += (nr))
>> -
>> -static struct {
>> -	unsigned long add_total;
>> -	unsigned long del_total;
>> -	unsigned long find_success;
>> -	unsigned long find_total;
>> -} swap_cache_info;
>> -
>>  static atomic_t swapin_readahead_hits = ATOMIC_INIT(4);
>>  
>>  void show_swap_cache_info(void)
>>  {
>>  	printk("%lu pages in swap cache\n", total_swapcache_pages());
>> -	printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n",
>> -		swap_cache_info.add_total, swap_cache_info.del_total,
>> -		swap_cache_info.find_success, swap_cache_info.find_total);
>>  	printk("Free swap  = %ldkB\n",
>>  		get_nr_swap_pages() << (PAGE_SHIFT - 10));
>>  	printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
>> @@ -133,7 +120,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry,
>>  		address_space->nrpages += nr;
>>  		__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
>>  		__mod_lruvec_page_state(page, NR_SWAPCACHE, nr);
>> -		ADD_CACHE_INFO(add_total, nr);
>>  unlock:
>>  		xas_unlock_irq(&xas);
>>  	} while (xas_nomem(&xas, gfp));
>> @@ -172,7 +158,6 @@ void __delete_from_swap_cache(struct page *page,
>>  	address_space->nrpages -= nr;
>>  	__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
>>  	__mod_lruvec_page_state(page, NR_SWAPCACHE, -nr);
>> -	ADD_CACHE_INFO(del_total, nr);
>>  }
>>  
>>  /**
>> @@ -348,12 +333,10 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
>>  	page = find_get_page(swap_address_space(entry), swp_offset(entry));
>>  	put_swap_device(si);
>>  
>> -	INC_CACHE_INFO(find_total);
>>  	if (page) {
>>  		bool vma_ra = swap_use_vma_readahead();
>>  		bool readahead;
>>  
>> -		INC_CACHE_INFO(find_success);
>>  		/*
>>  		 * At the moment, we don't support PG_readahead for anon THP
>>  		 * so let's bail out rather than confusing the readahead stat.
> 
> This looks reasonable.  And if we want to do some statistics for swap
> cache in the future, we can use BPF, that is even more convenient.

BPF should be very convenient. Many thanks for reviewing!

> 
> Acked-by: "Huang, Ying" <ying.huang@intel.com>
> 
> Best Regards,
> Huang, Ying
> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20  9:04     ` Miaohe Lin
@ 2022-06-20  9:23       ` Muchun Song
  2022-06-20 12:23         ` Miaohe Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-20  9:23 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Huang, Ying, akpm, david, linux-mm, linux-kernel

On Mon, Jun 20, 2022 at 05:04:50PM +0800, Miaohe Lin wrote:
> On 2022/6/20 15:54, Huang, Ying wrote:
> > Miaohe Lin <linmiaohe@huawei.com> writes:
> > 
> >> si->inuse_pages could still be accessed concurrently now. The plain reads
> >> outside si->lock critical section, i.e. swap_show and si_swapinfo, which
> >> results in data races. But these should be ok because they're just used
> >> for showing swap info.
> >>
> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >> Reviewed-by: David Hildenbrand <david@redhat.com>
> >> ---
> >>  mm/swapfile.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> index d2bead7b8b70..3fa26f6971e9 100644
> >> --- a/mm/swapfile.c
> >> +++ b/mm/swapfile.c
> >> @@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
> >>  	}
> >>  
> >>  	bytes = si->pages << (PAGE_SHIFT - 10);
> >> -	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
> >> +	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
> >>  
> >>  	file = si->swap_file;
> >>  	len = seq_file_path(swap, file, " \t\n\\");
> >> @@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
> >>  		struct swap_info_struct *si = swap_info[type];
> >>  
> >>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
> >> -			nr_to_be_unused += si->inuse_pages;
> >> +			nr_to_be_unused += READ_ONCE(si->inuse_pages);
> >>  	}
> >>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
> >>  	val->totalswap = total_swap_pages + nr_to_be_unused;
> > 
> > READ_ONCE() should be paired with WRITE_ONCE().  So, change the writer
> > side too?
> 
> READ_ONCE() is used to fix the complaint of concurrent accessing to si->inuse_pages from KCSAN here.
> The similar commit is 218209487c3d ("mm/swapfile: fix data races in try_to_unuse()"). IMHO, it's fine

I think the fix 218209487c3d is incomplete. The write side in swap_range_free() should
also be fixed. Otherwise, IIUC, it cannot stop KCSAN complaining.

> to see a not-uptodate value of si->inuse_pages because it's just used for showing swap info. So
> WRITE_ONCE() is not obligatory. Or am I miss something?
> 
> > 
> > Best Regards,
> > Huang, Ying
> 
> Thanks!
> 
> > .
> > 
> 
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics
  2022-06-08 14:40 ` [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics Miaohe Lin
  2022-06-08 15:16   ` David Hildenbrand
  2022-06-20  8:08   ` Huang, Ying
@ 2022-06-20  9:30   ` Muchun Song
  2 siblings, 0 replies; 31+ messages in thread
From: Muchun Song @ 2022-06-20  9:30 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, david, linux-mm, linux-kernel

On Wed, Jun 08, 2022 at 10:40:31PM +0800, Miaohe Lin wrote:
> swap_cache_info are not statistics that could be easily used to tune system
> performance because they are not easily accessile. Also they can't provide
> really useful info when OOM occurs. Remove these statistics can also help
> mitigate unneeded global swap_cache_info cacheline contention.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-20  7:31   ` Huang, Ying
@ 2022-06-20 12:12     ` Miaohe Lin
  2022-06-21  1:35       ` Huang, Ying
  0 siblings, 1 reply; 31+ messages in thread
From: Miaohe Lin @ 2022-06-20 12:12 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/20 15:31, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> security_vm_enough_memory_mm() checks whether a process has enough memory
>> to allocate a new virtual mapping. And total_swap_pages is considered as
>> available memory while swapoff tries to make sure there's enough memory
>> that can hold the swapped out memory. But total_swap_pages contains the
>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>> success even if there's no memory to hold the swapped out memory because
>> total_swap_pages always greater than or equal to p->pages.
> 
> Per my understanding, swapoff will not allocate virtual mapping by
> itself.  But after swapoff, the overcommit limit could be exceeded.
> security_vm_enough_memory_mm() is used to check that.  For example, in a
> system with 4GB memory and 8GB swap, and 10GB is in use,
> 
> CommitLimit:    4+8 = 12GB
> Committed_AS:   10GB
> 
> security_vm_enough_memory_mm() in swapoff() will fail because
> 10+8 = 18 > 12.  This is expected because after swapoff, the overcommit
> limit will be exceeded.
> 
> If 3GB is in use,
> 
> CommitLimit:    4+8 = 12GB
> Committed_AS:   3GB
> 
> security_vm_enough_memory_mm() in swapoff() will succeed because
> 3+8 = 11 < 12.  This is expected because after swapoff, the overcommit
> limit will not be exceeded.

In OVERCOMMIT_NEVER scene, I think you're right.

> 
> So, what's the real problem of the original implementation?  Can you
> show it with an example as above?

In OVERCOMMIT_GUESS scene, in a system with 4GB memory and 8GB swap, and 10GB is in use,
pages below is 8GB, totalram_pages() + total_swap_pages is 12GB, so swapoff() will succeed
instead of expected failure because 8 < 12. The overcommit limit is always *ignored* in the
below case.

	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
		if (pages > totalram_pages() + total_swap_pages)
			goto error;
		return 0;
	}

Or am I miss something?

> 
> Best Regards,
> Huang, Ying

Thanks!

> 
>> In order to fix it, p->pages should be retracted from total_swap_pages
>> first and then check whether there's enough memory for inuse swap pages.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> [snip]
> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20  9:23       ` Muchun Song
@ 2022-06-20 12:23         ` Miaohe Lin
  2022-06-20 12:32           ` Miaohe Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Miaohe Lin @ 2022-06-20 12:23 UTC (permalink / raw)
  To: Muchun Song, Huang, Ying, cai; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/20 17:23, Muchun Song wrote:
> On Mon, Jun 20, 2022 at 05:04:50PM +0800, Miaohe Lin wrote:
>> On 2022/6/20 15:54, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> si->inuse_pages could still be accessed concurrently now. The plain reads
>>>> outside si->lock critical section, i.e. swap_show and si_swapinfo, which
>>>> results in data races. But these should be ok because they're just used
>>>> for showing swap info.
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>  mm/swapfile.c | 4 ++--
>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index d2bead7b8b70..3fa26f6971e9 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
>>>>  	}
>>>>  
>>>>  	bytes = si->pages << (PAGE_SHIFT - 10);
>>>> -	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
>>>> +	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
>>>>  
>>>>  	file = si->swap_file;
>>>>  	len = seq_file_path(swap, file, " \t\n\\");
>>>> @@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
>>>>  		struct swap_info_struct *si = swap_info[type];
>>>>  
>>>>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
>>>> -			nr_to_be_unused += si->inuse_pages;
>>>> +			nr_to_be_unused += READ_ONCE(si->inuse_pages);
>>>>  	}
>>>>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
>>>>  	val->totalswap = total_swap_pages + nr_to_be_unused;
>>>
>>> READ_ONCE() should be paired with WRITE_ONCE().  So, change the writer
>>> side too?
>>
>> READ_ONCE() is used to fix the complaint of concurrent accessing to si->inuse_pages from KCSAN here.
>> The similar commit is 218209487c3d ("mm/swapfile: fix data races in try_to_unuse()"). IMHO, it's fine
> 
> I think the fix 218209487c3d is incomplete. The write side in swap_range_free() should
> also be fixed. Otherwise, IIUC, it cannot stop KCSAN complaining.

I tend to agree with you. READ_ONCE() should be paired with WRITE_ONCE() theoretically. But WRITTE_ONCE()
is ignored while the commit is introduced. Add Qian Cai for helping verify it. It's very kind of @Qian Cai
if he could tell us whether WRITTE_ONCE() is ignored deliberately.

Thanks all of you. :)

> 
>> to see a not-uptodate value of si->inuse_pages because it's just used for showing swap info. So
>> WRITE_ONCE() is not obligatory. Or am I miss something?
>>
>>>
>>> Best Regards,
>>> Huang, Ying
>>
>> Thanks!
>>
>>> .
>>>
>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20 12:23         ` Miaohe Lin
@ 2022-06-20 12:32           ` Miaohe Lin
  2022-06-20 13:46             ` Qian Cai
  0 siblings, 1 reply; 31+ messages in thread
From: Miaohe Lin @ 2022-06-20 12:32 UTC (permalink / raw)
  To: Muchun Song, Huang, Ying, Qian Cai; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/20 20:23, Miaohe Lin wrote:
> On 2022/6/20 17:23, Muchun Song wrote:
>> On Mon, Jun 20, 2022 at 05:04:50PM +0800, Miaohe Lin wrote:
>>> On 2022/6/20 15:54, Huang, Ying wrote:
>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>>
>>>>> si->inuse_pages could still be accessed concurrently now. The plain reads
>>>>> outside si->lock critical section, i.e. swap_show and si_swapinfo, which
>>>>> results in data races. But these should be ok because they're just used
>>>>> for showing swap info.
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> Reviewed-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>  mm/swapfile.c | 4 ++--
>>>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>> index d2bead7b8b70..3fa26f6971e9 100644
>>>>> --- a/mm/swapfile.c
>>>>> +++ b/mm/swapfile.c
>>>>> @@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
>>>>>  	}
>>>>>  
>>>>>  	bytes = si->pages << (PAGE_SHIFT - 10);
>>>>> -	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
>>>>> +	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
>>>>>  
>>>>>  	file = si->swap_file;
>>>>>  	len = seq_file_path(swap, file, " \t\n\\");
>>>>> @@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
>>>>>  		struct swap_info_struct *si = swap_info[type];
>>>>>  
>>>>>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
>>>>> -			nr_to_be_unused += si->inuse_pages;
>>>>> +			nr_to_be_unused += READ_ONCE(si->inuse_pages);
>>>>>  	}
>>>>>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
>>>>>  	val->totalswap = total_swap_pages + nr_to_be_unused;
>>>>
>>>> READ_ONCE() should be paired with WRITE_ONCE().  So, change the writer
>>>> side too?
>>>
>>> READ_ONCE() is used to fix the complaint of concurrent accessing to si->inuse_pages from KCSAN here.
>>> The similar commit is 218209487c3d ("mm/swapfile: fix data races in try_to_unuse()"). IMHO, it's fine
>>
>> I think the fix 218209487c3d is incomplete. The write side in swap_range_free() should
>> also be fixed. Otherwise, IIUC, it cannot stop KCSAN complaining.
> 
> I tend to agree with you. READ_ONCE() should be paired with WRITE_ONCE() theoretically. But WRITTE_ONCE()
> is ignored while the commit is introduced. Add Qian Cai for helping verify it. It's very kind of @Qian Cai
> if he could tell us whether WRITTE_ONCE() is ignored deliberately.

Update the email address of Qian Cai.

> 
> Thanks all of you. :)
> 
>>
>>> to see a not-uptodate value of si->inuse_pages because it's just used for showing swap info. So
>>> WRITE_ONCE() is not obligatory. Or am I miss something?
>>>
>>>>
>>>> Best Regards,
>>>> Huang, Ying
>>>
>>> Thanks!
>>>
>>>> .
>>>>
>>>
>>>
>> .
>>
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20 12:32           ` Miaohe Lin
@ 2022-06-20 13:46             ` Qian Cai
  2022-06-20 14:20               ` Muchun Song
  0 siblings, 1 reply; 31+ messages in thread
From: Qian Cai @ 2022-06-20 13:46 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Muchun Song, Huang, Ying, akpm, david, linux-mm, linux-kernel

On Mon, Jun 20, 2022 at 08:32:27PM +0800, Miaohe Lin wrote:
> >>>>> --- a/mm/swapfile.c
> >>>>> +++ b/mm/swapfile.c
> >>>>> @@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
> >>>>>  	}
> >>>>>  
> >>>>>  	bytes = si->pages << (PAGE_SHIFT - 10);
> >>>>> -	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
> >>>>> +	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
> >>>>>  
> >>>>>  	file = si->swap_file;
> >>>>>  	len = seq_file_path(swap, file, " \t\n\\");
> >>>>> @@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
> >>>>>  		struct swap_info_struct *si = swap_info[type];
> >>>>>  
> >>>>>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
> >>>>> -			nr_to_be_unused += si->inuse_pages;
> >>>>> +			nr_to_be_unused += READ_ONCE(si->inuse_pages);
> >>>>>  	}
> >>>>>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
> >>>>>  	val->totalswap = total_swap_pages + nr_to_be_unused;
> >>>>
> >>>> READ_ONCE() should be paired with WRITE_ONCE().  So, change the writer
> >>>> side too?
> >>>
> >>> READ_ONCE() is used to fix the complaint of concurrent accessing to si->inuse_pages from KCSAN here.
> >>> The similar commit is 218209487c3d ("mm/swapfile: fix data races in try_to_unuse()"). IMHO, it's fine
> >>
> >> I think the fix 218209487c3d is incomplete. The write side in swap_range_free() should
> >> also be fixed. Otherwise, IIUC, it cannot stop KCSAN complaining.
> > 
> > I tend to agree with you. READ_ONCE() should be paired with WRITE_ONCE() theoretically. But WRITTE_ONCE()
> > is ignored while the commit is introduced. Add Qian Cai for helping verify it. It's very kind of @Qian Cai
> > if he could tell us whether WRITTE_ONCE() is ignored deliberately.

The write side should be protected by the lock swap_info_struct::lock. Is
that not the case here?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20 13:46             ` Qian Cai
@ 2022-06-20 14:20               ` Muchun Song
  2022-06-20 21:36                 ` Qian Cai
  0 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-20 14:20 UTC (permalink / raw)
  To: Qian Cai; +Cc: Miaohe Lin, Huang, Ying, akpm, david, linux-mm, linux-kernel

On Mon, Jun 20, 2022 at 09:46:47AM -0400, Qian Cai wrote:
> On Mon, Jun 20, 2022 at 08:32:27PM +0800, Miaohe Lin wrote:
> > >>>>> --- a/mm/swapfile.c
> > >>>>> +++ b/mm/swapfile.c
> > >>>>> @@ -2646,7 +2646,7 @@ static int swap_show(struct seq_file *swap, void *v)
> > >>>>>  	}
> > >>>>>  
> > >>>>>  	bytes = si->pages << (PAGE_SHIFT - 10);
> > >>>>> -	inuse = si->inuse_pages << (PAGE_SHIFT - 10);
> > >>>>> +	inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10);
> > >>>>>  
> > >>>>>  	file = si->swap_file;
> > >>>>>  	len = seq_file_path(swap, file, " \t\n\\");
> > >>>>> @@ -3265,7 +3265,7 @@ void si_swapinfo(struct sysinfo *val)
> > >>>>>  		struct swap_info_struct *si = swap_info[type];
> > >>>>>  
> > >>>>>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
> > >>>>> -			nr_to_be_unused += si->inuse_pages;
> > >>>>> +			nr_to_be_unused += READ_ONCE(si->inuse_pages);
> > >>>>>  	}
> > >>>>>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
> > >>>>>  	val->totalswap = total_swap_pages + nr_to_be_unused;
> > >>>>
> > >>>> READ_ONCE() should be paired with WRITE_ONCE().  So, change the writer
> > >>>> side too?
> > >>>
> > >>> READ_ONCE() is used to fix the complaint of concurrent accessing to si->inuse_pages from KCSAN here.
> > >>> The similar commit is 218209487c3d ("mm/swapfile: fix data races in try_to_unuse()"). IMHO, it's fine
> > >>
> > >> I think the fix 218209487c3d is incomplete. The write side in swap_range_free() should
> > >> also be fixed. Otherwise, IIUC, it cannot stop KCSAN complaining.
> > > 
> > > I tend to agree with you. READ_ONCE() should be paired with WRITE_ONCE() theoretically. But WRITTE_ONCE()
> > > is ignored while the commit is introduced. Add Qian Cai for helping verify it. It's very kind of @Qian Cai
> > > if he could tell us whether WRITTE_ONCE() is ignored deliberately.
> 
> The write side should be protected by the lock swap_info_struct::lock. Is
> that not the case here?
>

The lock does not protect the read sides. So the write side should be
fixed by WRITTE_ONCE().

Thanks.
 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20 14:20               ` Muchun Song
@ 2022-06-20 21:36                 ` Qian Cai
  2022-06-21  1:14                   ` Huang, Ying
  0 siblings, 1 reply; 31+ messages in thread
From: Qian Cai @ 2022-06-20 21:36 UTC (permalink / raw)
  To: Muchun Song; +Cc: Miaohe Lin, Huang, Ying, akpm, david, linux-mm, linux-kernel

On Mon, Jun 20, 2022 at 10:20:07PM +0800, Muchun Song wrote:
> The lock does not protect the read sides. So the write side should be
> fixed by WRITTE_ONCE().

https://lwn.net/Articles/816854/

"Unmarked writes (aligned and up to word size) can be treated as if they had
used WRITE_ONCE() by building with
CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y (also selected by default).
Experience has shown that compilers are much less likely to destructively
optimize in-kernel writes than reads. Some developers might therefore
choose to use READ_ONCE() but omit the corresponding WRITE_ONCE(). Other
developers might prefer the documentation benefits and long-term peace of
mind accruing from explicit use of WRITE_ONCE()..."

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-20 21:36                 ` Qian Cai
@ 2022-06-21  1:14                   ` Huang, Ying
  2022-06-21  3:39                     ` Muchun Song
  0 siblings, 1 reply; 31+ messages in thread
From: Huang, Ying @ 2022-06-21  1:14 UTC (permalink / raw)
  To: Qian Cai, Miaohe Lin; +Cc: Muchun Song, akpm, david, linux-mm, linux-kernel

Qian Cai <quic_qiancai@quicinc.com> writes:

> On Mon, Jun 20, 2022 at 10:20:07PM +0800, Muchun Song wrote:
>> The lock does not protect the read sides. So the write side should be
>> fixed by WRITTE_ONCE().
>
> https://lwn.net/Articles/816854/
>
> "Unmarked writes (aligned and up to word size) can be treated as if they had
> used WRITE_ONCE() by building with
> CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y (also selected by default).
> Experience has shown that compilers are much less likely to destructively
> optimize in-kernel writes than reads. Some developers might therefore
> choose to use READ_ONCE() but omit the corresponding WRITE_ONCE(). Other
> developers might prefer the documentation benefits and long-term peace of
> mind accruing from explicit use of WRITE_ONCE()..."

Thanks for pointing me to this great article.  So although not required
by KCSAN strictly, WRITE_ONCE() is still good for documentation, etc.
Just like we have done for swap_info_struct->highest_bit, etc.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-20 12:12     ` Miaohe Lin
@ 2022-06-21  1:35       ` Huang, Ying
  2022-06-21  7:37         ` Miaohe Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Huang, Ying @ 2022-06-21  1:35 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, david, linux-mm, linux-kernel

Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2022/6/20 15:31, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
>> 
>>> security_vm_enough_memory_mm() checks whether a process has enough memory
>>> to allocate a new virtual mapping. And total_swap_pages is considered as
>>> available memory while swapoff tries to make sure there's enough memory
>>> that can hold the swapped out memory. But total_swap_pages contains the
>>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>>> success even if there's no memory to hold the swapped out memory because
>>> total_swap_pages always greater than or equal to p->pages.
>> 
>> Per my understanding, swapoff will not allocate virtual mapping by
>> itself.  But after swapoff, the overcommit limit could be exceeded.
>> security_vm_enough_memory_mm() is used to check that.  For example, in a
>> system with 4GB memory and 8GB swap, and 10GB is in use,
>> 
>> CommitLimit:    4+8 = 12GB
>> Committed_AS:   10GB
>> 
>> security_vm_enough_memory_mm() in swapoff() will fail because
>> 10+8 = 18 > 12.  This is expected because after swapoff, the overcommit
>> limit will be exceeded.
>> 
>> If 3GB is in use,
>> 
>> CommitLimit:    4+8 = 12GB
>> Committed_AS:   3GB
>> 
>> security_vm_enough_memory_mm() in swapoff() will succeed because
>> 3+8 = 11 < 12.  This is expected because after swapoff, the overcommit
>> limit will not be exceeded.
>
> In OVERCOMMIT_NEVER scene, I think you're right.
>
>> 
>> So, what's the real problem of the original implementation?  Can you
>> show it with an example as above?
>
> In OVERCOMMIT_GUESS scene, in a system with 4GB memory and 8GB swap, and 10GB is in use,
> pages below is 8GB, totalram_pages() + total_swap_pages is 12GB, so swapoff() will succeed
> instead of expected failure because 8 < 12. The overcommit limit is always *ignored* in the
> below case.
>
> 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
> 		if (pages > totalram_pages() + total_swap_pages)
> 			goto error;
> 		return 0;
> 	}
>
> Or am I miss something?

Per my understanding, with OVERCOMMIT_GUESS, the number of in-use pages
isn't checked at all.  The only restriction is that the size of the
virtual mapping created should be less than total RAM + total swap
pages.  Because swapoff() will not create virtual mapping, so it's
expected that security_vm_enough_memory_mm() in swapoff() always
succeeds.

Best Regards,
Huang, Ying

>
> Thanks!
>
>> 
>>> In order to fix it, p->pages should be retracted from total_swap_pages
>>> first and then check whether there's enough memory for inuse swap pages.
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> 
>> [snip]
>> 
>> .
>> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-21  1:14                   ` Huang, Ying
@ 2022-06-21  3:39                     ` Muchun Song
  2022-06-21  6:40                       ` Miaohe Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Muchun Song @ 2022-06-21  3:39 UTC (permalink / raw)
  To: Huang, Ying, Qian Cai; +Cc: Miaohe Lin, akpm, david, linux-mm, linux-kernel

On Tue, Jun 21, 2022 at 09:14:00AM +0800, Huang, Ying wrote:
> Qian Cai <quic_qiancai@quicinc.com> writes:
> 
> > On Mon, Jun 20, 2022 at 10:20:07PM +0800, Muchun Song wrote:
> >> The lock does not protect the read sides. So the write side should be
> >> fixed by WRITTE_ONCE().
> >
> > https://lwn.net/Articles/816854/
> >
> > "Unmarked writes (aligned and up to word size) can be treated as if they had
> > used WRITE_ONCE() by building with
> > CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y (also selected by default).

All right, CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC help us avoid KCSAN
complaining.

> > Experience has shown that compilers are much less likely to destructively
> > optimize in-kernel writes than reads. Some developers might therefore
> > choose to use READ_ONCE() but omit the corresponding WRITE_ONCE(). Other
> > developers might prefer the documentation benefits and long-term peace of
> > mind accruing from explicit use of WRITE_ONCE()..."
> 
> Thanks for pointing me to this great article.  So although not required
> by KCSAN strictly, WRITE_ONCE() is still good for documentation, etc.
> Just like we have done for swap_info_struct->highest_bit, etc.
> 

+1

> Best Regards,
> Huang, Ying
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages
  2022-06-21  3:39                     ` Muchun Song
@ 2022-06-21  6:40                       ` Miaohe Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-21  6:40 UTC (permalink / raw)
  To: Muchun Song, Huang, Ying, Qian Cai; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/21 11:39, Muchun Song wrote:
> On Tue, Jun 21, 2022 at 09:14:00AM +0800, Huang, Ying wrote:
>> Qian Cai <quic_qiancai@quicinc.com> writes:
>>
>>> On Mon, Jun 20, 2022 at 10:20:07PM +0800, Muchun Song wrote:
>>>> The lock does not protect the read sides. So the write side should be
>>>> fixed by WRITTE_ONCE().
>>>
>>> https://lwn.net/Articles/816854/
>>>
>>> "Unmarked writes (aligned and up to word size) can be treated as if they had
>>> used WRITE_ONCE() by building with
>>> CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=y (also selected by default).
> 
> All right, CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC help us avoid KCSAN
> complaining.
> 
>>> Experience has shown that compilers are much less likely to destructively
>>> optimize in-kernel writes than reads. Some developers might therefore
>>> choose to use READ_ONCE() but omit the corresponding WRITE_ONCE(). Other
>>> developers might prefer the documentation benefits and long-term peace of
>>> mind accruing from explicit use of WRITE_ONCE()..."
>>
>> Thanks for pointing me to this great article.  So although not required
>> by KCSAN strictly, WRITE_ONCE() is still good for documentation, etc.
>> Just like we have done for swap_info_struct->highest_bit, etc.
>>
> 
> +1

I tend to agree with Muchun & Huang, Ying. Thanks all of you.

> 
>> Best Regards,
>> Huang, Ying
>>
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-21  1:35       ` Huang, Ying
@ 2022-06-21  7:37         ` Miaohe Lin
  2022-06-21  7:42           ` Huang, Ying
  0 siblings, 1 reply; 31+ messages in thread
From: Miaohe Lin @ 2022-06-21  7:37 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/21 9:35, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> On 2022/6/20 15:31, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> security_vm_enough_memory_mm() checks whether a process has enough memory
>>>> to allocate a new virtual mapping. And total_swap_pages is considered as
>>>> available memory while swapoff tries to make sure there's enough memory
>>>> that can hold the swapped out memory. But total_swap_pages contains the
>>>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>>>> success even if there's no memory to hold the swapped out memory because
>>>> total_swap_pages always greater than or equal to p->pages.
>>>
>>> Per my understanding, swapoff will not allocate virtual mapping by
>>> itself.  But after swapoff, the overcommit limit could be exceeded.
>>> security_vm_enough_memory_mm() is used to check that.  For example, in a
>>> system with 4GB memory and 8GB swap, and 10GB is in use,
>>>
>>> CommitLimit:    4+8 = 12GB
>>> Committed_AS:   10GB
>>>
>>> security_vm_enough_memory_mm() in swapoff() will fail because
>>> 10+8 = 18 > 12.  This is expected because after swapoff, the overcommit
>>> limit will be exceeded.
>>>
>>> If 3GB is in use,
>>>
>>> CommitLimit:    4+8 = 12GB
>>> Committed_AS:   3GB
>>>
>>> security_vm_enough_memory_mm() in swapoff() will succeed because
>>> 3+8 = 11 < 12.  This is expected because after swapoff, the overcommit
>>> limit will not be exceeded.
>>
>> In OVERCOMMIT_NEVER scene, I think you're right.
>>
>>>
>>> So, what's the real problem of the original implementation?  Can you
>>> show it with an example as above?
>>
>> In OVERCOMMIT_GUESS scene, in a system with 4GB memory and 8GB swap, and 10GB is in use,
>> pages below is 8GB, totalram_pages() + total_swap_pages is 12GB, so swapoff() will succeed
>> instead of expected failure because 8 < 12. The overcommit limit is always *ignored* in the
>> below case.
>>
>> 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
>> 		if (pages > totalram_pages() + total_swap_pages)
>> 			goto error;
>> 		return 0;
>> 	}
>>
>> Or am I miss something?
> 
> Per my understanding, with OVERCOMMIT_GUESS, the number of in-use pages
> isn't checked at all.  The only restriction is that the size of the
> virtual mapping created should be less than total RAM + total swap

Do you mean the only restriction is that the size of the virtual mapping
*created every time* should be less than total RAM + total swap pages but
*total virtual mapping* is not limited in OVERCOMMIT_GUESS scene? If so,
the current behavior should be sane and I will drop this patch.

Thanks!

> pages.  Because swapoff() will not create virtual mapping, so it's
> expected that security_vm_enough_memory_mm() in swapoff() always
> succeeds.
> 
> Best Regards,
> Huang, Ying
> 
>>
>> Thanks!
>>
>>>
>>>> In order to fix it, p->pages should be retracted from total_swap_pages
>>>> first and then check whether there's enough memory for inuse swap pages.
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>
>>> [snip]
>>>
>>> .
>>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-21  7:37         ` Miaohe Lin
@ 2022-06-21  7:42           ` Huang, Ying
  2022-06-21  8:20             ` Miaohe Lin
  0 siblings, 1 reply; 31+ messages in thread
From: Huang, Ying @ 2022-06-21  7:42 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, david, linux-mm, linux-kernel

Miaohe Lin <linmiaohe@huawei.com> writes:

> On 2022/6/21 9:35, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
>> 
>>> On 2022/6/20 15:31, Huang, Ying wrote:
>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>>
>>>>> security_vm_enough_memory_mm() checks whether a process has enough memory
>>>>> to allocate a new virtual mapping. And total_swap_pages is considered as
>>>>> available memory while swapoff tries to make sure there's enough memory
>>>>> that can hold the swapped out memory. But total_swap_pages contains the
>>>>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>>>>> success even if there's no memory to hold the swapped out memory because
>>>>> total_swap_pages always greater than or equal to p->pages.
>>>>
>>>> Per my understanding, swapoff will not allocate virtual mapping by
>>>> itself.  But after swapoff, the overcommit limit could be exceeded.
>>>> security_vm_enough_memory_mm() is used to check that.  For example, in a
>>>> system with 4GB memory and 8GB swap, and 10GB is in use,
>>>>
>>>> CommitLimit:    4+8 = 12GB
>>>> Committed_AS:   10GB
>>>>
>>>> security_vm_enough_memory_mm() in swapoff() will fail because
>>>> 10+8 = 18 > 12.  This is expected because after swapoff, the overcommit
>>>> limit will be exceeded.
>>>>
>>>> If 3GB is in use,
>>>>
>>>> CommitLimit:    4+8 = 12GB
>>>> Committed_AS:   3GB
>>>>
>>>> security_vm_enough_memory_mm() in swapoff() will succeed because
>>>> 3+8 = 11 < 12.  This is expected because after swapoff, the overcommit
>>>> limit will not be exceeded.
>>>
>>> In OVERCOMMIT_NEVER scene, I think you're right.
>>>
>>>>
>>>> So, what's the real problem of the original implementation?  Can you
>>>> show it with an example as above?
>>>
>>> In OVERCOMMIT_GUESS scene, in a system with 4GB memory and 8GB swap, and 10GB is in use,
>>> pages below is 8GB, totalram_pages() + total_swap_pages is 12GB, so swapoff() will succeed
>>> instead of expected failure because 8 < 12. The overcommit limit is always *ignored* in the
>>> below case.
>>>
>>> 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
>>> 		if (pages > totalram_pages() + total_swap_pages)
>>> 			goto error;
>>> 		return 0;
>>> 	}
>>>
>>> Or am I miss something?
>> 
>> Per my understanding, with OVERCOMMIT_GUESS, the number of in-use pages
>> isn't checked at all.  The only restriction is that the size of the
>> virtual mapping created should be less than total RAM + total swap
>
> Do you mean the only restriction is that the size of the virtual mapping
> *created every time* should be less than total RAM + total swap pages but
> *total virtual mapping* is not limited in OVERCOMMIT_GUESS scene? If so,
> the current behavior should be sane and I will drop this patch.

Yes.  This is my understanding.

Best Regards,
Huang, Ying

> Thanks!
>
>> pages.  Because swapoff() will not create virtual mapping, so it's
>> expected that security_vm_enough_memory_mm() in swapoff() always
>> succeeds.
>> 
>> Best Regards,
>> Huang, Ying
>> 
>>>
>>> Thanks!
>>>
>>>>
>>>>> In order to fix it, p->pages should be retracted from total_swap_pages
>>>>> first and then check whether there's enough memory for inuse swap pages.
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>
>>>> [snip]
>>>>
>>>> .
>>>>
>> 
>> .
>> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected
  2022-06-21  7:42           ` Huang, Ying
@ 2022-06-21  8:20             ` Miaohe Lin
  0 siblings, 0 replies; 31+ messages in thread
From: Miaohe Lin @ 2022-06-21  8:20 UTC (permalink / raw)
  To: Huang, Ying; +Cc: akpm, david, linux-mm, linux-kernel

On 2022/6/21 15:42, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
> 
>> On 2022/6/21 9:35, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> On 2022/6/20 15:31, Huang, Ying wrote:
>>>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>>>
>>>>>> security_vm_enough_memory_mm() checks whether a process has enough memory
>>>>>> to allocate a new virtual mapping. And total_swap_pages is considered as
>>>>>> available memory while swapoff tries to make sure there's enough memory
>>>>>> that can hold the swapped out memory. But total_swap_pages contains the
>>>>>> swap space that is being swapoff. So security_vm_enough_memory_mm() will
>>>>>> success even if there's no memory to hold the swapped out memory because
>>>>>> total_swap_pages always greater than or equal to p->pages.
>>>>>
>>>>> Per my understanding, swapoff will not allocate virtual mapping by
>>>>> itself.  But after swapoff, the overcommit limit could be exceeded.
>>>>> security_vm_enough_memory_mm() is used to check that.  For example, in a
>>>>> system with 4GB memory and 8GB swap, and 10GB is in use,
>>>>>
>>>>> CommitLimit:    4+8 = 12GB
>>>>> Committed_AS:   10GB
>>>>>
>>>>> security_vm_enough_memory_mm() in swapoff() will fail because
>>>>> 10+8 = 18 > 12.  This is expected because after swapoff, the overcommit
>>>>> limit will be exceeded.
>>>>>
>>>>> If 3GB is in use,
>>>>>
>>>>> CommitLimit:    4+8 = 12GB
>>>>> Committed_AS:   3GB
>>>>>
>>>>> security_vm_enough_memory_mm() in swapoff() will succeed because
>>>>> 3+8 = 11 < 12.  This is expected because after swapoff, the overcommit
>>>>> limit will not be exceeded.
>>>>
>>>> In OVERCOMMIT_NEVER scene, I think you're right.
>>>>
>>>>>
>>>>> So, what's the real problem of the original implementation?  Can you
>>>>> show it with an example as above?
>>>>
>>>> In OVERCOMMIT_GUESS scene, in a system with 4GB memory and 8GB swap, and 10GB is in use,
>>>> pages below is 8GB, totalram_pages() + total_swap_pages is 12GB, so swapoff() will succeed
>>>> instead of expected failure because 8 < 12. The overcommit limit is always *ignored* in the
>>>> below case.
>>>>
>>>> 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
>>>> 		if (pages > totalram_pages() + total_swap_pages)
>>>> 			goto error;
>>>> 		return 0;
>>>> 	}
>>>>
>>>> Or am I miss something?
>>>
>>> Per my understanding, with OVERCOMMIT_GUESS, the number of in-use pages
>>> isn't checked at all.  The only restriction is that the size of the
>>> virtual mapping created should be less than total RAM + total swap
>>
>> Do you mean the only restriction is that the size of the virtual mapping
>> *created every time* should be less than total RAM + total swap pages but
>> *total virtual mapping* is not limited in OVERCOMMIT_GUESS scene? If so,
>> the current behavior should be sane and I will drop this patch.
> 
> Yes.  This is my understanding.

I see. Thank you.

> 
> Best Regards,
> Huang, Ying
> 
>> Thanks!
>>
>>> pages.  Because swapoff() will not create virtual mapping, so it's
>>> expected that security_vm_enough_memory_mm() in swapoff() always
>>> succeeds.
>>>
>>> Best Regards,
>>> Huang, Ying
>>>
>>>>
>>>> Thanks!
>>>>
>>>>>
>>>>>> In order to fix it, p->pages should be retracted from total_swap_pages
>>>>>> first and then check whether there's enough memory for inuse swap pages.
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>
>>>>> [snip]
>>>>>
>>>>> .
>>>>>
>>>
>>> .
>>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-06-21  8:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08 14:40 [PATCH v2 0/3] A few cleanup and fixup patches for swap Miaohe Lin
2022-06-08 14:40 ` [PATCH v2 1/3] mm/swapfile: make security_vm_enough_memory_mm() work as expected Miaohe Lin
2022-06-17  7:33   ` David Hildenbrand
2022-06-18  2:43     ` Miaohe Lin
2022-06-18  7:10       ` David Hildenbrand
2022-06-18  7:31         ` Miaohe Lin
2022-06-20  7:31   ` Huang, Ying
2022-06-20 12:12     ` Miaohe Lin
2022-06-21  1:35       ` Huang, Ying
2022-06-21  7:37         ` Miaohe Lin
2022-06-21  7:42           ` Huang, Ying
2022-06-21  8:20             ` Miaohe Lin
2022-06-08 14:40 ` [PATCH v2 2/3] mm/swapfile: fix possible data races of inuse_pages Miaohe Lin
2022-06-20  7:54   ` Huang, Ying
2022-06-20  9:04     ` Miaohe Lin
2022-06-20  9:23       ` Muchun Song
2022-06-20 12:23         ` Miaohe Lin
2022-06-20 12:32           ` Miaohe Lin
2022-06-20 13:46             ` Qian Cai
2022-06-20 14:20               ` Muchun Song
2022-06-20 21:36                 ` Qian Cai
2022-06-21  1:14                   ` Huang, Ying
2022-06-21  3:39                     ` Muchun Song
2022-06-21  6:40                       ` Miaohe Lin
2022-06-08 14:40 ` [PATCH v2 3/3] mm/swap: remove swap_cache_info statistics Miaohe Lin
2022-06-08 15:16   ` David Hildenbrand
2022-06-20  8:08   ` Huang, Ying
2022-06-20  9:05     ` Miaohe Lin
2022-06-20  9:30   ` Muchun Song
2022-06-17  2:37 ` [PATCH v2 0/3] A few cleanup and fixup patches for swap Andrew Morton
2022-06-17  3:00   ` Miaohe Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.