All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] A few fixup patches for mm
@ 2022-04-24  9:11 Miaohe Lin
  2022-04-24  9:11 ` [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails Miaohe Lin
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-04-24  9:11 UTC (permalink / raw)
  To: akpm
  Cc: willy, vbabka, dhowells, neilb, david, apopple, surenb, minchan,
	peterx, sfr, naoya.horiguchi, linux-mm, linux-kernel, linmiaohe

Hi everyone,
This series contains a few patches to avoid mapping random data if swap
read fails and fix lost swap bits in unuse_pte. Also we free hwpoison and
swapin error entry in madvise_free_pte_range. More details can be found
in the respective changelogs. Thanks!

---
v3:
  collect Acked-by tag per David
  remove unneeded pte wrprotect per David
v2:
  make the terminology consistent and collect Acked-by tag per David
  fix lost swap bits in unuse_pte per Peter
  free hwpoison and swapin error entry per Alistair
  Many thanks Alistair, David and Peter for review!
---
Miaohe Lin (3):
  mm/swapfile: unuse_pte can map random data if swap read fails
  mm/swapfile: Fix lost swap bits in unuse_pte()
  mm/madvise: free hwpoison and swapin error entry in
    madvise_free_pte_range

 include/linux/swap.h    |  7 ++++++-
 include/linux/swapops.h | 10 ++++++++++
 mm/madvise.c            | 13 ++++++++-----
 mm/memory.c             |  5 ++++-
 mm/swapfile.c           | 21 ++++++++++++++++++---
 5 files changed, 46 insertions(+), 10 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-24  9:11 [PATCH v3 0/3] A few fixup patches for mm Miaohe Lin
@ 2022-04-24  9:11 ` Miaohe Lin
  2022-04-25  1:08   ` HORIGUCHI NAOYA(堀口 直也)
                     ` (2 more replies)
  2022-04-24  9:11 ` [PATCH v3 2/3] mm/swapfile: Fix lost swap bits in unuse_pte() Miaohe Lin
  2022-04-24  9:11 ` [PATCH v3 3/3] mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range Miaohe Lin
  2 siblings, 3 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-04-24  9:11 UTC (permalink / raw)
  To: akpm
  Cc: willy, vbabka, dhowells, neilb, david, apopple, surenb, minchan,
	peterx, sfr, naoya.horiguchi, linux-mm, linux-kernel, linmiaohe

There is a bug in unuse_pte(): when swap page happens to be unreadable,
page filled with random data is mapped into user address space.  In case
of error, a special swap entry indicating swap read fails is set to the
page table.  So the swapcache page can be freed and the user won't end up
with a permanently mounted swap because a sector is bad.  And if the page
is accessed later, the user process will be killed so that corrupted data
is never consumed.  On the other hand, if the page is never accessed, the
user won't even notice it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/swap.h    |  7 ++++++-
 include/linux/swapops.h | 10 ++++++++++
 mm/memory.c             |  5 ++++-
 mm/swapfile.c           | 11 +++++++++++
 4 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5553189d0215..b82c196d8867 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -55,6 +55,10 @@ static inline int current_is_kswapd(void)
  * actions on faults.
  */
 
+#define SWP_SWAPIN_ERROR_NUM 1
+#define SWP_SWAPIN_ERROR     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
+			     SWP_MIGRATION_NUM + SWP_DEVICE_NUM + \
+			     SWP_PTE_MARKER_NUM)
 /*
  * PTE markers are used to persist information onto PTEs that are mapped with
  * file-backed memories.  As its name "PTE" hints, it should only be applied to
@@ -120,7 +124,8 @@ static inline int current_is_kswapd(void)
 
 #define MAX_SWAPFILES \
 	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
-	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - SWP_PTE_MARKER_NUM)
+	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - \
+	SWP_PTE_MARKER_NUM - SWP_SWAPIN_ERROR_NUM)
 
 /*
  * Magic header for a swap area. The first part of the union is
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index a291f210e7f8..9d989ed049a6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -108,6 +108,16 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 	return xa_mk_value(entry.val);
 }
 
+static inline swp_entry_t make_swapin_error_entry(struct page *page)
+{
+	return swp_entry(SWP_SWAPIN_ERROR, page_to_pfn(page));
+}
+
+static inline int is_swapin_error_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_SWAPIN_ERROR;
+}
+
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
diff --git a/mm/memory.c b/mm/memory.c
index f4161fb07ffa..626f63858e0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1488,7 +1488,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			/* Only drop the uffd-wp marker if explicitly requested */
 			if (!zap_drop_file_uffd_wp(details))
 				continue;
-		} else if (is_hwpoison_entry(entry)) {
+		} else if (is_hwpoison_entry(entry) ||
+			   is_swapin_error_entry(entry)) {
 			if (!should_zap_cows(details))
 				continue;
 		} else {
@@ -3728,6 +3729,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
+		} else if (is_swapin_error_entry(entry)) {
+			ret = VM_FAULT_SIGBUS;
 		} else if (is_pte_marker_entry(entry)) {
 			ret = handle_pte_marker(vmf);
 		} else {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9398e915b36b..95b63f69f388 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1797,6 +1797,17 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		goto out;
 	}
 
+	if (unlikely(!PageUptodate(page))) {
+		pte_t pteval;
+
+		dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
+		pteval = swp_entry_to_pte(make_swapin_error_entry(page));
+		set_pte_at(vma->vm_mm, addr, pte, pteval);
+		swap_free(entry);
+		ret = 0;
+		goto out;
+	}
+
 	/* See do_swap_page() */
 	BUG_ON(!PageAnon(page) && PageMappedToDisk(page));
 	BUG_ON(PageAnon(page) && PageAnonExclusive(page));
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/3] mm/swapfile: Fix lost swap bits in unuse_pte()
  2022-04-24  9:11 [PATCH v3 0/3] A few fixup patches for mm Miaohe Lin
  2022-04-24  9:11 ` [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails Miaohe Lin
@ 2022-04-24  9:11 ` Miaohe Lin
  2022-04-25  7:39   ` David Hildenbrand
  2022-04-24  9:11 ` [PATCH v3 3/3] mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range Miaohe Lin
  2 siblings, 1 reply; 25+ messages in thread
From: Miaohe Lin @ 2022-04-24  9:11 UTC (permalink / raw)
  To: akpm
  Cc: willy, vbabka, dhowells, neilb, david, apopple, surenb, minchan,
	peterx, sfr, naoya.horiguchi, linux-mm, linux-kernel, linmiaohe

This is observed by code review only but not any real report.

When we turn off swapping we could have lost the bits stored in the swap
ptes. The new rmap-exclusive bit is fine since that turned into a page
flag, but not for soft-dirty and uffd-wp. Add them.

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/swapfile.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 95b63f69f388..522a0eb16bf1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1783,7 +1783,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 {
 	struct page *swapcache;
 	spinlock_t *ptl;
-	pte_t *pte;
+	pte_t *pte, new_pte;
 	int ret = 1;
 
 	swapcache = page;
@@ -1832,8 +1832,12 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		page_add_new_anon_rmap(page, vma, addr);
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
-	set_pte_at(vma->vm_mm, addr, pte,
-		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
+	new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
+	if (pte_swp_soft_dirty(*pte))
+		new_pte = pte_mksoft_dirty(new_pte);
+	if (pte_swp_uffd_wp(*pte))
+		new_pte = pte_mkuffd_wp(new_pte);
+	set_pte_at(vma->vm_mm, addr, pte, new_pte);
 	swap_free(entry);
 out:
 	pte_unmap_unlock(pte, ptl);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/3] mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range
  2022-04-24  9:11 [PATCH v3 0/3] A few fixup patches for mm Miaohe Lin
  2022-04-24  9:11 ` [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails Miaohe Lin
  2022-04-24  9:11 ` [PATCH v3 2/3] mm/swapfile: Fix lost swap bits in unuse_pte() Miaohe Lin
@ 2022-04-24  9:11 ` Miaohe Lin
  2022-04-24 23:41   ` HORIGUCHI NAOYA(堀口 直也)
  2 siblings, 1 reply; 25+ messages in thread
From: Miaohe Lin @ 2022-04-24  9:11 UTC (permalink / raw)
  To: akpm
  Cc: willy, vbabka, dhowells, neilb, david, apopple, surenb, minchan,
	peterx, sfr, naoya.horiguchi, linux-mm, linux-kernel, linmiaohe

Once the MADV_FREE operation has succeeded, callers can expect they might
get zero-fill pages if accessing the memory again. Therefore it should be
safe to delete the hwpoison entry and swapin error entry. There is no
reason to kill the process if it has called MADV_FREE on the range.

Suggested-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/madvise.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d6592488b51..5f4537511532 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -624,11 +624,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			swp_entry_t entry;
 
 			entry = pte_to_swp_entry(ptent);
-			if (non_swap_entry(entry))
-				continue;
-			nr_swap--;
-			free_swap_and_cache(entry);
-			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			if (!non_swap_entry(entry)) {
+				nr_swap--;
+				free_swap_and_cache(entry);
+				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			} else if (is_hwpoison_entry(entry) ||
+				   is_swapin_error_entry(entry)) {
+				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+			}
 			continue;
 		}
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/3] mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range
  2022-04-24  9:11 ` [PATCH v3 3/3] mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range Miaohe Lin
@ 2022-04-24 23:41   ` HORIGUCHI NAOYA(堀口 直也)
  2022-04-25  1:59     ` Miaohe Lin
  0 siblings, 1 reply; 25+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-04-24 23:41 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On Sun, Apr 24, 2022 at 05:11:05PM +0800, Miaohe Lin wrote:
> Once the MADV_FREE operation has succeeded, callers can expect they might
> get zero-fill pages if accessing the memory again. Therefore it should be
> safe to delete the hwpoison entry and swapin error entry. There is no
> reason to kill the process if it has called MADV_FREE on the range.
> 
> Suggested-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Acked-by: David Hildenbrand <david@redhat.com>

I confirmed that hwpoison entry is properly removed with madvise(MADV_FREE)
with this patch. This provides applications with the ability to recover from
memory errors in simpler way (applications don't have to munmap then mmap again
the error address). That's a good small improvement. Thank you.

Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

> ---
>  mm/madvise.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4d6592488b51..5f4537511532 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -624,11 +624,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			swp_entry_t entry;
>  
>  			entry = pte_to_swp_entry(ptent);
> -			if (non_swap_entry(entry))
> -				continue;
> -			nr_swap--;
> -			free_swap_and_cache(entry);
> -			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			if (!non_swap_entry(entry)) {
> +				nr_swap--;
> +				free_swap_and_cache(entry);
> +				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			} else if (is_hwpoison_entry(entry) ||
> +				   is_swapin_error_entry(entry)) {
> +				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
> +			}
>  			continue;
>  		}
>  
> -- 
> 2.23.0

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-24  9:11 ` [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails Miaohe Lin
@ 2022-04-25  1:08   ` HORIGUCHI NAOYA(堀口 直也)
  2022-04-25  2:20     ` Miaohe Lin
  2022-04-25  7:45     ` David Hildenbrand
  2022-04-25  7:41   ` ying.huang
  2022-05-10  6:17   ` HORIGUCHI NAOYA(堀口 直也)
  2 siblings, 2 replies; 25+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-04-25  1:08 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
> There is a bug in unuse_pte(): when swap page happens to be unreadable,
> page filled with random data is mapped into user address space.  In case
> of error, a special swap entry indicating swap read fails is set to the
> page table.  So the swapcache page can be freed and the user won't end up
> with a permanently mounted swap because a sector is bad.  And if the page
> is accessed later, the user process will be killed so that corrupted data
> is never consumed.  On the other hand, if the page is never accessed, the
> user won't even notice it.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Acked-by: David Hildenbrand <david@redhat.com>

Hi Miaohe,

This bug sounds relatively serious to me, and it seems old, so is it worth
sending to -stable?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/3] mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range
  2022-04-24 23:41   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-04-25  1:59     ` Miaohe Lin
  0 siblings, 0 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-04-25  1:59 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On 2022/4/25 7:41, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Apr 24, 2022 at 05:11:05PM +0800, Miaohe Lin wrote:
>> Once the MADV_FREE operation has succeeded, callers can expect they might
>> get zero-fill pages if accessing the memory again. Therefore it should be
>> safe to delete the hwpoison entry and swapin error entry. There is no
>> reason to kill the process if it has called MADV_FREE on the range.
>>
>> Suggested-by: Alistair Popple <apopple@nvidia.com>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> I confirmed that hwpoison entry is properly removed with madvise(MADV_FREE)
> with this patch. This provides applications with the ability to recover from
> memory errors in simpler way (applications don't have to munmap then mmap again
> the error address). That's a good small improvement. Thank you.
> 
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Many thanks for your testing and review! :)

> 
>> ---
>>  mm/madvise.c | 13 ++++++++-----
>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 4d6592488b51..5f4537511532 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -624,11 +624,14 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>  			swp_entry_t entry;
>>  
>>  			entry = pte_to_swp_entry(ptent);
>> -			if (non_swap_entry(entry))
>> -				continue;
>> -			nr_swap--;
>> -			free_swap_and_cache(entry);
>> -			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +			if (!non_swap_entry(entry)) {
>> +				nr_swap--;
>> +				free_swap_and_cache(entry);
>> +				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +			} else if (is_hwpoison_entry(entry) ||
>> +				   is_swapin_error_entry(entry)) {
>> +				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
>> +			}
>>  			continue;
>>  		}
>>  
>> -- 
>> 2.23.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  1:08   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-04-25  2:20     ` Miaohe Lin
  2022-04-25  2:51       ` HORIGUCHI NAOYA(堀口 直也)
  2022-04-25  7:45     ` David Hildenbrand
  1 sibling, 1 reply; 25+ messages in thread
From: Miaohe Lin @ 2022-04-25  2:20 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On 2022/4/25 9:08, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>> page filled with random data is mapped into user address space.  In case
>> of error, a special swap entry indicating swap read fails is set to the
>> page table.  So the swapcache page can be freed and the user won't end up
>> with a permanently mounted swap because a sector is bad.  And if the page
>> is accessed later, the user process will be killed so that corrupted data
>> is never consumed.  On the other hand, if the page is never accessed, the
>> user won't even notice it.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Hi Miaohe,
> 
> This bug sounds relatively serious to me, and it seems old, so is it worth
> sending to -stable?

This bug is really old but it's never seen yet because swapoff is supposed only to
be done before rebooting the system. But swapoff can happen anytime. Poor guys might
come across it and get wrong data. So I think it's worth sending to -stable.

BTW: This patch should be revised in order to go to the stable version.

Thanks!

> 
> Thanks,
> Naoya Horiguchi
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  2:20     ` Miaohe Lin
@ 2022-04-25  2:51       ` HORIGUCHI NAOYA(堀口 直也)
  2022-04-25  3:10         ` Miaohe Lin
  0 siblings, 1 reply; 25+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-04-25  2:51 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On Mon, Apr 25, 2022 at 10:20:23AM +0800, Miaohe Lin wrote:
> On 2022/4/25 9:08, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
> >> There is a bug in unuse_pte(): when swap page happens to be unreadable,
> >> page filled with random data is mapped into user address space.  In case
> >> of error, a special swap entry indicating swap read fails is set to the
> >> page table.  So the swapcache page can be freed and the user won't end up
> >> with a permanently mounted swap because a sector is bad.  And if the page
> >> is accessed later, the user process will be killed so that corrupted data
> >> is never consumed.  On the other hand, if the page is never accessed, the
> >> user won't even notice it.
> >>
> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >> Acked-by: David Hildenbrand <david@redhat.com>
> > 
> > Hi Miaohe,
> > 
> > This bug sounds relatively serious to me, and it seems old, so is it worth
> > sending to -stable?
> 
> This bug is really old but it's never seen yet because swapoff is supposed only to
> be done before rebooting the system. But swapoff can happen anytime. Poor guys might
> come across it and get wrong data. So I think it's worth sending to -stable.
> 
> BTW: This patch should be revised in order to go to the stable version.

I sometimes have the same wonder, but I'm not sure about the rule.  If you
choose to send another version, could you update subject line (subject line
is supposed to show what the patch does rather than what the problem is).

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  2:51       ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-04-25  3:10         ` Miaohe Lin
  0 siblings, 0 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-04-25  3:10 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On 2022/4/25 10:51, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Apr 25, 2022 at 10:20:23AM +0800, Miaohe Lin wrote:
>> On 2022/4/25 9:08, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
>>>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>>>> page filled with random data is mapped into user address space.  In case
>>>> of error, a special swap entry indicating swap read fails is set to the
>>>> page table.  So the swapcache page can be freed and the user won't end up
>>>> with a permanently mounted swap because a sector is bad.  And if the page
>>>> is accessed later, the user process will be killed so that corrupted data
>>>> is never consumed.  On the other hand, if the page is never accessed, the
>>>> user won't even notice it.
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>
>>> Hi Miaohe,
>>>
>>> This bug sounds relatively serious to me, and it seems old, so is it worth
>>> sending to -stable?
>>
>> This bug is really old but it's never seen yet because swapoff is supposed only to
>> be done before rebooting the system. But swapoff can happen anytime. Poor guys might
>> come across it and get wrong data. So I think it's worth sending to -stable.
>>
>> BTW: This patch should be revised in order to go to the stable version.
> 
> I sometimes have the same wonder, but I'm not sure about the rule.  If you
> choose to send another version, could you update subject line (subject line

What I mean is that SWP_PTE_MARKER is newly added and it will conflict with the stable version.
So this patch might need to be revised for specified stable version in order to fix the possible
conflict beforehand. Or that should be done when it goes to the stable ?

> is supposed to show what the patch does rather than what the problem is).

If a specified version for stable is required, I will do this.

Thanks!

> 
> Thanks,
> Naoya Horiguchi
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/3] mm/swapfile: Fix lost swap bits in unuse_pte()
  2022-04-24  9:11 ` [PATCH v3 2/3] mm/swapfile: Fix lost swap bits in unuse_pte() Miaohe Lin
@ 2022-04-25  7:39   ` David Hildenbrand
  0 siblings, 0 replies; 25+ messages in thread
From: David Hildenbrand @ 2022-04-25  7:39 UTC (permalink / raw)
  To: Miaohe Lin, akpm
  Cc: willy, vbabka, dhowells, neilb, apopple, surenb, minchan, peterx,
	sfr, naoya.horiguchi, linux-mm, linux-kernel

On 24.04.22 11:11, Miaohe Lin wrote:
> This is observed by code review only but not any real report.
> 
> When we turn off swapping we could have lost the bits stored in the swap
> ptes. The new rmap-exclusive bit is fine since that turned into a page
> flag, but not for soft-dirty and uffd-wp. Add them.
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/swapfile.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 95b63f69f388..522a0eb16bf1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1783,7 +1783,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  {
>  	struct page *swapcache;
>  	spinlock_t *ptl;
> -	pte_t *pte;
> +	pte_t *pte, new_pte;
>  	int ret = 1;
>  
>  	swapcache = page;
> @@ -1832,8 +1832,12 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  		page_add_new_anon_rmap(page, vma, addr);
>  		lru_cache_add_inactive_or_unevictable(page, vma);
>  	}
> -	set_pte_at(vma->vm_mm, addr, pte,
> -		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
> +	new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
> +	if (pte_swp_soft_dirty(*pte))
> +		new_pte = pte_mksoft_dirty(new_pte);
> +	if (pte_swp_uffd_wp(*pte))
> +		new_pte = pte_mkuffd_wp(new_pte);
> +	set_pte_at(vma->vm_mm, addr, pte, new_pte);
>  	swap_free(entry);
>  out:
>  	pte_unmap_unlock(pte, ptl);

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-24  9:11 ` [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails Miaohe Lin
  2022-04-25  1:08   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-04-25  7:41   ` ying.huang
  2022-04-25  7:49     ` David Hildenbrand
  2022-05-10  6:17   ` HORIGUCHI NAOYA(堀口 直也)
  2 siblings, 1 reply; 25+ messages in thread
From: ying.huang @ 2022-04-25  7:41 UTC (permalink / raw)
  To: Miaohe Lin, akpm
  Cc: willy, vbabka, dhowells, neilb, david, apopple, surenb, minchan,
	peterx, sfr, naoya.horiguchi, linux-mm, linux-kernel

Hi, Miaohe,

On Sun, 2022-04-24 at 17:11 +0800, Miaohe Lin wrote:
> There is a bug in unuse_pte(): when swap page happens to be unreadable,
> page filled with random data is mapped into user address space.  In case
> of error, a special swap entry indicating swap read fails is set to the
> page table.  So the swapcache page can be freed and the user won't end up
> with a permanently mounted swap because a sector is bad.  And if the page
> is accessed later, the user process will be killed so that corrupted data
> is never consumed.  On the other hand, if the page is never accessed, the
> user won't even notice it.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/swap.h    |  7 ++++++-
>  include/linux/swapops.h | 10 ++++++++++
>  mm/memory.c             |  5 ++++-
>  mm/swapfile.c           | 11 +++++++++++
>  4 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 5553189d0215..b82c196d8867 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -55,6 +55,10 @@ static inline int current_is_kswapd(void)
>   * actions on faults.
>   */
> 
> +#define SWP_SWAPIN_ERROR_NUM 1
> +#define SWP_SWAPIN_ERROR     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
> +			     SWP_MIGRATION_NUM + SWP_DEVICE_NUM + \
> +			     SWP_PTE_MARKER_NUM)
> 
> 

It appears wasteful to use another swap device number.  Is it possible
to use a special swap offset?  For example, 0 or -1?

Best Regards,
Huang, Ying


[snip]



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  1:08   ` HORIGUCHI NAOYA(堀口 直也)
  2022-04-25  2:20     ` Miaohe Lin
@ 2022-04-25  7:45     ` David Hildenbrand
  2022-04-25  8:47       ` Miaohe Lin
  1 sibling, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2022-04-25  7:45 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Miaohe Lin
  Cc: akpm, willy, vbabka, dhowells, neilb, apopple, surenb, minchan,
	peterx, sfr, linux-mm, linux-kernel

On 25.04.22 03:08, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>> page filled with random data is mapped into user address space.  In case
>> of error, a special swap entry indicating swap read fails is set to the
>> page table.  So the swapcache page can be freed and the user won't end up
>> with a permanently mounted swap because a sector is bad.  And if the page
>> is accessed later, the user process will be killed so that corrupted data
>> is never consumed.  On the other hand, if the page is never accessed, the
>> user won't even notice it.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Hi Miaohe,
> 
> This bug sounds relatively serious to me, and it seems old, so is it worth
> sending to -stable?

I'm not sure if this is worth -stable, but no strong opinion.

The do_swap_page() part was added in 2005:

commit b81074800b98ac50b64d4c8d34e8abf0fda5e3d1
Author: Kirill Korotaev <dev@sw.ru>
Date:   Mon May 16 21:53:50 2005 -0700

    [PATCH] do_swap_page() can map random data if swap read fails
    
    There is a bug in do_swap_page(): when swap page happens to be unreadable,
    page filled with random data is mapped into user address space.  The fix is
    to check for PageUptodate and send SIGBUS in case of error.
    
    Signed-Off-By: Kirill Korotaev <dev@sw.ru>
    Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
    Acked-by: Hugh Dickins <hugh@veritas.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

So the do_swap_page() part has been fixed for quite a while already.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  7:41   ` ying.huang
@ 2022-04-25  7:49     ` David Hildenbrand
  2022-04-25  7:55       ` ying.huang
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2022-04-25  7:49 UTC (permalink / raw)
  To: ying.huang, Miaohe Lin, akpm
  Cc: willy, vbabka, dhowells, neilb, apopple, surenb, minchan, peterx,
	sfr, naoya.horiguchi, linux-mm, linux-kernel

On 25.04.22 09:41, ying.huang@intel.com wrote:
> Hi, Miaohe,
> 
> On Sun, 2022-04-24 at 17:11 +0800, Miaohe Lin wrote:
>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>> page filled with random data is mapped into user address space.  In case
>> of error, a special swap entry indicating swap read fails is set to the
>> page table.  So the swapcache page can be freed and the user won't end up
>> with a permanently mounted swap because a sector is bad.  And if the page
>> is accessed later, the user process will be killed so that corrupted data
>> is never consumed.  On the other hand, if the page is never accessed, the
>> user won't even notice it.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
>> ---
>>  include/linux/swap.h    |  7 ++++++-
>>  include/linux/swapops.h | 10 ++++++++++
>>  mm/memory.c             |  5 ++++-
>>  mm/swapfile.c           | 11 +++++++++++
>>  4 files changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 5553189d0215..b82c196d8867 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -55,6 +55,10 @@ static inline int current_is_kswapd(void)
>>   * actions on faults.
>>   */
>>
>> +#define SWP_SWAPIN_ERROR_NUM 1
>> +#define SWP_SWAPIN_ERROR     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
>> +			     SWP_MIGRATION_NUM + SWP_DEVICE_NUM + \
>> +			     SWP_PTE_MARKER_NUM)
>>
>>
> 
> It appears wasteful to use another swap device number. 

Do we really care?

We currently use 5 bits for swap types, so we have a total of 32.

SWP_HWPOISON_NUM -> 1
SWP_MIGRATION_NUM -> 3
SWP_PTE_MARKER_NUM -> 1
SWP_DEVICE_NUM -> 4
SWP_SWAPIN_ERROR_NUM -> 1

Which would leave us with 32 - 10 = 22 swap devices. IMHO that's plenty
for real life scenarios.

I'd prefer reworking this when we really run into trouble (and we could
think about using more bits for applicable architectures then, for
select 64bit architectures it might be fairly easily possible).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  7:49     ` David Hildenbrand
@ 2022-04-25  7:55       ` ying.huang
  2022-04-25  8:01         ` David Hildenbrand
  0 siblings, 1 reply; 25+ messages in thread
From: ying.huang @ 2022-04-25  7:55 UTC (permalink / raw)
  To: David Hildenbrand, Miaohe Lin, akpm
  Cc: willy, vbabka, dhowells, neilb, apopple, surenb, minchan, peterx,
	sfr, naoya.horiguchi, linux-mm, linux-kernel, Tim C Chen

On Mon, 2022-04-25 at 09:49 +0200, David Hildenbrand wrote:
> On 25.04.22 09:41, ying.huang@intel.com wrote:
> > Hi, Miaohe,
> > 
> > On Sun, 2022-04-24 at 17:11 +0800, Miaohe Lin wrote:
> > > There is a bug in unuse_pte(): when swap page happens to be unreadable,
> > > page filled with random data is mapped into user address space.  In case
> > > of error, a special swap entry indicating swap read fails is set to the
> > > page table.  So the swapcache page can be freed and the user won't end up
> > > with a permanently mounted swap because a sector is bad.  And if the page
> > > is accessed later, the user process will be killed so that corrupted data
> > > is never consumed.  On the other hand, if the page is never accessed, the
> > > user won't even notice it.
> > > 
> > > Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> > > Acked-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >  include/linux/swap.h    |  7 ++++++-
> > >  include/linux/swapops.h | 10 ++++++++++
> > >  mm/memory.c             |  5 ++++-
> > >  mm/swapfile.c           | 11 +++++++++++
> > >  4 files changed, 31 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 5553189d0215..b82c196d8867 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -55,6 +55,10 @@ static inline int current_is_kswapd(void)
> > >   * actions on faults.
> > >   */
> > > 
> > > +#define SWP_SWAPIN_ERROR_NUM 1
> > > +#define SWP_SWAPIN_ERROR     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
> > > +			     SWP_MIGRATION_NUM + SWP_DEVICE_NUM + \
> > > +			     SWP_PTE_MARKER_NUM)
> > > 
> > > 
> > 
> > It appears wasteful to use another swap device number. 
> 
> Do we really care?
> 
> We currently use 5 bits for swap types, so we have a total of 32.
> 
> SWP_HWPOISON_NUM -> 1
> SWP_MIGRATION_NUM -> 3
> SWP_PTE_MARKER_NUM -> 1
> SWP_DEVICE_NUM -> 4
> SWP_SWAPIN_ERROR_NUM -> 1
> 
> Which would leave us with 32 - 10 = 22 swap devices. IMHO that's plenty
> for real life scenarios.

Creating multiple swap partitions on one disk can improve the
scalability of swap subsystem, although we usually don't have so many
disks for swap. 

> I'd prefer reworking this when we really run into trouble (and we could
> think about using more bits for applicable architectures then, for
> select 64bit architectures it might be fairly easily possible).

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  7:55       ` ying.huang
@ 2022-04-25  8:01         ` David Hildenbrand
  2022-04-25  8:51           ` Miaohe Lin
  0 siblings, 1 reply; 25+ messages in thread
From: David Hildenbrand @ 2022-04-25  8:01 UTC (permalink / raw)
  To: ying.huang, Miaohe Lin, akpm
  Cc: willy, vbabka, dhowells, neilb, apopple, surenb, minchan, peterx,
	sfr, naoya.horiguchi, linux-mm, linux-kernel, Tim C Chen

On 25.04.22 09:55, ying.huang@intel.com wrote:
> On Mon, 2022-04-25 at 09:49 +0200, David Hildenbrand wrote:
>> On 25.04.22 09:41, ying.huang@intel.com wrote:
>>> Hi, Miaohe,
>>>
>>> On Sun, 2022-04-24 at 17:11 +0800, Miaohe Lin wrote:
>>>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>>>> page filled with random data is mapped into user address space.  In case
>>>> of error, a special swap entry indicating swap read fails is set to the
>>>> page table.  So the swapcache page can be freed and the user won't end up
>>>> with a permanently mounted swap because a sector is bad.  And if the page
>>>> is accessed later, the user process will be killed so that corrupted data
>>>> is never consumed.  On the other hand, if the page is never accessed, the
>>>> user won't even notice it.
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>  include/linux/swap.h    |  7 ++++++-
>>>>  include/linux/swapops.h | 10 ++++++++++
>>>>  mm/memory.c             |  5 ++++-
>>>>  mm/swapfile.c           | 11 +++++++++++
>>>>  4 files changed, 31 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 5553189d0215..b82c196d8867 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -55,6 +55,10 @@ static inline int current_is_kswapd(void)
>>>>   * actions on faults.
>>>>   */
>>>>
>>>> +#define SWP_SWAPIN_ERROR_NUM 1
>>>> +#define SWP_SWAPIN_ERROR     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
>>>> +			     SWP_MIGRATION_NUM + SWP_DEVICE_NUM + \
>>>> +			     SWP_PTE_MARKER_NUM)
>>>>
>>>>
>>>
>>> It appears wasteful to use another swap device number. 
>>
>> Do we really care?
>>
>> We currently use 5 bits for swap types, so we have a total of 32.
>>
>> SWP_HWPOISON_NUM -> 1
>> SWP_MIGRATION_NUM -> 3
>> SWP_PTE_MARKER_NUM -> 1
>> SWP_DEVICE_NUM -> 4
>> SWP_SWAPIN_ERROR_NUM -> 1
>>
>> Which would leave us with 32 - 10 = 22 swap devices. IMHO that's plenty
>> for real life scenarios.
> 
> Creating multiple swap partitions on one disk can improve the
> scalability of swap subsystem, although we usually don't have so many
> disks for swap. 

Exactly, and IMHO if we have 22 or 23 doesn't make a real difference
here ...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  7:45     ` David Hildenbrand
@ 2022-04-25  8:47       ` Miaohe Lin
  2022-04-26  0:31         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 25+ messages in thread
From: Miaohe Lin @ 2022-04-25  8:47 UTC (permalink / raw)
  To: David Hildenbrand, HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, willy, vbabka, dhowells, neilb, apopple, surenb, minchan,
	peterx, sfr, linux-mm, linux-kernel

On 2022/4/25 15:45, David Hildenbrand wrote:
> On 25.04.22 03:08, HORIGUCHI NAOYA(堀口 直也) wrote:
>> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
>>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>>> page filled with random data is mapped into user address space.  In case
>>> of error, a special swap entry indicating swap read fails is set to the
>>> page table.  So the swapcache page can be freed and the user won't end up
>>> with a permanently mounted swap because a sector is bad.  And if the page
>>> is accessed later, the user process will be killed so that corrupted data
>>> is never consumed.  On the other hand, if the page is never accessed, the
>>> user won't even notice it.
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> Acked-by: David Hildenbrand <david@redhat.com>
>>
>> Hi Miaohe,
>>
>> This bug sounds relatively serious to me, and it seems old, so is it worth
>> sending to -stable?
> 
> I'm not sure if this is worth -stable, but no strong opinion.

I have no strong opinion too. I'm just afraid someone might run into it. But swapoff is
expected to be a rare operation anyway...

> 
> The do_swap_page() part was added in 2005:
> 
> commit b81074800b98ac50b64d4c8d34e8abf0fda5e3d1
> Author: Kirill Korotaev <dev@sw.ru>
> Date:   Mon May 16 21:53:50 2005 -0700
> 
>     [PATCH] do_swap_page() can map random data if swap read fails
>     
>     There is a bug in do_swap_page(): when swap page happens to be unreadable,
>     page filled with random data is mapped into user address space.  The fix is
>     to check for PageUptodate and send SIGBUS in case of error.
>     
>     Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>     Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
>     Acked-by: Hugh Dickins <hugh@veritas.com>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> 
> So the do_swap_page() part has been fixed for quite a while already.

Does this mean only do_swap_page maps random data if swap read fails is observed from that time on?
So this might not be worth -stable as it's never seen more than a decade?

Thanks!

>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  8:01         ` David Hildenbrand
@ 2022-04-25  8:51           ` Miaohe Lin
  0 siblings, 0 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-04-25  8:51 UTC (permalink / raw)
  To: David Hildenbrand, ying.huang
  Cc: willy, vbabka, dhowells, neilb, apopple, surenb, minchan, peterx,
	sfr, naoya.horiguchi, linux-mm, linux-kernel, Tim C Chen,
	Andrew Morton

On 2022/4/25 16:01, David Hildenbrand wrote:
> On 25.04.22 09:55, ying.huang@intel.com wrote:
>> On Mon, 2022-04-25 at 09:49 +0200, David Hildenbrand wrote:
>>> On 25.04.22 09:41, ying.huang@intel.com wrote:
>>>> Hi, Miaohe,
>>>>
>>>> On Sun, 2022-04-24 at 17:11 +0800, Miaohe Lin wrote:
>>>>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>>>>> page filled with random data is mapped into user address space.  In case
>>>>> of error, a special swap entry indicating swap read fails is set to the
>>>>> page table.  So the swapcache page can be freed and the user won't end up
>>>>> with a permanently mounted swap because a sector is bad.  And if the page
>>>>> is accessed later, the user process will be killed so that corrupted data
>>>>> is never consumed.  On the other hand, if the page is never accessed, the
>>>>> user won't even notice it.
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>  include/linux/swap.h    |  7 ++++++-
>>>>>  include/linux/swapops.h | 10 ++++++++++
>>>>>  mm/memory.c             |  5 ++++-
>>>>>  mm/swapfile.c           | 11 +++++++++++
>>>>>  4 files changed, 31 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index 5553189d0215..b82c196d8867 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -55,6 +55,10 @@ static inline int current_is_kswapd(void)
>>>>>   * actions on faults.
>>>>>   */
>>>>>
>>>>> +#define SWP_SWAPIN_ERROR_NUM 1
>>>>> +#define SWP_SWAPIN_ERROR     (MAX_SWAPFILES + SWP_HWPOISON_NUM + \
>>>>> +			     SWP_MIGRATION_NUM + SWP_DEVICE_NUM + \
>>>>> +			     SWP_PTE_MARKER_NUM)
>>>>>
>>>>>
>>>>
>>>> It appears wasteful to use another swap device number. 
>>>
>>> Do we really care?
>>>
>>> We currently use 5 bits for swap types, so we have a total of 32.
>>>
>>> SWP_HWPOISON_NUM -> 1
>>> SWP_MIGRATION_NUM -> 3
>>> SWP_PTE_MARKER_NUM -> 1
>>> SWP_DEVICE_NUM -> 4
>>> SWP_SWAPIN_ERROR_NUM -> 1
>>>
>>> Which would leave us with 32 - 10 = 22 swap devices. IMHO that's plenty
>>> for real life scenarios.
>>
>> Creating multiple swap partitions on one disk can improve the
>> scalability of swap subsystem, although we usually don't have so many
>> disks for swap. 
> 
> Exactly, and IMHO if we have 22 or 23 doesn't make a real difference
> here ...

I tend to agree with David. Thanks both!

> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-25  8:47       ` Miaohe Lin
@ 2022-04-26  0:31         ` HORIGUCHI NAOYA(堀口 直也)
  2022-04-26  7:06           ` Miaohe Lin
  0 siblings, 1 reply; 25+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-04-26  0:31 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: David Hildenbrand, akpm, willy, vbabka, dhowells, neilb, apopple,
	surenb, minchan, peterx, sfr, linux-mm, linux-kernel

On Mon, Apr 25, 2022 at 04:47:41PM +0800, Miaohe Lin wrote:
> On 2022/4/25 15:45, David Hildenbrand wrote:
> > On 25.04.22 03:08, HORIGUCHI NAOYA(堀口 直也) wrote:
> >> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
> >>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
> >>> page filled with random data is mapped into user address space.  In case
> >>> of error, a special swap entry indicating swap read fails is set to the
> >>> page table.  So the swapcache page can be freed and the user won't end up
> >>> with a permanently mounted swap because a sector is bad.  And if the page
> >>> is accessed later, the user process will be killed so that corrupted data
> >>> is never consumed.  On the other hand, if the page is never accessed, the
> >>> user won't even notice it.
> >>>
> >>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>> Acked-by: David Hildenbrand <david@redhat.com>
> >>
> >> Hi Miaohe,
> >>
> >> This bug sounds relatively serious to me, and it seems old, so is it worth
> >> sending to -stable?
> > 
> > I'm not sure if this is worth -stable, but no strong opinion.
> 
> I have no strong opinion too. I'm just afraid someone might run into it. But swapoff is
> expected to be a rare operation anyway...
> 
> > 
> > The do_swap_page() part was added in 2005:
> > 
> > commit b81074800b98ac50b64d4c8d34e8abf0fda5e3d1
> > Author: Kirill Korotaev <dev@sw.ru>
> > Date:   Mon May 16 21:53:50 2005 -0700
> > 
> >     [PATCH] do_swap_page() can map random data if swap read fails
> >     
> >     There is a bug in do_swap_page(): when swap page happens to be unreadable,
> >     page filled with random data is mapped into user address space.  The fix is
> >     to check for PageUptodate and send SIGBUS in case of error.
> >     
> >     Signed-Off-By: Kirill Korotaev <dev@sw.ru>
> >     Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
> >     Acked-by: Hugh Dickins <hugh@veritas.com>
> >     Signed-off-by: Andrew Morton <akpm@osdl.org>
> >     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> > 
> > So the do_swap_page() part has been fixed for quite a while already.
> 
> Does this mean only do_swap_page maps random data if swap read fails is observed from that time on?
> So this might not be worth -stable as it's never seen more than a decade?

OK, both choices seems possible, so not sending to -stable is fine to me.
It's finally up to you.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-26  0:31         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-04-26  7:06           ` Miaohe Lin
  0 siblings, 0 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-04-26  7:06 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: David Hildenbrand, akpm, willy, vbabka, dhowells, neilb, apopple,
	surenb, minchan, peterx, sfr, linux-mm, linux-kernel

On 2022/4/26 8:31, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Apr 25, 2022 at 04:47:41PM +0800, Miaohe Lin wrote:
>> On 2022/4/25 15:45, David Hildenbrand wrote:
>>> On 25.04.22 03:08, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
>>>>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>>>>> page filled with random data is mapped into user address space.  In case
>>>>> of error, a special swap entry indicating swap read fails is set to the
>>>>> page table.  So the swapcache page can be freed and the user won't end up
>>>>> with a permanently mounted swap because a sector is bad.  And if the page
>>>>> is accessed later, the user process will be killed so that corrupted data
>>>>> is never consumed.  On the other hand, if the page is never accessed, the
>>>>> user won't even notice it.
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>>
>>>> Hi Miaohe,
>>>>
>>>> This bug sounds relatively serious to me, and it seems old, so is it worth
>>>> sending to -stable?
>>>
>>> I'm not sure if this is worth -stable, but no strong opinion.
>>
>> I have no strong opinion too. I'm just afraid someone might run into it. But swapoff is
>> expected to be a rare operation anyway...
>>
>>>
>>> The do_swap_page() part was added in 2005:
>>>
>>> commit b81074800b98ac50b64d4c8d34e8abf0fda5e3d1
>>> Author: Kirill Korotaev <dev@sw.ru>
>>> Date:   Mon May 16 21:53:50 2005 -0700
>>>
>>>     [PATCH] do_swap_page() can map random data if swap read fails
>>>     
>>>     There is a bug in do_swap_page(): when swap page happens to be unreadable,
>>>     page filled with random data is mapped into user address space.  The fix is
>>>     to check for PageUptodate and send SIGBUS in case of error.
>>>     
>>>     Signed-Off-By: Kirill Korotaev <dev@sw.ru>
>>>     Signed-Off-By: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
>>>     Acked-by: Hugh Dickins <hugh@veritas.com>
>>>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>>>     Signed-off-by: Linus Torvalds <torvalds@osdl.org>
>>>
>>> So the do_swap_page() part has been fixed for quite a while already.
>>
>> Does this mean only do_swap_page maps random data if swap read fails is observed from that time on?
>> So this might not be worth -stable as it's never seen more than a decade?
> 
> OK, both choices seems possible, so not sending to -stable is fine to me.
> It's finally up to you.

I tend not to send it to -stable due to the above concern now.

Thanks!

> 
> Thanks,
> Naoya Horiguchi
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-04-24  9:11 ` [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails Miaohe Lin
  2022-04-25  1:08   ` HORIGUCHI NAOYA(堀口 直也)
  2022-04-25  7:41   ` ying.huang
@ 2022-05-10  6:17   ` HORIGUCHI NAOYA(堀口 直也)
  2022-05-10  6:58     ` Miaohe Lin
  2022-05-10 12:46     ` Miaohe Lin
  2 siblings, 2 replies; 25+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-05-10  6:17 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
> There is a bug in unuse_pte(): when swap page happens to be unreadable,
> page filled with random data is mapped into user address space.  In case
> of error, a special swap entry indicating swap read fails is set to the
> page table.  So the swapcache page can be freed and the user won't end up
> with a permanently mounted swap because a sector is bad.  And if the page
> is accessed later, the user process will be killed so that corrupted data
> is never consumed.  On the other hand, if the page is never accessed, the
> user won't even notice it.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> Acked-by: David Hildenbrand <david@redhat.com>

When I reproduced the issue (generated read error with dm-dust), I saw
infinite loop in the while loop in shmem_unuse_inode() (and this happens
even with this patch). I confirmed that shmem_swapin_page() returns -EIO,
but shmem_unuse_swap_entries() does not return the error to the callers,
so the while loop in shmem_unuse_inode() seems not break.

So maybe you need more code around shmem_unuse_inode() to handle the error?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-05-10  6:17   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-05-10  6:58     ` Miaohe Lin
  2022-05-13  0:42       ` Andrew Morton
  2022-05-10 12:46     ` Miaohe Lin
  1 sibling, 1 reply; 25+ messages in thread
From: Miaohe Lin @ 2022-05-10  6:58 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, willy, vbabka, dhowells, neilb, david, apopple, surenb,
	minchan, peterx, sfr, linux-mm, linux-kernel

On 2022/5/10 14:17, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>> page filled with random data is mapped into user address space.  In case
>> of error, a special swap entry indicating swap read fails is set to the
>> page table.  So the swapcache page can be freed and the user won't end up
>> with a permanently mounted swap because a sector is bad.  And if the page
>> is accessed later, the user process will be killed so that corrupted data
>> is never consumed.  On the other hand, if the page is never accessed, the
>> user won't even notice it.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> When I reproduced the issue (generated read error with dm-dust), I saw
> infinite loop in the while loop in shmem_unuse_inode() (and this happens
> even with this patch). I confirmed that shmem_swapin_page() returns -EIO,
> but shmem_unuse_swap_entries() does not return the error to the callers,
> so the while loop in shmem_unuse_inode() seems not break.
> 

Many thanks for your report! I didn't test the shmem case because I saw -EIO
is returned. So I just focus on the normal page case. Sorry about it. :(

> So maybe you need more code around shmem_unuse_inode() to handle the error?

I will try to reproduce it and come up a fixup patch asap! And if you like, you
can kindly solve this issue too. ;)

Thanks a lot!

> 
> Thanks,
> Naoya Horiguchi
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-05-10  6:17   ` HORIGUCHI NAOYA(堀口 直也)
  2022-05-10  6:58     ` Miaohe Lin
@ 2022-05-10 12:46     ` Miaohe Lin
  1 sibling, 0 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-05-10 12:46 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), akpm, willy, david
  Cc: vbabka, dhowells, neilb, apopple, surenb, minchan, peterx, sfr,
	linux-mm, linux-kernel

On 2022/5/10 14:17, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Apr 24, 2022 at 05:11:03PM +0800, Miaohe Lin wrote:
>> There is a bug in unuse_pte(): when swap page happens to be unreadable,
>> page filled with random data is mapped into user address space.  In case
>> of error, a special swap entry indicating swap read fails is set to the
>> page table.  So the swapcache page can be freed and the user won't end up
>> with a permanently mounted swap because a sector is bad.  And if the page
>> is accessed later, the user process will be killed so that corrupted data
>> is never consumed.  On the other hand, if the page is never accessed, the
>> user won't even notice it.
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> When I reproduced the issue (generated read error with dm-dust), I saw
> infinite loop in the while loop in shmem_unuse_inode() (and this happens
> even with this patch). I confirmed that shmem_swapin_page() returns -EIO,
> but shmem_unuse_swap_entries() does not return the error to the callers,
> so the while loop in shmem_unuse_inode() seems not break.

In the current implementation, try_to_unuse will keep trying to do shmem_unuse unless -ENOMEM is
returned from shmem_swapin_folio. This could be easily fixed by return -EIO when swapin error
occurs. But the user will end up with a permanently mounted swap just because a sector was bad.
One alternative is inventing a way to proceed the swapoff while preventing user from accessing
the wrong data. But this might complicate the code a lot and I need to learn more about shmem.
Any suggestion will be really grateful!

Thanks! :)

> 
> So maybe you need more code around shmem_unuse_inode() to handle the error?
> 
> Thanks,
> Naoya Horiguchi
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-05-10  6:58     ` Miaohe Lin
@ 2022-05-13  0:42       ` Andrew Morton
  2022-05-13  3:14         ` Miaohe Lin
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2022-05-13  0:42 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: HORIGUCHI NAOYA, willy, vbabka, dhowells, neilb, david, apopple,
	surenb, minchan, peterx, sfr, linux-mm, linux-kernel

On Tue, 10 May 2022 14:58:05 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:

> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >> Acked-by: David Hildenbrand <david@redhat.com>
> > 
> > When I reproduced the issue (generated read error with dm-dust), I saw
> > infinite loop in the while loop in shmem_unuse_inode() (and this happens
> > even with this patch). I confirmed that shmem_swapin_page() returns -EIO,
> > but shmem_unuse_swap_entries() does not return the error to the callers,
> > so the while loop in shmem_unuse_inode() seems not break.
> > 
> 
> Many thanks for your report! I didn't test the shmem case because I saw -EIO
> is returned. So I just focus on the normal page case. Sorry about it. :(
> 
> > So maybe you need more code around shmem_unuse_inode() to handle the error?
> 
> I will try to reproduce it and come up a fixup patch asap! And if you like, you
> can kindly solve this issue too. ;)

Seems that this patch didn't cause the infinite loop, so as far as I
can tell it is good to be merged up.  But the problem it solves isn't
urgent and fixing that infinite loop might impact this change so I
think I'll drop this version.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails
  2022-05-13  0:42       ` Andrew Morton
@ 2022-05-13  3:14         ` Miaohe Lin
  0 siblings, 0 replies; 25+ messages in thread
From: Miaohe Lin @ 2022-05-13  3:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: HORIGUCHI NAOYA (堀口 直也),
	willy, vbabka, dhowells, neilb, david, apopple, surenb, minchan,
	peterx, sfr, linux-mm, linux-kernel

On 2022/5/13 8:42, Andrew Morton wrote:
> On Tue, 10 May 2022 14:58:05 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> Acked-by: David Hildenbrand <david@redhat.com>
>>>
>>> When I reproduced the issue (generated read error with dm-dust), I saw
>>> infinite loop in the while loop in shmem_unuse_inode() (and this happens
>>> even with this patch). I confirmed that shmem_swapin_page() returns -EIO,
>>> but shmem_unuse_swap_entries() does not return the error to the callers,
>>> so the while loop in shmem_unuse_inode() seems not break.
>>>
>>
>> Many thanks for your report! I didn't test the shmem case because I saw -EIO
>> is returned. So I just focus on the normal page case. Sorry about it. :(
>>
>>> So maybe you need more code around shmem_unuse_inode() to handle the error?
>>
>> I will try to reproduce it and come up a fixup patch asap! And if you like, you
>> can kindly solve this issue too. ;)
> 
> Seems that this patch didn't cause the infinite loop, so as far as I
> can tell it is good to be merged up.  But the problem it solves isn't
> urgent and fixing that infinite loop might impact this change so I
> think I'll drop this version.

I will update and resend the corresponding patch series when I fix this infinite loop.

Thanks!

> .
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-05-13  3:14 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-24  9:11 [PATCH v3 0/3] A few fixup patches for mm Miaohe Lin
2022-04-24  9:11 ` [PATCH v3 1/3] mm/swapfile: unuse_pte can map random data if swap read fails Miaohe Lin
2022-04-25  1:08   ` HORIGUCHI NAOYA(堀口 直也)
2022-04-25  2:20     ` Miaohe Lin
2022-04-25  2:51       ` HORIGUCHI NAOYA(堀口 直也)
2022-04-25  3:10         ` Miaohe Lin
2022-04-25  7:45     ` David Hildenbrand
2022-04-25  8:47       ` Miaohe Lin
2022-04-26  0:31         ` HORIGUCHI NAOYA(堀口 直也)
2022-04-26  7:06           ` Miaohe Lin
2022-04-25  7:41   ` ying.huang
2022-04-25  7:49     ` David Hildenbrand
2022-04-25  7:55       ` ying.huang
2022-04-25  8:01         ` David Hildenbrand
2022-04-25  8:51           ` Miaohe Lin
2022-05-10  6:17   ` HORIGUCHI NAOYA(堀口 直也)
2022-05-10  6:58     ` Miaohe Lin
2022-05-13  0:42       ` Andrew Morton
2022-05-13  3:14         ` Miaohe Lin
2022-05-10 12:46     ` Miaohe Lin
2022-04-24  9:11 ` [PATCH v3 2/3] mm/swapfile: Fix lost swap bits in unuse_pte() Miaohe Lin
2022-04-25  7:39   ` David Hildenbrand
2022-04-24  9:11 ` [PATCH v3 3/3] mm/madvise: free hwpoison and swapin error entry in madvise_free_pte_range Miaohe Lin
2022-04-24 23:41   ` HORIGUCHI NAOYA(堀口 直也)
2022-04-25  1:59     ` Miaohe Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.