linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
@ 2021-11-11  8:46 Ajay Garg
  2021-11-11 11:06 ` Muchun Song
  2021-11-11 17:45 ` Ajay Garg
  0 siblings, 2 replies; 16+ messages in thread
From: Ajay Garg @ 2021-11-11  8:46 UTC (permalink / raw)
  To: hughd, akpm, linux-mm, linux-kernel; +Cc: Ajay Garg

commit b9d02f1bdd98
("mm: shmem: don't truncate page if memory failure happens")

introduced a PageHWPoison(page) call in "shmem_read_mapping_page_gfp"
in shmem.c.

Now, if "shmem_getpage_gfp" returns an error, page is set to ERR-page.
Therafter, calling PageHWPoison() on this ERR-page, causes KASAN to OOP
the kernel :

 #############################
BUG: unable to handle page fault for address: fffffffffffffff4
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 18e019067 P4D 18e019067 PUD 18e01b067 PMD 0 
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 0 PID: 4836 Comm: MATLAB Not tainted 5.15.0+ #18
Hardware name: Dell Inc. Latitude E6320/0GJF11, BIOS A19 11/14/2013
RIP: 0010:shmem_read_mapping_page_gfp+0xd3/0x140
Code: 4c 89 ff e8 6f eb ff ff 5a 59 85 c0 74 64 48 63 d8 48 89 5d 98 be 08 00 00 00 48 89 df e8 e5 67 0c 00 48 89 df e8 6d 5c 0c 00 <48> 8b 13 48 c7 c0 fb ff ff ff f7 c2 00 00 80 00 74 30 48 ba 00 00
RSP: 0018:ffff88806b33f998 EFLAGS: 00010246
RAX: 0000000000000000 RBX: fffffffffffffff4 RCX: ffffffffb7a37ba3
RDX: 0000000000000003 RSI: dffffc0000000000 RDI: fffffffffffffff4
RBP: ffff88806b33fa20 R08: 1ffffffffffffffe R09: fffffffffffffffb
R10: fffffbffffffffff R11: 0000000000000001 R12: 1ffff1100d667f33
R13: 00000000001120d2 R14: 00000000000005db R15: ffff88814e64e2d8
FS:  00007f379a384640(0000) GS:ffff888161a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffff4 CR3: 00000000269dc004 CR4: 00000000000606f0
Call Trace:
<TASK>
? shmem_fault+0x480/0x480
? __cond_resched+0x1c/0x30
? __kasan_check_read+0x11/0x20
shmem_get_pages+0x3a4/0xa70 [i915]
? shmem_writeback+0x3b0/0x3b0 [i915]
? i915_gem_object_wait_reservation+0x330/0x560 [i915]   
...
...
 ################################

So, we proceed with PageHWPoison() call, only if the page is not a 
ERR-page.


P.S. : Alternative (optimised) solution :
===========================================

We could save some CPU cycles, if we directly replace

        if (error)
                page = ERR_PTR(error);
        else
                unlock_page(page);

with

        if (error)
                return ERR_PTR(error);


Fixes: b9d02f1bdd98 ("mm: shmem: don't truncate page if memory failure happens")
Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com>
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 23c91a8beb78..427863cbf0dc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4222,7 +4222,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 	else
 		unlock_page(page);
 
-	if (PageHWPoison(page))
+	if (!IS_ERR(page) && PageHWPoison(page))
 		page = ERR_PTR(-EIO);
 
 	return page;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11  8:46 [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page Ajay Garg
@ 2021-11-11 11:06 ` Muchun Song
  2021-11-11 11:40   ` Ajay Garg
  2021-11-11 18:42   ` Yang Shi
  2021-11-11 17:45 ` Ajay Garg
  1 sibling, 2 replies; 16+ messages in thread
From: Muchun Song @ 2021-11-11 11:06 UTC (permalink / raw)
  To: Ajay Garg, hughd, akpm, linux-mm, linux-kernel



On 2021/11/11 16:46, Ajay Garg wrote:
> commit b9d02f1bdd98
> ("mm: shmem: don't truncate page if memory failure happens")
> 
> introduced a PageHWPoison(page) call in "shmem_read_mapping_page_gfp"
> in shmem.c.
> 
> Now, if "shmem_getpage_gfp" returns an error, page is set to ERR-page.
> Therafter, calling PageHWPoison() on this ERR-page, causes KASAN to OOP
> the kernel :
> 
>   #############################
> BUG: unable to handle page fault for address: fffffffffffffff4
> PF: supervisor read access in kernel mode
> PF: error_code(0x0000) - not-present page
> PGD 18e019067 P4D 18e019067 PUD 18e01b067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
> CPU: 0 PID: 4836 Comm: MATLAB Not tainted 5.15.0+ #18
> Hardware name: Dell Inc. Latitude E6320/0GJF11, BIOS A19 11/14/2013
> RIP: 0010:shmem_read_mapping_page_gfp+0xd3/0x140
> Code: 4c 89 ff e8 6f eb ff ff 5a 59 85 c0 74 64 48 63 d8 48 89 5d 98 be 08 00 00 00 48 89 df e8 e5 67 0c 00 48 89 df e8 6d 5c 0c 00 <48> 8b 13 48 c7 c0 fb ff ff ff f7 c2 00 00 80 00 74 30 48 ba 00 00
> RSP: 0018:ffff88806b33f998 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: fffffffffffffff4 RCX: ffffffffb7a37ba3
> RDX: 0000000000000003 RSI: dffffc0000000000 RDI: fffffffffffffff4
> RBP: ffff88806b33fa20 R08: 1ffffffffffffffe R09: fffffffffffffffb
> R10: fffffbffffffffff R11: 0000000000000001 R12: 1ffff1100d667f33
> R13: 00000000001120d2 R14: 00000000000005db R15: ffff88814e64e2d8
> FS:  00007f379a384640(0000) GS:ffff888161a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffffffffffff4 CR3: 00000000269dc004 CR4: 00000000000606f0
> Call Trace:
> <TASK>
> ? shmem_fault+0x480/0x480
> ? __cond_resched+0x1c/0x30
> ? __kasan_check_read+0x11/0x20
> shmem_get_pages+0x3a4/0xa70 [i915]
> ? shmem_writeback+0x3b0/0x3b0 [i915]
> ? i915_gem_object_wait_reservation+0x330/0x560 [i915]
> ...
> ...
>   ################################
> 
> So, we proceed with PageHWPoison() call, only if the page is not a
> ERR-page.
> 
> 
> P.S. : Alternative (optimised) solution :
> ===========================================
> 
> We could save some CPU cycles, if we directly replace
> 
>          if (error)
>                  page = ERR_PTR(error);
>          else
>                  unlock_page(page);
> 
> with
> 
>          if (error)
>                  return ERR_PTR(error);
> 
> 
> Fixes: b9d02f1bdd98 ("mm: shmem: don't truncate page if memory failure happens")
> Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com>
> ---
>   mm/shmem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 23c91a8beb78..427863cbf0dc 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4222,7 +4222,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>   	else
>   		unlock_page(page);
>   
> -	if (PageHWPoison(page))
> +	if (!IS_ERR(page) && PageHWPoison(page))
>   		page = ERR_PTR(-EIO);

How about the following changes since the above if block
already do the judgment?

diff --git a/mm/shmem.c b/mm/shmem.c
index f0eee4e221a7..0c84b6624026 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4195,13 +4195,13 @@ struct page *shmem_read_mapping_page_gfp(struct 
address_space *mapping,
         BUG_ON(!shmem_mapping(mapping));
         error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
                                   gfp, NULL, NULL, NULL);
-       if (error)
+       if (error) {
                 page = ERR_PTR(error);
-       else
+       } else {
                 unlock_page(page);
-
-       if (PageHWPoison(page))
-               page = ERR_PTR(-EIO);
+               if (PageHWPoison(page))
+                       page = ERR_PTR(-EIO);
+       }

         return page;
  #else

>   
>   	return page;
> 

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11 11:06 ` Muchun Song
@ 2021-11-11 11:40   ` Ajay Garg
  2021-11-11 12:11     ` Muchun Song
  2021-11-11 18:42   ` Yang Shi
  1 sibling, 1 reply; 16+ messages in thread
From: Ajay Garg @ 2021-11-11 11:40 UTC (permalink / raw)
  To: Muchun Song; +Cc: hughd, akpm, linux-mm, linux-kernel

>
> How about the following changes since the above if block
> already do the judgment?
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f0eee4e221a7..0c84b6624026 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4195,13 +4195,13 @@ struct page *shmem_read_mapping_page_gfp(struct
> address_space *mapping,
>          BUG_ON(!shmem_mapping(mapping));
>          error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
>                                    gfp, NULL, NULL, NULL);
> -       if (error)
> +       if (error) {
>                  page = ERR_PTR(error);
> -       else
> +       } else {
>                  unlock_page(page);
> -
> -       if (PageHWPoison(page))
> -               page = ERR_PTR(-EIO);
> +               if (PageHWPoison(page))
> +                       page = ERR_PTR(-EIO);
> +       }
>
>          return page;


You have

* simply put braces (not required for 1-liner if/else blocks)
* contributed nothing to the issue the patch addresses.

I hope this is not a deliberate joke/spam.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11 11:40   ` Ajay Garg
@ 2021-11-11 12:11     ` Muchun Song
  2021-11-11 12:21       ` Ajay Garg
  0 siblings, 1 reply; 16+ messages in thread
From: Muchun Song @ 2021-11-11 12:11 UTC (permalink / raw)
  To: Ajay Garg; +Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, LKML

On Thu, Nov 11, 2021 at 7:40 PM Ajay Garg <ajaygargnsit@gmail.com> wrote:
>
> >
> > How about the following changes since the above if block
> > already do the judgment?
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index f0eee4e221a7..0c84b6624026 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -4195,13 +4195,13 @@ struct page *shmem_read_mapping_page_gfp(struct
> > address_space *mapping,
> >          BUG_ON(!shmem_mapping(mapping));
> >          error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
> >                                    gfp, NULL, NULL, NULL);
> > -       if (error)
> > +       if (error) {
> >                  page = ERR_PTR(error);
> > -       else
> > +       } else {
> >                  unlock_page(page);
> > -
> > -       if (PageHWPoison(page))
> > -               page = ERR_PTR(-EIO);
> > +               if (PageHWPoison(page))
> > +                       page = ERR_PTR(-EIO);
> > +       }
> >
> >          return page;
>
>
> You have
>
> * simply put braces (not required for 1-liner if/else blocks)
> * contributed nothing to the issue the patch addresses.
>
> I hope this is not a deliberate joke/spam.

What? I put "if (PageHWPoison(page))" into the else branch,
it should be added braces since it's not 1-line blocks. Why do
you think it does not address the issue? What am I missing
here?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11 12:11     ` Muchun Song
@ 2021-11-11 12:21       ` Ajay Garg
  2021-11-11 13:09         ` Muchun Song
  0 siblings, 1 reply; 16+ messages in thread
From: Ajay Garg @ 2021-11-11 12:21 UTC (permalink / raw)
  To: Muchun Song
  Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, LKML

>
> What? I put "if (PageHWPoison(page))" into the else branch,
> it should be added braces since it's not 1-line blocks.

Oh, ok.

> Why do
> you think it does not address the issue? What am I missing
> here?

That might work, however I haven't tested it.

Upto the reviewers to see what they see as the best way going forward.



Thanks and Regards,
Ajay

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11 12:21       ` Ajay Garg
@ 2021-11-11 13:09         ` Muchun Song
  0 siblings, 0 replies; 16+ messages in thread
From: Muchun Song @ 2021-11-11 13:09 UTC (permalink / raw)
  To: Ajay Garg; +Cc: Hugh Dickins, Andrew Morton, Linux Memory Management List, LKML

On Thu, Nov 11, 2021 at 8:21 PM Ajay Garg <ajaygargnsit@gmail.com> wrote:
>
> >
> > What? I put "if (PageHWPoison(page))" into the else branch,
> > it should be added braces since it's not 1-line blocks.
>
> Oh, ok.
>
> > Why do
> > you think it does not address the issue? What am I missing
> > here?
>
> That might work, however I haven't tested it.
>
> Upto the reviewers to see what they see as the best way going forward.

I just suggest, it's up to you.

>
>
>
> Thanks and Regards,
> Ajay

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11  8:46 [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page Ajay Garg
  2021-11-11 11:06 ` Muchun Song
@ 2021-11-11 17:45 ` Ajay Garg
  2021-11-11 17:59   ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Ajay Garg @ 2021-11-11 17:45 UTC (permalink / raw)
  To: hughd, akpm, linux-mm, linux-kernel, Jens Axboe

Another report of the issue (different call-flow, but the same error
at "shmem_read_mapping_page_gfp") at :
https://lore.kernel.org/lkml/6bb8c25c-cdcf-8bca-3db2-9871a90d518f@kernel.dk/T/#m52d98b6bdb05764524a118b15cec048b34e5ca76

with a tentative approval for the patch :
https://lore.kernel.org/lkml/6bb8c25c-cdcf-8bca-3db2-9871a90d518f@kernel.dk/T/#m24c2888a879d428cde5b34c43838301de544eb7e


Thanks and Regards,
Ajay


On Thu, Nov 11, 2021 at 2:16 PM Ajay Garg <ajaygargnsit@gmail.com> wrote:
>
> commit b9d02f1bdd98
> ("mm: shmem: don't truncate page if memory failure happens")
>
> introduced a PageHWPoison(page) call in "shmem_read_mapping_page_gfp"
> in shmem.c.
>
> Now, if "shmem_getpage_gfp" returns an error, page is set to ERR-page.
> Therafter, calling PageHWPoison() on this ERR-page, causes KASAN to OOP
> the kernel :
>
>  #############################
> BUG: unable to handle page fault for address: fffffffffffffff4
> PF: supervisor read access in kernel mode
> PF: error_code(0x0000) - not-present page
> PGD 18e019067 P4D 18e019067 PUD 18e01b067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
> CPU: 0 PID: 4836 Comm: MATLAB Not tainted 5.15.0+ #18
> Hardware name: Dell Inc. Latitude E6320/0GJF11, BIOS A19 11/14/2013
> RIP: 0010:shmem_read_mapping_page_gfp+0xd3/0x140
> Code: 4c 89 ff e8 6f eb ff ff 5a 59 85 c0 74 64 48 63 d8 48 89 5d 98 be 08 00 00 00 48 89 df e8 e5 67 0c 00 48 89 df e8 6d 5c 0c 00 <48> 8b 13 48 c7 c0 fb ff ff ff f7 c2 00 00 80 00 74 30 48 ba 00 00
> RSP: 0018:ffff88806b33f998 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: fffffffffffffff4 RCX: ffffffffb7a37ba3
> RDX: 0000000000000003 RSI: dffffc0000000000 RDI: fffffffffffffff4
> RBP: ffff88806b33fa20 R08: 1ffffffffffffffe R09: fffffffffffffffb
> R10: fffffbffffffffff R11: 0000000000000001 R12: 1ffff1100d667f33
> R13: 00000000001120d2 R14: 00000000000005db R15: ffff88814e64e2d8
> FS:  00007f379a384640(0000) GS:ffff888161a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: fffffffffffffff4 CR3: 00000000269dc004 CR4: 00000000000606f0
> Call Trace:
> <TASK>
> ? shmem_fault+0x480/0x480
> ? __cond_resched+0x1c/0x30
> ? __kasan_check_read+0x11/0x20
> shmem_get_pages+0x3a4/0xa70 [i915]
> ? shmem_writeback+0x3b0/0x3b0 [i915]
> ? i915_gem_object_wait_reservation+0x330/0x560 [i915]
> ...
> ...
>  ################################
>
> So, we proceed with PageHWPoison() call, only if the page is not a
> ERR-page.
>
>
> P.S. : Alternative (optimised) solution :
> ===========================================
>
> We could save some CPU cycles, if we directly replace
>
>         if (error)
>                 page = ERR_PTR(error);
>         else
>                 unlock_page(page);
>
> with
>
>         if (error)
>                 return ERR_PTR(error);
>
>
> Fixes: b9d02f1bdd98 ("mm: shmem: don't truncate page if memory failure happens")
> Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com>
> ---
>  mm/shmem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 23c91a8beb78..427863cbf0dc 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4222,7 +4222,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>         else
>                 unlock_page(page);
>
> -       if (PageHWPoison(page))
> +       if (!IS_ERR(page) && PageHWPoison(page))
>                 page = ERR_PTR(-EIO);
>
>         return page;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11 17:45 ` Ajay Garg
@ 2021-11-11 17:59   ` Jens Axboe
  2021-11-13 17:21     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-11-11 17:59 UTC (permalink / raw)
  To: Ajay Garg, hughd, akpm, linux-mm, linux-kernel

On 11/11/21 10:45 AM, Ajay Garg wrote:
> Another report of the issue (different call-flow, but the same error
> at "shmem_read_mapping_page_gfp") at :
> https://lore.kernel.org/lkml/6bb8c25c-cdcf-8bca-3db2-9871a90d518f@kernel.dk/T/#m52d98b6bdb05764524a118b15cec048b34e5ca76
> 
> with a tentative approval for the patch :
> https://lore.kernel.org/lkml/6bb8c25c-cdcf-8bca-3db2-9871a90d518f@kernel.dk/T/#m24c2888a879d428cde5b34c43838301de544eb7e

Andrew, I've hit this multiple times on resume on my laptop, and it's
a very apparent bug in that b9d02f1bdd98 commit that it doesn't factor
in error pointers. My oops clearly shows it being one too, with ...fff4
being the dereference address.

Can we get this pushed upstream asap if you're fine with this patch?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11 11:06 ` Muchun Song
  2021-11-11 11:40   ` Ajay Garg
@ 2021-11-11 18:42   ` Yang Shi
  1 sibling, 0 replies; 16+ messages in thread
From: Yang Shi @ 2021-11-11 18:42 UTC (permalink / raw)
  To: Muchun Song
  Cc: Ajay Garg, Hugh Dickins, Andrew Morton, Linux MM,
	Linux Kernel Mailing List

On Thu, Nov 11, 2021 at 3:20 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
>
>
> On 2021/11/11 16:46, Ajay Garg wrote:
> > commit b9d02f1bdd98
> > ("mm: shmem: don't truncate page if memory failure happens")
> >
> > introduced a PageHWPoison(page) call in "shmem_read_mapping_page_gfp"
> > in shmem.c.
> >
> > Now, if "shmem_getpage_gfp" returns an error, page is set to ERR-page.
> > Therafter, calling PageHWPoison() on this ERR-page, causes KASAN to OOP
> > the kernel :
> >
> >   #############################
> > BUG: unable to handle page fault for address: fffffffffffffff4
> > PF: supervisor read access in kernel mode
> > PF: error_code(0x0000) - not-present page
> > PGD 18e019067 P4D 18e019067 PUD 18e01b067 PMD 0
> > Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
> > CPU: 0 PID: 4836 Comm: MATLAB Not tainted 5.15.0+ #18
> > Hardware name: Dell Inc. Latitude E6320/0GJF11, BIOS A19 11/14/2013
> > RIP: 0010:shmem_read_mapping_page_gfp+0xd3/0x140
> > Code: 4c 89 ff e8 6f eb ff ff 5a 59 85 c0 74 64 48 63 d8 48 89 5d 98 be 08 00 00 00 48 89 df e8 e5 67 0c 00 48 89 df e8 6d 5c 0c 00 <48> 8b 13 48 c7 c0 fb ff ff ff f7 c2 00 00 80 00 74 30 48 ba 00 00
> > RSP: 0018:ffff88806b33f998 EFLAGS: 00010246
> > RAX: 0000000000000000 RBX: fffffffffffffff4 RCX: ffffffffb7a37ba3
> > RDX: 0000000000000003 RSI: dffffc0000000000 RDI: fffffffffffffff4
> > RBP: ffff88806b33fa20 R08: 1ffffffffffffffe R09: fffffffffffffffb
> > R10: fffffbffffffffff R11: 0000000000000001 R12: 1ffff1100d667f33
> > R13: 00000000001120d2 R14: 00000000000005db R15: ffff88814e64e2d8
> > FS:  00007f379a384640(0000) GS:ffff888161a00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: fffffffffffffff4 CR3: 00000000269dc004 CR4: 00000000000606f0
> > Call Trace:
> > <TASK>
> > ? shmem_fault+0x480/0x480
> > ? __cond_resched+0x1c/0x30
> > ? __kasan_check_read+0x11/0x20
> > shmem_get_pages+0x3a4/0xa70 [i915]
> > ? shmem_writeback+0x3b0/0x3b0 [i915]
> > ? i915_gem_object_wait_reservation+0x330/0x560 [i915]
> > ...
> > ...
> >   ################################
> >
> > So, we proceed with PageHWPoison() call, only if the page is not a
> > ERR-page.
> >
> >
> > P.S. : Alternative (optimised) solution :
> > ===========================================
> >
> > We could save some CPU cycles, if we directly replace
> >
> >          if (error)
> >                  page = ERR_PTR(error);
> >          else
> >                  unlock_page(page);
> >
> > with
> >
> >          if (error)
> >                  return ERR_PTR(error);
> >
> >
> > Fixes: b9d02f1bdd98 ("mm: shmem: don't truncate page if memory failure happens")
> > Signed-off-by: Ajay Garg <ajaygargnsit@gmail.com>
> > ---
> >   mm/shmem.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 23c91a8beb78..427863cbf0dc 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -4222,7 +4222,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> >       else
> >               unlock_page(page);
> >
> > -     if (PageHWPoison(page))
> > +     if (!IS_ERR(page) && PageHWPoison(page))
> >               page = ERR_PTR(-EIO);
>
> How about the following changes since the above if block
> already do the judgment?
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f0eee4e221a7..0c84b6624026 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -4195,13 +4195,13 @@ struct page *shmem_read_mapping_page_gfp(struct
> address_space *mapping,
>          BUG_ON(!shmem_mapping(mapping));
>          error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
>                                    gfp, NULL, NULL, NULL);
> -       if (error)
> +       if (error) {
>                  page = ERR_PTR(error);
> -       else
> +       } else {
>                  unlock_page(page);
> -
> -       if (PageHWPoison(page))
> -               page = ERR_PTR(-EIO);
> +               if (PageHWPoison(page))
> +                       page = ERR_PTR(-EIO);
> +       }

Thanks guys. My bad. It is an apparent bug. IMHO, Muchun's version
seems better to me.

Anyway, whatever Andrew picks up could have Reviewed-by: Yang Shi
<shy828301@gmail.com>

>
>          return page;
>   #else
>
> >
> >       return page;
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-11 17:59   ` Jens Axboe
@ 2021-11-13 17:21     ` Jens Axboe
  2021-11-13 20:16       ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2021-11-13 17:21 UTC (permalink / raw)
  To: Ajay Garg, hughd, akpm, linux-mm, linux-kernel, Linus Torvalds

On 11/11/21 10:59 AM, Jens Axboe wrote:
> On 11/11/21 10:45 AM, Ajay Garg wrote:
>> Another report of the issue (different call-flow, but the same error
>> at "shmem_read_mapping_page_gfp") at :
>> https://lore.kernel.org/lkml/6bb8c25c-cdcf-8bca-3db2-9871a90d518f@kernel.dk/T/#m52d98b6bdb05764524a118b15cec048b34e5ca76
>>
>> with a tentative approval for the patch :
>> https://lore.kernel.org/lkml/6bb8c25c-cdcf-8bca-3db2-9871a90d518f@kernel.dk/T/#m24c2888a879d428cde5b34c43838301de544eb7e
> 
> Andrew, I've hit this multiple times on resume on my laptop, and it's
> a very apparent bug in that b9d02f1bdd98 commit that it doesn't factor
> in error pointers. My oops clearly shows it being one too, with ...fff4
> being the dereference address.
> 
> Can we get this pushed upstream asap if you're fine with this patch?

Maybe Andrew is out - Linus, if you follow this thread, there are two
proposed fixes for this. It'd be nice to have one of them in -rc1.

Backstory is that commit:

commit b9d02f1bdd98f38e6e5ecacc9786a8f58f3f8b2c
Author: Yang Shi <shy828301@gmail.com>
Date:   Fri Nov 5 13:41:10 2021 -0700

    mm: shmem: don't truncate page if memory failure happens

adds a pretty obvious error, where PageHWPoison() is called on an error
pointer in shmem_read_mapping_page_gfp(). I can trivially hit this oops
on resuming the laptop.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-13 17:21     ` Jens Axboe
@ 2021-11-13 20:16       ` Linus Torvalds
  2021-11-13 20:21         ` Jens Axboe
  2021-11-13 20:23         ` Linus Torvalds
  0 siblings, 2 replies; 16+ messages in thread
From: Linus Torvalds @ 2021-11-13 20:16 UTC (permalink / raw)
  To: Jens Axboe, Arnd Bergmann, Yang Shi
  Cc: Ajay Garg, Hugh Dickins, Andrew Morton, Linux-MM,
	Linux Kernel Mailing List

On Sat, Nov 13, 2021 at 9:21 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Maybe Andrew is out - Linus, if you follow this thread, there are two
> proposed fixes for this. It'd be nice to have one of them in -rc1.

Neither of the fixes were sent to me, and honestly, I think the real
issue is that the original commit is just too broken for words.

The error handling in that commit is just entirely lacking.

I've reverted the commit rather than try to fix up the breakage.

Because at some point it's just better to say "that was broken" rather
than try to fix it up. And that original commit was too broken to be
worth fixing up. It already had one fix for uninitialized variables
before it even hit my tree, and then caused problems and had obvious
broken error handling.

             Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-13 20:16       ` Linus Torvalds
@ 2021-11-13 20:21         ` Jens Axboe
  2021-11-13 20:23         ` Linus Torvalds
  1 sibling, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2021-11-13 20:21 UTC (permalink / raw)
  To: Linus Torvalds, Arnd Bergmann, Yang Shi
  Cc: Ajay Garg, Hugh Dickins, Andrew Morton, Linux-MM,
	Linux Kernel Mailing List

On 11/13/21 1:16 PM, Linus Torvalds wrote:
> On Sat, Nov 13, 2021 at 9:21 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Maybe Andrew is out - Linus, if you follow this thread, there are two
>> proposed fixes for this. It'd be nice to have one of them in -rc1.
> 
> Neither of the fixes were sent to me, and honestly, I think the real
> issue is that the original commit is just too broken for words.
> 
> The error handling in that commit is just entirely lacking.
> 
> I've reverted the commit rather than try to fix up the breakage.
> 
> Because at some point it's just better to say "that was broken" rather
> than try to fix it up. And that original commit was too broken to be
> worth fixing up. It already had one fix for uninitialized variables
> before it even hit my tree, and then caused problems and had obvious
> broken error handling.

Thanks, that obviously fixes the problem as far as I'm concerned as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-13 20:16       ` Linus Torvalds
  2021-11-13 20:21         ` Jens Axboe
@ 2021-11-13 20:23         ` Linus Torvalds
  2021-11-13 22:29           ` Yang Shi
  1 sibling, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2021-11-13 20:23 UTC (permalink / raw)
  To: Jens Axboe, Arnd Bergmann, Yang Shi
  Cc: Ajay Garg, Hugh Dickins, Andrew Morton, Linux-MM,
	Linux Kernel Mailing List

On Sat, Nov 13, 2021 at 12:16 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Neither of the fixes were sent to me, and honestly, I think the real
> issue is that the original commit is just too broken for words.

Side note: the one you pointed to (by Ajay), had the comment that it
could be done differently as an optimization.

And I very much agree with that, although I think it would be a lot
more than an optimization: just doing

        if (error)
                return ERR_PTR(error);

earlier in the function would have avoided the issue entirely, and
would have made the code a lot easier to read too.

But what made me decide to just revert it entirely was that the
original commit that caused this all also had stuff like this:

-       return shmem_getpage(inode, index, pagep, SGP_WRITE);
+       ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
+
+       if (*pagep && PageHWPoison(*pagep)) {
+               unlock_page(*pagep);
+               put_page(*pagep);
+               ret = -EIO;
+       }
+
+       return ret;

which is another example of exactly the same issue: ignoring errors,
and then acting on other information and creating new errors.

Again, that code should have checked and handled errors first, and
then - if there wasn't an error - added that new HWpoison handling.

So that just made me go "this is not worth fixing up, this just needs
re-doing", and thus I just went for the revert instead.

                   Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-13 20:23         ` Linus Torvalds
@ 2021-11-13 22:29           ` Yang Shi
  2021-11-13 22:58             ` Linus Torvalds
  0 siblings, 1 reply; 16+ messages in thread
From: Yang Shi @ 2021-11-13 22:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Arnd Bergmann, Ajay Garg, Hugh Dickins,
	Andrew Morton, Linux-MM, Linux Kernel Mailing List

On Sat, Nov 13, 2021 at 12:23 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, Nov 13, 2021 at 12:16 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Neither of the fixes were sent to me, and honestly, I think the real
> > issue is that the original commit is just too broken for words.
>
> Side note: the one you pointed to (by Ajay), had the comment that it
> could be done differently as an optimization.
>
> And I very much agree with that, although I think it would be a lot
> more than an optimization: just doing
>
>         if (error)
>                 return ERR_PTR(error);
>
> earlier in the function would have avoided the issue entirely, and
> would have made the code a lot easier to read too.
>
> But what made me decide to just revert it entirely was that the
> original commit that caused this all also had stuff like this:
>
> -       return shmem_getpage(inode, index, pagep, SGP_WRITE);
> +       ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> +
> +       if (*pagep && PageHWPoison(*pagep)) {
> +               unlock_page(*pagep);
> +               put_page(*pagep);
> +               ret = -EIO;
> +       }
> +
> +       return ret;
>
> which is another example of exactly the same issue: ignoring errors,
> and then acting on other information and creating new errors.
>
> Again, that code should have checked and handled errors first, and
> then - if there wasn't an error - added that new HWpoison handling.
>
> So that just made me go "this is not worth fixing up, this just needs
> re-doing", and thus I just went for the revert instead.

I'm so sorry for the inconvenience.

The above snippet is actually ok since if *pagep returned via
shmem_getpage()'s parameter is not NULL, then ret is 0. When
shmem_getpage() returns error code, *pagep is NULL IIUC. So it
actually doesn't ignore errors then create and return new error.

But I do agree it seems tricky for someone who is not familiar with
shmem code. And if shmem code is changed in the future it may be error
prone. I could rewrite it to:

if (ret < 0)
    goto out;

 if (*pagep && PageHWPoison(*pagep)) {
               unlock_page(*pagep);
              put_page(*pagep);
              ret = -EIO;
}

out:
    return ret;

And fold in Ajay's fix (will take Muchun's version which returns error
earlier). Hopefully it still can make -rc1. Of course rc2 is fine
either.

>
>                    Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-13 22:29           ` Yang Shi
@ 2021-11-13 22:58             ` Linus Torvalds
  2021-11-14  3:10               ` Yang Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Linus Torvalds @ 2021-11-13 22:58 UTC (permalink / raw)
  To: Yang Shi
  Cc: Jens Axboe, Arnd Bergmann, Ajay Garg, Hugh Dickins,
	Andrew Morton, Linux-MM, Linux Kernel Mailing List

On Sat, Nov 13, 2021 at 2:30 PM Yang Shi <shy828301@gmail.com> wrote:
>
> The above snippet is actually ok since if *pagep returned via
> shmem_getpage()'s parameter is not NULL, then ret is 0.

That's a random implementation detail, and is not ok to rely on.

It may or may not be true, and is not part of the rules of error handling.

If a function returns an error, you shouldn't be looking at the other
stuff it returned.

Here's a very recent example of the same kind of problem:

    https://lore.kernel.org/lkml/163663333331.414.639840290224641315.tip-bot2@tip-bot2/

where people didn't actually look properly at the return value of the
function, and instead looked at the page pointers that the function
filled in.

See? EXACT same logic. And completely buggy.

> When  shmem_getpage() returns error code, *pagep is NULL IIUC.

No.

When a function returns an error code, you check for the error code,
and don't rely on weather the function then filled in other data (or
left it alone, or whatever).

So the code should

 (a) check and handle error returns properly

 (b) be legible

That (b) basically means that if it's not entirely trivial (and none
of this was entirely trivial), then when you get an error, you just
deal with it right away. You return early, and undo anything you need
to undo.

You don't do "oh, let's keep that error, and then do something else
that maybe also generates an error".

That "don't handle the error directly" was why
shmem_read_mapping_page_gfp() was buggy and would cause an oops.

And while the shmem_write_begin() code migth not cause an oops, it had
the same fundamental bad pattern.

Error handling is where 99% of all problems occur. But that also means
that you should do the obvious thing wrt error handling, and not have
some crazy "if function X returned an error, it will have left the
return array untouched" which may or may not be true.

When a function returns an error code, you do error handling based on
that code. Not on some random other state.

                Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page
  2021-11-13 22:58             ` Linus Torvalds
@ 2021-11-14  3:10               ` Yang Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Yang Shi @ 2021-11-14  3:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Arnd Bergmann, Ajay Garg, Hugh Dickins,
	Andrew Morton, Linux-MM, Linux Kernel Mailing List

On Sat, Nov 13, 2021 at 2:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, Nov 13, 2021 at 2:30 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > The above snippet is actually ok since if *pagep returned via
> > shmem_getpage()'s parameter is not NULL, then ret is 0.
>
> That's a random implementation detail, and is not ok to rely on.
>
> It may or may not be true, and is not part of the rules of error handling.
>
> If a function returns an error, you shouldn't be looking at the other
> stuff it returned.
>
> Here's a very recent example of the same kind of problem:
>
>     https://lore.kernel.org/lkml/163663333331.414.639840290224641315.tip-bot2@tip-bot2/
>
> where people didn't actually look properly at the return value of the
> function, and instead looked at the page pointers that the function
> filled in.
>
> See? EXACT same logic. And completely buggy.

Yes, I agree it is too fragile to rely on.

>
> > When  shmem_getpage() returns error code, *pagep is NULL IIUC.
>
> No.
>
> When a function returns an error code, you check for the error code,
> and don't rely on weather the function then filled in other data (or
> left it alone, or whatever).
>
> So the code should
>
>  (a) check and handle error returns properly
>
>  (b) be legible
>
> That (b) basically means that if it's not entirely trivial (and none
> of this was entirely trivial), then when you get an error, you just
> deal with it right away. You return early, and undo anything you need
> to undo.
>
> You don't do "oh, let's keep that error, and then do something else
> that maybe also generates an error".
>
> That "don't handle the error directly" was why
> shmem_read_mapping_page_gfp() was buggy and would cause an oops.
>
> And while the shmem_write_begin() code migth not cause an oops, it had
> the same fundamental bad pattern.
>
> Error handling is where 99% of all problems occur. But that also means
> that you should do the obvious thing wrt error handling, and not have
> some crazy "if function X returned an error, it will have left the
> return array untouched" which may or may not be true.
>
> When a function returns an error code, you do error handling based on
> that code. Not on some random other state.

Thanks a lot for the thorough explanation. Preparing a new patch.

>
>                 Linus

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-11-14  3:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-11  8:46 [PATCH] mm: shmem: do not call PageHWPoison on a ERR-page Ajay Garg
2021-11-11 11:06 ` Muchun Song
2021-11-11 11:40   ` Ajay Garg
2021-11-11 12:11     ` Muchun Song
2021-11-11 12:21       ` Ajay Garg
2021-11-11 13:09         ` Muchun Song
2021-11-11 18:42   ` Yang Shi
2021-11-11 17:45 ` Ajay Garg
2021-11-11 17:59   ` Jens Axboe
2021-11-13 17:21     ` Jens Axboe
2021-11-13 20:16       ` Linus Torvalds
2021-11-13 20:21         ` Jens Axboe
2021-11-13 20:23         ` Linus Torvalds
2021-11-13 22:29           ` Yang Shi
2021-11-13 22:58             ` Linus Torvalds
2021-11-14  3:10               ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).