linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
@ 2020-08-22  9:53 Muchun Song
  2020-08-24 20:59 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Muchun Song @ 2020-08-22  9:53 UTC (permalink / raw)
  To: mike.kravetz, akpm
  Cc: npiggin, agl, ak, nacc, linux-mm, linux-kernel, Muchun Song

There is a race between the assignment of `table->data` and write value
to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
Fix this by duplicating the `table`, and only update the duplicate of
it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
simplify the code.

The following oops was seen:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    Code: Bad RIP value.
    ...
    Call Trace:
     ? set_max_huge_pages+0x3da/0x4f0
     ? alloc_pool_huge_page+0x150/0x150
     ? proc_doulongvec_minmax+0x46/0x60
     ? hugetlb_sysctl_handler_common+0x1c7/0x200
     ? nr_hugepages_store+0x20/0x20
     ? copy_fd_bitmaps+0x170/0x170
     ? hugetlb_sysctl_handler+0x1e/0x20
     ? proc_sys_call_handler+0x2f1/0x300
     ? unregister_sysctl_table+0xb0/0xb0
     ? __fd_install+0x78/0x100
     ? proc_sys_write+0x14/0x20
     ? __vfs_write+0x4d/0x90
     ? vfs_write+0xef/0x240
     ? ksys_write+0xc0/0x160
     ? __ia32_sys_read+0x50/0x50
     ? __close_fd+0x129/0x150
     ? __x64_sys_write+0x43/0x50
     ? do_syscall_64+0x6c/0x200
     ? entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a301c2d672bf..818d6125af49 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3454,6 +3454,23 @@ static unsigned int allowed_mems_nr(struct hstate *h)
 }
 
 #ifdef CONFIG_SYSCTL
+static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write,
+					  void *buffer, size_t *length,
+					  loff_t *ppos, unsigned long *out)
+{
+	struct ctl_table dup_table;
+
+	/*
+	 * In order to avoid races with __do_proc_doulongvec_minmax(), we
+	 * can duplicate the @table and alter the duplicate of it.
+	 */
+	dup_table = *table;
+	dup_table.data = out;
+	dup_table.maxlen = sizeof(unsigned long);
+
+	return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos);
+}
+
 static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 			 struct ctl_table *table, int write,
 			 void *buffer, size_t *length, loff_t *ppos)
@@ -3465,9 +3482,8 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
 	if (!hugepages_supported())
 		return -EOPNOTSUPP;
 
-	table->data = &tmp;
-	table->maxlen = sizeof(unsigned long);
-	ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
+	ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
+					     &tmp);
 	if (ret)
 		goto out;
 
@@ -3510,9 +3526,8 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
 	if (write && hstate_is_gigantic(h))
 		return -EINVAL;
 
-	table->data = &tmp;
-	table->maxlen = sizeof(unsigned long);
-	ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
+	ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
+					     &tmp);
 	if (ret)
 		goto out;
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-22  9:53 [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers Muchun Song
@ 2020-08-24 20:59 ` Andrew Morton
  2020-08-24 21:19   ` Mike Kravetz
  2020-08-25  2:42 ` Muchun Song
  2020-08-25 15:25 ` Andi Kleen
  2 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2020-08-24 20:59 UTC (permalink / raw)
  To: Muchun Song; +Cc: mike.kravetz, npiggin, agl, ak, nacc, linux-mm, linux-kernel

On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:

> There is a race between the assignment of `table->data` and write value
> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().

Where does __do_proc_doulongvec_minmax() write to table->data?

I think you're saying that there is a race between the assignment of
ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
of the same ctl_table->table in hugetlb_overcommit_handler()?

Or not, maybe I'm being thick.  Can you please describe the race more
carefully and completely?

> Fix this by duplicating the `table`, and only update the duplicate of
> it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
> simplify the code.
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-24 20:59 ` Andrew Morton
@ 2020-08-24 21:19   ` Mike Kravetz
  2020-08-25  3:01     ` [External] " Muchun Song
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2020-08-24 21:19 UTC (permalink / raw)
  To: Andrew Morton, Muchun Song; +Cc: npiggin, agl, ak, nacc, linux-mm, linux-kernel

On 8/24/20 1:59 PM, Andrew Morton wrote:
> On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
> 
>> There is a race between the assignment of `table->data` and write value
>> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> 
> Where does __do_proc_doulongvec_minmax() write to table->data?
> 
> I think you're saying that there is a race between the assignment of
> ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
> of the same ctl_table->table in hugetlb_overcommit_handler()?
> 
> Or not, maybe I'm being thick.  Can you please describe the race more
> carefully and completely?
> 

I too am looking at this now and do not completely understand the race.
It could be that:

hugetlb_sysctl_handler_common
...
	table->data = &tmp;

and, do_proc_doulongvec_minmax()
...
	return __do_proc_doulongvec_minmax(table->data, table, write, ...
with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
...
	i = (unsigned long *) data;
	...
		*i = val;
	
So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
in one thread when hugetlb_sysctl_handler_common is setting it in another?

Another confusing part of the message is the stack trace which includes
...
     ? set_max_huge_pages+0x3da/0x4f0
     ? alloc_pool_huge_page+0x150/0x150

which are 'downstream' from these routines.  I don't understand why these
are in the trace.

If the race is with the pointer set and dereference/write, then this type of
fix is OK.  However, if you really have two 'sysadmin type' global operations
racing then one or both are not going to get what they expected.  Instead of
changing the code to 'handle the race', I think it might be acceptable to just
put a big semaphore around it.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-22  9:53 [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers Muchun Song
  2020-08-24 20:59 ` Andrew Morton
@ 2020-08-25  2:42 ` Muchun Song
  2020-08-25 15:25 ` Andi Kleen
  2 siblings, 0 replies; 11+ messages in thread
From: Muchun Song @ 2020-08-25  2:42 UTC (permalink / raw)
  To: mike.kravetz, Andrew Morton; +Cc: ak, Linux Memory Management List, LKML

Hi Andrew and Mike,

On Sat, Aug 22, 2020 at 5:53 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> There is a race between the assignment of `table->data` and write value
> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> Fix this by duplicating the `table`, and only update the duplicate of
> it. And introduce a helper of proc_hugetlb_doulongvec_minmax() to
> simplify the code.

I am sorry, I didn't expose more details about how the race happened.

CPU0:                                     CPU1:
                                          proc_sys_write
hugetlb_sysctl_handler                      proc_sys_call_handler
hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
  table->data = &tmp;                           hugetlb_sysctl_handler_common
                                                  table->data = &tmp;
    proc_doulongvec_minmax
      do_proc_doulongvec_minmax             sysctl_head_finish
        __do_proc_doulongvec_minmax
          i = table->data;
          *i = val;     // corrupt CPU1 stack

>
> The following oops was seen:
>
>     BUG: kernel NULL pointer dereference, address: 0000000000000000
>     #PF: supervisor instruction fetch in kernel mode
>     #PF: error_code(0x0010) - not-present page
>     Code: Bad RIP value.

Here we can see the "Bad RIP value", so the stack frame is corrupted by
others.

>     ...
>     Call Trace:
>      ? set_max_huge_pages+0x3da/0x4f0
>      ? alloc_pool_huge_page+0x150/0x150
>      ? proc_doulongvec_minmax+0x46/0x60
>      ? hugetlb_sysctl_handler_common+0x1c7/0x200
>      ? nr_hugepages_store+0x20/0x20
>      ? copy_fd_bitmaps+0x170/0x170
>      ? hugetlb_sysctl_handler+0x1e/0x20
>      ? proc_sys_call_handler+0x2f1/0x300
>      ? unregister_sysctl_table+0xb0/0xb0
>      ? __fd_install+0x78/0x100
>      ? proc_sys_write+0x14/0x20
>      ? __vfs_write+0x4d/0x90
>      ? vfs_write+0xef/0x240
>      ? ksys_write+0xc0/0x160
>      ? __ia32_sys_read+0x50/0x50
>      ? __close_fd+0x129/0x150
>      ? __x64_sys_write+0x43/0x50
>      ? do_syscall_64+0x6c/0x200
>      ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/hugetlb.c | 27 +++++++++++++++++++++------
>  1 file changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a301c2d672bf..818d6125af49 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3454,6 +3454,23 @@ static unsigned int allowed_mems_nr(struct hstate *h)
>  }
>
>  #ifdef CONFIG_SYSCTL
> +static int proc_hugetlb_doulongvec_minmax(struct ctl_table *table, int write,
> +                                         void *buffer, size_t *length,
> +                                         loff_t *ppos, unsigned long *out)
> +{
> +       struct ctl_table dup_table;
> +
> +       /*
> +        * In order to avoid races with __do_proc_doulongvec_minmax(), we
> +        * can duplicate the @table and alter the duplicate of it.
> +        */
> +       dup_table = *table;
> +       dup_table.data = out;
> +       dup_table.maxlen = sizeof(unsigned long);
> +
> +       return proc_doulongvec_minmax(&dup_table, write, buffer, length, ppos);
> +}
> +
>  static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>                          struct ctl_table *table, int write,
>                          void *buffer, size_t *length, loff_t *ppos)
> @@ -3465,9 +3482,8 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
>         if (!hugepages_supported())
>                 return -EOPNOTSUPP;
>
> -       table->data = &tmp;
> -       table->maxlen = sizeof(unsigned long);
> -       ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
> +       ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
> +                                            &tmp);
>         if (ret)
>                 goto out;
>
> @@ -3510,9 +3526,8 @@ int hugetlb_overcommit_handler(struct ctl_table *table, int write,
>         if (write && hstate_is_gigantic(h))
>                 return -EINVAL;
>
> -       table->data = &tmp;
> -       table->maxlen = sizeof(unsigned long);
> -       ret = proc_doulongvec_minmax(table, write, buffer, length, ppos);
> +       ret = proc_hugetlb_doulongvec_minmax(table, write, buffer, length, ppos,
> +                                            &tmp);
>         if (ret)
>                 goto out;
>
> --
> 2.11.0
>


-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-24 21:19   ` Mike Kravetz
@ 2020-08-25  3:01     ` Muchun Song
  2020-08-26  0:01       ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Muchun Song @ 2020-08-25  3:01 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Andrew Morton, ak, Linux Memory Management List, LKML

On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/24/20 1:59 PM, Andrew Morton wrote:
> > On Sat, 22 Aug 2020 17:53:28 +0800 Muchun Song <songmuchun@bytedance.com> wrote:
> >
> >> There is a race between the assignment of `table->data` and write value
> >> to the pointer of `table->data` in the __do_proc_doulongvec_minmax().
> >
> > Where does __do_proc_doulongvec_minmax() write to table->data?
> >
> > I think you're saying that there is a race between the assignment of
> > ctl_table->table in hugetlb_sysctl_handler_common() and the assignment
> > of the same ctl_table->table in hugetlb_overcommit_handler()?
> >
> > Or not, maybe I'm being thick.  Can you please describe the race more
> > carefully and completely?
> >
>
> I too am looking at this now and do not completely understand the race.
> It could be that:
>
> hugetlb_sysctl_handler_common
> ...
>         table->data = &tmp;
>
> and, do_proc_doulongvec_minmax()
> ...
>         return __do_proc_doulongvec_minmax(table->data, table, write, ...
> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
> ...
>         i = (unsigned long *) data;
>         ...
>                 *i = val;
>
> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
> in one thread when hugetlb_sysctl_handler_common is setting it in another?

Yes, you are right.

>
> Another confusing part of the message is the stack trace which includes
> ...
>      ? set_max_huge_pages+0x3da/0x4f0
>      ? alloc_pool_huge_page+0x150/0x150
>
> which are 'downstream' from these routines.  I don't understand why these
> are in the trace.

I am also confused. But this issue can be reproduced easily by letting more
than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
the issue can not be reproduced and disappears.

>
> If the race is with the pointer set and dereference/write, then this type of
> fix is OK.  However, if you really have two 'sysadmin type' global operations
> racing then one or both are not going to get what they expected.  Instead of

In our team, more than one developer shares one server which is a test server.
I guess that the panic happens when two people write
`/proc/sys/vm/nr_hugepages`.

> changing the code to 'handle the race', I think it might be acceptable to just
> put a big semaphore around it.

It is also a good idea to fix this issue. Thanks.

> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-22  9:53 [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers Muchun Song
  2020-08-24 20:59 ` Andrew Morton
  2020-08-25  2:42 ` Muchun Song
@ 2020-08-25 15:25 ` Andi Kleen
  2020-08-26  2:34   ` [Phishing Risk] [External] " Muchun Song
  2 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2020-08-25 15:25 UTC (permalink / raw)
  To: Muchun Song
  Cc: mike.kravetz, akpm, npiggin, agl, nacc, linux-mm, linux-kernel

> Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")

I don't think the Fixes line is correct. The original patch
just used a global variable and didn't have this race.
It must have been added later in some other patch that added
hugetlb_sysctl_handler_common.

-Andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-25  3:01     ` [External] " Muchun Song
@ 2020-08-26  0:01       ` Mike Kravetz
  2020-08-26  2:47         ` Muchun Song
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2020-08-26  0:01 UTC (permalink / raw)
  To: Muchun Song; +Cc: Andrew Morton, ak, Linux Memory Management List, LKML

On 8/24/20 8:01 PM, Muchun Song wrote:
> On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> I too am looking at this now and do not completely understand the race.
>> It could be that:
>>
>> hugetlb_sysctl_handler_common
>> ...
>>         table->data = &tmp;
>>
>> and, do_proc_doulongvec_minmax()
>> ...
>>         return __do_proc_doulongvec_minmax(table->data, table, write, ...
>> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
>> ...
>>         i = (unsigned long *) data;
>>         ...
>>                 *i = val;
>>
>> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
>> in one thread when hugetlb_sysctl_handler_common is setting it in another?
> 
> Yes, you are right.
> 
>>
>> Another confusing part of the message is the stack trace which includes
>> ...
>>      ? set_max_huge_pages+0x3da/0x4f0
>>      ? alloc_pool_huge_page+0x150/0x150
>>
>> which are 'downstream' from these routines.  I don't understand why these
>> are in the trace.
> 
> I am also confused. But this issue can be reproduced easily by letting more
> than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
> the issue can not be reproduced and disappears.

There certainly is an issue here as one thread can modify data in another.
However, I am having a hard time seeing what causes the 'kernel NULL pointer
dereference'.

I tried to reproduce the issue myself but was unsuccessful.  I have 16 threads
writing to /proc/sys/vm/nr_hugepages in an infinite loop.  After several hours
running, I did not hit the issue.  Just curious, what architecture is the
system?  any special config or compiler options?

If you can easily reproduce, can you post the detailed oops message?

The 'NULL pointer' seems strange because after the first assignment to
table->data the value should never be NULL.  Certainly it can be modified
by another thread, but I can not see how it can be NULL.  At the beginning
of __do_proc_doulongvec_minmax, there is a check for NULL pointer with:

	if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
		*lenp = 0;
		return 0;
	}

I looked at the code my compiler produced for __do_proc_doulongvec_minmax.
It appears to use the same value/register for the pointer throughout the
routine.  IOW, I do not see how the pointer can be NULL for the assignment
when the routine does:

			*i = val;

Again, your analysis/patch points out a real issue.  I just want to get
a better understanding to make sure there is not another issue causing
the NULL pointer dereference.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Phishing Risk] [External] Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-25 15:25 ` Andi Kleen
@ 2020-08-26  2:34   ` Muchun Song
  0 siblings, 0 replies; 11+ messages in thread
From: Muchun Song @ 2020-08-26  2:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: mike.kravetz, Andrew Morton, npiggin, agl, nacc,
	Linux Memory Management List, LKML

HI Andi,

On Tue, Aug 25, 2020 at 11:34 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> > Fixes: e5ff215941d5 ("hugetlb: multiple hstates for multiple page sizes")
>
> I don't think the Fixes line is correct. The original patch
> just used a global variable and didn't have this race.
> It must have been added later in some other patch that added
> hugetlb_sysctl_handler_common.

I don't agree with you. The 'e5ff215941d5' is not used a global
variable. Below is the code snippet of this patch. Thanks.

@@ -1037,8 +1109,19 @@ int hugetlb_sysctl_handler(struct ctl_table
*table, int write,
                           struct file *file, void __user *buffer,
                           size_t *length, loff_t *ppos)
 {
+       struct hstate *h = &default_hstate;
+       unsigned long tmp;

Here is a local variable of tmp.

+
+       if (!write)
+               tmp = h->max_huge_pages;
+
+       table->data = &tmp;
+       table->maxlen = sizeof(unsigned long);
        proc_doulongvec_minmax(table, write, file, buffer, length, ppos);
-       max_huge_pages = set_max_huge_pages(max_huge_pages);
+
+       if (write)
+               h->max_huge_pages = set_max_huge_pages(h, tmp);
+
        return 0;
 }


>
> -Andi



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-26  0:01       ` Mike Kravetz
@ 2020-08-26  2:47         ` Muchun Song
  2020-08-27 21:51           ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Muchun Song @ 2020-08-26  2:47 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Andrew Morton, ak, Linux Memory Management List, LKML

On Wed, Aug 26, 2020 at 8:03 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/24/20 8:01 PM, Muchun Song wrote:
> > On Tue, Aug 25, 2020 at 5:21 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> I too am looking at this now and do not completely understand the race.
> >> It could be that:
> >>
> >> hugetlb_sysctl_handler_common
> >> ...
> >>         table->data = &tmp;
> >>
> >> and, do_proc_doulongvec_minmax()
> >> ...
> >>         return __do_proc_doulongvec_minmax(table->data, table, write, ...
> >> with __do_proc_doulongvec_minmax(void *data, struct ctl_table *table, ...
> >> ...
> >>         i = (unsigned long *) data;
> >>         ...
> >>                 *i = val;
> >>
> >> So, __do_proc_doulongvec_minmax can be dereferencing and writing to the pointer
> >> in one thread when hugetlb_sysctl_handler_common is setting it in another?
> >
> > Yes, you are right.
> >
> >>
> >> Another confusing part of the message is the stack trace which includes
> >> ...
> >>      ? set_max_huge_pages+0x3da/0x4f0
> >>      ? alloc_pool_huge_page+0x150/0x150
> >>
> >> which are 'downstream' from these routines.  I don't understand why these
> >> are in the trace.
> >
> > I am also confused. But this issue can be reproduced easily by letting more
> > than one thread write to `/proc/sys/vm/nr_hugepages`. With this patch applied,
> > the issue can not be reproduced and disappears.
>
> There certainly is an issue here as one thread can modify data in another.
> However, I am having a hard time seeing what causes the 'kernel NULL pointer
> dereference'.

If you write 0 to '/proc/sys/vm/nr_hugepages', you will get the
    kernel NULL pointer dereference, address: 0000000000000000

If you write 1024 to '/proc/sys/vm/nr_hugepages', you will get the
    kernel NULL pointer dereference, address: 0000000000000400

The address of dereference is the value which you write to the
'/proc/sys/vm/nr_hugepages'.

>
> I tried to reproduce the issue myself but was unsuccessful.  I have 16 threads
> writing to /proc/sys/vm/nr_hugepages in an infinite loop.  After several hours
> running, I did not hit the issue.  Just curious, what architecture is the
> system?  any special config or compiler options?
>
> If you can easily reproduce, can you post the detailed oops message?
>
> The 'NULL pointer' seems strange because after the first assignment to
> table->data the value should never be NULL.  Certainly it can be modified
> by another thread, but I can not see how it can be NULL.  At the beginning
> of __do_proc_doulongvec_minmax, there is a check for NULL pointer with:

CPU0:                                     CPU1:
                                          proc_sys_write
hugetlb_sysctl_handler                      proc_sys_call_handler
hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
  table->data = &tmp;                           hugetlb_sysctl_handler_common
                                                  table->data = &tmp;
    proc_doulongvec_minmax
      do_proc_doulongvec_minmax             sysctl_head_finish
        __do_proc_doulongvec_minmax
          i = table->data;
          *i = val;     // corrupt CPU1 stack

If the val is 0, you will see the NULL.

>
>         if (!data || !table->maxlen || !*lenp || (*ppos && !write)) {
>                 *lenp = 0;
>                 return 0;
>         }
>
> I looked at the code my compiler produced for __do_proc_doulongvec_minmax.
> It appears to use the same value/register for the pointer throughout the
> routine.  IOW, I do not see how the pointer can be NULL for the assignment
> when the routine does:
>
>                         *i = val;
>
> Again, your analysis/patch points out a real issue.  I just want to get
> a better understanding to make sure there is not another issue causing
> the NULL pointer dereference.

Below is my test script. There are 8 threads to execute the following script.
In my qemu, it is easy to panic. Thanks.

#!/bin/sh

while :
do
        echo 128 > /proc/sys/vm/nr_hugepages
        echo 0 > /proc/sys/vm/nr_hugepages
done

> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-26  2:47         ` Muchun Song
@ 2020-08-27 21:51           ` Mike Kravetz
  2020-08-28  2:33             ` Muchun Song
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2020-08-27 21:51 UTC (permalink / raw)
  To: Muchun Song; +Cc: Andrew Morton, ak, Linux Memory Management List, LKML

On 8/25/20 7:47 PM, Muchun Song wrote:
> 
> CPU0:                                     CPU1:
>                                           proc_sys_write
> hugetlb_sysctl_handler                      proc_sys_call_handler
> hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
>   table->data = &tmp;                           hugetlb_sysctl_handler_common
>                                                   table->data = &tmp;
>     proc_doulongvec_minmax
>       do_proc_doulongvec_minmax             sysctl_head_finish
>         __do_proc_doulongvec_minmax
>           i = table->data;
>           *i = val;     // corrupt CPU1 stack

Thanks Muchun!
Can you please add this to the commit message.

Also, when looking closer at the patch I do not think setting table->maxlen
is necessary in these routines.  maxlen is set when the hugetlb ctl_table
entries are defined and initialized.  This is not something you introduced.
The unnecessary assignments are in the existing code.  However, there is no
need to carry them forward.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [External] Re: [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers
  2020-08-27 21:51           ` Mike Kravetz
@ 2020-08-28  2:33             ` Muchun Song
  0 siblings, 0 replies; 11+ messages in thread
From: Muchun Song @ 2020-08-28  2:33 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Andrew Morton, Andi Kleen, Linux Memory Management List, LKML

On Fri, Aug 28, 2020 at 5:51 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 8/25/20 7:47 PM, Muchun Song wrote:
> >
> > CPU0:                                     CPU1:
> >                                           proc_sys_write
> > hugetlb_sysctl_handler                      proc_sys_call_handler
> > hugetlb_sysctl_handler_common                 hugetlb_sysctl_handler
> >   table->data = &tmp;                           hugetlb_sysctl_handler_common
> >                                                   table->data = &tmp;
> >     proc_doulongvec_minmax
> >       do_proc_doulongvec_minmax             sysctl_head_finish
> >         __do_proc_doulongvec_minmax
> >           i = table->data;
> >           *i = val;     // corrupt CPU1 stack
>
> Thanks Muchun!
> Can you please add this to the commit message.

OK, I will do that. Thanks.

>
> Also, when looking closer at the patch I do not think setting table->maxlen
> is necessary in these routines.  maxlen is set when the hugetlb ctl_table
> entries are defined and initialized.  This is not something you introduced.
> The unnecessary assignments are in the existing code.  However, there is no
> need to carry them forward.

Yeah, I agree with you. I will remove the unnecessary assignment of
table->maxlen.

>
> --
> Mike Kravetz



-- 
Yours,
Muchun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-08-28  2:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-22  9:53 [PATCH] mm/hugetlb: Fix a race between hugetlb sysctl handlers Muchun Song
2020-08-24 20:59 ` Andrew Morton
2020-08-24 21:19   ` Mike Kravetz
2020-08-25  3:01     ` [External] " Muchun Song
2020-08-26  0:01       ` Mike Kravetz
2020-08-26  2:47         ` Muchun Song
2020-08-27 21:51           ` Mike Kravetz
2020-08-28  2:33             ` Muchun Song
2020-08-25  2:42 ` Muchun Song
2020-08-25 15:25 ` Andi Kleen
2020-08-26  2:34   ` [Phishing Risk] [External] " Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).