linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()
@ 2022-06-13 11:29 Yi Wang
  2022-06-13 19:20 ` Yury Norov
  0 siblings, 1 reply; 6+ messages in thread
From: Yi Wang @ 2022-06-13 11:29 UTC (permalink / raw)
  To: yury.norov, andriy.shevchenko, linux
  Cc: linux-kernel, xue.zhihong, wang.yi59, wang.liang82

Consider one situation:

The app have two vmas which mbind() to node 1 and node3 respectively,
and its cpuset.mems is 0-3, now set its cpuset.mems to 1,3, according
to current bitmap_remap(), we got:

    1 => 3
    3 => 3

This maybe confused because node 1,3 have already in the new settiing
region but both nodes are binded to the same node 3 now.

Actually we found the situation on a very old libvirt and qemu, but
this can be easily reproduced in the current kernel, so we try to fix
it.

A possible fix way is to ignore the bits in @src have already existed
in @new.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
---
 lib/bitmap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/bitmap.c b/lib/bitmap.c
index b18e31ea6e66..b77bf1b3852e 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -1006,8 +1006,8 @@ unsigned int bitmap_ord_to_pos(const unsigned long *buf, unsigned int ord, unsig
  * @dst point to the same location, then this routine copies @src
  * to @dst.
  *
- * The positions of unset bits in @old are mapped to themselves
- * (the identify map).
+ * The positions of unset bits in @old or bits in @src have already
+ * existed in @new are mapped to themselves (the identify map).
  *
  * Apply the above specified mapping to @src, placing the result in
  * @dst, clearing any bits previously set in @dst.
@@ -1033,7 +1033,7 @@ void bitmap_remap(unsigned long *dst, const unsigned long *src,
 	for_each_set_bit(oldbit, src, nbits) {
 		int n = bitmap_pos_to_ord(old, oldbit, nbits);
 
-		if (n < 0 || w == 0)
+		if (n < 0 || w == 0 || test_bit(oldbit, new))
 			set_bit(oldbit, dst);	/* identity map */
 		else
 			set_bit(bitmap_ord_to_pos(new, n % w, nbits), dst);
-- 
2.33.0.rc0.dirty

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()
  2022-06-13 11:29 [PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask() Yi Wang
@ 2022-06-13 19:20 ` Yury Norov
  2022-06-14  3:45   ` wang.yi59
  0 siblings, 1 reply; 6+ messages in thread
From: Yury Norov @ 2022-06-13 19:20 UTC (permalink / raw)
  To: Yi Wang; +Cc: andriy.shevchenko, linux, linux-kernel, xue.zhihong, wang.liang82

On Mon, Jun 13, 2022 at 4:31 AM Yi Wang <wang.yi59@zte.com.cn> wrote:
>
> Consider one situation:
>
> The app have two vmas which mbind() to node 1 and node3 respectively,
> and its cpuset.mems is 0-3, now set its cpuset.mems to 1,3, according
> to current bitmap_remap(), we got:
>
>     1 => 3
>     3 => 3
>
> This maybe confused because node 1,3 have already in the new settiing
> region but both nodes are binded to the same node 3 now.
>
> Actually we found the situation on a very old libvirt and qemu, but
> this can be easily reproduced in the current kernel, so we try to fix
> it.
>
> A possible fix way is to ignore the bits in @src have already existed
> in @new.
>
> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
> ---
>  lib/bitmap.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/lib/bitmap.c b/lib/bitmap.c
> index b18e31ea6e66..b77bf1b3852e 100644
> --- a/lib/bitmap.c
> +++ b/lib/bitmap.c
> @@ -1006,8 +1006,8 @@ unsigned int bitmap_ord_to_pos(const unsigned long *buf, unsigned int ord, unsig
>   * @dst point to the same location, then this routine copies @src
>   * to @dst.
>   *
> - * The positions of unset bits in @old are mapped to themselves
> - * (the identify map).
> + * The positions of unset bits in @old or bits in @src have already
> + * existed in @new are mapped to themselves (the identify map).
>   *
>   * Apply the above specified mapping to @src, placing the result in
>   * @dst, clearing any bits previously set in @dst.
> @@ -1033,7 +1033,7 @@ void bitmap_remap(unsigned long *dst, const unsigned long *src,
>         for_each_set_bit(oldbit, src, nbits) {
>                 int n = bitmap_pos_to_ord(old, oldbit, nbits);
>
> -               if (n < 0 || w == 0)
> +               if (n < 0 || w == 0 || test_bit(oldbit, new))
>                         set_bit(oldbit, dst);   /* identity map */
>                 else
>                         set_bit(bitmap_ord_to_pos(new, n % w, nbits), dst);
> --
> 2.33.0.rc0.dirty

Regarding the original problem - can you please confirm that
it's reproduced on current kernels, show the execution path etc.
From what I see on modern kernel, the only user of nodes_remap()
is mpol_rebind_nodemask(). Is that the correct path?

Anyways, as per name, bitmap_remap() is intended to change bit
positions, and it doesn't look wrong if it does so.

This is not how the function is supposed to work. For example,
        old: 00111000
        new: 00011100

means:
        old: 00111 000
             || \\\|||
        new: 000 11100

And after this patch it would be:
        old: 001 11000
             || \|||||
        new: 000 11100

Which is not the same, right?

If mpol_rebind() wants to keep previous relations, then according to
the comment:
 * The positions of unset bits in @old are mapped to themselves
 * (the identify map).

, you can just clear @old bits that already have good relations
you'd like to preserve.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re:[PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()
  2022-06-13 19:20 ` Yury Norov
@ 2022-06-14  3:45   ` wang.yi59
  2022-06-14 16:40     ` [PATCH] " Yury Norov
  0 siblings, 1 reply; 6+ messages in thread
From: wang.yi59 @ 2022-06-14  3:45 UTC (permalink / raw)
  To: yury.norov
  Cc: andriy.shevchenko, linux, linux-kernel, xue.zhihong,
	wang.liang82, Liu.Jianjun3

Hi Yury,                                                                                                                                                                                                            
                                                                                                                                                                                                                    
Thanks for your quick and clear response!                                                                                                                                                                           
                                                                                                                                                                                                                    
> On Mon, Jun 13, 2022 at 4:31 AM Yi Wang <wang.yi59@zte.com.cn> wrote:                                                                                                                                             
> >                                                                                                                                                                                                                 
> > Consider one situation:                                                                                                                                                                                         
> >                                     
> > The app have two vmas which mbind() to node 1 and node3 respectively,
> > and its cpuset.mems is 0-3, now set its cpuset.mems to 1,3, according
> > to current bitmap_remap(), we got:                         
>                                                       
> Regarding the original problem - can you please confirm that
> it's reproduced on current kernels, show the execution path etc.
> From what I see on modern kernel, the only user of nodes_remap()
> is mpol_rebind_nodemask(). Is that the correct path?       
                                                                
Yes, it's mpol_rebind_nodemask() calls nodes_remap() from
mpol_rebind_policy(). The stacks are as follow:     
[  290.836747]  bitmap_remap+0x84/0xe0
[  290.836753]  mpol_rebind_nodemask+0x64/0x2a0
[  290.836764]  mpol_rebind_mm+0x3a/0x90
[  290.836768]  update_tasks_nodemask+0x8a/0x1e0
[  290.836774]  cpuset_write_resmask+0x563/0xa00
[  290.836780]  cgroup_file_write+0x81/0x150
[  290.836784]  kernfs_fop_write_iter+0x12d/0x1c0
[  290.836791]  new_sync_write+0x109/0x190
[  290.836800]  vfs_write+0x218/0x2a0
[  290.836809]  ksys_write+0x59/0xd0
[  290.836812]  do_syscall_64+0x37/0x80
[  290.836818]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
                                          
To reproduce this situation, I write a program which seems like this:
    unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS;
    unsigned long size = 262144 << 12;                            
    unsigned long node1 = 2;   // node 1
    unsigned long node2 = 8;   // node 3
                                                                         
    p1 = vma1 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
    p2 = vma2 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
                       
    assert(!mbind(vma1, size, MPOL_BIND, &node1, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
    assert(!mbind(vma2, size, MPOL_BIND, &node2, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
                                                                      
Start the program whos name is mbind_tester, and do follow steps:
                                                                 
        mkdir && cd /sys/fs/cgroup/cpuset/mbind
        echo 0-31 > cpuset.cpus     
        echo 0-3 > cpuset.mems                                   
                                    
        cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w                
        7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4                                                                                                                        
        7ff77e200000 bind:1 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4

        echo 1,3 > cpuset.mems
        cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w
        7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4
        7ff77e200000 bind:3 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4

As you see, after set cpuset.mems to 1,3, the nodes which one of vma
binded to changed from 1 to 3.

This maybe confused, the original nodes binded is 1, after modify
cpuset.mems to 1,3 which include the node 3, it changed to 3...

>
> Anyways, as per name, bitmap_remap() is intended to change bit
> positions, and it doesn't look wrong if it does so.
>
> This is not how the function is supposed to work. For example,
>         old: 00111000
>         new: 00011100
>
> means:
>         old: 00111 000
>              || \\\|||
>         new: 000 11100
>
> And after this patch it would be:
>         old: 001 11000
>              || \|||||
>         new: 000 11100
>
> Which is not the same, right?

Right.

Actually this is what makes me embarrassed. If we want to fix this
situtation, we can:

- change the bitmap_remap() as this patch did, but this changed the
behavior of this routine which looks does the right thing. One good
news is this function is only called by mpol_rebind_nodemask().

- don't change the bitmap_remap(), to be honest, I didn't figure out
a way. Any suggestions?

>
> If mpol_rebind() wants to keep previous relations, then according to
> the comment:
>  * The positions of unset bits in @old are mapped to themselves
>  * (the identify map).
>
> , you can just clear @old bits that already have good relations
> you'd like to preserve.

Actually this does not work for me :)

In the example above, if set cpuset.mems to 0,2 firstly, the nodes
binds will change from 1 to 2. And then set cpuset.mems to 1,3, it will
change from 2 to 3 again.


---
Best wishes
Yi Wang

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()
  2022-06-14  3:45   ` wang.yi59
@ 2022-06-14 16:40     ` Yury Norov
  2022-06-14 16:50       ` Andy Shevchenko
  2022-06-15  8:54       ` wang.yi59
  0 siblings, 2 replies; 6+ messages in thread
From: Yury Norov @ 2022-06-14 16:40 UTC (permalink / raw)
  To: wang.yi59, Andrew Morton, linux-mm
  Cc: andriy.shevchenko, linux, linux-kernel, xue.zhihong,
	wang.liang82, Liu.Jianjun3, Yury Norov

+ Andrew Morton <akpm@linux-foundation.org>
+ linux-mm@kvack.org

On Mon, Jun 13, 2022 at 8:45 PM <wang.yi59@zte.com.cn> wrote:
>
> Hi Yury,
>
> Thanks for your quick and clear response!
>
> > On Mon, Jun 13, 2022 at 4:31 AM Yi Wang <wang.yi59@zte.com.cn> wrote:
> > >
> > > Consider one situation:
> > >
> > > The app have two vmas which mbind() to node 1 and node3 respectively,
> > > and its cpuset.mems is 0-3, now set its cpuset.mems to 1,3, according
> > > to current bitmap_remap(), we got:
> >
> > Regarding the original problem - can you please confirm that
> > it's reproduced on current kernels, show the execution path etc.
> > From what I see on modern kernel, the only user of nodes_remap()
> > is mpol_rebind_nodemask(). Is that the correct path?
>
> Yes, it's mpol_rebind_nodemask() calls nodes_remap() from
> mpol_rebind_policy(). The stacks are as follow:
> [  290.836747]  bitmap_remap+0x84/0xe0
> [  290.836753]  mpol_rebind_nodemask+0x64/0x2a0
> [  290.836764]  mpol_rebind_mm+0x3a/0x90
> [  290.836768]  update_tasks_nodemask+0x8a/0x1e0
> [  290.836774]  cpuset_write_resmask+0x563/0xa00
> [  290.836780]  cgroup_file_write+0x81/0x150
> [  290.836784]  kernfs_fop_write_iter+0x12d/0x1c0
> [  290.836791]  new_sync_write+0x109/0x190
> [  290.836800]  vfs_write+0x218/0x2a0
> [  290.836809]  ksys_write+0x59/0xd0
> [  290.836812]  do_syscall_64+0x37/0x80
> [  290.836818]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> To reproduce this situation, I write a program which seems like this:
>     unsigned int flags = MAP_PRIVATE | MAP_ANONYMOUS;
>     unsigned long size = 262144 << 12;
>     unsigned long node1 = 2;   // node 1
>     unsigned long node2 = 8;   // node 3
>
>     p1 = vma1 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
>     p2 = vma2 = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
>
>     assert(!mbind(vma1, size, MPOL_BIND, &node1, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
>     assert(!mbind(vma2, size, MPOL_BIND, &node2, MAX_NODES, MPOL_MF_STRICT | MPOL_MF_MOVE));
>
> Start the program whos name is mbind_tester, and do follow steps:
>
>         mkdir && cd /sys/fs/cgroup/cpuset/mbind
>         echo 0-31 > cpuset.cpus
>         echo 0-3 > cpuset.mems
>
>         cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w
>         7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4
>         7ff77e200000 bind:1 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4
>
>         echo 1,3 > cpuset.mems
>         cat /proc/`pidof mbind_tester`/numa_maps |grep bind -w
>         7ff73e200000 bind:3 anon=262144 dirty=262144 active=0 N3=262144 kernelpagesize_kB=4
>         7ff77e200000 bind:3 anon=262144 dirty=262144 active=0 N1=262144 kernelpagesize_kB=4
>
> As you see, after set cpuset.mems to 1,3, the nodes which one of vma
> binded to changed from 1 to 3.
>
> This maybe confused, the original nodes binded is 1, after modify
> cpuset.mems to 1,3 which include the node 3, it changed to 3...

Ok, thanks for the reproducer. I'll take a look at it closer to the weekend.

> > Anyways, as per name, bitmap_remap() is intended to change bit
> > positions, and it doesn't look wrong if it does so.
> >
> > This is not how the function is supposed to work. For example,
> >         old: 00111000
> >         new: 00011100
> >
> > means:
> >         old: 00111 000
> >              || \\\|||
> >         new: 000 11100
> >
> > And after this patch it would be:
> >         old: 001 11000
> >              || \|||||
> >         new: 000 11100
> >
> > Which is not the same, right?
>
> Right.

So, we both agree that bitmap_remap() works as advertised. This is
good. Let's try figuring out a solution without touching it.

> Actually this is what makes me embarrassed. If we want to fix this
> situtation, we can:
>
> - change the bitmap_remap() as this patch did, but this changed the
> behavior of this routine which looks does the right thing. One good
> news is this function is only called by mpol_rebind_nodemask().

There are users of bitmap_remap() in drivers/gpio/gpio-xilinx.c

> - don't change the bitmap_remap(), to be honest, I didn't figure out
> a way. Any suggestions?

I haven't had a chance to play with it (because of my dayjob), but I
have a strong feeling that the proper solution should come from
existing functionality.

Did you experiment with MPOL_F_{STATIC,RELATIVE}_NODES?
Those flags enable nodes_and() and mpol_relative_nodemask()
paths correspondingly.

> > If mpol_rebind() wants to keep previous relations, then according to
> > the comment:
> >  * The positions of unset bits in @old are mapped to themselves
> >  * (the identify map).
> >
> > , you can just clear @old bits that already have good relations
> > you'd like to preserve.
>
> Actually this does not work for me :)

What I suggested is:

 328                 node_clear(1, pol->w.cpuset_mems_allowed);
 329                 node_clear(3, pol->w.cpuset_mems_allowed);
 330                 nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
 331                                                                 *nodes);
 332                 pol->w.cpuset_mems_allowed = *nodes;

> In the example above, if set cpuset.mems to 0,2 firstly, the nodes
> binds will change from 1 to 2. And then set cpuset.mems to 1,3, it will
> change from 2 to 3 again.

I bet you can find a sequence that will finally give you the desired binding.
And probably somebody does this black magic in production. For me it
looks just scary. Can you try those static/relative flags, and if it doesn't
work, we'll have to invent another policy for nodes binding.

Adding Andrew and linux-mm, as it's definitely beyond bitmaps scope.

Thanks,
Yury

> ---
> Best wishes
> Yi Wang

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()
  2022-06-14 16:40     ` [PATCH] " Yury Norov
@ 2022-06-14 16:50       ` Andy Shevchenko
  2022-06-15  8:54       ` wang.yi59
  1 sibling, 0 replies; 6+ messages in thread
From: Andy Shevchenko @ 2022-06-14 16:50 UTC (permalink / raw)
  To: Yury Norov
  Cc: wang.yi59, Andrew Morton, linux-mm, linux, linux-kernel,
	xue.zhihong, wang.liang82, Liu.Jianjun3

On Tue, Jun 14, 2022 at 09:40:08AM -0700, Yury Norov wrote:
> On Mon, Jun 13, 2022 at 8:45 PM <wang.yi59@zte.com.cn> wrote:
> > > On Mon, Jun 13, 2022 at 4:31 AM Yi Wang <wang.yi59@zte.com.cn> wrote:

...

> > > Anyways, as per name, bitmap_remap() is intended to change bit
> > > positions, and it doesn't look wrong if it does so.

Haven't read the discussion in full, but saw the function here and my 2 cents
are the following:
 1) if we don't have tests for bitmap_remap() / bitmap_bitremap(), we should;
 2) these functions are used at least in one GPIO driver, so it would be nice
    to not touch them if they are not broken (see 1 above).

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re:[PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask()
  2022-06-14 16:40     ` [PATCH] " Yury Norov
  2022-06-14 16:50       ` Andy Shevchenko
@ 2022-06-15  8:54       ` wang.yi59
  1 sibling, 0 replies; 6+ messages in thread
From: wang.yi59 @ 2022-06-15  8:54 UTC (permalink / raw)
  To: yury.norov
  Cc: akpm, linux-mm, andriy.shevchenko, linux, linux-kernel,
	xue.zhihong, wang.liang82, Liu.Jianjun3, yury.norov

> On Mon, Jun 13, 2022 at 8:45 PM <wang.yi59@zte.com.cn> wrote:
> >
> >
> > - change the bitmap_remap() as this patch did, but this changed the
> > behavior of this routine which looks does the right thing. One good
> > news is this function is only called by mpol_rebind_nodemask().
> 
> There are users of bitmap_remap() in drivers/gpio/gpio-xilinx.c
> 
> > - don't change the bitmap_remap(), to be honest, I didn't figure out
> > a way. Any suggestions?
> 
> I haven't had a chance to play with it (because of my dayjob), but I
> have a strong feeling that the proper solution should come from
> existing functionality.
> 
> Did you experiment with MPOL_F_{STATIC,RELATIVE}_NODES?
> Those flags enable nodes_and() and mpol_relative_nodemask()
> paths correspondingly.
> 

You are right, adding MPOL_F_STATIC_NODES can fix this situation. So it
seems that we don't need to do anything in kernel now.

Many thanks for your time :)


> What I suggested is:
> 
>  328                 node_clear(1, pol->w.cpuset_mems_allowed);
>  329                 node_clear(3, pol->w.cpuset_mems_allowed);
>  330                 nodes_remap(tmp, pol->nodes, pol->w.cpuset_mems_allowed,
>  331                                                                 *nodes);
>  332                 pol->w.cpuset_mems_allowed = *nodes;
> 
> > In the example above, if set cpuset.mems to 0,2 firstly, the nodes
> > binds will change from 1 to 2. And then set cpuset.mems to 1,3, it will
> > change from 2 to 3 again.


---
Best wishes
Yi Wang

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-06-15  8:55 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-13 11:29 [PATCH] bitmap: fix a unproper remap when mpol_rebind_nodemask() Yi Wang
2022-06-13 19:20 ` Yury Norov
2022-06-14  3:45   ` wang.yi59
2022-06-14 16:40     ` [PATCH] " Yury Norov
2022-06-14 16:50       ` Andy Shevchenko
2022-06-15  8:54       ` wang.yi59

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).