linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* unexpected -ENOMEM from percpu_counter_init()
@ 2021-04-01 10:51 Wang Yugui
  2021-04-02  1:49 ` Wang Yugui
  2021-04-07 12:35 ` Vlastimil Babka
  0 siblings, 2 replies; 27+ messages in thread
From: Wang Yugui @ 2021-04-01 10:51 UTC (permalink / raw)
  To: linux-mm

Hi,

an unexpected -ENOMEM from percpu_counter_init() happened when xfstest 
with kernel 5.11.10 and 5.10.27

direct caller:
int btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
{
    int ret;

    ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL);
    if (ret)
        return ret;

    atomic_set(&lock->readers, 0);
    init_waitqueue_head(&lock->pending_readers);
    init_waitqueue_head(&lock->pending_writers);

    return 0;
}

upper caller:
    nofs_flag = memalloc_nofs_save();
    ret = btrfs_drew_lock_init(&root->snapshot_lock);
    memalloc_nofs_restore(nofs_flag);
    if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
    if (ret)
        goto fail;

The hardware of this server:
CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
memory:  192G, no swap

Only one xfstests job is running in this server, and about 7% of memory
is used.

Any advice please.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/01



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-01 10:51 unexpected -ENOMEM from percpu_counter_init() Wang Yugui
@ 2021-04-02  1:49 ` Wang Yugui
  2021-04-07 12:35 ` Vlastimil Babka
  1 sibling, 0 replies; 27+ messages in thread
From: Wang Yugui @ 2021-04-02  1:49 UTC (permalink / raw)
  To: linux-mm, Wang Yugui

Hi,

This problem happend in  5.12.0-0.rc5.20210331git2bb25b3a748a too.
https://kojipkgs.fedoraproject.org/packages/kernel/5.12.0/0.rc5.20210331git2bb25b3a748a.181.eln110/

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/02

> Hi,
> 
> an unexpected -ENOMEM from percpu_counter_init() happened when xfstest 
> with kernel 5.11.10 and 5.10.27
> 
> direct caller:
> int btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
> {
>     int ret;
> 
>     ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL);
>     if (ret)
>         return ret;
> 
>     atomic_set(&lock->readers, 0);
>     init_waitqueue_head(&lock->pending_readers);
>     init_waitqueue_head(&lock->pending_writers);
> 
>     return 0;
> }
> 
> upper caller:
>     nofs_flag = memalloc_nofs_save();
>     ret = btrfs_drew_lock_init(&root->snapshot_lock);
>     memalloc_nofs_restore(nofs_flag);
>     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
>     if (ret)
>         goto fail;
> 
> The hardware of this server:
> CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
> memory:  192G, no swap
> 
> Only one xfstests job is running in this server, and about 7% of memory
> is used.
> 
> Any advice please.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/01



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-01 10:51 unexpected -ENOMEM from percpu_counter_init() Wang Yugui
  2021-04-02  1:49 ` Wang Yugui
@ 2021-04-07 12:35 ` Vlastimil Babka
  2021-04-07 13:09   ` Wang Yugui
  2021-04-09  9:52   ` Wang Yugui
  1 sibling, 2 replies; 27+ messages in thread
From: Vlastimil Babka @ 2021-04-07 12:35 UTC (permalink / raw)
  To: Wang Yugui, linux-mm; +Cc: linux-btrfs

+CC btrfs

On 4/1/21 12:51 PM, Wang Yugui wrote:
> Hi,
> 
> an unexpected -ENOMEM from percpu_counter_init() happened when xfstest 
> with kernel 5.11.10 and 5.10.27

Is there a dmesg log showing allocation failure or something?

> direct caller:
> int btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
> {
>     int ret;
> 
>     ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL);
>     if (ret)
>         return ret;
> 
>     atomic_set(&lock->readers, 0);
>     init_waitqueue_head(&lock->pending_readers);
>     init_waitqueue_head(&lock->pending_writers);
> 
>     return 0;
> }
> 
> upper caller:
>     nofs_flag = memalloc_nofs_save();
>     ret = btrfs_drew_lock_init(&root->snapshot_lock);
>     memalloc_nofs_restore(nofs_flag);
>     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
>     if (ret)
>         goto fail;
> 
> The hardware of this server:
> CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
> memory:  192G, no swap
> 
> Only one xfstests job is running in this server, and about 7% of memory
> is used.
> 
> Any advice please.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/01
> 
> 



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-07 12:35 ` Vlastimil Babka
@ 2021-04-07 13:09   ` Wang Yugui
  2021-04-07 14:56     ` Dennis Zhou
  2021-04-09  9:52   ` Wang Yugui
  1 sibling, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-07 13:09 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-btrfs, wangyugui

Hi,

> +CC btrfs
> 
> On 4/1/21 12:51 PM, Wang Yugui wrote:
> > Hi,
> > 
> > an unexpected -ENOMEM from percpu_counter_init() happened when xfstest 
> > with kernel 5.11.10 and 5.10.27
> 
> Is there a dmesg log showing allocation failure or something?

When unexpected -ENOMEM of percpu_counter_init(), btrfs as upper caller
finally output something to dmesg.

And we add one trace to btrfs source to make sure that.
>     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");


Now the reproduce frequency become from >50% to not happen or very slow
with the flowing change.

diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a..0127be1 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -104,8 +104,8 @@
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD	3
 
-#define PCPU_EMPTY_POP_PAGES_LOW	2
-#define PCPU_EMPTY_POP_PAGES_HIGH	4
+#define PCPU_EMPTY_POP_PAGES_LOW	8
+#define PCPU_EMPTY_POP_PAGES_HIGH	16
 
 #ifdef CONFIG_SMP
 /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 5e76af7..8cc091b 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -14,7 +14,7 @@
 
 /* enough to cover all DEFINE_PER_CPUs in modules */
 #ifdef CONFIG_MODULES
-#define PERCPU_MODULE_RESERVE		(8 << 10)
+#define PERCPU_MODULE_RESERVE		(32 << 10)
 #else
 #define PERCPU_MODULE_RESERVE		0
 #endif


Just some guess,
1) maybe some releationship to the trigger of 'vm.dirty_bytes=10737418240'.

this problem happen in 
server/T7610 with E5-2660v2 *2 and SSD/SAS(6Gb/s) and 192G memory
but not happen in
server/T620 with E5-2680v2 *2 and SSD/NVMe and 192G memory.

2) maybe some releationship to numa.
128G memory in node1(CPU1), and 64G in node2(CPU2)

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/07


> > direct caller:
> > int btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
> > {
> >     int ret;
> > 
> >     ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL);
> >     if (ret)
> >         return ret;
> > 
> >     atomic_set(&lock->readers, 0);
> >     init_waitqueue_head(&lock->pending_readers);
> >     init_waitqueue_head(&lock->pending_writers);
> > 
> >     return 0;
> > }
> > 
> > upper caller:
> >     nofs_flag = memalloc_nofs_save();
> >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> >     memalloc_nofs_restore(nofs_flag);
> >     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
> >     if (ret)
> >         goto fail;
> > 
> > The hardware of this server:
> > CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
> > memory:  192G, no swap
> > 
> > Only one xfstests job is running in this server, and about 7% of memory
> > is used.
> > 
> > Any advice please.
> > 
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/04/01
> > 
> > 




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-07 13:09   ` Wang Yugui
@ 2021-04-07 14:56     ` Dennis Zhou
  2021-04-07 23:28       ` Wang Yugui
  0 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-07 14:56 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hello,

On Wed, Apr 07, 2021 at 09:09:07PM +0800, Wang Yugui wrote:
> Hi,
> 
> > +CC btrfs
> > 
> > On 4/1/21 12:51 PM, Wang Yugui wrote:
> > > Hi,
> > > 
> > > an unexpected -ENOMEM from percpu_counter_init() happened when xfstest 
> > > with kernel 5.11.10 and 5.10.27
> > 
> > Is there a dmesg log showing allocation failure or something?
> 
> When unexpected -ENOMEM of percpu_counter_init(), btrfs as upper caller
> finally output something to dmesg.
> 
> And we add one trace to btrfs source to make sure that.
> >     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
> 
> 
> Now the reproduce frequency become from >50% to not happen or very slow
> with the flowing change.
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 6596a0a..0127be1 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -104,8 +104,8 @@
>  /* chunks in slots below this are subject to being sidelined on failed alloc */
>  #define PCPU_SLOT_FAIL_THRESHOLD	3
>  
> -#define PCPU_EMPTY_POP_PAGES_LOW	2
> -#define PCPU_EMPTY_POP_PAGES_HIGH	4
> +#define PCPU_EMPTY_POP_PAGES_LOW	8
> +#define PCPU_EMPTY_POP_PAGES_HIGH	16
>  

These settings are from 2014 when Tejun initially implemented the atomic
allocation float. It is probably time to think about increasing the
number of pages. I'd prefer to do it in a dynamic way though (some X% of
a chunk instead of a fixed number increase).

>  #ifdef CONFIG_SMP
>  /* default addr <-> pcpu_ptr mapping, override in asm/percpu.h if necessary */
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index 5e76af7..8cc091b 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -14,7 +14,7 @@
>  
>  /* enough to cover all DEFINE_PER_CPUs in modules */
>  #ifdef CONFIG_MODULES
> -#define PERCPU_MODULE_RESERVE		(8 << 10)
> +#define PERCPU_MODULE_RESERVE		(32 << 10)
>  #else
>  #define PERCPU_MODULE_RESERVE		0
>  #endif
> 

This is a reserved region purely for module static inits.
btrfs_drew_lock_init() is a dynamic init.

> 
> Just some guess,
> 1) maybe some releationship to the trigger of 'vm.dirty_bytes=10737418240'.
> 
> this problem happen in 
> server/T7610 with E5-2660v2 *2 and SSD/SAS(6Gb/s) and 192G memory
> but not happen in
> server/T620 with E5-2680v2 *2 and SSD/NVMe and 192G memory.
> 
> 2) maybe some releationship to numa.
> 128G memory in node1(CPU1), and 64G in node2(CPU2)
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/07
> 
> 
> > > direct caller:
> > > int btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
> > > {
> > >     int ret;
> > > 
> > >     ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL);
> > >     if (ret)
> > >         return ret;
> > > 
> > >     atomic_set(&lock->readers, 0);
> > >     init_waitqueue_head(&lock->pending_readers);
> > >     init_waitqueue_head(&lock->pending_writers);
> > > 
> > >     return 0;
> > > }
> > > 
> > > upper caller:
> > >     nofs_flag = memalloc_nofs_save();
> > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > >     memalloc_nofs_restore(nofs_flag);

The issue is here. nofs is set which means percpu attempts an atomic
allocation. If it cannot find anything already allocated it isn't happy.
This was done before memalloc_nofs_{save/restore}() were pervasive.

Percpu should probably try to allocate some pages if possible even if
nofs is set.

> > >     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
> > >     if (ret)
> > >         goto fail;
> > > 
> > > The hardware of this server:
> > > CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
> > > memory:  192G, no swap
> > > 
> > > Only one xfstests job is running in this server, and about 7% of memory
> > > is used.
> > > 
> > > Any advice please.
> > > 
> > > Best Regards
> > > Wang Yugui (wangyugui@e16-tech.com)
> > > 2021/04/01
> > > 
> > > 
> 

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-07 14:56     ` Dennis Zhou
@ 2021-04-07 23:28       ` Wang Yugui
  2021-04-08  2:44         ` Dennis Zhou
  0 siblings, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-07 23:28 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hi,

> > > > upper caller:
> > > >     nofs_flag = memalloc_nofs_save();
> > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > >     memalloc_nofs_restore(nofs_flag);
> 
> The issue is here. nofs is set which means percpu attempts an atomic
> allocation. If it cannot find anything already allocated it isn't happy.
> This was done before memalloc_nofs_{save/restore}() were pervasive.
> 
> Percpu should probably try to allocate some pages if possible even if
> nofs is set.

Thanks.

I will wait for the patch, and then test it.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/08




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-07 23:28       ` Wang Yugui
@ 2021-04-08  2:44         ` Dennis Zhou
  2021-04-08  9:20           ` Wang Yugui
  0 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-08  2:44 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> Hi,
> 
> > > > > upper caller:
> > > > >     nofs_flag = memalloc_nofs_save();
> > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > >     memalloc_nofs_restore(nofs_flag);
> > 
> > The issue is here. nofs is set which means percpu attempts an atomic
> > allocation. If it cannot find anything already allocated it isn't happy.
> > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > 
> > Percpu should probably try to allocate some pages if possible even if
> > nofs is set.
> 
> Thanks.
> 
> I will wait for the patch, and then test it.
> 

I'm currently a bit busy with some other things. Adding support I don't
think will be much work, just a little bit tricky.

I recommend carrying what you have minus the change to reserved percpu
memory for now. If I'm the one to write it, I'll cc you.

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-08  2:44         ` Dennis Zhou
@ 2021-04-08  9:20           ` Wang Yugui
  2021-04-08 13:48             ` Dennis Zhou
  0 siblings, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-08  9:20 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hi,

> On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > > > > upper caller:
> > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > >     memalloc_nofs_restore(nofs_flag);
> > > 
> > > The issue is here. nofs is set which means percpu attempts an atomic
> > > allocation. If it cannot find anything already allocated it isn't happy.
> > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > 
> > > Percpu should probably try to allocate some pages if possible even if
> > > nofs is set.
> > 
> > Thanks.
> > 
> > I will wait for the patch, and then test it.
> > 
> 
> I'm currently a bit busy with some other things. Adding support I don't
> think will be much work, just a little bit tricky.
> 
> I recommend carrying what you have minus the change to reserved percpu
> memory for now. If I'm the one to write it, I'll cc you.
> 
> Thanks,
> Dennis


In the recent test, another problem is triggered too with my extended
percpu buffer size patch. maybe this info is helpful.

problem:
OS/VGA console is freezed , and no call stace is outputed.
Just some info is outputed to IPMI/dell iDRAC
   2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
   3 | Linux kernel panic: Fatal excep
   4 | Linux kernel panic: tion
   5 | 04/05/2021 | 19:09:14 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
   6 | Linux kernel panic: Fatal excep
   7 | Linux kernel panic: tion
   8 | 04/06/2021 | 13:08:42 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
   9 | Linux kernel panic: Fatal excep
   a | Linux kernel panic: tion
   b | 04/08/2021 | 02:12:46 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
   c | Linux kernel panic: Fatal excep
   d | Linux kernel panic: tion
kernel: at least 5.10.26/5.10.27/5.10.28

This problem is triggered by our application, NOT xfstests.
But our applicaiton have some heavy write load just like xfstest/generic/476.
Our application use at most 75% of memory, if still not enough, 
it will write out all buffer info to filesystem.

This problem is happen in linux kernel 5.10.x, but not happen in linux
kernel 5.4.x. It have high frequency to repduce too.

If any guide to get more info for troubleshooting, I will follow it to test.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/08




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-08  9:20           ` Wang Yugui
@ 2021-04-08 13:48             ` Dennis Zhou
  2021-04-08 14:28               ` Filipe Manana
  2021-04-09  0:08               ` Wang Yugui
  0 siblings, 2 replies; 27+ messages in thread
From: Dennis Zhou @ 2021-04-08 13:48 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

On Thu, Apr 08, 2021 at 05:20:00PM +0800, Wang Yugui wrote:
> Hi,
> 
> > On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> > > Hi,
> > > 
> > > > > > > upper caller:
> > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > 
> > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > 
> > > > Percpu should probably try to allocate some pages if possible even if
> > > > nofs is set.
> > > 
> > > Thanks.
> > > 
> > > I will wait for the patch, and then test it.
> > > 
> > 
> > I'm currently a bit busy with some other things. Adding support I don't
> > think will be much work, just a little bit tricky.
> > 
> > I recommend carrying what you have minus the change to reserved percpu
> > memory for now. If I'm the one to write it, I'll cc you.
> > 
> > Thanks,
> > Dennis
> 
> 
> In the recent test, another problem is triggered too with my extended
> percpu buffer size patch. maybe this info is helpful.
> 
> problem:
> OS/VGA console is freezed , and no call stace is outputed.
> Just some info is outputed to IPMI/dell iDRAC
>    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
>    3 | Linux kernel panic: Fatal excep
>    4 | Linux kernel panic: tion
>    5 | 04/05/2021 | 19:09:14 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
>    6 | Linux kernel panic: Fatal excep
>    7 | Linux kernel panic: tion
>    8 | 04/06/2021 | 13:08:42 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
>    9 | Linux kernel panic: Fatal excep
>    a | Linux kernel panic: tion
>    b | 04/08/2021 | 02:12:46 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
>    c | Linux kernel panic: Fatal excep
>    d | Linux kernel panic: tion

Unfortunately non of the above to me is useful.

> kernel: at least 5.10.26/5.10.27/5.10.28
> 
> This problem is triggered by our application, NOT xfstests.
> But our applicaiton have some heavy write load just like xfstest/generic/476.
> Our application use at most 75% of memory, if still not enough, 
> it will write out all buffer info to filesystem.

Do you use cgroups at all? If yes can you describe the workload pattern
a bit.

> This problem is happen in linux kernel 5.10.x, but not happen in linux
> kernel 5.4.x. It have high frequency to repduce too.

Ah. Can you try the following patch?
https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-08 13:48             ` Dennis Zhou
@ 2021-04-08 14:28               ` Filipe Manana
  2021-04-08 15:02                 ` Dennis Zhou
  2021-04-09  0:08               ` Wang Yugui
  1 sibling, 1 reply; 27+ messages in thread
From: Filipe Manana @ 2021-04-08 14:28 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Wang Yugui, Vlastimil Babka, linux-mm, linux-btrfs

On Thu, Apr 8, 2021 at 2:50 PM Dennis Zhou <dennis@kernel.org> wrote:
>
> On Thu, Apr 08, 2021 at 05:20:00PM +0800, Wang Yugui wrote:
> > Hi,
> >
> > > On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> > > > Hi,
> > > >
> > > > > > > > upper caller:
> > > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > >
> > > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > >
> > > > > Percpu should probably try to allocate some pages if possible even if
> > > > > nofs is set.
> > > >
> > > > Thanks.
> > > >
> > > > I will wait for the patch, and then test it.
> > > >
> > >
> > > I'm currently a bit busy with some other things. Adding support I don't
> > > think will be much work, just a little bit tricky.
> > >
> > > I recommend carrying what you have minus the change to reserved percpu
> > > memory for now. If I'm the one to write it, I'll cc you.
> > >
> > > Thanks,
> > > Dennis
> >
> >
> > In the recent test, another problem is triggered too with my extended
> > percpu buffer size patch. maybe this info is helpful.
> >
> > problem:
> > OS/VGA console is freezed , and no call stace is outputed.
> > Just some info is outputed to IPMI/dell iDRAC
> >    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> >    3 | Linux kernel panic: Fatal excep
> >    4 | Linux kernel panic: tion
> >    5 | 04/05/2021 | 19:09:14 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> >    6 | Linux kernel panic: Fatal excep
> >    7 | Linux kernel panic: tion
> >    8 | 04/06/2021 | 13:08:42 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> >    9 | Linux kernel panic: Fatal excep
> >    a | Linux kernel panic: tion
> >    b | 04/08/2021 | 02:12:46 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> >    c | Linux kernel panic: Fatal excep
> >    d | Linux kernel panic: tion
>
> Unfortunately non of the above to me is useful.
>
> > kernel: at least 5.10.26/5.10.27/5.10.28
> >
> > This problem is triggered by our application, NOT xfstests.
> > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > Our application use at most 75% of memory, if still not enough,
> > it will write out all buffer info to filesystem.
>
> Do you use cgroups at all? If yes can you describe the workload pattern
> a bit.
>
> > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > kernel 5.4.x. It have high frequency to repduce too.
>
> Ah. Can you try the following patch?
> https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/

Btw, this has been happening since 5.9.
I never managed to find the time to bisect it, but it might be more
obvious to you or anyone else with deep experience of mm/percpu of
what changed.

It's triggered very frequently with long runs of fsstress on btrfs,
such as with test cases btrfs/078 and generic/476 from fstests.
It produces a trace like the following:

[128063.794597] ------------[ cut here ]------------
[128063.795305] BTRFS: Transaction aborted (error -12)
[128063.795831] WARNING: CPU: 0 PID: 1131545 at
fs/btrfs/transaction.c:1683 create_pending_snapshot+0xa2a/0xfd0
[btrfs]
[128063.796235] Modules linked in: dm_snapshot btrfs dm_thin_pool
dm_persistent_data dm_bio_prison dm_bufio dm_log_writes dm_dust
dm_flakey dm_mod loop xfs blake2b_generic xor raid6_pq libcrc32c
intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass
crct10dif_pclmul g>
[128063.798521] CPU: 0 PID: 1131545 Comm: fsstress Tainted: G        W
        5.10.0-rc2-btrfs-next-71 #1
[128063.799102] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[128063.800150] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
[128063.800748] Code: 02 72 30 83 f8 fb 0f 84 38 03 00 00 83 f8 e2 0f
84 2f 03 00 00 89 c6 48 c7 c7 e8 af c5 c0 48 89 85 78 ff ff ff e8 6d
29 6b ca <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 93 06 00 00 48 c7 c6 90
a6 c4
[128063.801886] RSP: 0018:ffffaad1444cfd50 EFLAGS: 00010282
[128063.802529] RAX: 0000000000000000 RBX: ffff99c4c0d0b500 RCX:
0000000000000000
[128063.803175] RDX: 0000000000000001 RSI: 0000000000000027 RDI:
00000000ffffffff
[128063.803829] RBP: ffffaad1444cfe20 R08: 0000000000000000 R09:
0000000000000000
[128063.804478] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff99c70a1a4c10
[128063.805134] R13: ffff99c59e3d0e00 R14: ffff99c70e935d08 R15:
00000000fffffff4
[128063.805816] FS:  00007f0fdd733240(0000) GS:ffff99c7ebe00000(0000)
knlGS:0000000000000000
[128063.806547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[128063.807264] CR2: 00007f0fdd731000 CR3: 00000001ee3b6003 CR4:
00000000003706f0
[128063.807998] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[128063.808707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[128063.809415] Call Trace:
[128063.810172]  ? create_pending_snapshots+0xaa/0xd0 [btrfs]
[128063.810921]  create_pending_snapshots+0xaa/0xd0 [btrfs]
[128063.811680]  btrfs_commit_transaction+0x2b6/0xb80 [btrfs]
[128063.812429]  ? finish_wait+0x90/0x90
[128063.813176]  ? __ia32_sys_fdatasync+0x20/0x20
[128063.813898]  iterate_supers+0x87/0xf0
[128063.814562]  ksys_sync+0x60/0xb0
[128063.815214]  __do_sys_sync+0xa/0x10
[128063.815879]  do_syscall_64+0x33/0x80
[128063.816539]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[128063.817191] RIP: 0033:0x7f0fdd829bd7
[128063.817907] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 b1 82
0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb b8 0f 1f 44 00 00 b8 a2 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 82 0c 00 f7 d8 64 89
01 48
[128063.819411] RSP: 002b:00007fff356b8968 EFLAGS: 00000206 ORIG_RAX:
00000000000000a2
[128063.820233] RAX: ffffffffffffffda RBX: 000055acc4127560 RCX:
00007f0fdd829bd7
[128063.821028] RDX: 00000000ffffffff RSI: 000000002ecb3555 RDI:
00000000000069f4
[128063.821808] RBP: 000000000000c350 R08: 0000000000000014 R09:
00007fff356b893c
[128063.822632] R10: 00007fff356b8565 R11: 0000000000000206 R12:
00000000000069f4
[128063.823416] R13: 00007fff356b89d0 R14: 00007fff356b8986 R15:
000055acc4115350
[128063.824249] CPU: 5 PID: 1131545 Comm: fsstress Tainted: G        W
        5.10.0-rc2-btrfs-next-71 #1
[128063.824931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[128063.826439] Call Trace:
[128063.827119]  dump_stack+0x8d/0xb5
[128063.827804]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
[128063.828456]  __warn.cold+0x24/0x4b
[128063.829100]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
[128063.829724]  report_bug+0xd1/0x100
[128063.830327]  handle_bug+0x35/0x80
[128063.830910]  exc_invalid_op+0x14/0x70
[128063.831476]  asm_exc_invalid_op+0x12/0x20
[128063.832042] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]

With 5.8 and older, I never got such failures on my test boxes.

Thanks.


>
> Thanks,
> Dennis



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-08 14:28               ` Filipe Manana
@ 2021-04-08 15:02                 ` Dennis Zhou
  2021-04-09 11:39                   ` Filipe Manana
  0 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-08 15:02 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Wang Yugui, Vlastimil Babka, linux-mm, linux-btrfs

On Thu, Apr 08, 2021 at 03:28:20PM +0100, Filipe Manana wrote:
> On Thu, Apr 8, 2021 at 2:50 PM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > On Thu, Apr 08, 2021 at 05:20:00PM +0800, Wang Yugui wrote:
> > > Hi,
> > >
> > > > On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> > > > > Hi,
> > > > >
> > > > > > > > > upper caller:
> > > > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > > >
> > > > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > > >
> > > > > > Percpu should probably try to allocate some pages if possible even if
> > > > > > nofs is set.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > I will wait for the patch, and then test it.
> > > > >
> > > >
> > > > I'm currently a bit busy with some other things. Adding support I don't
> > > > think will be much work, just a little bit tricky.
> > > >
> > > > I recommend carrying what you have minus the change to reserved percpu
> > > > memory for now. If I'm the one to write it, I'll cc you.
> > > >
> > > > Thanks,
> > > > Dennis
> > >
> > >
> > > In the recent test, another problem is triggered too with my extended
> > > percpu buffer size patch. maybe this info is helpful.
> > >
> > > problem:
> > > OS/VGA console is freezed , and no call stace is outputed.
> > > Just some info is outputed to IPMI/dell iDRAC
> > >    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > >    3 | Linux kernel panic: Fatal excep
> > >    4 | Linux kernel panic: tion
> > >    5 | 04/05/2021 | 19:09:14 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > >    6 | Linux kernel panic: Fatal excep
> > >    7 | Linux kernel panic: tion
> > >    8 | 04/06/2021 | 13:08:42 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > >    9 | Linux kernel panic: Fatal excep
> > >    a | Linux kernel panic: tion
> > >    b | 04/08/2021 | 02:12:46 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > >    c | Linux kernel panic: Fatal excep
> > >    d | Linux kernel panic: tion
> >
> > Unfortunately non of the above to me is useful.
> >
> > > kernel: at least 5.10.26/5.10.27/5.10.28
> > >
> > > This problem is triggered by our application, NOT xfstests.
> > > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > > Our application use at most 75% of memory, if still not enough,
> > > it will write out all buffer info to filesystem.
> >
> > Do you use cgroups at all? If yes can you describe the workload pattern
> > a bit.
> >
> > > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > > kernel 5.4.x. It have high frequency to repduce too.
> >
> > Ah. Can you try the following patch?
> > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> 
> Btw, this has been happening since 5.9.
> I never managed to find the time to bisect it, but it might be more
> obvious to you or anyone else with deep experience of mm/percpu of
> what changed.
> 

Ah I'm sorry about that. It wasn't brought to my attention and I don't
frequent the btrfs slack anymore. I can try and pop in more frequently
if that would help with these things.

> It's triggered very frequently with long runs of fsstress on btrfs,
> such as with test cases btrfs/078 and generic/476 from fstests.
> It produces a trace like the following:
> 
> [128063.794597] ------------[ cut here ]------------
> [128063.795305] BTRFS: Transaction aborted (error -12)
> [128063.795831] WARNING: CPU: 0 PID: 1131545 at
> fs/btrfs/transaction.c:1683 create_pending_snapshot+0xa2a/0xfd0
> [btrfs]
> [128063.796235] Modules linked in: dm_snapshot btrfs dm_thin_pool
> dm_persistent_data dm_bio_prison dm_bufio dm_log_writes dm_dust
> dm_flakey dm_mod loop xfs blake2b_generic xor raid6_pq libcrc32c
> intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass
> crct10dif_pclmul g>
> [128063.798521] CPU: 0 PID: 1131545 Comm: fsstress Tainted: G        W
>         5.10.0-rc2-btrfs-next-71 #1
> [128063.799102] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [128063.800150] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> [128063.800748] Code: 02 72 30 83 f8 fb 0f 84 38 03 00 00 83 f8 e2 0f
> 84 2f 03 00 00 89 c6 48 c7 c7 e8 af c5 c0 48 89 85 78 ff ff ff e8 6d
> 29 6b ca <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 93 06 00 00 48 c7 c6 90
> a6 c4
> [128063.801886] RSP: 0018:ffffaad1444cfd50 EFLAGS: 00010282
> [128063.802529] RAX: 0000000000000000 RBX: ffff99c4c0d0b500 RCX:
> 0000000000000000
> [128063.803175] RDX: 0000000000000001 RSI: 0000000000000027 RDI:
> 00000000ffffffff
> [128063.803829] RBP: ffffaad1444cfe20 R08: 0000000000000000 R09:
> 0000000000000000
> [128063.804478] R10: 0000000000000000 R11: 0000000000000000 R12:
> ffff99c70a1a4c10
> [128063.805134] R13: ffff99c59e3d0e00 R14: ffff99c70e935d08 R15:
> 00000000fffffff4
> [128063.805816] FS:  00007f0fdd733240(0000) GS:ffff99c7ebe00000(0000)
> knlGS:0000000000000000
> [128063.806547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [128063.807264] CR2: 00007f0fdd731000 CR3: 00000001ee3b6003 CR4:
> 00000000003706f0
> [128063.807998] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [128063.808707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [128063.809415] Call Trace:
> [128063.810172]  ? create_pending_snapshots+0xaa/0xd0 [btrfs]
> [128063.810921]  create_pending_snapshots+0xaa/0xd0 [btrfs]
> [128063.811680]  btrfs_commit_transaction+0x2b6/0xb80 [btrfs]
> [128063.812429]  ? finish_wait+0x90/0x90
> [128063.813176]  ? __ia32_sys_fdatasync+0x20/0x20
> [128063.813898]  iterate_supers+0x87/0xf0
> [128063.814562]  ksys_sync+0x60/0xb0
> [128063.815214]  __do_sys_sync+0xa/0x10
> [128063.815879]  do_syscall_64+0x33/0x80
> [128063.816539]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [128063.817191] RIP: 0033:0x7f0fdd829bd7
> [128063.817907] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 b1 82
> 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb b8 0f 1f 44 00 00 b8 a2 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 82 0c 00 f7 d8 64 89
> 01 48
> [128063.819411] RSP: 002b:00007fff356b8968 EFLAGS: 00000206 ORIG_RAX:
> 00000000000000a2
> [128063.820233] RAX: ffffffffffffffda RBX: 000055acc4127560 RCX:
> 00007f0fdd829bd7
> [128063.821028] RDX: 00000000ffffffff RSI: 000000002ecb3555 RDI:
> 00000000000069f4
> [128063.821808] RBP: 000000000000c350 R08: 0000000000000014 R09:
> 00007fff356b893c
> [128063.822632] R10: 00007fff356b8565 R11: 0000000000000206 R12:
> 00000000000069f4
> [128063.823416] R13: 00007fff356b89d0 R14: 00007fff356b8986 R15:
> 000055acc4115350
> [128063.824249] CPU: 5 PID: 1131545 Comm: fsstress Tainted: G        W
>         5.10.0-rc2-btrfs-next-71 #1
> [128063.824931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> [128063.826439] Call Trace:
> [128063.827119]  dump_stack+0x8d/0xb5
> [128063.827804]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> [128063.828456]  __warn.cold+0x24/0x4b
> [128063.829100]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> [128063.829724]  report_bug+0xd1/0x100
> [128063.830327]  handle_bug+0x35/0x80
> [128063.830910]  exc_invalid_op+0x14/0x70
> [128063.831476]  asm_exc_invalid_op+0x12/0x20
> [128063.832042] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> 
> With 5.8 and older, I never got such failures on my test boxes.
> 

Ah. Roman's cgroup percpu changes went in for 5.9. Can you please patch:
https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/

That most likely will have to be cced to stable for 5.9+.

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-08 13:48             ` Dennis Zhou
  2021-04-08 14:28               ` Filipe Manana
@ 2021-04-09  0:08               ` Wang Yugui
  2021-04-09  2:14                 ` Dennis Zhou
  1 sibling, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-09  0:08 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hi,

> > kernel: at least 5.10.26/5.10.27/5.10.28
> > 
> > This problem is triggered by our application, NOT xfstests.
> > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > Our application use at most 75% of memory, if still not enough, 
> > it will write out all buffer info to filesystem.
> 
> Do you use cgroups at all? If yes can you describe the workload pattern
> a bit.

cgroups is enabled defaultly, so cgroups is used.

This is the output of systemd-cgls, ''samtools.nipt sort -m 60G" is one
of our application.  but our application is NOT cgroups-aware, and it NOT
call any cgroup interface directly.

Control group /:
-.slice
├─user.slice
│ └─user-0.slice
│   ├─session-55.scope
│   │ ├─48747 sshd: root [priv]
│   │ ├─48788 sshd: root@notty
│   │ ├─48795 perl -e @GNU_Parallel=split/_/,"use_IPC::Open3;_use_MIME::Base6...
│   │ ├─48943 samtools.nipt sort -m 60G -T /nodetmp//nfs/biowrk/baseline.wgs2...
│   │ ├─....
│   └─user@0.service
│     └─init.scope
│       ├─48775 /usr/lib/systemd/systemd --user
│       └─48781 (sd-pam)
├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
└─system.slice
  ├─rngd.service
  │ └─1577 /sbin/rngd -f --fill-watermark=0
  ├─irqbalance.service
  │ └─1543 /usr/sbin/irqbalance --foreground
....


> > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > kernel 5.4.x. It have high frequency to repduce too.
> 
> Ah. Can you try the following patch?
> https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> 
> Thanks,
> Dennis

kernel: kernel 5.10.28+this patch
result: yet not happen after 4 times test.
          without this path, the reproduce frequency is >50%

And a question about this,
> > > > upper caller:
> > > >     nofs_flag = memalloc_nofs_save();
> > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > >     memalloc_nofs_restore(nofs_flag);
> 
> The issue is here. nofs is set which means percpu attempts an atomic
> allocation. If it cannot find anything already allocated it isn't happy.
> This was done before memalloc_nofs_{save/restore}() were pervasive.
> 
> Percpu should probably try to allocate some pages if possible even if
> nofs is set.

Should we check and pre-alloc memory inside memalloc_nofs_restore()?
another memalloc_nofs_save() may come soon.

something like this in memalloc_nofs_save()?
	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
 		pcpu_schedule_balance_work();


by the way, this problem still happen in kernel 5.10.28+this patch.
Is this is a PANIC without OOPS?  any guide for troubleshooting please.
> problem:
> OS/VGA console is freezed , and no call trace is outputed.
> Just some info is outputed to IPMI/dell iDRAC
>    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
>    3 | Linux kernel panic: Fatal excep
>    4 | Linux kernel panic: tion

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/08



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09  0:08               ` Wang Yugui
@ 2021-04-09  2:14                 ` Dennis Zhou
  2021-04-09  4:02                   ` Wang Yugui
  0 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-09  2:14 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

On Fri, Apr 09, 2021 at 08:08:00AM +0800, Wang Yugui wrote:
> Hi,
> 
> > > kernel: at least 5.10.26/5.10.27/5.10.28
> > > 
> > > This problem is triggered by our application, NOT xfstests.
> > > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > > Our application use at most 75% of memory, if still not enough, 
> > > it will write out all buffer info to filesystem.
> > 
> > Do you use cgroups at all? If yes can you describe the workload pattern
> > a bit.
> 
> cgroups is enabled defaultly, so cgroups is used.
> 
> This is the output of systemd-cgls, ''samtools.nipt sort -m 60G" is one
> of our application.  but our application is NOT cgroups-aware, and it NOT
> call any cgroup interface directly.
> 
> Control group /:
> -.slice
> ├─user.slice
> │ └─user-0.slice
> │   ├─session-55.scope
> │   │ ├─48747 sshd: root [priv]
> │   │ ├─48788 sshd: root@notty
> │   │ ├─48795 perl -e @GNU_Parallel=split/_/,"use_IPC::Open3;_use_MIME::Base6...
> │   │ ├─48943 samtools.nipt sort -m 60G -T /nodetmp//nfs/biowrk/baseline.wgs2...
> │   │ ├─....
> │   └─user@0.service
> │     └─init.scope
> │       ├─48775 /usr/lib/systemd/systemd --user
> │       └─48781 (sd-pam)
> ├─init.scope
> │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
> └─system.slice
>   ├─rngd.service
>   │ └─1577 /sbin/rngd -f --fill-watermark=0
>   ├─irqbalance.service
>   │ └─1543 /usr/sbin/irqbalance --foreground
> ....
> 
> 
> > > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > > kernel 5.4.x. It have high frequency to repduce too.
> > 
> > Ah. Can you try the following patch?
> > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> > 
> > Thanks,
> > Dennis
> 
> kernel: kernel 5.10.28+this patch
> result: yet not happen after 4 times test.
>           without this path, the reproduce frequency is >50%
> 
> And a question about this,
> > > > > upper caller:
> > > > >     nofs_flag = memalloc_nofs_save();
> > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > >     memalloc_nofs_restore(nofs_flag);
> > 
> > The issue is here. nofs is set which means percpu attempts an atomic
> > allocation. If it cannot find anything already allocated it isn't happy.
> > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > 
> > Percpu should probably try to allocate some pages if possible even if
> > nofs is set.
> 
> Should we check and pre-alloc memory inside memalloc_nofs_restore()?
> another memalloc_nofs_save() may come soon.
> 
> something like this in memalloc_nofs_save()?
> 	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
>  		pcpu_schedule_balance_work();
> 

Percpu does do this via a workqueue item. The issue is in v5.9 we
introduced 2 types of chunks. However, the free float page number was
for the total. So even if 1 chunk type dropped below, the other chunk
type might have enough pages. I'm queuing this for 5.12 and will send it
out assuming it does fix your problem.

> 
> by the way, this problem still happen in kernel 5.10.28+this patch.
> Is this is a PANIC without OOPS?  any guide for troubleshooting please.

Sorry I don't follow. Above you said the problem hasn't reproed. But now
you're saying it does? Does your issue still reproduce with the patch
above?

> > problem:
> > OS/VGA console is freezed , and no call trace is outputed.
> > Just some info is outputed to IPMI/dell iDRAC
> >    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> >    3 | Linux kernel panic: Fatal excep
> >    4 | Linux kernel panic: tion
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/08
> 

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09  2:14                 ` Dennis Zhou
@ 2021-04-09  4:02                   ` Wang Yugui
  2021-04-09  7:36                     ` Wang Yugui
  0 siblings, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-09  4:02 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hi,

> On Fri, Apr 09, 2021 at 08:08:00AM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > > kernel: at least 5.10.26/5.10.27/5.10.28
> > > > 
> > > > This problem is triggered by our application, NOT xfstests.
> > > > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > > > Our application use at most 75% of memory, if still not enough, 
> > > > it will write out all buffer info to filesystem.
> > > 
> > > Do you use cgroups at all? If yes can you describe the workload pattern
> > > a bit.
> > 
> > cgroups is enabled defaultly, so cgroups is used.
> > 
> > This is the output of systemd-cgls, ''samtools.nipt sort -m 60G" is one
> > of our application.  but our application is NOT cgroups-aware, and it NOT
> > call any cgroup interface directly.
> > 
> > Control group /:
> > -.slice
> > ├─user.slice
> > │ └─user-0.slice
> > │   ├─session-55.scope
> > │   │ ├─48747 sshd: root [priv]
> > │   │ ├─48788 sshd: root@notty
> > │   │ ├─48795 perl -e @GNU_Parallel=split/_/,"use_IPC::Open3;_use_MIME::Base6...
> > │   │ ├─48943 samtools.nipt sort -m 60G -T /nodetmp//nfs/biowrk/baseline.wgs2...
> > │   │ ├─....
> > │   └─user@0.service
> > │     └─init.scope
> > │       ├─48775 /usr/lib/systemd/systemd --user
> > │       └─48781 (sd-pam)
> > ├─init.scope
> > │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 18
> > └─system.slice
> >   ├─rngd.service
> >   │ └─1577 /sbin/rngd -f --fill-watermark=0
> >   ├─irqbalance.service
> >   │ └─1543 /usr/sbin/irqbalance --foreground
> > ....
> > 
> > 
> > > > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > > > kernel 5.4.x. It have high frequency to repduce too.
> > > 
> > > Ah. Can you try the following patch?
> > > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> > > 
> > > Thanks,
> > > Dennis
> > 
> > kernel: kernel 5.10.28+this patch
> > result: yet not happen after 4 times test.
> >           without this path, the reproduce frequency is >50%
> > 
> > And a question about this,
> > > > > > upper caller:
> > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > >     memalloc_nofs_restore(nofs_flag);
> > > 
> > > The issue is here. nofs is set which means percpu attempts an atomic
> > > allocation. If it cannot find anything already allocated it isn't happy.
> > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > 
> > > Percpu should probably try to allocate some pages if possible even if
> > > nofs is set.
> > 
> > Should we check and pre-alloc memory inside memalloc_nofs_restore()?
> > another memalloc_nofs_save() may come soon.
> > 
> > something like this in memalloc_nofs_save()?
> > 	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
> >  		pcpu_schedule_balance_work();
> > 
> 
> Percpu does do this via a workqueue item. The issue is in v5.9 we
> introduced 2 types of chunks. However, the free float page number was
> for the total. So even if 1 chunk type dropped below, the other chunk
> type might have enough pages. I'm queuing this for 5.12 and will send it
> out assuming it does fix your problem.
> 
> > 
> > by the way, this problem still happen in kernel 5.10.28+this patch.
> > Is this is a PANIC without OOPS?  any guide for troubleshooting please.
> 
> Sorry I don't follow. Above you said the problem hasn't reproed. But now
> you're saying it does? Does your issue still reproduce with the patch
> above?

I'm sorry.

The problem (-ENOMEM of percpu_counter_init) yet not happen with
the patch(https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/).

but another problem(os freezed without call trace, PANIC without OOPS?,
the reason is yet unkown) still happen.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/09


> 
> > > problem:
> > > OS/VGA console is freezed , and no call trace is outputed.
> > > Just some info is outputed to IPMI/dell iDRAC
> > >    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > >    3 | Linux kernel panic: Fatal excep
> > >    4 | Linux kernel panic: tion
> > 
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/04/08
> > 
> 
> Thanks,
> Dennis




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09  4:02                   ` Wang Yugui
@ 2021-04-09  7:36                     ` Wang Yugui
  2021-04-09  7:48                       ` Wang Yugui
  2021-04-09 13:56                       ` Dennis Zhou
  0 siblings, 2 replies; 27+ messages in thread
From: Wang Yugui @ 2021-04-09  7:36 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Dennis Zhou, Vlastimil Babka, linux-mm, linux-btrfs

Hi,

some question about workqueue for percpu.

> > > 
> > > And a question about this,
> > > > > > > upper caller:
> > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > 
> > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > 
> > > > Percpu should probably try to allocate some pages if possible even if
> > > > nofs is set.
> > > 
> > > Should we check and pre-alloc memory inside memalloc_nofs_restore()?
> > > another memalloc_nofs_save() may come soon.
> > > 
> > > something like this in memalloc_nofs_save()?
> > > 	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
> > >  		pcpu_schedule_balance_work();
> > > 
> > 
> > Percpu does do this via a workqueue item. The issue is in v5.9 we
> > introduced 2 types of chunks. However, the free float page number was
> > for the total. So even if 1 chunk type dropped below, the other chunk
> > type might have enough pages. I'm queuing this for 5.12 and will send it
> > out assuming it does fix your problem.

workqueue for percpu maybe not strong enough( not scheduled?) when high
CPU load?

this is our application pipeline.
	file_pre_process |
	bwa.nipt xx |
	samtools.nipt sort xx |
	file_post_process

file_pre_process/file_post_process is fast, so often are blocked by
pipe input/output.

'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.

'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
if the memory is not enough, it will save all the buffer to temp file,
so it is sometimes high-IO-load too(write 60G or more to file).


xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
so xfstests(generic/476) maybe easy than our application pipeline.

Although there is yet not a simple reproducer for another problem
happend here, but there is a little high chance that something is wrong
in btrfs/mm/fs-buffer.
> but another problem(os freezed without call trace, PANIC without OOPS?,
> the reason is yet unkown) still happen.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/09




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09  7:36                     ` Wang Yugui
@ 2021-04-09  7:48                       ` Wang Yugui
  2021-04-09 13:56                       ` Dennis Zhou
  1 sibling, 0 replies; 27+ messages in thread
From: Wang Yugui @ 2021-04-09  7:48 UTC (permalink / raw)
  To: Dennis Zhou, Vlastimil Babka, linux-mm, linux-btrfs

Hi,

Add top/free info when our applicaiton pipeline is running.

> Hi,
> 
> some question about workqueue for percpu.
> 
> > > > 
> > > > And a question about this,
> > > > > > > > upper caller:
> > > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > > 
> > > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > > 
> > > > > Percpu should probably try to allocate some pages if possible even if
> > > > > nofs is set.
> > > > 
> > > > Should we check and pre-alloc memory inside memalloc_nofs_restore()?
> > > > another memalloc_nofs_save() may come soon.
> > > > 
> > > > something like this in memalloc_nofs_save()?
> > > > 	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
> > > >  		pcpu_schedule_balance_work();
> > > > 
> > > 
> > > Percpu does do this via a workqueue item. The issue is in v5.9 we
> > > introduced 2 types of chunks. However, the free float page number was
> > > for the total. So even if 1 chunk type dropped below, the other chunk
> > > type might have enough pages. I'm queuing this for 5.12 and will send it
> > > out assuming it does fix your problem.
> 
> workqueue for percpu maybe not strong enough( not scheduled?) when high
> CPU load?
> 
> this is our application pipeline.
> 	file_pre_process |
> 	bwa.nipt xx |
> 	samtools.nipt sort xx |
> 	file_post_process
> 
> file_pre_process/file_post_process is fast, so often are blocked by
> pipe input/output.
> 
> 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> 
> 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> if the memory is not enough, it will save all the buffer to temp file,
> so it is sometimes high-IO-load too(write 60G or more to file).
> 
> 
> xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> so xfstests(generic/476) maybe easy than our application pipeline.


# nproc
40
# top
top - 15:43:06 up 10:16,  1 user,  load average: 41.39, 37.90, 35.98
Tasks: 488 total,   3 running, 485 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.6 us,  0.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.3 hi,  0.0 si,  0.0 st
MiB Mem : 58.3/193384.1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||                                       ]
MiB Swap:  0.0/0.0      [                                                                                             ]


# free -h
              total        used        free      shared  buff/cache   available
Mem:          188Gi        98Gi       5.8Gi        17Mi        84Gi        78Gi
Swap:            0B          0B          0B

memory reclaim from 'buff/cache' is easy to happen.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/09


> Although there is yet not a simple reproducer for another problem
> happend here, but there is a little high chance that something is wrong
> in btrfs/mm/fs-buffer.
> > but another problem(os freezed without call trace, PANIC without OOPS?,
> > the reason is yet unkown) still happen.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/09
> 




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-07 12:35 ` Vlastimil Babka
  2021-04-07 13:09   ` Wang Yugui
@ 2021-04-09  9:52   ` Wang Yugui
  1 sibling, 0 replies; 27+ messages in thread
From: Wang Yugui @ 2021-04-09  9:52 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: linux-mm, linux-btrfs

Hi, Dennis Zhou, Vlastimil Babka, Filipe Manana

The root reason of this problem maybe the design of
'memalloc_nofs_restore()/memalloc_nofs_save()'.

When some job such as memory pre-alloc and reclaim is needed,  that is
done in a workqueue now.

This is a problem for high-load and over-load. In that case, we need to
do these job in current task/process, so that current task/process will
be blocked until necessary job is done.

If we let these job in done in a workqueue, and current task/process is
not blocked, that means failure is very near, and then we can not work
stable in high-load and over-load.

For high-load and over-load, failure is not expected,  we expect some
job be blocked well.

> > Percpu does do this via a workqueue item. The issue is in v5.9 we
> > introduced 2 types of chunks. However, the free float page number was
> > for the total. So even if 1 chunk type dropped below, the other chunk
> > type might have enough pages. I'm queuing this for 5.12 and will send it
> > out assuming it does fix your problem.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/09

> +CC btrfs
> 
> On 4/1/21 12:51 PM, Wang Yugui wrote:
> > Hi,
> > 
> > an unexpected -ENOMEM from percpu_counter_init() happened when xfstest 
> > with kernel 5.11.10 and 5.10.27
> 
> Is there a dmesg log showing allocation failure or something?
> 
> > direct caller:
> > int btrfs_drew_lock_init(struct btrfs_drew_lock *lock)
> > {
> >     int ret;
> > 
> >     ret = percpu_counter_init(&lock->writers, 0, GFP_KERNEL);
> >     if (ret)
> >         return ret;
> > 
> >     atomic_set(&lock->readers, 0);
> >     init_waitqueue_head(&lock->pending_readers);
> >     init_waitqueue_head(&lock->pending_writers);
> > 
> >     return 0;
> > }
> > 
> > upper caller:
> >     nofs_flag = memalloc_nofs_save();
> >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> >     memalloc_nofs_restore(nofs_flag);
> >     if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n");
> >     if (ret)
> >         goto fail;
> > 
> > The hardware of this server:
> > CPU:  Xeon(R) CPU E5-2660 v2(10 core)  *2
> > memory:  192G, no swap
> > 
> > Only one xfstests job is running in this server, and about 7% of memory
> > is used.
> > 
> > Any advice please.
> > 
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/04/01
> > 
> > 




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-08 15:02                 ` Dennis Zhou
@ 2021-04-09 11:39                   ` Filipe Manana
  2021-04-09 13:39                     ` Dennis Zhou
  0 siblings, 1 reply; 27+ messages in thread
From: Filipe Manana @ 2021-04-09 11:39 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Wang Yugui, Vlastimil Babka, linux-mm, linux-btrfs

On Thu, Apr 8, 2021 at 4:02 PM Dennis Zhou <dennis@kernel.org> wrote:
>
> On Thu, Apr 08, 2021 at 03:28:20PM +0100, Filipe Manana wrote:
> > On Thu, Apr 8, 2021 at 2:50 PM Dennis Zhou <dennis@kernel.org> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 05:20:00PM +0800, Wang Yugui wrote:
> > > > Hi,
> > > >
> > > > > On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> > > > > > Hi,
> > > > > >
> > > > > > > > > > upper caller:
> > > > > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > > > >
> > > > > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > > > >
> > > > > > > Percpu should probably try to allocate some pages if possible even if
> > > > > > > nofs is set.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > I will wait for the patch, and then test it.
> > > > > >
> > > > >
> > > > > I'm currently a bit busy with some other things. Adding support I don't
> > > > > think will be much work, just a little bit tricky.
> > > > >
> > > > > I recommend carrying what you have minus the change to reserved percpu
> > > > > memory for now. If I'm the one to write it, I'll cc you.
> > > > >
> > > > > Thanks,
> > > > > Dennis
> > > >
> > > >
> > > > In the recent test, another problem is triggered too with my extended
> > > > percpu buffer size patch. maybe this info is helpful.
> > > >
> > > > problem:
> > > > OS/VGA console is freezed , and no call stace is outputed.
> > > > Just some info is outputed to IPMI/dell iDRAC
> > > >    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > >    3 | Linux kernel panic: Fatal excep
> > > >    4 | Linux kernel panic: tion
> > > >    5 | 04/05/2021 | 19:09:14 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > >    6 | Linux kernel panic: Fatal excep
> > > >    7 | Linux kernel panic: tion
> > > >    8 | 04/06/2021 | 13:08:42 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > >    9 | Linux kernel panic: Fatal excep
> > > >    a | Linux kernel panic: tion
> > > >    b | 04/08/2021 | 02:12:46 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > >    c | Linux kernel panic: Fatal excep
> > > >    d | Linux kernel panic: tion
> > >
> > > Unfortunately non of the above to me is useful.
> > >
> > > > kernel: at least 5.10.26/5.10.27/5.10.28
> > > >
> > > > This problem is triggered by our application, NOT xfstests.
> > > > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > > > Our application use at most 75% of memory, if still not enough,
> > > > it will write out all buffer info to filesystem.
> > >
> > > Do you use cgroups at all? If yes can you describe the workload pattern
> > > a bit.
> > >
> > > > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > > > kernel 5.4.x. It have high frequency to repduce too.
> > >
> > > Ah. Can you try the following patch?
> > > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> >
> > Btw, this has been happening since 5.9.
> > I never managed to find the time to bisect it, but it might be more
> > obvious to you or anyone else with deep experience of mm/percpu of
> > what changed.
> >
>
> Ah I'm sorry about that. It wasn't brought to my attention and I don't
> frequent the btrfs slack anymore. I can try and pop in more frequently
> if that would help with these things.

No worries, I don't think anyone reported it before.

>
> > It's triggered very frequently with long runs of fsstress on btrfs,
> > such as with test cases btrfs/078 and generic/476 from fstests.
> > It produces a trace like the following:
> >
> > [128063.794597] ------------[ cut here ]------------
> > [128063.795305] BTRFS: Transaction aborted (error -12)
> > [128063.795831] WARNING: CPU: 0 PID: 1131545 at
> > fs/btrfs/transaction.c:1683 create_pending_snapshot+0xa2a/0xfd0
> > [btrfs]
> > [128063.796235] Modules linked in: dm_snapshot btrfs dm_thin_pool
> > dm_persistent_data dm_bio_prison dm_bufio dm_log_writes dm_dust
> > dm_flakey dm_mod loop xfs blake2b_generic xor raid6_pq libcrc32c
> > intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass
> > crct10dif_pclmul g>
> > [128063.798521] CPU: 0 PID: 1131545 Comm: fsstress Tainted: G        W
> >         5.10.0-rc2-btrfs-next-71 #1
> > [128063.799102] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > [128063.800150] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > [128063.800748] Code: 02 72 30 83 f8 fb 0f 84 38 03 00 00 83 f8 e2 0f
> > 84 2f 03 00 00 89 c6 48 c7 c7 e8 af c5 c0 48 89 85 78 ff ff ff e8 6d
> > 29 6b ca <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 93 06 00 00 48 c7 c6 90
> > a6 c4
> > [128063.801886] RSP: 0018:ffffaad1444cfd50 EFLAGS: 00010282
> > [128063.802529] RAX: 0000000000000000 RBX: ffff99c4c0d0b500 RCX:
> > 0000000000000000
> > [128063.803175] RDX: 0000000000000001 RSI: 0000000000000027 RDI:
> > 00000000ffffffff
> > [128063.803829] RBP: ffffaad1444cfe20 R08: 0000000000000000 R09:
> > 0000000000000000
> > [128063.804478] R10: 0000000000000000 R11: 0000000000000000 R12:
> > ffff99c70a1a4c10
> > [128063.805134] R13: ffff99c59e3d0e00 R14: ffff99c70e935d08 R15:
> > 00000000fffffff4
> > [128063.805816] FS:  00007f0fdd733240(0000) GS:ffff99c7ebe00000(0000)
> > knlGS:0000000000000000
> > [128063.806547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [128063.807264] CR2: 00007f0fdd731000 CR3: 00000001ee3b6003 CR4:
> > 00000000003706f0
> > [128063.807998] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [128063.808707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [128063.809415] Call Trace:
> > [128063.810172]  ? create_pending_snapshots+0xaa/0xd0 [btrfs]
> > [128063.810921]  create_pending_snapshots+0xaa/0xd0 [btrfs]
> > [128063.811680]  btrfs_commit_transaction+0x2b6/0xb80 [btrfs]
> > [128063.812429]  ? finish_wait+0x90/0x90
> > [128063.813176]  ? __ia32_sys_fdatasync+0x20/0x20
> > [128063.813898]  iterate_supers+0x87/0xf0
> > [128063.814562]  ksys_sync+0x60/0xb0
> > [128063.815214]  __do_sys_sync+0xa/0x10
> > [128063.815879]  do_syscall_64+0x33/0x80
> > [128063.816539]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [128063.817191] RIP: 0033:0x7f0fdd829bd7
> > [128063.817907] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 b1 82
> > 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb b8 0f 1f 44 00 00 b8 a2 00 00
> > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 82 0c 00 f7 d8 64 89
> > 01 48
> > [128063.819411] RSP: 002b:00007fff356b8968 EFLAGS: 00000206 ORIG_RAX:
> > 00000000000000a2
> > [128063.820233] RAX: ffffffffffffffda RBX: 000055acc4127560 RCX:
> > 00007f0fdd829bd7
> > [128063.821028] RDX: 00000000ffffffff RSI: 000000002ecb3555 RDI:
> > 00000000000069f4
> > [128063.821808] RBP: 000000000000c350 R08: 0000000000000014 R09:
> > 00007fff356b893c
> > [128063.822632] R10: 00007fff356b8565 R11: 0000000000000206 R12:
> > 00000000000069f4
> > [128063.823416] R13: 00007fff356b89d0 R14: 00007fff356b8986 R15:
> > 000055acc4115350
> > [128063.824249] CPU: 5 PID: 1131545 Comm: fsstress Tainted: G        W
> >         5.10.0-rc2-btrfs-next-71 #1
> > [128063.824931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > [128063.826439] Call Trace:
> > [128063.827119]  dump_stack+0x8d/0xb5
> > [128063.827804]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > [128063.828456]  __warn.cold+0x24/0x4b
> > [128063.829100]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > [128063.829724]  report_bug+0xd1/0x100
> > [128063.830327]  handle_bug+0x35/0x80
> > [128063.830910]  exc_invalid_op+0x14/0x70
> > [128063.831476]  asm_exc_invalid_op+0x12/0x20
> > [128063.832042] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> >
> > With 5.8 and older, I never got such failures on my test boxes.
> >
>
> Ah. Roman's cgroup percpu changes went in for 5.9. Can you please patch:
> https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
>
> That most likely will have to be cced to stable for 5.9+.

With that patch applied, +12 hours runs of heavy fsstress and fstests
did not trigger the issue anymore here.

Thanks Dennis.

>
> Thanks,
> Dennis



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09 11:39                   ` Filipe Manana
@ 2021-04-09 13:39                     ` Dennis Zhou
  2021-04-09 13:42                       ` Filipe Manana
  0 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-09 13:39 UTC (permalink / raw)
  To: Filipe Manana; +Cc: Wang Yugui, Vlastimil Babka, linux-mm, linux-btrfs

On Fri, Apr 09, 2021 at 12:39:38PM +0100, Filipe Manana wrote:
> On Thu, Apr 8, 2021 at 4:02 PM Dennis Zhou <dennis@kernel.org> wrote:
> >
> > On Thu, Apr 08, 2021 at 03:28:20PM +0100, Filipe Manana wrote:
> > > On Thu, Apr 8, 2021 at 2:50 PM Dennis Zhou <dennis@kernel.org> wrote:
> > > >
> > > > On Thu, Apr 08, 2021 at 05:20:00PM +0800, Wang Yugui wrote:
> > > > > Hi,
> > > > >
> > > > > > On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > > > > > upper caller:
> > > > > > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > > > > >
> > > > > > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > > > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > > > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > > > > >
> > > > > > > > Percpu should probably try to allocate some pages if possible even if
> > > > > > > > nofs is set.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > I will wait for the patch, and then test it.
> > > > > > >
> > > > > >
> > > > > > I'm currently a bit busy with some other things. Adding support I don't
> > > > > > think will be much work, just a little bit tricky.
> > > > > >
> > > > > > I recommend carrying what you have minus the change to reserved percpu
> > > > > > memory for now. If I'm the one to write it, I'll cc you.
> > > > > >
> > > > > > Thanks,
> > > > > > Dennis
> > > > >
> > > > >
> > > > > In the recent test, another problem is triggered too with my extended
> > > > > percpu buffer size patch. maybe this info is helpful.
> > > > >
> > > > > problem:
> > > > > OS/VGA console is freezed , and no call stace is outputed.
> > > > > Just some info is outputed to IPMI/dell iDRAC
> > > > >    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > >    3 | Linux kernel panic: Fatal excep
> > > > >    4 | Linux kernel panic: tion
> > > > >    5 | 04/05/2021 | 19:09:14 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > >    6 | Linux kernel panic: Fatal excep
> > > > >    7 | Linux kernel panic: tion
> > > > >    8 | 04/06/2021 | 13:08:42 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > >    9 | Linux kernel panic: Fatal excep
> > > > >    a | Linux kernel panic: tion
> > > > >    b | 04/08/2021 | 02:12:46 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > >    c | Linux kernel panic: Fatal excep
> > > > >    d | Linux kernel panic: tion
> > > >
> > > > Unfortunately non of the above to me is useful.
> > > >
> > > > > kernel: at least 5.10.26/5.10.27/5.10.28
> > > > >
> > > > > This problem is triggered by our application, NOT xfstests.
> > > > > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > > > > Our application use at most 75% of memory, if still not enough,
> > > > > it will write out all buffer info to filesystem.
> > > >
> > > > Do you use cgroups at all? If yes can you describe the workload pattern
> > > > a bit.
> > > >
> > > > > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > > > > kernel 5.4.x. It have high frequency to repduce too.
> > > >
> > > > Ah. Can you try the following patch?
> > > > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> > >
> > > Btw, this has been happening since 5.9.
> > > I never managed to find the time to bisect it, but it might be more
> > > obvious to you or anyone else with deep experience of mm/percpu of
> > > what changed.
> > >
> >
> > Ah I'm sorry about that. It wasn't brought to my attention and I don't
> > frequent the btrfs slack anymore. I can try and pop in more frequently
> > if that would help with these things.
> 
> No worries, I don't think anyone reported it before.
> 
> >
> > > It's triggered very frequently with long runs of fsstress on btrfs,
> > > such as with test cases btrfs/078 and generic/476 from fstests.
> > > It produces a trace like the following:
> > >
> > > [128063.794597] ------------[ cut here ]------------
> > > [128063.795305] BTRFS: Transaction aborted (error -12)
> > > [128063.795831] WARNING: CPU: 0 PID: 1131545 at
> > > fs/btrfs/transaction.c:1683 create_pending_snapshot+0xa2a/0xfd0
> > > [btrfs]
> > > [128063.796235] Modules linked in: dm_snapshot btrfs dm_thin_pool
> > > dm_persistent_data dm_bio_prison dm_bufio dm_log_writes dm_dust
> > > dm_flakey dm_mod loop xfs blake2b_generic xor raid6_pq libcrc32c
> > > intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass
> > > crct10dif_pclmul g>
> > > [128063.798521] CPU: 0 PID: 1131545 Comm: fsstress Tainted: G        W
> > >         5.10.0-rc2-btrfs-next-71 #1
> > > [128063.799102] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > [128063.800150] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > > [128063.800748] Code: 02 72 30 83 f8 fb 0f 84 38 03 00 00 83 f8 e2 0f
> > > 84 2f 03 00 00 89 c6 48 c7 c7 e8 af c5 c0 48 89 85 78 ff ff ff e8 6d
> > > 29 6b ca <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 93 06 00 00 48 c7 c6 90
> > > a6 c4
> > > [128063.801886] RSP: 0018:ffffaad1444cfd50 EFLAGS: 00010282
> > > [128063.802529] RAX: 0000000000000000 RBX: ffff99c4c0d0b500 RCX:
> > > 0000000000000000
> > > [128063.803175] RDX: 0000000000000001 RSI: 0000000000000027 RDI:
> > > 00000000ffffffff
> > > [128063.803829] RBP: ffffaad1444cfe20 R08: 0000000000000000 R09:
> > > 0000000000000000
> > > [128063.804478] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > ffff99c70a1a4c10
> > > [128063.805134] R13: ffff99c59e3d0e00 R14: ffff99c70e935d08 R15:
> > > 00000000fffffff4
> > > [128063.805816] FS:  00007f0fdd733240(0000) GS:ffff99c7ebe00000(0000)
> > > knlGS:0000000000000000
> > > [128063.806547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [128063.807264] CR2: 00007f0fdd731000 CR3: 00000001ee3b6003 CR4:
> > > 00000000003706f0
> > > [128063.807998] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [128063.808707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [128063.809415] Call Trace:
> > > [128063.810172]  ? create_pending_snapshots+0xaa/0xd0 [btrfs]
> > > [128063.810921]  create_pending_snapshots+0xaa/0xd0 [btrfs]
> > > [128063.811680]  btrfs_commit_transaction+0x2b6/0xb80 [btrfs]
> > > [128063.812429]  ? finish_wait+0x90/0x90
> > > [128063.813176]  ? __ia32_sys_fdatasync+0x20/0x20
> > > [128063.813898]  iterate_supers+0x87/0xf0
> > > [128063.814562]  ksys_sync+0x60/0xb0
> > > [128063.815214]  __do_sys_sync+0xa/0x10
> > > [128063.815879]  do_syscall_64+0x33/0x80
> > > [128063.816539]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > [128063.817191] RIP: 0033:0x7f0fdd829bd7
> > > [128063.817907] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 b1 82
> > > 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb b8 0f 1f 44 00 00 b8 a2 00 00
> > > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 82 0c 00 f7 d8 64 89
> > > 01 48
> > > [128063.819411] RSP: 002b:00007fff356b8968 EFLAGS: 00000206 ORIG_RAX:
> > > 00000000000000a2
> > > [128063.820233] RAX: ffffffffffffffda RBX: 000055acc4127560 RCX:
> > > 00007f0fdd829bd7
> > > [128063.821028] RDX: 00000000ffffffff RSI: 000000002ecb3555 RDI:
> > > 00000000000069f4
> > > [128063.821808] RBP: 000000000000c350 R08: 0000000000000014 R09:
> > > 00007fff356b893c
> > > [128063.822632] R10: 00007fff356b8565 R11: 0000000000000206 R12:
> > > 00000000000069f4
> > > [128063.823416] R13: 00007fff356b89d0 R14: 00007fff356b8986 R15:
> > > 000055acc4115350
> > > [128063.824249] CPU: 5 PID: 1131545 Comm: fsstress Tainted: G        W
> > >         5.10.0-rc2-btrfs-next-71 #1
> > > [128063.824931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > [128063.826439] Call Trace:
> > > [128063.827119]  dump_stack+0x8d/0xb5
> > > [128063.827804]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > > [128063.828456]  __warn.cold+0x24/0x4b
> > > [128063.829100]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > > [128063.829724]  report_bug+0xd1/0x100
> > > [128063.830327]  handle_bug+0x35/0x80
> > > [128063.830910]  exc_invalid_op+0x14/0x70
> > > [128063.831476]  asm_exc_invalid_op+0x12/0x20
> > > [128063.832042] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > >
> > > With 5.8 and older, I never got such failures on my test boxes.
> > >
> >
> > Ah. Roman's cgroup percpu changes went in for 5.9. Can you please patch:
> > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> >
> > That most likely will have to be cced to stable for 5.9+.
> 
> With that patch applied, +12 hours runs of heavy fsstress and fstests
> did not trigger the issue anymore here.

Wonderful! Is it okay if I throw your Tested-by: on it? I'm going to
send this up tomorrow and have cced stable on the patch as well.

> 
> Thanks Dennis.
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09 13:39                     ` Dennis Zhou
@ 2021-04-09 13:42                       ` Filipe Manana
  0 siblings, 0 replies; 27+ messages in thread
From: Filipe Manana @ 2021-04-09 13:42 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Wang Yugui, Vlastimil Babka, linux-mm, linux-btrfs

On Fri, Apr 9, 2021 at 2:39 PM Dennis Zhou <dennis@kernel.org> wrote:
>
> On Fri, Apr 09, 2021 at 12:39:38PM +0100, Filipe Manana wrote:
> > On Thu, Apr 8, 2021 at 4:02 PM Dennis Zhou <dennis@kernel.org> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 03:28:20PM +0100, Filipe Manana wrote:
> > > > On Thu, Apr 8, 2021 at 2:50 PM Dennis Zhou <dennis@kernel.org> wrote:
> > > > >
> > > > > On Thu, Apr 08, 2021 at 05:20:00PM +0800, Wang Yugui wrote:
> > > > > > Hi,
> > > > > >
> > > > > > > On Thu, Apr 08, 2021 at 07:28:01AM +0800, Wang Yugui wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > > > > > upper caller:
> > > > > > > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > > > > > >
> > > > > > > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > > > > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > > > > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > > > > > >
> > > > > > > > > Percpu should probably try to allocate some pages if possible even if
> > > > > > > > > nofs is set.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > I will wait for the patch, and then test it.
> > > > > > > >
> > > > > > >
> > > > > > > I'm currently a bit busy with some other things. Adding support I don't
> > > > > > > think will be much work, just a little bit tricky.
> > > > > > >
> > > > > > > I recommend carrying what you have minus the change to reserved percpu
> > > > > > > memory for now. If I'm the one to write it, I'll cc you.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Dennis
> > > > > >
> > > > > >
> > > > > > In the recent test, another problem is triggered too with my extended
> > > > > > percpu buffer size patch. maybe this info is helpful.
> > > > > >
> > > > > > problem:
> > > > > > OS/VGA console is freezed , and no call stace is outputed.
> > > > > > Just some info is outputed to IPMI/dell iDRAC
> > > > > >    2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > > >    3 | Linux kernel panic: Fatal excep
> > > > > >    4 | Linux kernel panic: tion
> > > > > >    5 | 04/05/2021 | 19:09:14 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > > >    6 | Linux kernel panic: Fatal excep
> > > > > >    7 | Linux kernel panic: tion
> > > > > >    8 | 04/06/2021 | 13:08:42 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > > >    9 | Linux kernel panic: Fatal excep
> > > > > >    a | Linux kernel panic: tion
> > > > > >    b | 04/08/2021 | 02:12:46 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
> > > > > >    c | Linux kernel panic: Fatal excep
> > > > > >    d | Linux kernel panic: tion
> > > > >
> > > > > Unfortunately non of the above to me is useful.
> > > > >
> > > > > > kernel: at least 5.10.26/5.10.27/5.10.28
> > > > > >
> > > > > > This problem is triggered by our application, NOT xfstests.
> > > > > > But our applicaiton have some heavy write load just like xfstest/generic/476.
> > > > > > Our application use at most 75% of memory, if still not enough,
> > > > > > it will write out all buffer info to filesystem.
> > > > >
> > > > > Do you use cgroups at all? If yes can you describe the workload pattern
> > > > > a bit.
> > > > >
> > > > > > This problem is happen in linux kernel 5.10.x, but not happen in linux
> > > > > > kernel 5.4.x. It have high frequency to repduce too.
> > > > >
> > > > > Ah. Can you try the following patch?
> > > > > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> > > >
> > > > Btw, this has been happening since 5.9.
> > > > I never managed to find the time to bisect it, but it might be more
> > > > obvious to you or anyone else with deep experience of mm/percpu of
> > > > what changed.
> > > >
> > >
> > > Ah I'm sorry about that. It wasn't brought to my attention and I don't
> > > frequent the btrfs slack anymore. I can try and pop in more frequently
> > > if that would help with these things.
> >
> > No worries, I don't think anyone reported it before.
> >
> > >
> > > > It's triggered very frequently with long runs of fsstress on btrfs,
> > > > such as with test cases btrfs/078 and generic/476 from fstests.
> > > > It produces a trace like the following:
> > > >
> > > > [128063.794597] ------------[ cut here ]------------
> > > > [128063.795305] BTRFS: Transaction aborted (error -12)
> > > > [128063.795831] WARNING: CPU: 0 PID: 1131545 at
> > > > fs/btrfs/transaction.c:1683 create_pending_snapshot+0xa2a/0xfd0
> > > > [btrfs]
> > > > [128063.796235] Modules linked in: dm_snapshot btrfs dm_thin_pool
> > > > dm_persistent_data dm_bio_prison dm_bufio dm_log_writes dm_dust
> > > > dm_flakey dm_mod loop xfs blake2b_generic xor raid6_pq libcrc32c
> > > > intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass
> > > > crct10dif_pclmul g>
> > > > [128063.798521] CPU: 0 PID: 1131545 Comm: fsstress Tainted: G        W
> > > >         5.10.0-rc2-btrfs-next-71 #1
> > > > [128063.799102] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > > BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > > [128063.800150] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > > > [128063.800748] Code: 02 72 30 83 f8 fb 0f 84 38 03 00 00 83 f8 e2 0f
> > > > 84 2f 03 00 00 89 c6 48 c7 c7 e8 af c5 c0 48 89 85 78 ff ff ff e8 6d
> > > > 29 6b ca <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 93 06 00 00 48 c7 c6 90
> > > > a6 c4
> > > > [128063.801886] RSP: 0018:ffffaad1444cfd50 EFLAGS: 00010282
> > > > [128063.802529] RAX: 0000000000000000 RBX: ffff99c4c0d0b500 RCX:
> > > > 0000000000000000
> > > > [128063.803175] RDX: 0000000000000001 RSI: 0000000000000027 RDI:
> > > > 00000000ffffffff
> > > > [128063.803829] RBP: ffffaad1444cfe20 R08: 0000000000000000 R09:
> > > > 0000000000000000
> > > > [128063.804478] R10: 0000000000000000 R11: 0000000000000000 R12:
> > > > ffff99c70a1a4c10
> > > > [128063.805134] R13: ffff99c59e3d0e00 R14: ffff99c70e935d08 R15:
> > > > 00000000fffffff4
> > > > [128063.805816] FS:  00007f0fdd733240(0000) GS:ffff99c7ebe00000(0000)
> > > > knlGS:0000000000000000
> > > > [128063.806547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [128063.807264] CR2: 00007f0fdd731000 CR3: 00000001ee3b6003 CR4:
> > > > 00000000003706f0
> > > > [128063.807998] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [128063.808707] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [128063.809415] Call Trace:
> > > > [128063.810172]  ? create_pending_snapshots+0xaa/0xd0 [btrfs]
> > > > [128063.810921]  create_pending_snapshots+0xaa/0xd0 [btrfs]
> > > > [128063.811680]  btrfs_commit_transaction+0x2b6/0xb80 [btrfs]
> > > > [128063.812429]  ? finish_wait+0x90/0x90
> > > > [128063.813176]  ? __ia32_sys_fdatasync+0x20/0x20
> > > > [128063.813898]  iterate_supers+0x87/0xf0
> > > > [128063.814562]  ksys_sync+0x60/0xb0
> > > > [128063.815214]  __do_sys_sync+0xa/0x10
> > > > [128063.815879]  do_syscall_64+0x33/0x80
> > > > [128063.816539]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > [128063.817191] RIP: 0033:0x7f0fdd829bd7
> > > > [128063.817907] Code: ff ff ff ff c3 66 0f 1f 44 00 00 48 8b 15 b1 82
> > > > 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb b8 0f 1f 44 00 00 b8 a2 00 00
> > > > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 82 0c 00 f7 d8 64 89
> > > > 01 48
> > > > [128063.819411] RSP: 002b:00007fff356b8968 EFLAGS: 00000206 ORIG_RAX:
> > > > 00000000000000a2
> > > > [128063.820233] RAX: ffffffffffffffda RBX: 000055acc4127560 RCX:
> > > > 00007f0fdd829bd7
> > > > [128063.821028] RDX: 00000000ffffffff RSI: 000000002ecb3555 RDI:
> > > > 00000000000069f4
> > > > [128063.821808] RBP: 000000000000c350 R08: 0000000000000014 R09:
> > > > 00007fff356b893c
> > > > [128063.822632] R10: 00007fff356b8565 R11: 0000000000000206 R12:
> > > > 00000000000069f4
> > > > [128063.823416] R13: 00007fff356b89d0 R14: 00007fff356b8986 R15:
> > > > 000055acc4115350
> > > > [128063.824249] CPU: 5 PID: 1131545 Comm: fsstress Tainted: G        W
> > > >         5.10.0-rc2-btrfs-next-71 #1
> > > > [128063.824931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > > > BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > > [128063.826439] Call Trace:
> > > > [128063.827119]  dump_stack+0x8d/0xb5
> > > > [128063.827804]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > > > [128063.828456]  __warn.cold+0x24/0x4b
> > > > [128063.829100]  ? create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > > > [128063.829724]  report_bug+0xd1/0x100
> > > > [128063.830327]  handle_bug+0x35/0x80
> > > > [128063.830910]  exc_invalid_op+0x14/0x70
> > > > [128063.831476]  asm_exc_invalid_op+0x12/0x20
> > > > [128063.832042] RIP: 0010:create_pending_snapshot+0xa2a/0xfd0 [btrfs]
> > > >
> > > > With 5.8 and older, I never got such failures on my test boxes.
> > > >
> > >
> > > Ah. Roman's cgroup percpu changes went in for 5.9. Can you please patch:
> > > https://lore.kernel.org/lkml/20210408035736.883861-4-guro@fb.com/
> > >
> > > That most likely will have to be cced to stable for 5.9+.
> >
> > With that patch applied, +12 hours runs of heavy fsstress and fstests
> > did not trigger the issue anymore here.
>
> Wonderful! Is it okay if I throw your Tested-by: on it? I'm going to
> send this up tomorrow and have cced stable on the patch as well.

Sure, please add with my suse address:

Tested-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
> >
> > Thanks Dennis.
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”
>
> Thanks,
> Dennis



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09  7:36                     ` Wang Yugui
  2021-04-09  7:48                       ` Wang Yugui
@ 2021-04-09 13:56                       ` Dennis Zhou
  2021-04-10 15:29                         ` Wang Yugui
  1 sibling, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-09 13:56 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

On Fri, Apr 09, 2021 at 03:36:39PM +0800, Wang Yugui wrote:
> Hi,
> 
> some question about workqueue for percpu.
> 
> > > > 
> > > > And a question about this,
> > > > > > > > upper caller:
> > > > > > > >     nofs_flag = memalloc_nofs_save();
> > > > > > > >     ret = btrfs_drew_lock_init(&root->snapshot_lock);
> > > > > > > >     memalloc_nofs_restore(nofs_flag);
> > > > > 
> > > > > The issue is here. nofs is set which means percpu attempts an atomic
> > > > > allocation. If it cannot find anything already allocated it isn't happy.
> > > > > This was done before memalloc_nofs_{save/restore}() were pervasive.
> > > > > 
> > > > > Percpu should probably try to allocate some pages if possible even if
> > > > > nofs is set.
> > > > 
> > > > Should we check and pre-alloc memory inside memalloc_nofs_restore()?
> > > > another memalloc_nofs_save() may come soon.
> > > > 
> > > > something like this in memalloc_nofs_save()?
> > > > 	if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
> > > >  		pcpu_schedule_balance_work();
> > > > 
> > > 
> > > Percpu does do this via a workqueue item. The issue is in v5.9 we
> > > introduced 2 types of chunks. However, the free float page number was
> > > for the total. So even if 1 chunk type dropped below, the other chunk
> > > type might have enough pages. I'm queuing this for 5.12 and will send it
> > > out assuming it does fix your problem.
> 
> workqueue for percpu maybe not strong enough( not scheduled?) when high
> CPU load?
> 

Percpu is not really cheap memory to allocate because it has a
amplification factor of NR_CPUS. As a result, percpu on the critical
path is really not something that is expected to be high throughput.

Ideally things like btrfs snapshots should preallocate a number of these
and not try to do atomic allocations because that in theory could fail
because even after we go to the page allocator in the future we can't
get enough pages due to needing to go into reclaim.

The workqueue approach has been good enough so far. Technically there is
a higher priority workqueue that this work could be scheduled on, but
save for this miss on my part, the system workqueue has worked out fine.

In the future as I mentioned above. It would be good to support actually
getting pages, but it's work that needs to be tackled with a bit of
care. I might target the work for v5.14.

> this is our application pipeline.
> 	file_pre_process |
> 	bwa.nipt xx |
> 	samtools.nipt sort xx |
> 	file_post_process
> 
> file_pre_process/file_post_process is fast, so often are blocked by
> pipe input/output.
> 
> 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> 
> 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> if the memory is not enough, it will save all the buffer to temp file,
> so it is sometimes high-IO-load too(write 60G or more to file).
> 
> 
> xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> so xfstests(generic/476) maybe easy than our application pipeline.
> 
> Although there is yet not a simple reproducer for another problem
> happend here, but there is a little high chance that something is wrong
> in btrfs/mm/fs-buffer.
> > but another problem(os freezed without call trace, PANIC without OOPS?,
> > the reason is yet unkown) still happen.

I do not have an answer for this. I would recommend looking into kdump.

> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/09
> 
> 
> 

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-09 13:56                       ` Dennis Zhou
@ 2021-04-10 15:29                         ` Wang Yugui
  2021-04-10 15:52                           ` Dennis Zhou
  0 siblings, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-10 15:29 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hi, Dennis Zhou 

Thanks for your ncie answer.
but still a few questions.

> Percpu is not really cheap memory to allocate because it has a
> amplification factor of NR_CPUS. As a result, percpu on the critical
> path is really not something that is expected to be high throughput.

> Ideally things like btrfs snapshots should preallocate a number of these
> and not try to do atomic allocations because that in theory could fail
> because even after we go to the page allocator in the future we can't
> get enough pages due to needing to go into reclaim.

pre-allocate in module such as mempool_t is just used in a few place in
linux/fs.  so most people like system wide pre-allocate, because it is
more easy to use?

can we add more chance to management the system wide pre-alloc
just like this?

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index dc1f4dc..eb3f592 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags)
 static inline unsigned int memalloc_nofs_save(void)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+
+	// just like slab_pre_alloc_hook
+	fs_reclaim_acquire(current->flags & gfp_allowed_mask);
+	fs_reclaim_release(current->flags & gfp_allowed_mask);
+
 	current->flags |= PF_MEMALLOC_NOFS;
 	return flags;
 }


> The workqueue approach has been good enough so far. Technically there is
> a higher priority workqueue that this work could be scheduled on, but
> save for this miss on my part, the system workqueue has worked out fine.

> In the future as I mentioned above. It would be good to support actually
> getting pages, but it's work that needs to be tackled with a bit of
> care. I might target the work for v5.14.
> 
> > this is our application pipeline.
> > 	file_pre_process |
> > 	bwa.nipt xx |
> > 	samtools.nipt sort xx |
> > 	file_post_process
> > 
> > file_pre_process/file_post_process is fast, so often are blocked by
> > pipe input/output.
> > 
> > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> > 
> > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > if the memory is not enough, it will save all the buffer to temp file,
> > so it is sometimes high-IO-load too(write 60G or more to file).
> > 
> > 
> > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > so xfstests(generic/476) maybe easy than our application pipeline.
> > 
> > Although there is yet not a simple reproducer for another problem
> > happend here, but there is a little high chance that something is wrong
> > in btrfs/mm/fs-buffer.
> > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > the reason is yet unkown) still happen.
> 
> I do not have an answer for this. I would recommend looking into kdump.

percpu ENOMEM problem blocked many heavy load test a little long time?
I still guess this problem of system freeze is a mm/btrfs problem.
OOM not work, OOPS not work too.

I try to reproduce it with some simple script. I noticed the value of
'free' is a little low, although 'available' is big.

# free -h
              total        used        free      shared  buff/cache   available
Mem:          188Gi       1.4Gi       5.5Gi        17Mi       181Gi       175Gi
Swap:            0B          0B          0B

vm.min_free_kbytes is auto configed to 4Gi(4194304)

# write files with the size >= memory size *3
#for((i=0;i<10;++i));do dd if=/dev/zero bs=1M count=64K of=/nodetmp/${i}.txt; free -h; done

any advice or patch to let the value of 'free' a little bigger?


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/10




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-10 15:29                         ` Wang Yugui
@ 2021-04-10 15:52                           ` Dennis Zhou
  2021-04-10 16:08                             ` Wang Yugui
  0 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-10 15:52 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

On Sat, Apr 10, 2021 at 11:29:17PM +0800, Wang Yugui wrote:
> Hi, Dennis Zhou 
> 
> Thanks for your ncie answer.
> but still a few questions.
> 
> > Percpu is not really cheap memory to allocate because it has a
> > amplification factor of NR_CPUS. As a result, percpu on the critical
> > path is really not something that is expected to be high throughput.
> 
> > Ideally things like btrfs snapshots should preallocate a number of these
> > and not try to do atomic allocations because that in theory could fail
> > because even after we go to the page allocator in the future we can't
> > get enough pages due to needing to go into reclaim.
> 
> pre-allocate in module such as mempool_t is just used in a few place in
> linux/fs.  so most people like system wide pre-allocate, because it is
> more easy to use?
> 
> can we add more chance to management the system wide pre-alloc
> just like this?
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index dc1f4dc..eb3f592 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags)
>  static inline unsigned int memalloc_nofs_save(void)
>  {
>  	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> +
> +	// just like slab_pre_alloc_hook
> +	fs_reclaim_acquire(current->flags & gfp_allowed_mask);
> +	fs_reclaim_release(current->flags & gfp_allowed_mask);
> +
>  	current->flags |= PF_MEMALLOC_NOFS;
>  	return flags;
>  }
> 
> 
> > The workqueue approach has been good enough so far. Technically there is
> > a higher priority workqueue that this work could be scheduled on, but
> > save for this miss on my part, the system workqueue has worked out fine.
> 
> > In the future as I mentioned above. It would be good to support actually
> > getting pages, but it's work that needs to be tackled with a bit of
> > care. I might target the work for v5.14.
> > 
> > > this is our application pipeline.
> > > 	file_pre_process |
> > > 	bwa.nipt xx |
> > > 	samtools.nipt sort xx |
> > > 	file_post_process
> > > 
> > > file_pre_process/file_post_process is fast, so often are blocked by
> > > pipe input/output.
> > > 
> > > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> > > 
> > > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > > if the memory is not enough, it will save all the buffer to temp file,
> > > so it is sometimes high-IO-load too(write 60G or more to file).
> > > 
> > > 
> > > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > > so xfstests(generic/476) maybe easy than our application pipeline.
> > > 
> > > Although there is yet not a simple reproducer for another problem
> > > happend here, but there is a little high chance that something is wrong
> > > in btrfs/mm/fs-buffer.
> > > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > > the reason is yet unkown) still happen.
> > 
> > I do not have an answer for this. I would recommend looking into kdump.
> 
> percpu ENOMEM problem blocked many heavy load test a little long time?
> I still guess this problem of system freeze is a mm/btrfs problem.
> OOM not work, OOPS not work too.
> 

I don't follow. Is this still a problem after the patch?

> I try to reproduce it with some simple script. I noticed the value of
> 'free' is a little low, although 'available' is big.
> 
> # free -h
>               total        used        free      shared  buff/cache   available
> Mem:          188Gi       1.4Gi       5.5Gi        17Mi       181Gi       175Gi
> Swap:            0B          0B          0B
> 
> vm.min_free_kbytes is auto configed to 4Gi(4194304)
> 
> # write files with the size >= memory size *3
> #for((i=0;i<10;++i));do dd if=/dev/zero bs=1M count=64K of=/nodetmp/${i}.txt; free -h; done
> 
> any advice or patch to let the value of 'free' a little bigger?
> 
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/10
> 
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-10 15:52                           ` Dennis Zhou
@ 2021-04-10 16:08                             ` Wang Yugui
  2021-04-11 15:20                               ` Wang Yugui
  0 siblings, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-10 16:08 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hi,

> On Sat, Apr 10, 2021 at 11:29:17PM +0800, Wang Yugui wrote:
> > Hi, Dennis Zhou 
> > 
> > Thanks for your ncie answer.
> > but still a few questions.
> > 
> > > Percpu is not really cheap memory to allocate because it has a
> > > amplification factor of NR_CPUS. As a result, percpu on the critical
> > > path is really not something that is expected to be high throughput.
> > 
> > > Ideally things like btrfs snapshots should preallocate a number of these
> > > and not try to do atomic allocations because that in theory could fail
> > > because even after we go to the page allocator in the future we can't
> > > get enough pages due to needing to go into reclaim.
> > 
> > pre-allocate in module such as mempool_t is just used in a few place in
> > linux/fs.  so most people like system wide pre-allocate, because it is
> > more easy to use?
> > 
> > can we add more chance to management the system wide pre-alloc
> > just like this?
> > 
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index dc1f4dc..eb3f592 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags)
> >  static inline unsigned int memalloc_nofs_save(void)
> >  {
> >  	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > +
> > +	// just like slab_pre_alloc_hook
> > +	fs_reclaim_acquire(current->flags & gfp_allowed_mask);
> > +	fs_reclaim_release(current->flags & gfp_allowed_mask);
> > +
> >  	current->flags |= PF_MEMALLOC_NOFS;
> >  	return flags;
> >  }
> > 
> > 
> > > The workqueue approach has been good enough so far. Technically there is
> > > a higher priority workqueue that this work could be scheduled on, but
> > > save for this miss on my part, the system workqueue has worked out fine.
> > 
> > > In the future as I mentioned above. It would be good to support actually
> > > getting pages, but it's work that needs to be tackled with a bit of
> > > care. I might target the work for v5.14.
> > > 
> > > > this is our application pipeline.
> > > > 	file_pre_process |
> > > > 	bwa.nipt xx |
> > > > 	samtools.nipt sort xx |
> > > > 	file_post_process
> > > > 
> > > > file_pre_process/file_post_process is fast, so often are blocked by
> > > > pipe input/output.
> > > > 
> > > > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> > > > 
> > > > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > > > if the memory is not enough, it will save all the buffer to temp file,
> > > > so it is sometimes high-IO-load too(write 60G or more to file).
> > > > 
> > > > 
> > > > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > > > so xfstests(generic/476) maybe easy than our application pipeline.
> > > > 
> > > > Although there is yet not a simple reproducer for another problem
> > > > happend here, but there is a little high chance that something is wrong
> > > > in btrfs/mm/fs-buffer.
> > > > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > > > the reason is yet unkown) still happen.
> > > 
> > > I do not have an answer for this. I would recommend looking into kdump.
> > 
> > percpu ENOMEM problem blocked many heavy load test a little long time?
> > I still guess this problem of system freeze is a mm/btrfs problem.
> > OOM not work, OOPS not work too.
> > 
> 
> I don't follow. Is this still a problem after the patch?


After the patch for percpu ENOMEM,  the problem of system freeze have a high
frequecy (>75%) to be triggered by our user-space application.

The problem of system freeze maybe not caused by the percpu ENOMEM patch.

percpu ENOMEM problem maybe more easy to happen than the problem of
system freeze.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/10



> > I try to reproduce it with some simple script. I noticed the value of
> > 'free' is a little low, although 'available' is big.
> > 
> > # free -h
> >               total        used        free      shared  buff/cache   available
> > Mem:          188Gi       1.4Gi       5.5Gi        17Mi       181Gi       175Gi
> > Swap:            0B          0B          0B
> > 
> > vm.min_free_kbytes is auto configed to 4Gi(4194304)
> > 
> > # write files with the size >= memory size *3
> > #for((i=0;i<10;++i));do dd if=/dev/zero bs=1M count=64K of=/nodetmp/${i}.txt; free -h; done
> > 
> > any advice or patch to let the value of 'free' a little bigger?
> > 
> > 
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/04/10
> > 
> > 
> > 




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-10 16:08                             ` Wang Yugui
@ 2021-04-11 15:20                               ` Wang Yugui
  2021-04-12  4:03                                 ` Dennis Zhou
  0 siblings, 1 reply; 27+ messages in thread
From: Wang Yugui @ 2021-04-11 15:20 UTC (permalink / raw)
  To: Dennis Zhou, Vlastimil Babka, linux-mm, linux-btrfs

Hi, Dennis Zhou

> Hi,
> 
> > On Sat, Apr 10, 2021 at 11:29:17PM +0800, Wang Yugui wrote:
> > > Hi, Dennis Zhou 
> > > 
> > > Thanks for your ncie answer.
> > > but still a few questions.
> > > 
> > > > Percpu is not really cheap memory to allocate because it has a
> > > > amplification factor of NR_CPUS. As a result, percpu on the critical
> > > > path is really not something that is expected to be high throughput.
> > > 
> > > > Ideally things like btrfs snapshots should preallocate a number of these
> > > > and not try to do atomic allocations because that in theory could fail
> > > > because even after we go to the page allocator in the future we can't
> > > > get enough pages due to needing to go into reclaim.
> > > 
> > > pre-allocate in module such as mempool_t is just used in a few place in
> > > linux/fs.  so most people like system wide pre-allocate, because it is
> > > more easy to use?
> > > 
> > > can we add more chance to management the system wide pre-alloc
> > > just like this?
> > > 
> > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > index dc1f4dc..eb3f592 100644
> > > --- a/include/linux/sched/mm.h
> > > +++ b/include/linux/sched/mm.h
> > > @@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags)
> > >  static inline unsigned int memalloc_nofs_save(void)
> > >  {
> > >  	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > > +
> > > +	// just like slab_pre_alloc_hook
> > > +	fs_reclaim_acquire(current->flags & gfp_allowed_mask);
> > > +	fs_reclaim_release(current->flags & gfp_allowed_mask);
> > > +
> > >  	current->flags |= PF_MEMALLOC_NOFS;
> > >  	return flags;
> > >  }
> > > 
> > > 
> > > > The workqueue approach has been good enough so far. Technically there is
> > > > a higher priority workqueue that this work could be scheduled on, but
> > > > save for this miss on my part, the system workqueue has worked out fine.
> > > 
> > > > In the future as I mentioned above. It would be good to support actually
> > > > getting pages, but it's work that needs to be tackled with a bit of
> > > > care. I might target the work for v5.14.
> > > > 
> > > > > this is our application pipeline.
> > > > > 	file_pre_process |
> > > > > 	bwa.nipt xx |
> > > > > 	samtools.nipt sort xx |
> > > > > 	file_post_process
> > > > > 
> > > > > file_pre_process/file_post_process is fast, so often are blocked by
> > > > > pipe input/output.
> > > > > 
> > > > > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> > > > > 
> > > > > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > > > > if the memory is not enough, it will save all the buffer to temp file,
> > > > > so it is sometimes high-IO-load too(write 60G or more to file).
> > > > > 
> > > > > 
> > > > > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > > > > so xfstests(generic/476) maybe easy than our application pipeline.
> > > > > 
> > > > > Although there is yet not a simple reproducer for another problem
> > > > > happend here, but there is a little high chance that something is wrong
> > > > > in btrfs/mm/fs-buffer.
> > > > > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > > > > the reason is yet unkown) still happen.
> > > > 
> > > > I do not have an answer for this. I would recommend looking into kdump.
> > > 
> > > percpu ENOMEM problem blocked many heavy load test a little long time?
> > > I still guess this problem of system freeze is a mm/btrfs problem.
> > > OOM not work, OOPS not work too.
> > > 
> > 
> > I don't follow. Is this still a problem after the patch?
> 
> 
> After the patch for percpu ENOMEM,  the problem of system freeze have a high
> frequecy (>75%) to be triggered by our user-space application.
> 
> The problem of system freeze maybe not caused by the percpu ENOMEM patch.
> 
> percpu ENOMEM problem maybe more easy to happen than the problem of
> system freeze.

After highmem zone +80% / otherzone +40% of WMARK_MIN/ WMARK_LOW/
WMARK_HIGH, we walked around or reduced the reproduce frequency of the
problem of system freeze.

so this is a problem of linux-mm.

the user case of our user-space application.
1)  write the files with the total size > 3 * memory size.
     the memory size > 128G
2)  btrfs with SSD/SAS, SSD/SATA, or btrfs RAID6 hdd
    SSD/NVMe maybe too fast, so difficult to reproduce.
3) some CPU load, and some memory load.

btrfs and other fs seem not like mempool_t wiht pre-alloc, so difficult
job is left to the system-wide reclaim/pre-alloc of linux-mm.

maye memalloc_nofs_save() or memalloc_nofs_restore() is a good place to
 add some sync/aysnc memory reclaim/pre-alloc operations for WMARK_MIN/
WMARK_LOW/WMARK_HIGH and percpu PCPU_EMPTY_POP_PAGES_LOW.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/11


> > > I try to reproduce it with some simple script. I noticed the value of
> > > 'free' is a little low, although 'available' is big.
> > > 
> > > # free -h
> > >               total        used        free      shared  buff/cache   available
> > > Mem:          188Gi       1.4Gi       5.5Gi        17Mi       181Gi       175Gi
> > > Swap:            0B          0B          0B
> > > 
> > > vm.min_free_kbytes is auto configed to 4Gi(4194304)
> > > 
> > > # write files with the size >= memory size *3
> > > #for((i=0;i<10;++i));do dd if=/dev/zero bs=1M count=64K of=/nodetmp/${i}.txt; free -h; done
> > > 
> > > any advice or patch to let the value of 'free' a little bigger?
> > > 
> > > 
> > > Best Regards
> > > Wang Yugui (wangyugui@e16-tech.com)
> > > 2021/04/10
> > > 
> > > 
> > > 
> 




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-11 15:20                               ` Wang Yugui
@ 2021-04-12  4:03                                 ` Dennis Zhou
  2021-04-12  5:24                                   ` Wang Yugui
  0 siblings, 1 reply; 27+ messages in thread
From: Dennis Zhou @ 2021-04-12  4:03 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

On Sun, Apr 11, 2021 at 11:20:00PM +0800, Wang Yugui wrote:
> Hi, Dennis Zhou
> 
> > Hi,
> > 
> > > On Sat, Apr 10, 2021 at 11:29:17PM +0800, Wang Yugui wrote:
> > > > Hi, Dennis Zhou 
> > > > 
> > > > Thanks for your ncie answer.
> > > > but still a few questions.
> > > > 
> > > > > Percpu is not really cheap memory to allocate because it has a
> > > > > amplification factor of NR_CPUS. As a result, percpu on the critical
> > > > > path is really not something that is expected to be high throughput.
> > > > 
> > > > > Ideally things like btrfs snapshots should preallocate a number of these
> > > > > and not try to do atomic allocations because that in theory could fail
> > > > > because even after we go to the page allocator in the future we can't
> > > > > get enough pages due to needing to go into reclaim.
> > > > 
> > > > pre-allocate in module such as mempool_t is just used in a few place in
> > > > linux/fs.  so most people like system wide pre-allocate, because it is
> > > > more easy to use?
> > > > 
> > > > can we add more chance to management the system wide pre-alloc
> > > > just like this?
> > > > 
> > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > > index dc1f4dc..eb3f592 100644
> > > > --- a/include/linux/sched/mm.h
> > > > +++ b/include/linux/sched/mm.h
> > > > @@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags)
> > > >  static inline unsigned int memalloc_nofs_save(void)
> > > >  {
> > > >  	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > > > +
> > > > +	// just like slab_pre_alloc_hook
> > > > +	fs_reclaim_acquire(current->flags & gfp_allowed_mask);
> > > > +	fs_reclaim_release(current->flags & gfp_allowed_mask);
> > > > +
> > > >  	current->flags |= PF_MEMALLOC_NOFS;
> > > >  	return flags;
> > > >  }
> > > > 
> > > > 
> > > > > The workqueue approach has been good enough so far. Technically there is
> > > > > a higher priority workqueue that this work could be scheduled on, but
> > > > > save for this miss on my part, the system workqueue has worked out fine.
> > > > 
> > > > > In the future as I mentioned above. It would be good to support actually
> > > > > getting pages, but it's work that needs to be tackled with a bit of
> > > > > care. I might target the work for v5.14.
> > > > > 
> > > > > > this is our application pipeline.
> > > > > > 	file_pre_process |
> > > > > > 	bwa.nipt xx |
> > > > > > 	samtools.nipt sort xx |
> > > > > > 	file_post_process
> > > > > > 
> > > > > > file_pre_process/file_post_process is fast, so often are blocked by
> > > > > > pipe input/output.
> > > > > > 
> > > > > > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> > > > > > 
> > > > > > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > > > > > if the memory is not enough, it will save all the buffer to temp file,
> > > > > > so it is sometimes high-IO-load too(write 60G or more to file).
> > > > > > 
> > > > > > 
> > > > > > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > > > > > so xfstests(generic/476) maybe easy than our application pipeline.
> > > > > > 
> > > > > > Although there is yet not a simple reproducer for another problem
> > > > > > happend here, but there is a little high chance that something is wrong
> > > > > > in btrfs/mm/fs-buffer.
> > > > > > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > > > > > the reason is yet unkown) still happen.
> > > > > 
> > > > > I do not have an answer for this. I would recommend looking into kdump.
> > > > 
> > > > percpu ENOMEM problem blocked many heavy load test a little long time?
> > > > I still guess this problem of system freeze is a mm/btrfs problem.
> > > > OOM not work, OOPS not work too.
> > > > 
> > > 
> > > I don't follow. Is this still a problem after the patch?
> > 
> > 
> > After the patch for percpu ENOMEM,  the problem of system freeze have a high
> > frequecy (>75%) to be triggered by our user-space application.
> > 
> > The problem of system freeze maybe not caused by the percpu ENOMEM patch.
> > 
> > percpu ENOMEM problem maybe more easy to happen than the problem of
> > system freeze.
> 
> After highmem zone +80% / otherzone +40% of WMARK_MIN/ WMARK_LOW/
> WMARK_HIGH, we walked around or reduced the reproduce frequency of the
> problem of system freeze.
> 
> so this is a problem of linux-mm.
> 
> the user case of our user-space application.
> 1)  write the files with the total size > 3 * memory size.
>      the memory size > 128G
> 2)  btrfs with SSD/SAS, SSD/SATA, or btrfs RAID6 hdd
>     SSD/NVMe maybe too fast, so difficult to reproduce.
> 3) some CPU load, and some memory load.
> 

To me it just sounds like writeback is slow. It's hard to debug a system
without actually observing it as well. You might want to limit the
memory allotted to the workload cgroup possibly memory.high. This may
help kick reclaim in earlier.

> btrfs and other fs seem not like mempool_t wiht pre-alloc, so difficult
> job is left to the system-wide reclaim/pre-alloc of linux-mm.
> 
> maye memalloc_nofs_save() or memalloc_nofs_restore() is a good place to
>  add some sync/aysnc memory reclaim/pre-alloc operations for WMARK_MIN/
> WMARK_LOW/WMARK_HIGH and percpu PCPU_EMPTY_POP_PAGES_LOW.
> 

It's not that simple. Memory reclaim is a balancing act and these places
mark where reclaim cannot trigger writeback and thus oom-killer is the
only way out. I'm sorry, but beyond the above, I don't really have any
additional advice besides retuning your workload to use less memory and
give the system more headroom.

I appreciate the bug report though and if its anything percpu related I
will always be available.

> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/04/11
> 

Thanks,
Dennis


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: unexpected -ENOMEM from percpu_counter_init()
  2021-04-12  4:03                                 ` Dennis Zhou
@ 2021-04-12  5:24                                   ` Wang Yugui
  0 siblings, 0 replies; 27+ messages in thread
From: Wang Yugui @ 2021-04-12  5:24 UTC (permalink / raw)
  To: Dennis Zhou; +Cc: Vlastimil Babka, linux-mm, linux-btrfs

Hi,

> On Sun, Apr 11, 2021 at 11:20:00PM +0800, Wang Yugui wrote:
> > Hi, Dennis Zhou
> > 
> > > Hi,
> > > 
> > > > On Sat, Apr 10, 2021 at 11:29:17PM +0800, Wang Yugui wrote:
> > > > > Hi, Dennis Zhou 
> > > > > 
> > > > > Thanks for your ncie answer.
> > > > > but still a few questions.
> > > > > 
> > > > > > Percpu is not really cheap memory to allocate because it has a
> > > > > > amplification factor of NR_CPUS. As a result, percpu on the critical
> > > > > > path is really not something that is expected to be high throughput.
> > > > > 
> > > > > > Ideally things like btrfs snapshots should preallocate a number of these
> > > > > > and not try to do atomic allocations because that in theory could fail
> > > > > > because even after we go to the page allocator in the future we can't
> > > > > > get enough pages due to needing to go into reclaim.
> > > > > 
> > > > > pre-allocate in module such as mempool_t is just used in a few place in
> > > > > linux/fs.  so most people like system wide pre-allocate, because it is
> > > > > more easy to use?
> > > > > 
> > > > > can we add more chance to management the system wide pre-alloc
> > > > > just like this?
> > > > > 
> > > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > > > index dc1f4dc..eb3f592 100644
> > > > > --- a/include/linux/sched/mm.h
> > > > > +++ b/include/linux/sched/mm.h
> > > > > @@ -226,6 +226,11 @@ static inline void memalloc_noio_restore(unsigned int flags)
> > > > >  static inline unsigned int memalloc_nofs_save(void)
> > > > >  {
> > > > >  	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > > > > +
> > > > > +	// just like slab_pre_alloc_hook
> > > > > +	fs_reclaim_acquire(current->flags & gfp_allowed_mask);
> > > > > +	fs_reclaim_release(current->flags & gfp_allowed_mask);
> > > > > +
> > > > >  	current->flags |= PF_MEMALLOC_NOFS;
> > > > >  	return flags;
> > > > >  }
> > > > > 
> > > > > 
> > > > > > The workqueue approach has been good enough so far. Technically there is
> > > > > > a higher priority workqueue that this work could be scheduled on, but
> > > > > > save for this miss on my part, the system workqueue has worked out fine.
> > > > > 
> > > > > > In the future as I mentioned above. It would be good to support actually
> > > > > > getting pages, but it's work that needs to be tackled with a bit of
> > > > > > care. I might target the work for v5.14.
> > > > > > 
> > > > > > > this is our application pipeline.
> > > > > > > 	file_pre_process |
> > > > > > > 	bwa.nipt xx |
> > > > > > > 	samtools.nipt sort xx |
> > > > > > > 	file_post_process
> > > > > > > 
> > > > > > > file_pre_process/file_post_process is fast, so often are blocked by
> > > > > > > pipe input/output.
> > > > > > > 
> > > > > > > 'bwa.nipt xx' is a high-cpu-load, almost all of CPU cores.
> > > > > > > 
> > > > > > > 'samtools.nipt sort xx' is a high-mem-load, it keep the input in memory.
> > > > > > > if the memory is not enough, it will save all the buffer to temp file,
> > > > > > > so it is sometimes high-IO-load too(write 60G or more to file).
> > > > > > > 
> > > > > > > 
> > > > > > > xfstests(generic/476) is just high-IO-load, cpu/memory load is NOT high.
> > > > > > > so xfstests(generic/476) maybe easy than our application pipeline.
> > > > > > > 
> > > > > > > Although there is yet not a simple reproducer for another problem
> > > > > > > happend here, but there is a little high chance that something is wrong
> > > > > > > in btrfs/mm/fs-buffer.
> > > > > > > > but another problem(os freezed without call trace, PANIC without OOPS?,
> > > > > > > > the reason is yet unkown) still happen.
> > > > > > 
> > > > > > I do not have an answer for this. I would recommend looking into kdump.
> > > > > 
> > > > > percpu ENOMEM problem blocked many heavy load test a little long time?
> > > > > I still guess this problem of system freeze is a mm/btrfs problem.
> > > > > OOM not work, OOPS not work too.
> > > > > 
> > > > 
> > > > I don't follow. Is this still a problem after the patch?
> > > 
> > > 
> > > After the patch for percpu ENOMEM,  the problem of system freeze have a high
> > > frequecy (>75%) to be triggered by our user-space application.
> > > 
> > > The problem of system freeze maybe not caused by the percpu ENOMEM patch.
> > > 
> > > percpu ENOMEM problem maybe more easy to happen than the problem of
> > > system freeze.
> > 
> > After highmem zone +80% / otherzone +40% of WMARK_MIN/ WMARK_LOW/
> > WMARK_HIGH, we walked around or reduced the reproduce frequency of the
> > problem of system freeze.
> > 
> > so this is a problem of linux-mm.
> > 
> > the user case of our user-space application.
> > 1)  write the files with the total size > 3 * memory size.
> >      the memory size > 128G
> > 2)  btrfs with SSD/SAS, SSD/SATA, or btrfs RAID6 hdd
> >     SSD/NVMe maybe too fast, so difficult to reproduce.
> > 3) some CPU load, and some memory load.
> > 
> 
> To me it just sounds like writeback is slow. It's hard to debug a system
> without actually observing it as well. You might want to limit the
> memory allotted to the workload cgroup possibly memory.high. This may
> help kick reclaim in earlier.

It is  system panic (but NO OOPS info).
some info is outputed to IPMI/dell iDRAC
   2 | 04/03/2021 | 11:35:01 | OS Critical Stop #0x46 | Run-time critical stop () | Asserted
   3 | Linux kernel panic: Fatal excep
   4 | Linux kernel panic: tion

Now I use vm.watermark_scale_factor=40 (default 10) to help kick reclaim
in earlier,  the reproduce frequency is reduced but still >25%.

I will start a new mail thread because this is not related to
percpu ENOMEM.

> > btrfs and other fs seem not like mempool_t wiht pre-alloc, so difficult
> > job is left to the system-wide reclaim/pre-alloc of linux-mm.
> > 
> > maye memalloc_nofs_save() or memalloc_nofs_restore() is a good place to
> >  add some sync/aysnc memory reclaim/pre-alloc operations for WMARK_MIN/
> > WMARK_LOW/WMARK_HIGH and percpu PCPU_EMPTY_POP_PAGES_LOW.
> > 
> 
> It's not that simple. Memory reclaim is a balancing act and these places
> mark where reclaim cannot trigger writeback and thus oom-killer is the
> only way out. I'm sorry, but beyond the above, I don't really have any
> additional advice besides retuning your workload to use less memory and
> give the system more headroom.

Thanks for this info.

> I appreciate the bug report though and if its anything percpu related I
> will always be available.

The patch that fixed percpu ENOMEM already helps a lot. Thanks.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/12



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-04-12  5:24 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 10:51 unexpected -ENOMEM from percpu_counter_init() Wang Yugui
2021-04-02  1:49 ` Wang Yugui
2021-04-07 12:35 ` Vlastimil Babka
2021-04-07 13:09   ` Wang Yugui
2021-04-07 14:56     ` Dennis Zhou
2021-04-07 23:28       ` Wang Yugui
2021-04-08  2:44         ` Dennis Zhou
2021-04-08  9:20           ` Wang Yugui
2021-04-08 13:48             ` Dennis Zhou
2021-04-08 14:28               ` Filipe Manana
2021-04-08 15:02                 ` Dennis Zhou
2021-04-09 11:39                   ` Filipe Manana
2021-04-09 13:39                     ` Dennis Zhou
2021-04-09 13:42                       ` Filipe Manana
2021-04-09  0:08               ` Wang Yugui
2021-04-09  2:14                 ` Dennis Zhou
2021-04-09  4:02                   ` Wang Yugui
2021-04-09  7:36                     ` Wang Yugui
2021-04-09  7:48                       ` Wang Yugui
2021-04-09 13:56                       ` Dennis Zhou
2021-04-10 15:29                         ` Wang Yugui
2021-04-10 15:52                           ` Dennis Zhou
2021-04-10 16:08                             ` Wang Yugui
2021-04-11 15:20                               ` Wang Yugui
2021-04-12  4:03                                 ` Dennis Zhou
2021-04-12  5:24                                   ` Wang Yugui
2021-04-09  9:52   ` Wang Yugui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).