All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: mhocko@kernel.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org,
	rientjes@google.com, mgorman@suse.de, hillf.zj@alibaba-inc.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
Date: Tue, 3 Jan 2017 23:38:30 +0900	[thread overview]
Message-ID: <201701032338.EFH69294.VOMSHFLOFOtQFJ@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20170103084211.GB30111@dhcp22.suse.cz>

Michal Hocko wrote:
> On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
> [...]
> > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> > __GFP_NORETRY with this patch in the changelog.
> 
> Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
> reason to describe all the nonsense combinations of gfp flags.

Before [PATCH 1/3]:

  __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
  request even if __GFP_NOFAIL is specified if direct reclaim/compaction
  did not help."

  __GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
  is specified even if direct reclaim/compaction did not help."

After [PATCH 1/3]:

  __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
  request unless __GFP_NOFAIL is specified."

  __GFP_NOFAIL is used as "Never fail allocation request even if direct
  reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
  specified."

Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
"Never fail allocation request if direct reclaim/compaction did not help.
But do not invoke the OOM killer even if direct reclaim/compaction did not help."

> 
> > But I don't think "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" is correct. Firstly, we need to confirm
> > 
> >   "The pre-mature OOM killer is a real issue as reported by Nils Holland"
> > 
> > in the changelog is still true because we haven't tested with "[PATCH] mm, memcg:
> > fix the active list aging for lowmem requests when memcg is enabled" applied and
> > without "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" and "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not
> > trigger OOM killer" applied.
> 
> Yes I have dropped the reference to this report already in my local
> patch because in this particular case the issue was somewhere else
> indeed!

OK.

> 
> > Secondly, as you are using __GFP_NORETRY in "[PATCH] mm: introduce kv[mz]alloc
> > helpers" as a mean to enforce not to invoke the OOM killer
> > 
> > 	/*
> > 	 * Make sure that larger requests are not too disruptive - no OOM
> > 	 * killer and no allocation failure warnings as we have a fallback
> > 	 */
> > 	if (size > PAGE_SIZE)
> > 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> > 
> > , we can use __GFP_NORETRY as a mean to enforce not to invoke the OOM killer
> > rather than applying "[PATCH 2/3] mm, oom: do not enfore OOM killer for
> > __GFP_NOFAIL automatically".
> > 

As I wrote above, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense.

> > Additionally, although currently there seems to be no
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) users, kvmalloc_node() in
> > "[PATCH] mm: introduce kv[mz]alloc helpers" will be confused when a
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) user comes in in the future because
> > "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" makes
> > __GFP_NOFAIL stronger than __GFP_NORETRY.
> 
> Using NOFAIL in kv[mz]alloc simply makes no sense at all. The vmalloc
> fallback would be simply unreachable!

My intention is shown below.

 void *kvmalloc_node(size_t size, gfp_t flags, int node)
 {
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
 	/*
 	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
 	 * so the given set of flags has to be compatible.
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
 	/*
 	 * Make sure that larger requests are not too disruptive - no OOM
 	 * killer and no allocation failure warnings as we have a fallback
 	 */
-	if (size > PAGE_SIZE)
+	if (size > PAGE_SIZE) {
 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+		kmalloc_flags &= ~__GFP_NOFAIL;
+	}
 
 	ret = kmalloc_node(size, kmalloc_flags, node);
 
 	/*
 	 * It doesn't really make sense to fallback to vmalloc for sub page
 	 * requests
 	 */
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
 	return __vmalloc_node_flags(size, node, flags);
 }

> 
> > My concern with "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which
> > do not trigger OOM killer" is
> > 
> >   "AFAIU, this is an allocation path which doesn't block a forward progress
> >    on a regular IO. It is merely a check whether there is a new medium in
> >    the CDROM (aka regular polling of the device). I really fail to see any
> >    reason why this one should get any access to memory reserves at all."
> > 
> > in http://lkml.kernel.org/r/20161218163727.GC8440@dhcp22.suse.cz .
> > Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
> > other workqueue items which a regular I/O depend on, I think there are
> > !__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
> > which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
> > allocations with memory reserves. If a SCSI disk I/O request fails due to
> > GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
> > use memory reserves, it adds a new problem.
> 
> Do you have any example of such a request? Anything that requires
> a forward progress during IO should be using mempools otherwise it
> is broken pretty much by design already. Also IO depending on NOFS
> allocations sounds pretty much broken already. So I suspect the above
> reasoning is just bogus.

You are missing my point. My question is "who needs memory reserves".
I'm not saying that disk I/O depends on GFP_NOFS allocation. I'm worrying
that [PATCH 3/3] consumes memory reserves when disk I/O also depends on
memory reserves.

My understanding is that when accessing SCSI disks, SCSI protocol is used.
SCSI driver allocates memory at runtime for using SCSI protocol using
GFP_ATOMIC. And GFP_ATOMIC uses some of memory reserves. But [PATCH 3/3]
also uses memory reserves. If memory reserves are consumed by [PATCH 3/3]
to the level where GFP_ATOMIC cannot succeed, I think it causes troubles.

I'm unable to obtain nice backtraces, but I think we can confirm that
there are GFP_ATOMIC allocations (e.g. sg_alloc_table_chained() calls
__sg_alloc_table(GFP_ATOMIC)) when we are using SCSI disks.

stap -DSTAP_NO_OVERLOAD=1 -DMAXSKIPPED=10000000 -e 'global in_scope; global traces;
probe begin { println("Ready."); }
probe kernel.function("scsi_*").call { in_scope[tid()]++; }
probe kernel.function("scsi_*").return { in_scope[tid()]--; }
probe kernel.function("__alloc_pages_nodemask") {
  if (in_scope[tid()] != 0) {
    bt = backtrace();
    if (!([bt,$gfp_mask] in traces)) {
      traces[bt,$gfp_mask] = 1;
      printf("mode=0x%x,order=%u by %s(%u)\n",
             $gfp_mask, $order, execname(), pid());
      print_backtrace();
      println();
    }
  }
}'

[  183.887841] ------------[ cut here ]------------
[  183.888847] WARNING: CPU: 0 PID: 0 at /root/systemtap.tmp/share/systemtap/runtime/linux/addr-map.c:42 lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.892011] Modules linked in: stap_40159b816dfa3e3814a532dc982f551b_1_4182(O) nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper vmw_vsock_vmci_transport ppdev vmw_balloon vsock pcspkr sg parport_pc parport vmw_vmci i2c_piix4 shpchp ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel
[  183.905779]  serio_raw mptspi scsi_transport_spi vmwgfx mptscsih ahci drm_kms_helper libahci syscopyarea sysfillrect mptbase sysimgblt fb_sys_fops ttm ata_piix drm e1000 i2c_core libata
[  183.909128] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.10.0-rc2-next-20170103 #486
[  183.910839] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  183.912815] Call Trace:
[  183.913326]  <IRQ>
[  183.913729]  dump_stack+0x85/0xc9
[  183.914366]  __warn+0xd1/0xf0
[  183.914930]  warn_slowpath_null+0x1d/0x20
[  183.915691]  lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.917539]  unwind_frame.constprop.79+0xbed/0x1270 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.919238]  _stp_stack_kernel_get+0x16e/0x4d0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.920870]  ? gfp_pfmemalloc_allowed+0x80/0x80
[  183.921737]  _stp_stack_kernel_print+0x3e/0xc0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.923583]  probe_3838+0x1f6/0x8c0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.925005]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.925875]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.926724]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.927598]  enter_kprobe_probe+0x1e5/0x310 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.929164]  kprobe_ftrace_handler+0xe9/0x150
[  183.930260]  ? __alloc_pages_nodemask+0x5/0x4e0
[  183.931123]  ftrace_ops_assist_func+0xbf/0x110
[  183.932012]  ? sched_clock+0x9/0x10
[  183.932815]  ? sched_clock_cpu+0x84/0xb0
[  183.933550]  0xffffffffa00430d5
[  183.934947]  ? gfp_pfmemalloc_allowed+0x80/0x80
[  183.936581]  __alloc_pages_nodemask+0x5/0x4e0
[  183.938409]  alloc_pages_current+0x97/0x1b0
[  183.939971]  ? __alloc_pages_nodemask+0x5/0x4e0
[  183.941567]  ? alloc_pages_current+0x97/0x1b0
[  183.943173]  new_slab+0x3a8/0x6a0
[  183.944702]  ___slab_alloc+0x3a3/0x620
[  183.946157]  ? stp_lock_probe+0x4a/0xb0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.948375]  ? mempool_alloc_slab+0x1c/0x20
[  183.950135]  ? mempool_alloc_slab+0x1c/0x20
[  183.951957]  __slab_alloc+0x46/0x7d
[  183.953415]  ? mempool_alloc_slab+0x1c/0x20
[  183.955001]  kmem_cache_alloc+0x2e8/0x370
[  183.956552]  ? aggr_pre_handler+0x3f/0x80
[  183.958428]  mempool_alloc_slab+0x1c/0x20
[  183.960042]  mempool_alloc+0x79/0x1b0
[  183.961512]  ? kprobe_ftrace_handler+0xa3/0x150
[  183.963082]  ? scsi_init_io+0x5/0x1d0
[  183.964501]  sg_pool_alloc+0x45/0x50
[  183.966064]  __sg_alloc_table+0xdf/0x150
[  183.967402]  ? sg_free_table_chained+0x30/0x30
[  183.969030]  sg_alloc_table_chained+0x3f/0x90
[  183.970405]  scsi_init_sgtable+0x31/0x70
[  183.971783]  copy_oldmem_page+0xd0/0xd0
[  183.973506]  copy_oldmem_page+0xd0/0xd0
[  183.974802]  sd_init_command+0x3c/0xb0 [sd_mod]
[  183.976187]  scsi_setup_cmnd+0xf0/0x150
[  183.977617]  copy_oldmem_page+0xd0/0xd0
[  183.978849]  ? scsi_prep_fn+0x5/0x170
[  183.980119]  copy_oldmem_page+0xd0/0xd0
[  183.981406]  scsi_request_fn+0x42/0x740
[  183.982760]  ? scsi_request_fn+0x5/0x740
[  183.984049]  copy_oldmem_page+0xd0/0xd0
[  183.985438]  blk_run_queue+0x26/0x40
[  183.986751]  scsi_kick_queue+0x25/0x30
[  183.987891]  copy_oldmem_page+0xd0/0xd0
[  183.989073]  copy_oldmem_page+0xd0/0xd0
[  183.990175]  copy_oldmem_page+0xd0/0xd0
[  183.991348]  copy_oldmem_page+0xd0/0xd0
[  183.992450]  copy_oldmem_page+0xd0/0xd0
[  183.993677]  blk_done_softirq+0xa8/0xd0
[  183.994766]  __do_softirq+0xc0/0x52d
[  183.995792]  irq_exit+0xf5/0x110
[  183.996748]  smp_call_function_single_interrupt+0x33/0x40
[  183.998230]  call_function_single_interrupt+0x9d/0xb0
[  183.999521] RIP: 0010:native_safe_halt+0x6/0x10
[  184.000791] RSP: 0018:ffffffff81c03dd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff04
[  184.002749] RAX: ffffffff81c18500 RBX: 0000000000000000 RCX: 0000000000000000
[  184.004440] RDX: ffffffff81c18500 RSI: 0000000000000001 RDI: ffffffff81c18500
[  184.006281] RBP: ffffffff81c03dd0 R08: 0000000000000000 R09: 0000000000000000
[  184.008063] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[  184.009842] R13: ffffffff81c18500 R14: ffffffff81c18500 R15: 0000000000000000
[  184.011758]  </IRQ>
[  184.012952]  default_idle+0x23/0x1d0
[  184.014085]  arch_cpu_idle+0xf/0x20
[  184.015204]  default_idle_call+0x23/0x40
[  184.016347]  do_idle+0x162/0x230
[  184.017353]  cpu_startup_entry+0x71/0x80
[  184.018686]  rest_init+0x138/0x140
[  184.019961]  ? rest_init+0x5/0x140
[  184.021034]  start_kernel+0x4ba/0x4db
[  184.022305]  ? set_init_arg+0x55/0x55
[  184.023405]  ? early_idt_handler_array+0x120/0x120
[  184.024730]  x86_64_start_reservations+0x2a/0x2c
[  184.026121]  x86_64_start_kernel+0x14c/0x16f
[  184.027329]  start_cpu+0x14/0x14
[  184.028391] ---[ end trace cbfebb3ae93a99b9 ]---

> 
> That being said, to summarize your arguments again. 1) you do not like
> that a combination of __GFP_NORETRY | __GFP_NOFAIL is not documented
> to never fail,

Correct.

>                2) based on that you argue that kv[mvz]alloc with
> __GFP_NOFAIL will never reach vmalloc

Wrong.

>                                       and 3) that there might be some IO
> paths depending on NOFS|NOFAIL allocation which would have harder time
> to make forward progress.

Wrong.

> 
> I would call 1 and 2 just bogus and 3 highly dubious at best. Do not
> get me wrong but this is not what I call a useful review feedback yet
> alone a reason to block these patches. If there are any reasons to not
> merge them these are not those.

WARNING: multiple messages have this Message-ID (diff)
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: mhocko@kernel.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org,
	rientjes@google.com, mgorman@suse.de, hillf.zj@alibaba-inc.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/3 -v3] GFP_NOFAIL cleanups
Date: Tue, 3 Jan 2017 23:38:30 +0900	[thread overview]
Message-ID: <201701032338.EFH69294.VOMSHFLOFOtQFJ@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20170103084211.GB30111@dhcp22.suse.cz>

Michal Hocko wrote:
> On Tue 03-01-17 10:36:31, Tetsuo Handa wrote:
> [...]
> > I'm OK with "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator
> > slowpath" given that we describe that we make __GFP_NOFAIL stronger than
> > __GFP_NORETRY with this patch in the changelog.
> 
> Again. __GFP_NORETRY | __GFP_NOFAIL is nonsense! I do not really see any
> reason to describe all the nonsense combinations of gfp flags.

Before [PATCH 1/3]:

  __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
  request even if __GFP_NOFAIL is specified if direct reclaim/compaction
  did not help."

  __GFP_NOFAIL is used as "Never fail allocation request unless __GFP_NORETRY
  is specified even if direct reclaim/compaction did not help."

After [PATCH 1/3]:

  __GFP_NORETRY is used as "Do not invoke the OOM killer. Fail allocation
  request unless __GFP_NOFAIL is specified."

  __GFP_NOFAIL is used as "Never fail allocation request even if direct
  reclaim/compaction did not help. Invoke the OOM killer unless __GFP_NORETRY is
  specified."

Thus, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense as
"Never fail allocation request if direct reclaim/compaction did not help.
But do not invoke the OOM killer even if direct reclaim/compaction did not help."

> 
> > But I don't think "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" is correct. Firstly, we need to confirm
> > 
> >   "The pre-mature OOM killer is a real issue as reported by Nils Holland"
> > 
> > in the changelog is still true because we haven't tested with "[PATCH] mm, memcg:
> > fix the active list aging for lowmem requests when memcg is enabled" applied and
> > without "[PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL
> > automatically" and "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not
> > trigger OOM killer" applied.
> 
> Yes I have dropped the reference to this report already in my local
> patch because in this particular case the issue was somewhere else
> indeed!

OK.

> 
> > Secondly, as you are using __GFP_NORETRY in "[PATCH] mm: introduce kv[mz]alloc
> > helpers" as a mean to enforce not to invoke the OOM killer
> > 
> > 	/*
> > 	 * Make sure that larger requests are not too disruptive - no OOM
> > 	 * killer and no allocation failure warnings as we have a fallback
> > 	 */
> > 	if (size > PAGE_SIZE)
> > 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
> > 
> > , we can use __GFP_NORETRY as a mean to enforce not to invoke the OOM killer
> > rather than applying "[PATCH 2/3] mm, oom: do not enfore OOM killer for
> > __GFP_NOFAIL automatically".
> > 

As I wrote above, __GFP_NORETRY | __GFP_NOFAIL perfectly makes sense.

> > Additionally, although currently there seems to be no
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) users, kvmalloc_node() in
> > "[PATCH] mm: introduce kv[mz]alloc helpers" will be confused when a
> > kv[mz]alloc(GFP_KERNEL | __GFP_NOFAIL) user comes in in the future because
> > "[PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath" makes
> > __GFP_NOFAIL stronger than __GFP_NORETRY.
> 
> Using NOFAIL in kv[mz]alloc simply makes no sense at all. The vmalloc
> fallback would be simply unreachable!

My intention is shown below.

 void *kvmalloc_node(size_t size, gfp_t flags, int node)
 {
 	gfp_t kmalloc_flags = flags;
 	void *ret;
 
 	/*
 	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
 	 * so the given set of flags has to be compatible.
 	 */
 	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
 
 	/*
 	 * Make sure that larger requests are not too disruptive - no OOM
 	 * killer and no allocation failure warnings as we have a fallback
 	 */
-	if (size > PAGE_SIZE)
+	if (size > PAGE_SIZE) {
 		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+		kmalloc_flags &= ~__GFP_NOFAIL;
+	}
 
 	ret = kmalloc_node(size, kmalloc_flags, node);
 
 	/*
 	 * It doesn't really make sense to fallback to vmalloc for sub page
 	 * requests
 	 */
 	if (ret || size <= PAGE_SIZE)
 		return ret;
 
 	return __vmalloc_node_flags(size, node, flags);
 }

> 
> > My concern with "[PATCH 3/3] mm: help __GFP_NOFAIL allocations which
> > do not trigger OOM killer" is
> > 
> >   "AFAIU, this is an allocation path which doesn't block a forward progress
> >    on a regular IO. It is merely a check whether there is a new medium in
> >    the CDROM (aka regular polling of the device). I really fail to see any
> >    reason why this one should get any access to memory reserves at all."
> > 
> > in http://lkml.kernel.org/r/20161218163727.GC8440@dhcp22.suse.cz .
> > Indeed that trace is a __GFP_DIRECT_RECLAIM and it might not be blocking
> > other workqueue items which a regular I/O depend on, I think there are
> > !__GFP_DIRECT_RECLAIM memory allocation requests for issuing SCSI commands
> > which could potentially start failing due to helping GFP_NOFS | __GFP_NOFAIL
> > allocations with memory reserves. If a SCSI disk I/O request fails due to
> > GFP_ATOMIC memory allocation failures because we allow a FS I/O request to
> > use memory reserves, it adds a new problem.
> 
> Do you have any example of such a request? Anything that requires
> a forward progress during IO should be using mempools otherwise it
> is broken pretty much by design already. Also IO depending on NOFS
> allocations sounds pretty much broken already. So I suspect the above
> reasoning is just bogus.

You are missing my point. My question is "who needs memory reserves".
I'm not saying that disk I/O depends on GFP_NOFS allocation. I'm worrying
that [PATCH 3/3] consumes memory reserves when disk I/O also depends on
memory reserves.

My understanding is that when accessing SCSI disks, SCSI protocol is used.
SCSI driver allocates memory at runtime for using SCSI protocol using
GFP_ATOMIC. And GFP_ATOMIC uses some of memory reserves. But [PATCH 3/3]
also uses memory reserves. If memory reserves are consumed by [PATCH 3/3]
to the level where GFP_ATOMIC cannot succeed, I think it causes troubles.

I'm unable to obtain nice backtraces, but I think we can confirm that
there are GFP_ATOMIC allocations (e.g. sg_alloc_table_chained() calls
__sg_alloc_table(GFP_ATOMIC)) when we are using SCSI disks.

stap -DSTAP_NO_OVERLOAD=1 -DMAXSKIPPED=10000000 -e 'global in_scope; global traces;
probe begin { println("Ready."); }
probe kernel.function("scsi_*").call { in_scope[tid()]++; }
probe kernel.function("scsi_*").return { in_scope[tid()]--; }
probe kernel.function("__alloc_pages_nodemask") {
  if (in_scope[tid()] != 0) {
    bt = backtrace();
    if (!([bt,$gfp_mask] in traces)) {
      traces[bt,$gfp_mask] = 1;
      printf("mode=0x%x,order=%u by %s(%u)\n",
             $gfp_mask, $order, execname(), pid());
      print_backtrace();
      println();
    }
  }
}'

[  183.887841] ------------[ cut here ]------------
[  183.888847] WARNING: CPU: 0 PID: 0 at /root/systemtap.tmp/share/systemtap/runtime/linux/addr-map.c:42 lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.892011] Modules linked in: stap_40159b816dfa3e3814a532dc982f551b_1_4182(O) nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper vmw_vsock_vmci_transport ppdev vmw_balloon vsock pcspkr sg parport_pc parport vmw_vmci i2c_piix4 shpchp ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel
[  183.905779]  serio_raw mptspi scsi_transport_spi vmwgfx mptscsih ahci drm_kms_helper libahci syscopyarea sysfillrect mptbase sysimgblt fb_sys_fops ttm ata_piix drm e1000 i2c_core libata
[  183.909128] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O    4.10.0-rc2-next-20170103 #486
[  183.910839] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  183.912815] Call Trace:
[  183.913326]  <IRQ>
[  183.913729]  dump_stack+0x85/0xc9
[  183.914366]  __warn+0xd1/0xf0
[  183.914930]  warn_slowpath_null+0x1d/0x20
[  183.915691]  lookup_bad_addr.isra.33+0x84/0x90 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.917539]  unwind_frame.constprop.79+0xbed/0x1270 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.919238]  _stp_stack_kernel_get+0x16e/0x4d0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.920870]  ? gfp_pfmemalloc_allowed+0x80/0x80
[  183.921737]  _stp_stack_kernel_print+0x3e/0xc0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.923583]  probe_3838+0x1f6/0x8c0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.925005]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.925875]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.926724]  ? __alloc_pages_nodemask+0x1/0x4e0
[  183.927598]  enter_kprobe_probe+0x1e5/0x310 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.929164]  kprobe_ftrace_handler+0xe9/0x150
[  183.930260]  ? __alloc_pages_nodemask+0x5/0x4e0
[  183.931123]  ftrace_ops_assist_func+0xbf/0x110
[  183.932012]  ? sched_clock+0x9/0x10
[  183.932815]  ? sched_clock_cpu+0x84/0xb0
[  183.933550]  0xffffffffa00430d5
[  183.934947]  ? gfp_pfmemalloc_allowed+0x80/0x80
[  183.936581]  __alloc_pages_nodemask+0x5/0x4e0
[  183.938409]  alloc_pages_current+0x97/0x1b0
[  183.939971]  ? __alloc_pages_nodemask+0x5/0x4e0
[  183.941567]  ? alloc_pages_current+0x97/0x1b0
[  183.943173]  new_slab+0x3a8/0x6a0
[  183.944702]  ___slab_alloc+0x3a3/0x620
[  183.946157]  ? stp_lock_probe+0x4a/0xb0 [stap_40159b816dfa3e3814a532dc982f551b_1_4182]
[  183.948375]  ? mempool_alloc_slab+0x1c/0x20
[  183.950135]  ? mempool_alloc_slab+0x1c/0x20
[  183.951957]  __slab_alloc+0x46/0x7d
[  183.953415]  ? mempool_alloc_slab+0x1c/0x20
[  183.955001]  kmem_cache_alloc+0x2e8/0x370
[  183.956552]  ? aggr_pre_handler+0x3f/0x80
[  183.958428]  mempool_alloc_slab+0x1c/0x20
[  183.960042]  mempool_alloc+0x79/0x1b0
[  183.961512]  ? kprobe_ftrace_handler+0xa3/0x150
[  183.963082]  ? scsi_init_io+0x5/0x1d0
[  183.964501]  sg_pool_alloc+0x45/0x50
[  183.966064]  __sg_alloc_table+0xdf/0x150
[  183.967402]  ? sg_free_table_chained+0x30/0x30
[  183.969030]  sg_alloc_table_chained+0x3f/0x90
[  183.970405]  scsi_init_sgtable+0x31/0x70
[  183.971783]  copy_oldmem_page+0xd0/0xd0
[  183.973506]  copy_oldmem_page+0xd0/0xd0
[  183.974802]  sd_init_command+0x3c/0xb0 [sd_mod]
[  183.976187]  scsi_setup_cmnd+0xf0/0x150
[  183.977617]  copy_oldmem_page+0xd0/0xd0
[  183.978849]  ? scsi_prep_fn+0x5/0x170
[  183.980119]  copy_oldmem_page+0xd0/0xd0
[  183.981406]  scsi_request_fn+0x42/0x740
[  183.982760]  ? scsi_request_fn+0x5/0x740
[  183.984049]  copy_oldmem_page+0xd0/0xd0
[  183.985438]  blk_run_queue+0x26/0x40
[  183.986751]  scsi_kick_queue+0x25/0x30
[  183.987891]  copy_oldmem_page+0xd0/0xd0
[  183.989073]  copy_oldmem_page+0xd0/0xd0
[  183.990175]  copy_oldmem_page+0xd0/0xd0
[  183.991348]  copy_oldmem_page+0xd0/0xd0
[  183.992450]  copy_oldmem_page+0xd0/0xd0
[  183.993677]  blk_done_softirq+0xa8/0xd0
[  183.994766]  __do_softirq+0xc0/0x52d
[  183.995792]  irq_exit+0xf5/0x110
[  183.996748]  smp_call_function_single_interrupt+0x33/0x40
[  183.998230]  call_function_single_interrupt+0x9d/0xb0
[  183.999521] RIP: 0010:native_safe_halt+0x6/0x10
[  184.000791] RSP: 0018:ffffffff81c03dd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff04
[  184.002749] RAX: ffffffff81c18500 RBX: 0000000000000000 RCX: 0000000000000000
[  184.004440] RDX: ffffffff81c18500 RSI: 0000000000000001 RDI: ffffffff81c18500
[  184.006281] RBP: ffffffff81c03dd0 R08: 0000000000000000 R09: 0000000000000000
[  184.008063] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
[  184.009842] R13: ffffffff81c18500 R14: ffffffff81c18500 R15: 0000000000000000
[  184.011758]  </IRQ>
[  184.012952]  default_idle+0x23/0x1d0
[  184.014085]  arch_cpu_idle+0xf/0x20
[  184.015204]  default_idle_call+0x23/0x40
[  184.016347]  do_idle+0x162/0x230
[  184.017353]  cpu_startup_entry+0x71/0x80
[  184.018686]  rest_init+0x138/0x140
[  184.019961]  ? rest_init+0x5/0x140
[  184.021034]  start_kernel+0x4ba/0x4db
[  184.022305]  ? set_init_arg+0x55/0x55
[  184.023405]  ? early_idt_handler_array+0x120/0x120
[  184.024730]  x86_64_start_reservations+0x2a/0x2c
[  184.026121]  x86_64_start_kernel+0x14c/0x16f
[  184.027329]  start_cpu+0x14/0x14
[  184.028391] ---[ end trace cbfebb3ae93a99b9 ]---

> 
> That being said, to summarize your arguments again. 1) you do not like
> that a combination of __GFP_NORETRY | __GFP_NOFAIL is not documented
> to never fail,

Correct.

>                2) based on that you argue that kv[mvz]alloc with
> __GFP_NOFAIL will never reach vmalloc

Wrong.

>                                       and 3) that there might be some IO
> paths depending on NOFS|NOFAIL allocation which would have harder time
> to make forward progress.

Wrong.

> 
> I would call 1 and 2 just bogus and 3 highly dubious at best. Do not
> get me wrong but this is not what I call a useful review feedback yet
> alone a reason to block these patches. If there are any reasons to not
> merge them these are not those.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-01-03 14:39 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-20 13:49 [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
2016-12-20 13:49 ` Michal Hocko
2016-12-20 13:49 ` [PATCH 1/3] mm: consolidate GFP_NOFAIL checks in the allocator slowpath Michal Hocko
2016-12-20 13:49   ` Michal Hocko
2016-12-20 13:49 ` [PATCH 2/3] mm, oom: do not enfore OOM killer for __GFP_NOFAIL automatically Michal Hocko
2016-12-20 13:49   ` Michal Hocko
2016-12-20 15:31   ` Tetsuo Handa
2016-12-20 15:31     ` Tetsuo Handa
2016-12-21  8:15     ` Michal Hocko
2016-12-21  8:15       ` Michal Hocko
2017-01-19 18:41   ` Johannes Weiner
2017-01-19 18:41     ` Johannes Weiner
2017-01-20  8:33   ` Hillf Danton
2017-01-20  8:33     ` Hillf Danton
2017-01-24 12:40     ` Michal Hocko
2017-01-24 12:40       ` Michal Hocko
2017-01-25  7:00       ` Hillf Danton
2017-01-25  7:00         ` Hillf Danton
2017-01-25  7:59         ` Michal Hocko
2017-01-25  7:59           ` Michal Hocko
2017-01-25  8:41           ` Hillf Danton
2017-01-25  8:41             ` Hillf Danton
2017-01-25 10:19             ` Michal Hocko
2017-01-25 10:19               ` Michal Hocko
2016-12-20 13:49 ` [PATCH 3/3] mm: help __GFP_NOFAIL allocations which do not trigger OOM killer Michal Hocko
2016-12-20 13:49   ` Michal Hocko
2017-01-02 15:49 ` [PATCH 0/3 -v3] GFP_NOFAIL cleanups Michal Hocko
2017-01-02 15:49   ` Michal Hocko
2017-01-03  1:36   ` Tetsuo Handa
2017-01-03  1:36     ` Tetsuo Handa
2017-01-03  8:42     ` Michal Hocko
2017-01-03  8:42       ` Michal Hocko
2017-01-03 14:38       ` Tetsuo Handa [this message]
2017-01-03 14:38         ` Tetsuo Handa
2017-01-03 16:25         ` Vlastimil Babka
2017-01-03 16:25           ` Vlastimil Babka
2017-01-03 20:40         ` Michal Hocko
2017-01-03 20:40           ` Michal Hocko
2017-01-04 14:22           ` Tetsuo Handa
2017-01-04 14:22             ` Tetsuo Handa
2017-01-04 15:20             ` Michal Hocko
2017-01-04 15:20               ` Michal Hocko
2017-01-05 10:50               ` Tetsuo Handa
2017-01-05 10:50                 ` Tetsuo Handa
2017-01-05 11:54                 ` Michal Hocko
2017-01-05 11:54                   ` Michal Hocko
2017-01-18 18:42 ` Michal Hocko
2017-01-18 18:42   ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201701032338.EFH69294.VOMSHFLOFOtQFJ@I-love.SAKURA.ne.jp \
    --to=penguin-kernel@i-love.sakura.ne.jp \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hillf.zj@alibaba-inc.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@kernel.org \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.