All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v3] kvmalloc
@ 2017-01-30  9:49 Michal Hocko
  2017-01-30  9:49 ` [PATCH 1/9] mm: introduce kv[mz]alloc helpers Michal Hocko
                   ` (9 more replies)
  0 siblings, 10 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Andreas Dilger,
	Andreas Dilger, Andrey Konovalov, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Christian Borntraeger, Colin Cross,
	Daniel Borkmann, Dan Williams, David Sterba, Eric Dumazet,
	Eric Dumazet, Hariprasad S, Heiko Carstens, Herbert Xu,
	Ilya Dryomov, John Hubbard, Kees Cook, Kent Overstreet,
	Marcelo Ricardo Leitner, Martin Schwidefsky, Michael S. Tsirkin,
	Michal Hocko, Mike Snitzer, Mikulas Patocka, Oleg Drokin,
	Pablo Neira Ayuso, Rafael J. Wysocki, Santosh Raspatur,
	Tariq Toukan, Tom Herbert, Tony Luck, Yan, Zheng, Yishai Hadas

Hi,
this has been previously posted here [1] and it received quite some
feedback. As a result the number of patches has grown again. We are at
9 patches right now. I have rebased the series on top of the current
next-20170130. There were some changes since the last posting, namely
a7f6c1b63b86 ("AppArmor: Use GFP_KERNEL for __aa_kvmalloc().") which
dropped GFP_NOIO from __aa_kvmalloc and d407bd25a204 ("bpf: don't
trigger OOM killer under pressure with map alloc") which has created a
kvmalloc alternative for bpf code. Both have been changed to use the mm
kvmalloc but it is worth noting this dependency during the merge window.

I hope there are no further obstacles to have this merged into the mmotm
tree and go in in the next merge window.

Original cover:

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 5. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170112153717.28943-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

Michal Hocko (9):
      mm: introduce kv[mz]alloc helpers
      mm: support __GFP_REPEAT in kvmalloc_node for >32kB
      rhashtable: simplify a strange allocation pattern
      ila: simplify a strange allocation pattern
      treewide: use kv[mz]alloc* rather than opencoded variants
      net: use kvmalloc with __GFP_REPEAT rather than open coded variant
      md: use kvmalloc rather than opencoded variant
      bcache: use kvmalloc
      net, bpf: use kvzalloc helper

 arch/s390/kvm/kvm-s390.c                           | 10 +---
 arch/x86/kvm/lapic.c                               |  4 +-
 arch/x86/kvm/page_track.c                          |  4 +-
 arch/x86/kvm/x86.c                                 |  4 +-
 crypto/lzo.c                                       |  4 +-
 drivers/acpi/apei/erst.c                           |  8 +--
 drivers/char/agp/generic.c                         |  8 +--
 drivers/gpu/drm/nouveau/nouveau_gem.c              |  4 +-
 drivers/md/bcache/super.c                          |  8 +--
 drivers/md/bcache/util.h                           | 12 +----
 drivers/md/dm-ioctl.c                              | 13 ++---
 drivers/md/dm-stats.c                              |  7 +--
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h    |  3 --
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 ++---------
 drivers/net/ethernet/chelsio/cxgb3/l2t.c           |  8 +--
 drivers/net/ethernet/chelsio/cxgb3/l2t.h           |  1 -
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c      | 12 ++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |  3 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c |  8 +--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 31 ++----------
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/l2t.c           |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/sched.c         | 12 ++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |  9 ++--
 drivers/net/ethernet/mellanox/mlx4/mr.c            |  9 ++--
 drivers/nvdimm/dimm_devs.c                         |  5 +-
 .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +----
 drivers/vhost/net.c                                |  9 ++--
 drivers/vhost/vhost.c                              | 15 ++----
 drivers/vhost/vsock.c                              |  9 ++--
 drivers/xen/evtchn.c                               | 14 +-----
 fs/btrfs/ctree.c                                   |  9 ++--
 fs/btrfs/ioctl.c                                   |  9 ++--
 fs/btrfs/send.c                                    | 27 ++++------
 fs/ceph/file.c                                     |  9 ++--
 fs/ext4/mballoc.c                                  |  2 +-
 fs/ext4/super.c                                    |  4 +-
 fs/f2fs/f2fs.h                                     | 20 --------
 fs/f2fs/file.c                                     |  4 +-
 fs/f2fs/segment.c                                  | 14 +++---
 fs/select.c                                        |  5 +-
 fs/seq_file.c                                      | 16 +-----
 fs/xattr.c                                         | 27 ++++------
 include/linux/kvm_host.h                           |  2 -
 include/linux/mlx5/driver.h                        |  7 +--
 include/linux/mm.h                                 | 22 +++++++++
 include/linux/vmalloc.h                            |  1 +
 ipc/util.c                                         |  7 +--
 kernel/bpf/syscall.c                               | 19 ++------
 lib/iov_iter.c                                     |  5 +-
 lib/rhashtable.c                                   | 13 ++---
 mm/frame_vector.c                                  |  5 +-
 mm/nommu.c                                         |  5 ++
 mm/util.c                                          | 57 ++++++++++++++++++++++
 mm/vmalloc.c                                       |  9 +++-
 net/core/dev.c                                     | 24 ++++-----
 net/ipv4/inet_hashtables.c                         |  6 +--
 net/ipv4/tcp_metrics.c                             |  5 +-
 net/ipv6/ila/ila_xlat.c                            |  8 +--
 net/mpls/af_mpls.c                                 |  5 +-
 net/netfilter/x_tables.c                           | 37 ++++----------
 net/netfilter/xt_recent.c                          |  5 +-
 net/sched/sch_choke.c                              |  5 +-
 net/sched/sch_fq.c                                 | 12 +----
 net/sched/sch_fq_codel.c                           | 26 +++-------
 net/sched/sch_hhf.c                                | 33 ++++---------
 net/sched/sch_netem.c                              |  6 +--
 net/sched/sch_sfq.c                                |  6 +--
 security/apparmor/apparmorfs.c                     |  2 +-
 security/apparmor/include/lib.h                    | 11 -----
 security/apparmor/lib.c                            | 30 ------------
 security/apparmor/match.c                          |  2 +-
 security/apparmor/policy_unpack.c                  |  2 +-
 security/keys/keyctl.c                             | 22 +++------
 virt/kvm/kvm_main.c                                | 18 ++-----
 76 files changed, 279 insertions(+), 583 deletions(-)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 0/6 v3] kvmalloc
@ 2017-01-30  9:49 Michal Hocko
  2017-01-30  9:49 ` [PATCH 1/9] mm: introduce kv[mz]alloc helpers Michal Hocko
                   ` (9 more replies)
  0 siblings, 10 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Andreas Dilger,
	Andreas Dilger, Andrey Konovalov, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Christian Borntraeger, Colin Cross,
	Daniel Borkmann, Dan Williams, David Sterba, Eric Dumazet,
	Eric Dumazet, Hariprasad S, Heiko Carstens, Herbert Xu,
	Ilya Dryomov, John Hubbard, Kees Cook, Kent Overstreet,
	Marcelo Ricardo Leitner, Martin Schwidefsky, Michael S. Tsirkin,
	Michal Hocko, Mike Snitzer, Mikulas Patocka, Oleg Drokin,
	Pablo Neira Ayuso, Rafael J. Wysocki, Santosh Raspatur,
	Tariq Toukan, Tom Herbert, Tony Luck, Yan, Zheng, Yishai Hadas

Hi,
this has been previously posted here [1] and it received quite some
feedback. As a result the number of patches has grown again. We are at
9 patches right now. I have rebased the series on top of the current
next-20170130. There were some changes since the last posting, namely
a7f6c1b63b86 ("AppArmor: Use GFP_KERNEL for __aa_kvmalloc().") which
dropped GFP_NOIO from __aa_kvmalloc and d407bd25a204 ("bpf: don't
trigger OOM killer under pressure with map alloc") which has created a
kvmalloc alternative for bpf code. Both have been changed to use the mm
kvmalloc but it is worth noting this dependency during the merge window.

I hope there are no further obstacles to have this merged into the mmotm
tree and go in in the next merge window.

Original cover:

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 5. There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170112153717.28943-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

Michal Hocko (9):
      mm: introduce kv[mz]alloc helpers
      mm: support __GFP_REPEAT in kvmalloc_node for >32kB
      rhashtable: simplify a strange allocation pattern
      ila: simplify a strange allocation pattern
      treewide: use kv[mz]alloc* rather than opencoded variants
      net: use kvmalloc with __GFP_REPEAT rather than open coded variant
      md: use kvmalloc rather than opencoded variant
      bcache: use kvmalloc
      net, bpf: use kvzalloc helper

 arch/s390/kvm/kvm-s390.c                           | 10 +---
 arch/x86/kvm/lapic.c                               |  4 +-
 arch/x86/kvm/page_track.c                          |  4 +-
 arch/x86/kvm/x86.c                                 |  4 +-
 crypto/lzo.c                                       |  4 +-
 drivers/acpi/apei/erst.c                           |  8 +--
 drivers/char/agp/generic.c                         |  8 +--
 drivers/gpu/drm/nouveau/nouveau_gem.c              |  4 +-
 drivers/md/bcache/super.c                          |  8 +--
 drivers/md/bcache/util.h                           | 12 +----
 drivers/md/dm-ioctl.c                              | 13 ++---
 drivers/md/dm-stats.c                              |  7 +--
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h    |  3 --
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 ++---------
 drivers/net/ethernet/chelsio/cxgb3/l2t.c           |  8 +--
 drivers/net/ethernet/chelsio/cxgb3/l2t.h           |  1 -
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c      | 12 ++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |  3 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c |  8 +--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 31 ++----------
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  | 13 +++--
 drivers/net/ethernet/chelsio/cxgb4/l2t.c           |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/sched.c         | 12 ++---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |  9 ++--
 drivers/net/ethernet/mellanox/mlx4/mr.c            |  9 ++--
 drivers/nvdimm/dimm_devs.c                         |  5 +-
 .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +----
 drivers/vhost/net.c                                |  9 ++--
 drivers/vhost/vhost.c                              | 15 ++----
 drivers/vhost/vsock.c                              |  9 ++--
 drivers/xen/evtchn.c                               | 14 +-----
 fs/btrfs/ctree.c                                   |  9 ++--
 fs/btrfs/ioctl.c                                   |  9 ++--
 fs/btrfs/send.c                                    | 27 ++++------
 fs/ceph/file.c                                     |  9 ++--
 fs/ext4/mballoc.c                                  |  2 +-
 fs/ext4/super.c                                    |  4 +-
 fs/f2fs/f2fs.h                                     | 20 --------
 fs/f2fs/file.c                                     |  4 +-
 fs/f2fs/segment.c                                  | 14 +++---
 fs/select.c                                        |  5 +-
 fs/seq_file.c                                      | 16 +-----
 fs/xattr.c                                         | 27 ++++------
 include/linux/kvm_host.h                           |  2 -
 include/linux/mlx5/driver.h                        |  7 +--
 include/linux/mm.h                                 | 22 +++++++++
 include/linux/vmalloc.h                            |  1 +
 ipc/util.c                                         |  7 +--
 kernel/bpf/syscall.c                               | 19 ++------
 lib/iov_iter.c                                     |  5 +-
 lib/rhashtable.c                                   | 13 ++---
 mm/frame_vector.c                                  |  5 +-
 mm/nommu.c                                         |  5 ++
 mm/util.c                                          | 57 ++++++++++++++++++++++
 mm/vmalloc.c                                       |  9 +++-
 net/core/dev.c                                     | 24 ++++-----
 net/ipv4/inet_hashtables.c                         |  6 +--
 net/ipv4/tcp_metrics.c                             |  5 +-
 net/ipv6/ila/ila_xlat.c                            |  8 +--
 net/mpls/af_mpls.c                                 |  5 +-
 net/netfilter/x_tables.c                           | 37 ++++----------
 net/netfilter/xt_recent.c                          |  5 +-
 net/sched/sch_choke.c                              |  5 +-
 net/sched/sch_fq.c                                 | 12 +----
 net/sched/sch_fq_codel.c                           | 26 +++-------
 net/sched/sch_hhf.c                                | 33 ++++---------
 net/sched/sch_netem.c                              |  6 +--
 net/sched/sch_sfq.c                                |  6 +--
 security/apparmor/apparmorfs.c                     |  2 +-
 security/apparmor/include/lib.h                    | 11 -----
 security/apparmor/lib.c                            | 30 ------------
 security/apparmor/match.c                          |  2 +-
 security/apparmor/policy_unpack.c                  |  2 +-
 security/keys/keyctl.c                             | 22 +++------
 virt/kvm/kvm_main.c                                | 18 ++-----
 76 files changed, 279 insertions(+), 583 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 1/9] mm: introduce kv[mz]alloc helpers
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30  9:49 ` [PATCH 2/9] mm: support __GFP_REPEAT in kvmalloc_node for >32kB Michal Hocko
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, John Hubbard

From: Michal Hocko <mhocko@suse.com>

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be
broken and require GFP_NO{FS,IO} context which is not vmalloc compatible
in general (note that the page table allocation is GFP_KERNEL). Those
need to be fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported
gfp mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.

Changes since v3
- add __GFP_HIGHMEM for the vmalloc fallback
- document gfp_mask in __vmalloc_node
- change ipc_alloc to use the library kvmalloc
- __aa_kvmalloc doesn't rely on GFP_NOIO anymore so we can drop and
use the library kvmalloc directly

Changes since v2
- s@WARN_ON@WARN_ON_ONCE@ as per Vlastimil
- do not fallback to vmalloc for size = PAGE_SIZE as per Vlastimil

Changes since v1
- define __vmalloc_node_flags for CONFIG_MMU=n

Cc: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> # ext4 part
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/kvm/lapic.c              |  4 ++--
 arch/x86/kvm/page_track.c         |  4 ++--
 arch/x86/kvm/x86.c                |  4 ++--
 drivers/md/dm-stats.c             |  7 +-----
 fs/ext4/mballoc.c                 |  2 +-
 fs/ext4/super.c                   |  4 ++--
 fs/f2fs/f2fs.h                    | 20 -----------------
 fs/f2fs/file.c                    |  4 ++--
 fs/f2fs/segment.c                 | 14 ++++++------
 fs/seq_file.c                     | 16 +-------------
 include/linux/kvm_host.h          |  2 --
 include/linux/mm.h                | 14 ++++++++++++
 include/linux/vmalloc.h           |  1 +
 ipc/util.c                        |  7 +-----
 mm/nommu.c                        |  5 +++++
 mm/util.c                         | 45 +++++++++++++++++++++++++++++++++++++++
 mm/vmalloc.c                      |  9 +++++++-
 security/apparmor/apparmorfs.c    |  2 +-
 security/apparmor/include/lib.h   | 11 ----------
 security/apparmor/lib.c           | 30 --------------------------
 security/apparmor/match.c         |  2 +-
 security/apparmor/policy_unpack.c |  2 +-
 virt/kvm/kvm_main.c               | 18 +++-------------
 23 files changed, 100 insertions(+), 127 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 33b799fd3a6e..42562348bed2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -177,8 +177,8 @@ static void recalculate_apic_map(struct kvm *kvm)
 		if (kvm_apic_present(vcpu))
 			max_id = max(max_id, kvm_x2apic_id(vcpu->arch.apic));
 
-	new = kvm_kvzalloc(sizeof(struct kvm_apic_map) +
-	                   sizeof(struct kvm_lapic *) * ((u64)max_id + 1));
+	new = kvzalloc(sizeof(struct kvm_apic_map) +
+	                   sizeof(struct kvm_lapic *) * ((u64)max_id + 1), GFP_KERNEL);
 
 	if (!new)
 		goto out;
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 4a1c13eaa518..d46663e655b0 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -38,8 +38,8 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
 	int  i;
 
 	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
-		slot->arch.gfn_track[i] = kvm_kvzalloc(npages *
-					    sizeof(*slot->arch.gfn_track[i]));
+		slot->arch.gfn_track[i] = kvzalloc(npages *
+					    sizeof(*slot->arch.gfn_track[i]), GFP_KERNEL);
 		if (!slot->arch.gfn_track[i])
 			goto track_free;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 313f2cecbc57..07b0d17df9ea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8121,13 +8121,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				      slot->base_gfn, level) + 1;
 
 		slot->arch.rmap[i] =
-			kvm_kvzalloc(lpages * sizeof(*slot->arch.rmap[i]));
+			kvzalloc(lpages * sizeof(*slot->arch.rmap[i]), GFP_KERNEL);
 		if (!slot->arch.rmap[i])
 			goto out_free;
 		if (i == 0)
 			continue;
 
-		linfo = kvm_kvzalloc(lpages * sizeof(*linfo));
+		linfo = kvzalloc(lpages * sizeof(*linfo), GFP_KERNEL);
 		if (!linfo)
 			goto out_free;
 
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c
index 38b05f23b96c..674f9a1686f7 100644
--- a/drivers/md/dm-stats.c
+++ b/drivers/md/dm-stats.c
@@ -146,12 +146,7 @@ static void *dm_kvzalloc(size_t alloc_size, int node)
 	if (!claim_shared_memory(alloc_size))
 		return NULL;
 
-	if (alloc_size <= KMALLOC_MAX_SIZE) {
-		p = kzalloc_node(alloc_size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN, node);
-		if (p)
-			return p;
-	}
-	p = vzalloc_node(alloc_size, node);
+	p = kvzalloc_node(alloc_size, GFP_KERNEL | __GFP_NOMEMALLOC, node);
 	if (p)
 		return p;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..31a761dd76f5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2381,7 +2381,7 @@ int ext4_mb_alloc_groupinfo(struct super_block *sb, ext4_group_t ngroups)
 		return 0;
 
 	size = roundup_pow_of_two(sizeof(*sbi->s_group_info) * size);
-	new_groupinfo = ext4_kvzalloc(size, GFP_KERNEL);
+	new_groupinfo = kvzalloc(size, GFP_KERNEL);
 	if (!new_groupinfo) {
 		ext4_msg(sb, KERN_ERR, "can't allocate buddy meta group");
 		return -ENOMEM;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9d15a6293124..e3f1ff04a85f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2110,7 +2110,7 @@ int ext4_alloc_flex_bg_array(struct super_block *sb, ext4_group_t ngroup)
 		return 0;
 
 	size = roundup_pow_of_two(size * sizeof(struct flex_groups));
-	new_groups = ext4_kvzalloc(size, GFP_KERNEL);
+	new_groups = kvzalloc(size, GFP_KERNEL);
 	if (!new_groups) {
 		ext4_msg(sb, KERN_ERR, "not enough memory for %d flex groups",
 			 size / (int) sizeof(struct flex_groups));
@@ -3844,7 +3844,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			goto failed_mount;
 		}
 	}
-	sbi->s_group_desc = ext4_kvmalloc(db_count *
+	sbi->s_group_desc = kvmalloc(db_count *
 					  sizeof(struct buffer_head *),
 					  GFP_KERNEL);
 	if (sbi->s_group_desc == NULL) {
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 4cce13d0a1a4..83dc25025277 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1935,26 +1935,6 @@ static inline void *f2fs_kmalloc(struct f2fs_sb_info *sbi,
 	return kmalloc(size, flags);
 }
 
-static inline void *f2fs_kvmalloc(size_t size, gfp_t flags)
-{
-	void *ret;
-
-	ret = kmalloc(size, flags | __GFP_NOWARN);
-	if (!ret)
-		ret = __vmalloc(size, flags, PAGE_KERNEL);
-	return ret;
-}
-
-static inline void *f2fs_kvzalloc(size_t size, gfp_t flags)
-{
-	void *ret;
-
-	ret = kzalloc(size, flags | __GFP_NOWARN);
-	if (!ret)
-		ret = __vmalloc(size, flags | __GFP_ZERO, PAGE_KERNEL);
-	return ret;
-}
-
 #define get_inode_mode(i) \
 	((is_inode_flag_set(i, FI_ACL_MODE)) ? \
 	 (F2FS_I(i)->i_acl_mode) : ((i)->i_mode))
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 2752bcf98f95..82ca8c038ecf 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1014,11 +1014,11 @@ static int __exchange_data_block(struct inode *src_inode,
 	while (len) {
 		olen = min((pgoff_t)4 * ADDRS_PER_BLOCK, len);
 
-		src_blkaddr = f2fs_kvzalloc(sizeof(block_t) * olen, GFP_KERNEL);
+		src_blkaddr = kvzalloc(sizeof(block_t) * olen, GFP_KERNEL);
 		if (!src_blkaddr)
 			return -ENOMEM;
 
-		do_replace = f2fs_kvzalloc(sizeof(int) * olen, GFP_KERNEL);
+		do_replace = kvzalloc(sizeof(int) * olen, GFP_KERNEL);
 		if (!do_replace) {
 			kvfree(src_blkaddr);
 			return -ENOMEM;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index fb57ab9f6aa6..127d875a79f7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -2351,13 +2351,13 @@ static int build_sit_info(struct f2fs_sb_info *sbi)
 
 	SM_I(sbi)->sit_info = sit_i;
 
-	sit_i->sentries = f2fs_kvzalloc(MAIN_SEGS(sbi) *
+	sit_i->sentries = kvzalloc(MAIN_SEGS(sbi) *
 					sizeof(struct seg_entry), GFP_KERNEL);
 	if (!sit_i->sentries)
 		return -ENOMEM;
 
 	bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi));
-	sit_i->dirty_sentries_bitmap = f2fs_kvzalloc(bitmap_size, GFP_KERNEL);
+	sit_i->dirty_sentries_bitmap = kvzalloc(bitmap_size, GFP_KERNEL);
 	if (!sit_i->dirty_sentries_bitmap)
 		return -ENOMEM;
 
@@ -2390,7 +2390,7 @@ static int build_sit_info(struct f2fs_sb_info *sbi)
 		return -ENOMEM;
 
 	if (sbi->segs_per_sec > 1) {
-		sit_i->sec_entries = f2fs_kvzalloc(MAIN_SECS(sbi) *
+		sit_i->sec_entries = kvzalloc(MAIN_SECS(sbi) *
 					sizeof(struct sec_entry), GFP_KERNEL);
 		if (!sit_i->sec_entries)
 			return -ENOMEM;
@@ -2441,12 +2441,12 @@ static int build_free_segmap(struct f2fs_sb_info *sbi)
 	SM_I(sbi)->free_info = free_i;
 
 	bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi));
-	free_i->free_segmap = f2fs_kvmalloc(bitmap_size, GFP_KERNEL);
+	free_i->free_segmap = kvmalloc(bitmap_size, GFP_KERNEL);
 	if (!free_i->free_segmap)
 		return -ENOMEM;
 
 	sec_bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi));
-	free_i->free_secmap = f2fs_kvmalloc(sec_bitmap_size, GFP_KERNEL);
+	free_i->free_secmap = kvmalloc(sec_bitmap_size, GFP_KERNEL);
 	if (!free_i->free_secmap)
 		return -ENOMEM;
 
@@ -2614,7 +2614,7 @@ static int init_victim_secmap(struct f2fs_sb_info *sbi)
 	struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
 	unsigned int bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi));
 
-	dirty_i->victim_secmap = f2fs_kvzalloc(bitmap_size, GFP_KERNEL);
+	dirty_i->victim_secmap = kvzalloc(bitmap_size, GFP_KERNEL);
 	if (!dirty_i->victim_secmap)
 		return -ENOMEM;
 	return 0;
@@ -2636,7 +2636,7 @@ static int build_dirty_segmap(struct f2fs_sb_info *sbi)
 	bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi));
 
 	for (i = 0; i < NR_DIRTY_TYPE; i++) {
-		dirty_i->dirty_segmap[i] = f2fs_kvzalloc(bitmap_size, GFP_KERNEL);
+		dirty_i->dirty_segmap[i] = kvzalloc(bitmap_size, GFP_KERNEL);
 		if (!dirty_i->dirty_segmap[i])
 			return -ENOMEM;
 	}
diff --git a/fs/seq_file.c b/fs/seq_file.c
index ca69fb99e41a..dc7c2be963ed 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -25,21 +25,7 @@ static void seq_set_overflow(struct seq_file *m)
 
 static void *seq_buf_alloc(unsigned long size)
 {
-	void *buf;
-	gfp_t gfp = GFP_KERNEL;
-
-	/*
-	 * For high order allocations, use __GFP_NORETRY to avoid oom-killing -
-	 * it's better to fall back to vmalloc() than to kill things.  For small
-	 * allocations, just use GFP_KERNEL which will oom kill, thus no need
-	 * for vmalloc fallback.
-	 */
-	if (size > PAGE_SIZE)
-		gfp |= __GFP_NORETRY | __GFP_NOWARN;
-	buf = kmalloc(size, gfp);
-	if (!buf && size > PAGE_SIZE)
-		buf = vmalloc(size);
-	return buf;
+	return kvmalloc(size, GFP_KERNEL);
 }
 
 /**
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c5190dab2c1..00e6f93d1ee0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -768,8 +768,6 @@ void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 
-void *kvm_kvzalloc(unsigned long size);
-
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 static inline struct kvm *kvm_arch_alloc_vm(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a6b9ab0945c4..0d9fdc0a2a7b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -498,6 +498,20 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 }
 #endif
 
+extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
+static inline void *kvmalloc(size_t size, gfp_t flags)
+{
+	return kvmalloc_node(size, flags, NUMA_NO_NODE);
+}
+static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)
+{
+	return kvmalloc_node(size, flags | __GFP_ZERO, node);
+}
+static inline void *kvzalloc(size_t size, gfp_t flags)
+{
+	return kvmalloc(size, flags | __GFP_ZERO);
+}
+
 extern void kvfree(const void *addr);
 
 static inline atomic_t *compound_mapcount_ptr(struct page *page)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index d68edffbf142..46991ad3ddd5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -80,6 +80,7 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller);
+extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
diff --git a/ipc/util.c b/ipc/util.c
index 798cad18dd87..74c2adc62086 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -403,12 +403,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
  */
 void *ipc_alloc(int size)
 {
-	void *out;
-	if (size > PAGE_SIZE)
-		out = vmalloc(size);
-	else
-		out = kmalloc(size, GFP_KERNEL);
-	return out;
+	return kvmalloc(size, GFP_KERNEL);
 }
 
 /**
diff --git a/mm/nommu.c b/mm/nommu.c
index 215c62296028..bee76e6cd4e5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -236,6 +236,11 @@ void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot)
 }
 EXPORT_SYMBOL(__vmalloc);
 
+void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags)
+{
+	return __vmalloc(size, flags, PAGE_KERNEL);
+}
+
 void *vmalloc_user(unsigned long size)
 {
 	void *ret;
diff --git a/mm/util.c b/mm/util.c
index 3cb2164f4099..ef72e2554edb 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -324,6 +324,51 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_mmap);
 
+/**
+ * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
+ * failure, fall back to non-contiguous (vmalloc) allocation.
+ * @size: size of the request.
+ * @flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
+ * @node: numa node to allocate from
+ *
+ * Uses kmalloc to get the memory but if the allocation fails then falls back
+ * to the vmalloc allocator. Use kvfree for freeing the memory.
+ *
+ * Reclaim modifiers - __GFP_NORETRY, __GFP_REPEAT and __GFP_NOFAIL are not supported
+ *
+ * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
+ */
+void *kvmalloc_node(size_t size, gfp_t flags, int node)
+{
+	gfp_t kmalloc_flags = flags;
+	void *ret;
+
+	/*
+	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
+	 * so the given set of flags has to be compatible.
+	 */
+	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
+
+	/*
+	 * Make sure that larger requests are not too disruptive - no OOM
+	 * killer and no allocation failure warnings as we have a fallback
+	 */
+	if (size > PAGE_SIZE)
+		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+
+	ret = kmalloc_node(size, kmalloc_flags, node);
+
+	/*
+	 * It doesn't really make sense to fallback to vmalloc for sub page
+	 * requests
+	 */
+	if (ret || size <= PAGE_SIZE)
+		return ret;
+
+	return __vmalloc_node_flags(size, node, flags | __GFP_HIGHMEM);
+}
+EXPORT_SYMBOL(kvmalloc_node);
+
 void kvfree(const void *addr)
 {
 	if (is_vmalloc_addr(addr))
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *	Allocate enough pages to cover @size from the page level
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
+ *
+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ *	and __GFP_NOFAIL are not supported
+ *
+ *	Any use of gfp flags outside of GFP_KERNEL should be consulted
+ *	with mm people.
+ *
  */
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
@@ -1757,7 +1764,7 @@ void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot)
 }
 EXPORT_SYMBOL(__vmalloc);
 
-static inline void *__vmalloc_node_flags(unsigned long size,
+void *__vmalloc_node_flags(unsigned long size,
 					int node, gfp_t flags)
 {
 	return __vmalloc_node(size, 1, flags, PAGE_KERNEL,
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 41073f70eb41..be0b49897a67 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -98,7 +98,7 @@ static struct aa_loaddata *aa_simple_write_to_buffer(const char __user *userbuf,
 		return ERR_PTR(-ESPIPE);
 
 	/* freed by caller to simple_write_to_buffer */
-	data = kvmalloc(sizeof(*data) + alloc_size);
+	data = kvmalloc(sizeof(*data) + alloc_size, GFP_KERNEL);
 	if (data == NULL)
 		return ERR_PTR(-ENOMEM);
 	kref_init(&data->count);
diff --git a/security/apparmor/include/lib.h b/security/apparmor/include/lib.h
index 65ff492a9807..75733baa6702 100644
--- a/security/apparmor/include/lib.h
+++ b/security/apparmor/include/lib.h
@@ -64,17 +64,6 @@ char *aa_split_fqname(char *args, char **ns_name);
 const char *aa_splitn_fqname(const char *fqname, size_t n, const char **ns_name,
 			     size_t *ns_len);
 void aa_info_message(const char *str);
-void *__aa_kvmalloc(size_t size, gfp_t flags);
-
-static inline void *kvmalloc(size_t size)
-{
-	return __aa_kvmalloc(size, 0);
-}
-
-static inline void *kvzalloc(size_t size)
-{
-	return __aa_kvmalloc(size, __GFP_ZERO);
-}
 
 /**
  * aa_strneq - compare null terminated @str to a non null terminated substring
diff --git a/security/apparmor/lib.c b/security/apparmor/lib.c
index 66475bda6f72..1a13494bc7c7 100644
--- a/security/apparmor/lib.c
+++ b/security/apparmor/lib.c
@@ -129,36 +129,6 @@ void aa_info_message(const char *str)
 }
 
 /**
- * __aa_kvmalloc - do allocation preferring kmalloc but falling back to vmalloc
- * @size: how many bytes of memory are required
- * @flags: the type of memory to allocate (see kmalloc).
- *
- * Return: allocated buffer or NULL if failed
- *
- * It is possible that policy being loaded from the user is larger than
- * what can be allocated by kmalloc, in those cases fall back to vmalloc.
- */
-void *__aa_kvmalloc(size_t size, gfp_t flags)
-{
-	void *buffer = NULL;
-
-	if (size == 0)
-		return NULL;
-
-	/* do not attempt kmalloc if we need more than 16 pages at once */
-	if (size <= (16*PAGE_SIZE))
-		buffer = kmalloc(size, flags | GFP_KERNEL | __GFP_NORETRY |
-				 __GFP_NOWARN);
-	if (!buffer) {
-		if (flags & __GFP_ZERO)
-			buffer = vzalloc(size);
-		else
-			buffer = vmalloc(size);
-	}
-	return buffer;
-}
-
-/**
  * aa_policy_init - initialize a policy structure
  * @policy: policy to initialize  (NOT NULL)
  * @prefix: prefix name if any is required.  (MAYBE NULL)
diff --git a/security/apparmor/match.c b/security/apparmor/match.c
index eb0efef746f5..960c913381e2 100644
--- a/security/apparmor/match.c
+++ b/security/apparmor/match.c
@@ -88,7 +88,7 @@ static struct table_header *unpack_table(char *blob, size_t bsize)
 	if (bsize < tsize)
 		goto out;
 
-	table = kvzalloc(tsize);
+	table = kvzalloc(tsize, GFP_KERNEL);
 	if (table) {
 		table->td_id = th.td_id;
 		table->td_flags = th.td_flags;
diff --git a/security/apparmor/policy_unpack.c b/security/apparmor/policy_unpack.c
index 2e37c9c26bbd..f3422a91353c 100644
--- a/security/apparmor/policy_unpack.c
+++ b/security/apparmor/policy_unpack.c
@@ -487,7 +487,7 @@ static bool unpack_rlimits(struct aa_ext *e, struct aa_profile *profile)
 
 static void *kvmemdup(const void *src, size_t len)
 {
-	void *p = kvmalloc(len);
+	void *p = kvmalloc(len, GFP_KERNEL);
 
 	if (p)
 		memcpy(p, src, len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dcd1c12940e6..795c8269ef63 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -502,7 +502,7 @@ static struct kvm_memslots *kvm_alloc_memslots(void)
 	int i;
 	struct kvm_memslots *slots;
 
-	slots = kvm_kvzalloc(sizeof(struct kvm_memslots));
+	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 	if (!slots)
 		return NULL;
 
@@ -685,18 +685,6 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	return ERR_PTR(r);
 }
 
-/*
- * Avoid using vmalloc for a small buffer.
- * Should not be used when the size is statically known.
- */
-void *kvm_kvzalloc(unsigned long size)
-{
-	if (size > PAGE_SIZE)
-		return vzalloc(size);
-	else
-		return kzalloc(size, GFP_KERNEL);
-}
-
 static void kvm_destroy_devices(struct kvm *kvm)
 {
 	struct kvm_device *dev, *tmp;
@@ -775,7 +763,7 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
 {
 	unsigned long dirty_bytes = 2 * kvm_dirty_bitmap_bytes(memslot);
 
-	memslot->dirty_bitmap = kvm_kvzalloc(dirty_bytes);
+	memslot->dirty_bitmap = kvzalloc(dirty_bytes, GFP_KERNEL);
 	if (!memslot->dirty_bitmap)
 		return -ENOMEM;
 
@@ -995,7 +983,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 			goto out_free;
 	}
 
-	slots = kvm_kvzalloc(sizeof(struct kvm_memslots));
+	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 	if (!slots)
 		goto out_free;
 	memcpy(slots, __kvm_memslots(kvm, as_id), sizeof(struct kvm_memslots));
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 1/9] mm: introduce kv[mz]alloc helpers
@ 2017-01-30  9:49 ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, John Hubbard

From: Michal Hocko <mhocko@suse.com>

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be
broken and require GFP_NO{FS,IO} context which is not vmalloc compatible
in general (note that the page table allocation is GFP_KERNEL). Those
need to be fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported
gfp mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers. Existing ones would have to die
slowly.

Changes since v3
- add __GFP_HIGHMEM for the vmalloc fallback
- document gfp_mask in __vmalloc_node
- change ipc_alloc to use the library kvmalloc
- __aa_kvmalloc doesn't rely on GFP_NOIO anymore so we can drop and
use the library kvmalloc directly

Changes since v2
- s@WARN_ON@WARN_ON_ONCE@ as per Vlastimil
- do not fallback to vmalloc for size = PAGE_SIZE as per Vlastimil

Changes since v1
- define __vmalloc_node_flags for CONFIG_MMU=n

Cc: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> # ext4 part
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/kvm/lapic.c              |  4 ++--
 arch/x86/kvm/page_track.c         |  4 ++--
 arch/x86/kvm/x86.c                |  4 ++--
 drivers/md/dm-stats.c             |  7 +-----
 fs/ext4/mballoc.c                 |  2 +-
 fs/ext4/super.c                   |  4 ++--
 fs/f2fs/f2fs.h                    | 20 -----------------
 fs/f2fs/file.c                    |  4 ++--
 fs/f2fs/segment.c                 | 14 ++++++------
 fs/seq_file.c                     | 16 +-------------
 include/linux/kvm_host.h          |  2 --
 include/linux/mm.h                | 14 ++++++++++++
 include/linux/vmalloc.h           |  1 +
 ipc/util.c                        |  7 +-----
 mm/nommu.c                        |  5 +++++
 mm/util.c                         | 45 +++++++++++++++++++++++++++++++++++++++
 mm/vmalloc.c                      |  9 +++++++-
 security/apparmor/apparmorfs.c    |  2 +-
 security/apparmor/include/lib.h   | 11 ----------
 security/apparmor/lib.c           | 30 --------------------------
 security/apparmor/match.c         |  2 +-
 security/apparmor/policy_unpack.c |  2 +-
 virt/kvm/kvm_main.c               | 18 +++-------------
 23 files changed, 100 insertions(+), 127 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 33b799fd3a6e..42562348bed2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -177,8 +177,8 @@ static void recalculate_apic_map(struct kvm *kvm)
 		if (kvm_apic_present(vcpu))
 			max_id = max(max_id, kvm_x2apic_id(vcpu->arch.apic));
 
-	new = kvm_kvzalloc(sizeof(struct kvm_apic_map) +
-	                   sizeof(struct kvm_lapic *) * ((u64)max_id + 1));
+	new = kvzalloc(sizeof(struct kvm_apic_map) +
+	                   sizeof(struct kvm_lapic *) * ((u64)max_id + 1), GFP_KERNEL);
 
 	if (!new)
 		goto out;
diff --git a/arch/x86/kvm/page_track.c b/arch/x86/kvm/page_track.c
index 4a1c13eaa518..d46663e655b0 100644
--- a/arch/x86/kvm/page_track.c
+++ b/arch/x86/kvm/page_track.c
@@ -38,8 +38,8 @@ int kvm_page_track_create_memslot(struct kvm_memory_slot *slot,
 	int  i;
 
 	for (i = 0; i < KVM_PAGE_TRACK_MAX; i++) {
-		slot->arch.gfn_track[i] = kvm_kvzalloc(npages *
-					    sizeof(*slot->arch.gfn_track[i]));
+		slot->arch.gfn_track[i] = kvzalloc(npages *
+					    sizeof(*slot->arch.gfn_track[i]), GFP_KERNEL);
 		if (!slot->arch.gfn_track[i])
 			goto track_free;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 313f2cecbc57..07b0d17df9ea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8121,13 +8121,13 @@ int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
 				      slot->base_gfn, level) + 1;
 
 		slot->arch.rmap[i] =
-			kvm_kvzalloc(lpages * sizeof(*slot->arch.rmap[i]));
+			kvzalloc(lpages * sizeof(*slot->arch.rmap[i]), GFP_KERNEL);
 		if (!slot->arch.rmap[i])
 			goto out_free;
 		if (i == 0)
 			continue;
 
-		linfo = kvm_kvzalloc(lpages * sizeof(*linfo));
+		linfo = kvzalloc(lpages * sizeof(*linfo), GFP_KERNEL);
 		if (!linfo)
 			goto out_free;
 
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c
index 38b05f23b96c..674f9a1686f7 100644
--- a/drivers/md/dm-stats.c
+++ b/drivers/md/dm-stats.c
@@ -146,12 +146,7 @@ static void *dm_kvzalloc(size_t alloc_size, int node)
 	if (!claim_shared_memory(alloc_size))
 		return NULL;
 
-	if (alloc_size <= KMALLOC_MAX_SIZE) {
-		p = kzalloc_node(alloc_size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN, node);
-		if (p)
-			return p;
-	}
-	p = vzalloc_node(alloc_size, node);
+	p = kvzalloc_node(alloc_size, GFP_KERNEL | __GFP_NOMEMALLOC, node);
 	if (p)
 		return p;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..31a761dd76f5 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2381,7 +2381,7 @@ int ext4_mb_alloc_groupinfo(struct super_block *sb, ext4_group_t ngroups)
 		return 0;
 
 	size = roundup_pow_of_two(sizeof(*sbi->s_group_info) * size);
-	new_groupinfo = ext4_kvzalloc(size, GFP_KERNEL);
+	new_groupinfo = kvzalloc(size, GFP_KERNEL);
 	if (!new_groupinfo) {
 		ext4_msg(sb, KERN_ERR, "can't allocate buddy meta group");
 		return -ENOMEM;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9d15a6293124..e3f1ff04a85f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2110,7 +2110,7 @@ int ext4_alloc_flex_bg_array(struct super_block *sb, ext4_group_t ngroup)
 		return 0;
 
 	size = roundup_pow_of_two(size * sizeof(struct flex_groups));
-	new_groups = ext4_kvzalloc(size, GFP_KERNEL);
+	new_groups = kvzalloc(size, GFP_KERNEL);
 	if (!new_groups) {
 		ext4_msg(sb, KERN_ERR, "not enough memory for %d flex groups",
 			 size / (int) sizeof(struct flex_groups));
@@ -3844,7 +3844,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			goto failed_mount;
 		}
 	}
-	sbi->s_group_desc = ext4_kvmalloc(db_count *
+	sbi->s_group_desc = kvmalloc(db_count *
 					  sizeof(struct buffer_head *),
 					  GFP_KERNEL);
 	if (sbi->s_group_desc == NULL) {
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 4cce13d0a1a4..83dc25025277 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -1935,26 +1935,6 @@ static inline void *f2fs_kmalloc(struct f2fs_sb_info *sbi,
 	return kmalloc(size, flags);
 }
 
-static inline void *f2fs_kvmalloc(size_t size, gfp_t flags)
-{
-	void *ret;
-
-	ret = kmalloc(size, flags | __GFP_NOWARN);
-	if (!ret)
-		ret = __vmalloc(size, flags, PAGE_KERNEL);
-	return ret;
-}
-
-static inline void *f2fs_kvzalloc(size_t size, gfp_t flags)
-{
-	void *ret;
-
-	ret = kzalloc(size, flags | __GFP_NOWARN);
-	if (!ret)
-		ret = __vmalloc(size, flags | __GFP_ZERO, PAGE_KERNEL);
-	return ret;
-}
-
 #define get_inode_mode(i) \
 	((is_inode_flag_set(i, FI_ACL_MODE)) ? \
 	 (F2FS_I(i)->i_acl_mode) : ((i)->i_mode))
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 2752bcf98f95..82ca8c038ecf 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1014,11 +1014,11 @@ static int __exchange_data_block(struct inode *src_inode,
 	while (len) {
 		olen = min((pgoff_t)4 * ADDRS_PER_BLOCK, len);
 
-		src_blkaddr = f2fs_kvzalloc(sizeof(block_t) * olen, GFP_KERNEL);
+		src_blkaddr = kvzalloc(sizeof(block_t) * olen, GFP_KERNEL);
 		if (!src_blkaddr)
 			return -ENOMEM;
 
-		do_replace = f2fs_kvzalloc(sizeof(int) * olen, GFP_KERNEL);
+		do_replace = kvzalloc(sizeof(int) * olen, GFP_KERNEL);
 		if (!do_replace) {
 			kvfree(src_blkaddr);
 			return -ENOMEM;
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index fb57ab9f6aa6..127d875a79f7 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -2351,13 +2351,13 @@ static int build_sit_info(struct f2fs_sb_info *sbi)
 
 	SM_I(sbi)->sit_info = sit_i;
 
-	sit_i->sentries = f2fs_kvzalloc(MAIN_SEGS(sbi) *
+	sit_i->sentries = kvzalloc(MAIN_SEGS(sbi) *
 					sizeof(struct seg_entry), GFP_KERNEL);
 	if (!sit_i->sentries)
 		return -ENOMEM;
 
 	bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi));
-	sit_i->dirty_sentries_bitmap = f2fs_kvzalloc(bitmap_size, GFP_KERNEL);
+	sit_i->dirty_sentries_bitmap = kvzalloc(bitmap_size, GFP_KERNEL);
 	if (!sit_i->dirty_sentries_bitmap)
 		return -ENOMEM;
 
@@ -2390,7 +2390,7 @@ static int build_sit_info(struct f2fs_sb_info *sbi)
 		return -ENOMEM;
 
 	if (sbi->segs_per_sec > 1) {
-		sit_i->sec_entries = f2fs_kvzalloc(MAIN_SECS(sbi) *
+		sit_i->sec_entries = kvzalloc(MAIN_SECS(sbi) *
 					sizeof(struct sec_entry), GFP_KERNEL);
 		if (!sit_i->sec_entries)
 			return -ENOMEM;
@@ -2441,12 +2441,12 @@ static int build_free_segmap(struct f2fs_sb_info *sbi)
 	SM_I(sbi)->free_info = free_i;
 
 	bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi));
-	free_i->free_segmap = f2fs_kvmalloc(bitmap_size, GFP_KERNEL);
+	free_i->free_segmap = kvmalloc(bitmap_size, GFP_KERNEL);
 	if (!free_i->free_segmap)
 		return -ENOMEM;
 
 	sec_bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi));
-	free_i->free_secmap = f2fs_kvmalloc(sec_bitmap_size, GFP_KERNEL);
+	free_i->free_secmap = kvmalloc(sec_bitmap_size, GFP_KERNEL);
 	if (!free_i->free_secmap)
 		return -ENOMEM;
 
@@ -2614,7 +2614,7 @@ static int init_victim_secmap(struct f2fs_sb_info *sbi)
 	struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
 	unsigned int bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi));
 
-	dirty_i->victim_secmap = f2fs_kvzalloc(bitmap_size, GFP_KERNEL);
+	dirty_i->victim_secmap = kvzalloc(bitmap_size, GFP_KERNEL);
 	if (!dirty_i->victim_secmap)
 		return -ENOMEM;
 	return 0;
@@ -2636,7 +2636,7 @@ static int build_dirty_segmap(struct f2fs_sb_info *sbi)
 	bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi));
 
 	for (i = 0; i < NR_DIRTY_TYPE; i++) {
-		dirty_i->dirty_segmap[i] = f2fs_kvzalloc(bitmap_size, GFP_KERNEL);
+		dirty_i->dirty_segmap[i] = kvzalloc(bitmap_size, GFP_KERNEL);
 		if (!dirty_i->dirty_segmap[i])
 			return -ENOMEM;
 	}
diff --git a/fs/seq_file.c b/fs/seq_file.c
index ca69fb99e41a..dc7c2be963ed 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -25,21 +25,7 @@ static void seq_set_overflow(struct seq_file *m)
 
 static void *seq_buf_alloc(unsigned long size)
 {
-	void *buf;
-	gfp_t gfp = GFP_KERNEL;
-
-	/*
-	 * For high order allocations, use __GFP_NORETRY to avoid oom-killing -
-	 * it's better to fall back to vmalloc() than to kill things.  For small
-	 * allocations, just use GFP_KERNEL which will oom kill, thus no need
-	 * for vmalloc fallback.
-	 */
-	if (size > PAGE_SIZE)
-		gfp |= __GFP_NORETRY | __GFP_NOWARN;
-	buf = kmalloc(size, gfp);
-	if (!buf && size > PAGE_SIZE)
-		buf = vmalloc(size);
-	return buf;
+	return kvmalloc(size, GFP_KERNEL);
 }
 
 /**
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1c5190dab2c1..00e6f93d1ee0 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -768,8 +768,6 @@ void kvm_arch_check_processor_compat(void *rtn);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
 
-void *kvm_kvzalloc(unsigned long size);
-
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 static inline struct kvm *kvm_arch_alloc_vm(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a6b9ab0945c4..0d9fdc0a2a7b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -498,6 +498,20 @@ static inline int is_vmalloc_or_module_addr(const void *x)
 }
 #endif
 
+extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
+static inline void *kvmalloc(size_t size, gfp_t flags)
+{
+	return kvmalloc_node(size, flags, NUMA_NO_NODE);
+}
+static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)
+{
+	return kvmalloc_node(size, flags | __GFP_ZERO, node);
+}
+static inline void *kvzalloc(size_t size, gfp_t flags)
+{
+	return kvmalloc(size, flags | __GFP_ZERO);
+}
+
 extern void kvfree(const void *addr);
 
 static inline atomic_t *compound_mapcount_ptr(struct page *page)
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index d68edffbf142..46991ad3ddd5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -80,6 +80,7 @@ extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
 			const void *caller);
+extern void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags);
 
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
diff --git a/ipc/util.c b/ipc/util.c
index 798cad18dd87..74c2adc62086 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -403,12 +403,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm *ipcp)
  */
 void *ipc_alloc(int size)
 {
-	void *out;
-	if (size > PAGE_SIZE)
-		out = vmalloc(size);
-	else
-		out = kmalloc(size, GFP_KERNEL);
-	return out;
+	return kvmalloc(size, GFP_KERNEL);
 }
 
 /**
diff --git a/mm/nommu.c b/mm/nommu.c
index 215c62296028..bee76e6cd4e5 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -236,6 +236,11 @@ void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot)
 }
 EXPORT_SYMBOL(__vmalloc);
 
+void *__vmalloc_node_flags(unsigned long size, int node, gfp_t flags)
+{
+	return __vmalloc(size, flags, PAGE_KERNEL);
+}
+
 void *vmalloc_user(unsigned long size)
 {
 	void *ret;
diff --git a/mm/util.c b/mm/util.c
index 3cb2164f4099..ef72e2554edb 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -324,6 +324,51 @@ unsigned long vm_mmap(struct file *file, unsigned long addr,
 }
 EXPORT_SYMBOL(vm_mmap);
 
+/**
+ * kvmalloc_node - attempt to allocate physically contiguous memory, but upon
+ * failure, fall back to non-contiguous (vmalloc) allocation.
+ * @size: size of the request.
+ * @flags: gfp mask for the allocation - must be compatible (superset) with GFP_KERNEL.
+ * @node: numa node to allocate from
+ *
+ * Uses kmalloc to get the memory but if the allocation fails then falls back
+ * to the vmalloc allocator. Use kvfree for freeing the memory.
+ *
+ * Reclaim modifiers - __GFP_NORETRY, __GFP_REPEAT and __GFP_NOFAIL are not supported
+ *
+ * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
+ */
+void *kvmalloc_node(size_t size, gfp_t flags, int node)
+{
+	gfp_t kmalloc_flags = flags;
+	void *ret;
+
+	/*
+	 * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
+	 * so the given set of flags has to be compatible.
+	 */
+	WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
+
+	/*
+	 * Make sure that larger requests are not too disruptive - no OOM
+	 * killer and no allocation failure warnings as we have a fallback
+	 */
+	if (size > PAGE_SIZE)
+		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+
+	ret = kmalloc_node(size, kmalloc_flags, node);
+
+	/*
+	 * It doesn't really make sense to fallback to vmalloc for sub page
+	 * requests
+	 */
+	if (ret || size <= PAGE_SIZE)
+		return ret;
+
+	return __vmalloc_node_flags(size, node, flags | __GFP_HIGHMEM);
+}
+EXPORT_SYMBOL(kvmalloc_node);
+
 void kvfree(const void *addr)
 {
 	if (is_vmalloc_addr(addr))
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *	Allocate enough pages to cover @size from the page level
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
+ *
+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ *	and __GFP_NOFAIL are not supported
+ *
+ *	Any use of gfp flags outside of GFP_KERNEL should be consulted
+ *	with mm people.
+ *
  */
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
@@ -1757,7 +1764,7 @@ void *__vmalloc(unsigned long size, gfp_t gfp_mask, pgprot_t prot)
 }
 EXPORT_SYMBOL(__vmalloc);
 
-static inline void *__vmalloc_node_flags(unsigned long size,
+void *__vmalloc_node_flags(unsigned long size,
 					int node, gfp_t flags)
 {
 	return __vmalloc_node(size, 1, flags, PAGE_KERNEL,
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index 41073f70eb41..be0b49897a67 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -98,7 +98,7 @@ static struct aa_loaddata *aa_simple_write_to_buffer(const char __user *userbuf,
 		return ERR_PTR(-ESPIPE);
 
 	/* freed by caller to simple_write_to_buffer */
-	data = kvmalloc(sizeof(*data) + alloc_size);
+	data = kvmalloc(sizeof(*data) + alloc_size, GFP_KERNEL);
 	if (data == NULL)
 		return ERR_PTR(-ENOMEM);
 	kref_init(&data->count);
diff --git a/security/apparmor/include/lib.h b/security/apparmor/include/lib.h
index 65ff492a9807..75733baa6702 100644
--- a/security/apparmor/include/lib.h
+++ b/security/apparmor/include/lib.h
@@ -64,17 +64,6 @@ char *aa_split_fqname(char *args, char **ns_name);
 const char *aa_splitn_fqname(const char *fqname, size_t n, const char **ns_name,
 			     size_t *ns_len);
 void aa_info_message(const char *str);
-void *__aa_kvmalloc(size_t size, gfp_t flags);
-
-static inline void *kvmalloc(size_t size)
-{
-	return __aa_kvmalloc(size, 0);
-}
-
-static inline void *kvzalloc(size_t size)
-{
-	return __aa_kvmalloc(size, __GFP_ZERO);
-}
 
 /**
  * aa_strneq - compare null terminated @str to a non null terminated substring
diff --git a/security/apparmor/lib.c b/security/apparmor/lib.c
index 66475bda6f72..1a13494bc7c7 100644
--- a/security/apparmor/lib.c
+++ b/security/apparmor/lib.c
@@ -129,36 +129,6 @@ void aa_info_message(const char *str)
 }
 
 /**
- * __aa_kvmalloc - do allocation preferring kmalloc but falling back to vmalloc
- * @size: how many bytes of memory are required
- * @flags: the type of memory to allocate (see kmalloc).
- *
- * Return: allocated buffer or NULL if failed
- *
- * It is possible that policy being loaded from the user is larger than
- * what can be allocated by kmalloc, in those cases fall back to vmalloc.
- */
-void *__aa_kvmalloc(size_t size, gfp_t flags)
-{
-	void *buffer = NULL;
-
-	if (size == 0)
-		return NULL;
-
-	/* do not attempt kmalloc if we need more than 16 pages at once */
-	if (size <= (16*PAGE_SIZE))
-		buffer = kmalloc(size, flags | GFP_KERNEL | __GFP_NORETRY |
-				 __GFP_NOWARN);
-	if (!buffer) {
-		if (flags & __GFP_ZERO)
-			buffer = vzalloc(size);
-		else
-			buffer = vmalloc(size);
-	}
-	return buffer;
-}
-
-/**
  * aa_policy_init - initialize a policy structure
  * @policy: policy to initialize  (NOT NULL)
  * @prefix: prefix name if any is required.  (MAYBE NULL)
diff --git a/security/apparmor/match.c b/security/apparmor/match.c
index eb0efef746f5..960c913381e2 100644
--- a/security/apparmor/match.c
+++ b/security/apparmor/match.c
@@ -88,7 +88,7 @@ static struct table_header *unpack_table(char *blob, size_t bsize)
 	if (bsize < tsize)
 		goto out;
 
-	table = kvzalloc(tsize);
+	table = kvzalloc(tsize, GFP_KERNEL);
 	if (table) {
 		table->td_id = th.td_id;
 		table->td_flags = th.td_flags;
diff --git a/security/apparmor/policy_unpack.c b/security/apparmor/policy_unpack.c
index 2e37c9c26bbd..f3422a91353c 100644
--- a/security/apparmor/policy_unpack.c
+++ b/security/apparmor/policy_unpack.c
@@ -487,7 +487,7 @@ static bool unpack_rlimits(struct aa_ext *e, struct aa_profile *profile)
 
 static void *kvmemdup(const void *src, size_t len)
 {
-	void *p = kvmalloc(len);
+	void *p = kvmalloc(len, GFP_KERNEL);
 
 	if (p)
 		memcpy(p, src, len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dcd1c12940e6..795c8269ef63 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -502,7 +502,7 @@ static struct kvm_memslots *kvm_alloc_memslots(void)
 	int i;
 	struct kvm_memslots *slots;
 
-	slots = kvm_kvzalloc(sizeof(struct kvm_memslots));
+	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 	if (!slots)
 		return NULL;
 
@@ -685,18 +685,6 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	return ERR_PTR(r);
 }
 
-/*
- * Avoid using vmalloc for a small buffer.
- * Should not be used when the size is statically known.
- */
-void *kvm_kvzalloc(unsigned long size)
-{
-	if (size > PAGE_SIZE)
-		return vzalloc(size);
-	else
-		return kzalloc(size, GFP_KERNEL);
-}
-
 static void kvm_destroy_devices(struct kvm *kvm)
 {
 	struct kvm_device *dev, *tmp;
@@ -775,7 +763,7 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot)
 {
 	unsigned long dirty_bytes = 2 * kvm_dirty_bitmap_bytes(memslot);
 
-	memslot->dirty_bitmap = kvm_kvzalloc(dirty_bytes);
+	memslot->dirty_bitmap = kvzalloc(dirty_bytes, GFP_KERNEL);
 	if (!memslot->dirty_bitmap)
 		return -ENOMEM;
 
@@ -995,7 +983,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 			goto out_free;
 	}
 
-	slots = kvm_kvzalloc(sizeof(struct kvm_memslots));
+	slots = kvzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 	if (!slots)
 		goto out_free;
 	memcpy(slots, __kvm_memslots(kvm, as_id), sizeof(struct kvm_memslots));
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 2/9] mm: support __GFP_REPEAT in kvmalloc_node for >32kB
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
  2017-01-30  9:49 ` [PATCH 1/9] mm: introduce kv[mz]alloc helpers Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30  9:49 ` [PATCH 3/9] rhashtable: simplify a strange allocation pattern Michal Hocko
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
vhost_vsock because it would really like to prefer kmalloc to the
vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
allocation to vmalloc") for more context. Michael Tsirkin has also
noted:
"
__GFP_REPEAT overhead is during allocation time.  Using vmalloc means all
accesses are slowed down.  Allocation is not on data path, accesses are.
"

The similar applies to other vhost_kvzalloc users.

Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are two
things to be careful about. First we should prevent from the OOM killer
and so have to involve __GFP_NORETRY by default and secondly override
__GFP_REPEAT for !costly order requests as the __GFP_REPEAT is ignored
for !costly orders.

Supporting __GFP_REPEAT like semantic for !costly request is possible
it would require changes in the page allocator. This is out of scope of
this patch.

This patch shouldn't introduce any functional change.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/vhost/net.c   |  9 +++------
 drivers/vhost/vhost.c | 15 +++------------
 drivers/vhost/vsock.c |  9 +++------
 mm/util.c             | 20 ++++++++++++++++----
 4 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c42e9c305134..f40e0150ca37 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -814,12 +814,9 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int i;
 
-	n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!n) {
-		n = vmalloc(sizeof *n);
-		if (!n)
-			return -ENOMEM;
-	}
+	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);
+	if (!n)
+		return -ENOMEM;
 	vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
 	if (!vqs) {
 		kvfree(n);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9f118388a5b7..596099c645ff 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -515,18 +515,9 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-static void *vhost_kvzalloc(unsigned long size)
-{
-	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-
-	if (!n)
-		n = vzalloc(size);
-	return n;
-}
-
 struct vhost_umem *vhost_dev_reset_owner_prepare(void)
 {
-	return vhost_kvzalloc(sizeof(struct vhost_umem));
+	return kvzalloc(sizeof(struct vhost_umem), GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
@@ -1190,7 +1181,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
 static struct vhost_umem *vhost_umem_alloc(void)
 {
-	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
+	struct vhost_umem *umem = kvzalloc(sizeof(*umem), GFP_KERNEL);
 
 	if (!umem)
 		return NULL;
@@ -1216,7 +1207,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		return -EOPNOTSUPP;
 	if (mem.nregions > max_mem_regions)
 		return -E2BIG;
-	newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
+	newmem = kvzalloc(size + mem.nregions * sizeof(*m->regions), GFP_KERNEL);
 	if (!newmem)
 		return -ENOMEM;
 
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index ce5e63d2c66a..d403c647ba56 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -460,12 +460,9 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 	/* This struct is large and allocation could fail, fall back to vmalloc
 	 * if there is no other way.
 	 */
-	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!vsock) {
-		vsock = vmalloc(sizeof(*vsock));
-		if (!vsock)
-			return -ENOMEM;
-	}
+	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_REPEAT);
+	if (!vsock)
+		return -ENOMEM;
 
 	vqs = kmalloc_array(ARRAY_SIZE(vsock->vqs), sizeof(*vqs), GFP_KERNEL);
 	if (!vqs) {
diff --git a/mm/util.c b/mm/util.c
index ef72e2554edb..f23cf264e21d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -333,8 +333,10 @@ EXPORT_SYMBOL(vm_mmap);
  *
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
- *
- * Reclaim modifiers - __GFP_NORETRY, __GFP_REPEAT and __GFP_NOFAIL are not supported
+ * 
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_REPEAT
+ * is supported only for large (>32kB) allocations, and it should be used only if
+ * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
  * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
  */
@@ -353,8 +355,18 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	 * Make sure that larger requests are not too disruptive - no OOM
 	 * killer and no allocation failure warnings as we have a fallback
 	 */
-	if (size > PAGE_SIZE)
-		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+	if (size > PAGE_SIZE) {
+		kmalloc_flags |= __GFP_NOWARN;
+
+		/*
+		 * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
+		 * requests because there is no other way to tell the allocator
+		 * that we want to fail rather than retry endlessly.
+		 */
+		if (!(kmalloc_flags & __GFP_REPEAT) ||
+				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+			kmalloc_flags |= __GFP_NORETRY;
+	}
 
 	ret = kmalloc_node(size, kmalloc_flags, node);
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 2/9] mm: support __GFP_REPEAT in kvmalloc_node for >32kB
@ 2017-01-30  9:49 ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
vhost_vsock because it would really like to prefer kmalloc to the
vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
allocation to vmalloc") for more context. Michael Tsirkin has also
noted:
"
__GFP_REPEAT overhead is during allocation time.  Using vmalloc means all
accesses are slowed down.  Allocation is not on data path, accesses are.
"

The similar applies to other vhost_kvzalloc users.

Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are two
things to be careful about. First we should prevent from the OOM killer
and so have to involve __GFP_NORETRY by default and secondly override
__GFP_REPEAT for !costly order requests as the __GFP_REPEAT is ignored
for !costly orders.

Supporting __GFP_REPEAT like semantic for !costly request is possible
it would require changes in the page allocator. This is out of scope of
this patch.

This patch shouldn't introduce any functional change.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/vhost/net.c   |  9 +++------
 drivers/vhost/vhost.c | 15 +++------------
 drivers/vhost/vsock.c |  9 +++------
 mm/util.c             | 20 ++++++++++++++++----
 4 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index c42e9c305134..f40e0150ca37 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -814,12 +814,9 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int i;
 
-	n = kmalloc(sizeof *n, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!n) {
-		n = vmalloc(sizeof *n);
-		if (!n)
-			return -ENOMEM;
-	}
+	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);
+	if (!n)
+		return -ENOMEM;
 	vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
 	if (!vqs) {
 		kvfree(n);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9f118388a5b7..596099c645ff 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -515,18 +515,9 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_set_owner);
 
-static void *vhost_kvzalloc(unsigned long size)
-{
-	void *n = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-
-	if (!n)
-		n = vzalloc(size);
-	return n;
-}
-
 struct vhost_umem *vhost_dev_reset_owner_prepare(void)
 {
-	return vhost_kvzalloc(sizeof(struct vhost_umem));
+	return kvzalloc(sizeof(struct vhost_umem), GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_reset_owner_prepare);
 
@@ -1190,7 +1181,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_access_ok);
 
 static struct vhost_umem *vhost_umem_alloc(void)
 {
-	struct vhost_umem *umem = vhost_kvzalloc(sizeof(*umem));
+	struct vhost_umem *umem = kvzalloc(sizeof(*umem), GFP_KERNEL);
 
 	if (!umem)
 		return NULL;
@@ -1216,7 +1207,7 @@ static long vhost_set_memory(struct vhost_dev *d, struct vhost_memory __user *m)
 		return -EOPNOTSUPP;
 	if (mem.nregions > max_mem_regions)
 		return -E2BIG;
-	newmem = vhost_kvzalloc(size + mem.nregions * sizeof(*m->regions));
+	newmem = kvzalloc(size + mem.nregions * sizeof(*m->regions), GFP_KERNEL);
 	if (!newmem)
 		return -ENOMEM;
 
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index ce5e63d2c66a..d403c647ba56 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -460,12 +460,9 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 	/* This struct is large and allocation could fail, fall back to vmalloc
 	 * if there is no other way.
 	 */
-	vsock = kzalloc(sizeof(*vsock), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!vsock) {
-		vsock = vmalloc(sizeof(*vsock));
-		if (!vsock)
-			return -ENOMEM;
-	}
+	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_REPEAT);
+	if (!vsock)
+		return -ENOMEM;
 
 	vqs = kmalloc_array(ARRAY_SIZE(vsock->vqs), sizeof(*vqs), GFP_KERNEL);
 	if (!vqs) {
diff --git a/mm/util.c b/mm/util.c
index ef72e2554edb..f23cf264e21d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -333,8 +333,10 @@ EXPORT_SYMBOL(vm_mmap);
  *
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
- *
- * Reclaim modifiers - __GFP_NORETRY, __GFP_REPEAT and __GFP_NOFAIL are not supported
+ * 
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_REPEAT
+ * is supported only for large (>32kB) allocations, and it should be used only if
+ * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
  * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
  */
@@ -353,8 +355,18 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	 * Make sure that larger requests are not too disruptive - no OOM
 	 * killer and no allocation failure warnings as we have a fallback
 	 */
-	if (size > PAGE_SIZE)
-		kmalloc_flags |= __GFP_NORETRY | __GFP_NOWARN;
+	if (size > PAGE_SIZE) {
+		kmalloc_flags |= __GFP_NOWARN;
+
+		/*
+		 * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
+		 * requests because there is no other way to tell the allocator
+		 * that we want to fail rather than retry endlessly.
+		 */
+		if (!(kmalloc_flags & __GFP_REPEAT) ||
+				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+			kmalloc_flags |= __GFP_NORETRY;
+	}
 
 	ret = kmalloc_node(size, kmalloc_flags, node);
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 3/9] rhashtable: simplify a strange allocation pattern
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
  2017-01-30  9:49 ` [PATCH 1/9] mm: introduce kv[mz]alloc helpers Michal Hocko
  2017-01-30  9:49 ` [PATCH 2/9] mm: support __GFP_REPEAT in kvmalloc_node for >32kB Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 14:04   ` Vlastimil Babka
  2017-01-30  9:49 ` [PATCH 4/9] ila: " Michal Hocko
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Tom Herbert, Eric Dumazet

From: Michal Hocko <mhocko@suse.com>

alloc_bucket_locks allocation pattern is quite unusual. We are
preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that
vmalloc will respect the memory policy of the current process and so the
backing memory will get distributed over multiple nodes if the requester
is configured properly. At least that is the intention, in reality
rhastable is shrunk and expanded from a kernel worker so no mempolicy
can be assumed.

Let's just simplify the code and use kvmalloc helper, which is a
transparent way to use kmalloc with vmalloc fallback, if the caller
is allowed to block and use the flag otherwise.

Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 lib/rhashtable.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 32d0ad058380..1a487ea70829 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -77,16 +77,9 @@ static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl,
 	size = min_t(unsigned int, size, tbl->size >> 1);
 
 	if (sizeof(spinlock_t) != 0) {
-		tbl->locks = NULL;
-#ifdef CONFIG_NUMA
-		if (size * sizeof(spinlock_t) > PAGE_SIZE &&
-		    gfp == GFP_KERNEL)
-			tbl->locks = vmalloc(size * sizeof(spinlock_t));
-#endif
-		if (gfp != GFP_KERNEL)
-			gfp |= __GFP_NOWARN | __GFP_NORETRY;
-
-		if (!tbl->locks)
+		if (gfpflags_allow_blocking(gfp))
+			tbl->locks = kvmalloc(size * sizeof(spinlock_t), gfp);
+		else
 			tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
 						   gfp);
 		if (!tbl->locks)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 3/9] rhashtable: simplify a strange allocation pattern
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 14:04   ` Vlastimil Babka
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Tom Herbert, Eric Dumazet

From: Michal Hocko <mhocko@suse.com>

alloc_bucket_locks allocation pattern is quite unusual. We are
preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that
vmalloc will respect the memory policy of the current process and so the
backing memory will get distributed over multiple nodes if the requester
is configured properly. At least that is the intention, in reality
rhastable is shrunk and expanded from a kernel worker so no mempolicy
can be assumed.

Let's just simplify the code and use kvmalloc helper, which is a
transparent way to use kmalloc with vmalloc fallback, if the caller
is allowed to block and use the flag otherwise.

Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 lib/rhashtable.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 32d0ad058380..1a487ea70829 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -77,16 +77,9 @@ static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl,
 	size = min_t(unsigned int, size, tbl->size >> 1);
 
 	if (sizeof(spinlock_t) != 0) {
-		tbl->locks = NULL;
-#ifdef CONFIG_NUMA
-		if (size * sizeof(spinlock_t) > PAGE_SIZE &&
-		    gfp == GFP_KERNEL)
-			tbl->locks = vmalloc(size * sizeof(spinlock_t));
-#endif
-		if (gfp != GFP_KERNEL)
-			gfp |= __GFP_NOWARN | __GFP_NORETRY;
-
-		if (!tbl->locks)
+		if (gfpflags_allow_blocking(gfp))
+			tbl->locks = kvmalloc(size * sizeof(spinlock_t), gfp);
+		else
 			tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
 						   gfp);
 		if (!tbl->locks)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 4/9] ila: simplify a strange allocation pattern
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
                   ` (2 preceding siblings ...)
  2017-01-30  9:49 ` [PATCH 3/9] rhashtable: simplify a strange allocation pattern Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 15:21   ` Vlastimil Babka
  2017-01-30  9:49 ` [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants Michal Hocko
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Tom Herbert, Eric Dumazet

From: Michal Hocko <mhocko@suse.com>

alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation
pattern which is quite unusual. The default allocation size is 320 *
sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
performance benefit is really questionable and not worth the subtle code
IMHO. Also note that the context when we call ila_init_net (modprobe or
a task creating a net namespace) has to be properly configured.

Let's just simplify the code and use kvmalloc helper which is a
transparent way to use kmalloc with vmalloc fallback.

Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/ipv6/ila/ila_xlat.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index af8f52ee7180..2fd5ca151dcf 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -41,13 +41,7 @@ static int alloc_ila_locks(struct ila_net *ilan)
 	size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU);
 
 	if (sizeof(spinlock_t) != 0) {
-#ifdef CONFIG_NUMA
-		if (size * sizeof(spinlock_t) > PAGE_SIZE)
-			ilan->locks = vmalloc(size * sizeof(spinlock_t));
-		else
-#endif
-		ilan->locks = kmalloc_array(size, sizeof(spinlock_t),
-					    GFP_KERNEL);
+		ilan->locks = kvmalloc(size * sizeof(spinlock_t), GFP_KERNEL);
 		if (!ilan->locks)
 			return -ENOMEM;
 		for (i = 0; i < size; i++)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 4/9] ila: simplify a strange allocation pattern
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 15:21   ` Vlastimil Babka
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Tom Herbert, Eric Dumazet

From: Michal Hocko <mhocko@suse.com>

alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation
pattern which is quite unusual. The default allocation size is 320 *
sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
performance benefit is really questionable and not worth the subtle code
IMHO. Also note that the context when we call ila_init_net (modprobe or
a task creating a net namespace) has to be properly configured.

Let's just simplify the code and use kvmalloc helper which is a
transparent way to use kmalloc with vmalloc fallback.

Cc: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/ipv6/ila/ila_xlat.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index af8f52ee7180..2fd5ca151dcf 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -41,13 +41,7 @@ static int alloc_ila_locks(struct ila_net *ilan)
 	size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU);
 
 	if (sizeof(spinlock_t) != 0) {
-#ifdef CONFIG_NUMA
-		if (size * sizeof(spinlock_t) > PAGE_SIZE)
-			ilan->locks = vmalloc(size * sizeof(spinlock_t));
-		else
-#endif
-		ilan->locks = kmalloc_array(size, sizeof(spinlock_t),
-					    GFP_KERNEL);
+		ilan->locks = kvmalloc(size * sizeof(spinlock_t), GFP_KERNEL);
 		if (!ilan->locks)
 			return -ENOMEM;
 		for (i = 0; i < size; i++)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
                   ` (3 preceding siblings ...)
  2017-01-30  9:49 ` [PATCH 4/9] ila: " Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 10:21   ` Leon Romanovsky
                     ` (2 more replies)
  2017-01-30  9:49 ` [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant Michal Hocko
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Martin Schwidefsky,
	Heiko Carstens, Herbert Xu, Anton Vorontsov, Colin Cross,
	Kees Cook, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng, Alexei Starovoitov, Eric Dumazet,
	netdev

From: Michal Hocko <mhocko@suse.com>

There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are usually
not considering all the aspects of the memory allocator. E.g. allocation
requests <= 32kB (with 4kB pages) are basically never failing and invoke
OOM killer to satisfy the allocation. This sounds too disruptive for
something that has a reasonable fallback - the vmalloc. On the other
hand those requests might fallback to vmalloc even when the memory
allocator would succeed after several more reclaim/compaction attempts
previously. There is no guarantee something like that happens though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Changes since v1
- add kvmalloc_array - this might silently fix some overflow issues
  because most users simply didn't check the overflow for the vmalloc
  fallback.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/s390/kvm/kvm-s390.c                           | 10 ++-----
 crypto/lzo.c                                       |  4 +--
 drivers/acpi/apei/erst.c                           |  8 ++----
 drivers/char/agp/generic.c                         |  8 +-----
 drivers/gpu/drm/nouveau/nouveau_gem.c              |  4 +--
 drivers/md/bcache/util.h                           | 12 ++------
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h    |  3 --
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 +++----------------
 drivers/net/ethernet/chelsio/cxgb3/l2t.c           |  8 +-----
 drivers/net/ethernet/chelsio/cxgb3/l2t.h           |  1 -
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c      | 12 ++++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |  3 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 +++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c |  8 +++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 31 ++++----------------
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  | 13 ++++-----
 drivers/net/ethernet/chelsio/cxgb4/l2t.c           |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/sched.c         | 12 ++++----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |  9 ++----
 drivers/net/ethernet/mellanox/mlx4/mr.c            |  9 ++----
 drivers/nvdimm/dimm_devs.c                         |  5 +---
 .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +-------
 drivers/xen/evtchn.c                               | 14 +--------
 fs/btrfs/ctree.c                                   |  9 ++----
 fs/btrfs/ioctl.c                                   |  9 ++----
 fs/btrfs/send.c                                    | 27 ++++++------------
 fs/ceph/file.c                                     |  9 ++----
 fs/select.c                                        |  5 +---
 fs/xattr.c                                         | 27 ++++++------------
 include/linux/mlx5/driver.h                        |  7 +----
 include/linux/mm.h                                 |  8 ++++++
 lib/iov_iter.c                                     |  5 +---
 mm/frame_vector.c                                  |  5 +---
 net/ipv4/inet_hashtables.c                         |  6 +---
 net/ipv4/tcp_metrics.c                             |  5 +---
 net/mpls/af_mpls.c                                 |  5 +---
 net/netfilter/x_tables.c                           | 21 +++-----------
 net/netfilter/xt_recent.c                          |  5 +---
 net/sched/sch_choke.c                              |  5 +---
 net/sched/sch_fq_codel.c                           | 26 ++++-------------
 net/sched/sch_hhf.c                                | 33 ++++++----------------
 net/sched/sch_netem.c                              |  6 +---
 net/sched/sch_sfq.c                                |  6 +---
 security/keys/keyctl.c                             | 22 ++++-----------
 44 files changed, 127 insertions(+), 350 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index dabd3b15bf11..ecc09b6648f3 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1159,10 +1159,7 @@ static long kvm_s390_get_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
 	if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
 		return -EINVAL;
 
-	keys = kmalloc_array(args->count, sizeof(uint8_t),
-			     GFP_KERNEL | __GFP_NOWARN);
-	if (!keys)
-		keys = vmalloc(sizeof(uint8_t) * args->count);
+	keys = kvmalloc_array(args->count, sizeof(uint8_t), GFP_KERNEL);
 	if (!keys)
 		return -ENOMEM;
 
@@ -1204,10 +1201,7 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
 	if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
 		return -EINVAL;
 
-	keys = kmalloc_array(args->count, sizeof(uint8_t),
-			     GFP_KERNEL | __GFP_NOWARN);
-	if (!keys)
-		keys = vmalloc(sizeof(uint8_t) * args->count);
+	keys = kvmalloc_array(args->count, sizeof(uint8_t), GFP_KERNEL);
 	if (!keys)
 		return -ENOMEM;
 
diff --git a/crypto/lzo.c b/crypto/lzo.c
index 168df784da84..218567d717d6 100644
--- a/crypto/lzo.c
+++ b/crypto/lzo.c
@@ -32,9 +32,7 @@ static void *lzo_alloc_ctx(struct crypto_scomp *tfm)
 {
 	void *ctx;
 
-	ctx = kmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL | __GFP_NOWARN);
-	if (!ctx)
-		ctx = vmalloc(LZO1X_MEM_COMPRESS);
+	ctx = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index ec4f507b524f..a2898df61744 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -513,7 +513,7 @@ static int __erst_record_id_cache_add_one(void)
 	if (i < erst_record_id_cache.len)
 		goto retry;
 	if (erst_record_id_cache.len >= erst_record_id_cache.size) {
-		int new_size, alloc_size;
+		int new_size;
 		u64 *new_entries;
 
 		new_size = erst_record_id_cache.size * 2;
@@ -524,11 +524,7 @@ static int __erst_record_id_cache_add_one(void)
 				pr_warn(FW_WARN "too many record IDs!\n");
 			return 0;
 		}
-		alloc_size = new_size * sizeof(entries[0]);
-		if (alloc_size < PAGE_SIZE)
-			new_entries = kmalloc(alloc_size, GFP_KERNEL);
-		else
-			new_entries = vmalloc(alloc_size);
+		new_entries = kvmalloc(new_size * sizeof(entries[0]), GFP_KERNEL);
 		if (!new_entries)
 			return -ENOMEM;
 		memcpy(new_entries, entries,
diff --git a/drivers/char/agp/generic.c b/drivers/char/agp/generic.c
index f002fa5d1887..bdf418cac8ef 100644
--- a/drivers/char/agp/generic.c
+++ b/drivers/char/agp/generic.c
@@ -88,13 +88,7 @@ static int agp_get_key(void)
 
 void agp_alloc_page_array(size_t size, struct agp_memory *mem)
 {
-	mem->pages = NULL;
-
-	if (size <= 2*PAGE_SIZE)
-		mem->pages = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (mem->pages == NULL) {
-		mem->pages = vmalloc(size);
-	}
+	mem->pages = kvmalloc(size, GFP_KERNEL);
 }
 EXPORT_SYMBOL(agp_alloc_page_array);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index 201b52b750dd..77dd73ff126f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -568,9 +568,7 @@ u_memcpya(uint64_t user, unsigned nmemb, unsigned size)
 
 	size *= nmemb;
 
-	mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!mem)
-		mem = vmalloc(size);
+	mem = kvmalloc(size, GFP_KERNEL);
 	if (!mem)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index cf2cbc211d83..d00bcb64d3a8 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -43,11 +43,7 @@ struct closure;
 	(heap)->used = 0;						\
 	(heap)->size = (_size);						\
 	_bytes = (heap)->size * sizeof(*(heap)->data);			\
-	(heap)->data = NULL;						\
-	if (_bytes < KMALLOC_MAX_SIZE)					\
-		(heap)->data = kmalloc(_bytes, (gfp));			\
-	if ((!(heap)->data) && ((gfp) & GFP_KERNEL))			\
-		(heap)->data = vmalloc(_bytes);				\
+	(heap)->data = kvmalloc(_bytes, (gfp) & GFP_KERNEL);		\
 	(heap)->data;							\
 })
 
@@ -136,12 +132,8 @@ do {									\
 									\
 	(fifo)->mask = _allocated_size - 1;				\
 	(fifo)->front = (fifo)->back = 0;				\
-	(fifo)->data = NULL;						\
 									\
-	if (_bytes < KMALLOC_MAX_SIZE)					\
-		(fifo)->data = kmalloc(_bytes, (gfp));			\
-	if ((!(fifo)->data) && ((gfp) & GFP_KERNEL))			\
-		(fifo)->data = vmalloc(_bytes);				\
+	(fifo)->data = kvmalloc(_bytes, (gfp) & GFP_KERNEL);		\
 	(fifo)->data;							\
 })
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
index 920d918ed193..f04e81f33795 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
@@ -41,9 +41,6 @@
 
 #define VALIDATE_TID 1
 
-void *cxgb_alloc_mem(unsigned long size);
-void cxgb_free_mem(void *addr);
-
 /*
  * Map an ATID or STID to their entries in the corresponding TID tables.
  */
diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
index 76684dcb874c..fa81445e334c 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
@@ -1152,27 +1152,6 @@ static void cxgb_redirect(struct dst_entry *old, struct dst_entry *new,
 }
 
 /*
- * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc.
- * The allocated memory is cleared.
- */
-void *cxgb_alloc_mem(unsigned long size)
-{
-	void *p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!p)
-		p = vzalloc(size);
-	return p;
-}
-
-/*
- * Free memory allocated through t3_alloc_mem().
- */
-void cxgb_free_mem(void *addr)
-{
-	kvfree(addr);
-}
-
-/*
  * Allocate and initialize the TID tables.  Returns 0 on success.
  */
 static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
@@ -1182,7 +1161,7 @@ static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
 	unsigned long size = ntids * sizeof(*t->tid_tab) +
 	    natids * sizeof(*t->atid_tab) + nstids * sizeof(*t->stid_tab);
 
-	t->tid_tab = cxgb_alloc_mem(size);
+	t->tid_tab = kvzalloc(size, GFP_KERNEL);
 	if (!t->tid_tab)
 		return -ENOMEM;
 
@@ -1218,7 +1197,7 @@ static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
 
 static void free_tid_maps(struct tid_info *t)
 {
-	cxgb_free_mem(t->tid_tab);
+	kvfree(t->tid_tab);
 }
 
 static inline void add_adapter(struct adapter *adap)
@@ -1293,7 +1272,7 @@ int cxgb3_offload_activate(struct adapter *adapter)
 	return 0;
 
 out_free_l2t:
-	t3_free_l2t(l2td);
+	kvfree(l2td);
 out_free:
 	kfree(t);
 	return err;
@@ -1302,7 +1281,7 @@ int cxgb3_offload_activate(struct adapter *adapter)
 static void clean_l2_data(struct rcu_head *head)
 {
 	struct l2t_data *d = container_of(head, struct l2t_data, rcu_head);
-	t3_free_l2t(d);
+	kvfree(d);
 }
 
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/l2t.c b/drivers/net/ethernet/chelsio/cxgb3/l2t.c
index 5f226eda8cd6..f93160271917 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/l2t.c
@@ -444,7 +444,7 @@ struct l2t_data *t3_init_l2t(unsigned int l2t_capacity)
 	struct l2t_data *d;
 	int i, size = sizeof(*d) + l2t_capacity * sizeof(struct l2t_entry);
 
-	d = cxgb_alloc_mem(size);
+	d = kvzalloc(size, GFP_KERNEL);
 	if (!d)
 		return NULL;
 
@@ -462,9 +462,3 @@ struct l2t_data *t3_init_l2t(unsigned int l2t_capacity)
 	}
 	return d;
 }
-
-void t3_free_l2t(struct l2t_data *d)
-{
-	cxgb_free_mem(d);
-}
-
diff --git a/drivers/net/ethernet/chelsio/cxgb3/l2t.h b/drivers/net/ethernet/chelsio/cxgb3/l2t.h
index 8cffcdfd5678..c2fd323c4078 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/l2t.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/l2t.h
@@ -115,7 +115,6 @@ int t3_l2t_send_slow(struct t3cdev *dev, struct sk_buff *skb,
 		     struct l2t_entry *e);
 void t3_l2t_send_event(struct t3cdev *dev, struct l2t_entry *e);
 struct l2t_data *t3_init_l2t(unsigned int l2t_capacity);
-void t3_free_l2t(struct l2t_data *d);
 
 int cxgb3_ofld_send(struct t3cdev *dev, struct sk_buff *skb);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
index 7ad43af6bde1..3103ef9b561d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
@@ -290,8 +290,8 @@ struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
 	if (clipt_size < CLIPT_MIN_HASH_BUCKETS)
 		return NULL;
 
-	ctbl = t4_alloc_mem(sizeof(*ctbl) +
-			    clipt_size*sizeof(struct list_head));
+	ctbl = kvzalloc(sizeof(*ctbl) +
+			    clipt_size*sizeof(struct list_head), GFP_KERNEL);
 	if (!ctbl)
 		return NULL;
 
@@ -305,9 +305,9 @@ struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
 	for (i = 0; i < ctbl->clipt_size; ++i)
 		INIT_LIST_HEAD(&ctbl->hash_list[i]);
 
-	cl_list = t4_alloc_mem(clipt_size*sizeof(struct clip_entry));
+	cl_list = kvzalloc(clipt_size*sizeof(struct clip_entry), GFP_KERNEL);
 	if (!cl_list) {
-		t4_free_mem(ctbl);
+		kvfree(ctbl);
 		return NULL;
 	}
 	ctbl->cl_list = (void *)cl_list;
@@ -326,8 +326,8 @@ void t4_cleanup_clip_tbl(struct adapter *adap)
 
 	if (ctbl) {
 		if (ctbl->cl_list)
-			t4_free_mem(ctbl->cl_list);
-		t4_free_mem(ctbl);
+			kvfree(ctbl->cl_list);
+		kvfree(ctbl);
 	}
 }
 EXPORT_SYMBOL(t4_cleanup_clip_tbl);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index ccb455f14d08..129bd41a051c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1296,8 +1296,6 @@ extern const char cxgb4_driver_version[];
 void t4_os_portmod_changed(const struct adapter *adap, int port_id);
 void t4_os_link_changed(struct adapter *adap, int port_id, int link_stat);
 
-void *t4_alloc_mem(size_t size);
-
 void t4_free_sge_resources(struct adapter *adap);
 void t4_free_ofld_rxqs(struct adapter *adap, int n, struct sge_ofld_rxq *q);
 irq_handler_t t4_intr_handler(struct adapter *adap);
@@ -1670,7 +1668,6 @@ int t4_sched_params(struct adapter *adapter, int type, int level, int mode,
 		    int rateunit, int ratemode, int channel, int class,
 		    int minrate, int maxrate, int weight, int pktsize);
 void t4_sge_decode_idma_state(struct adapter *adapter, int state);
-void t4_free_mem(void *addr);
 void t4_idma_monitor_init(struct adapter *adapter,
 			  struct sge_idma_monitor_state *idma);
 void t4_idma_monitor(struct adapter *adapter,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index f6e739da7bb7..1fa34b009891 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2634,7 +2634,7 @@ static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 	if (count > avail - pos)
 		count = avail - pos;
 
-	data = t4_alloc_mem(count);
+	data = kvzalloc(count, GFP_KERNEL);
 	if (!data)
 		return -ENOMEM;
 
@@ -2642,12 +2642,12 @@ static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 	ret = t4_memory_rw(adap, 0, mem, pos, count, data, T4_MEMORY_READ);
 	spin_unlock(&adap->win0_lock);
 	if (ret) {
-		t4_free_mem(data);
+		kvfree(data);
 		return ret;
 	}
 	ret = copy_to_user(buf, data, count);
 
-	t4_free_mem(data);
+	kvfree(data);
 	if (ret)
 		return -EFAULT;
 
@@ -2753,7 +2753,7 @@ static ssize_t blocked_fl_read(struct file *filp, char __user *ubuf,
 		       adap->sge.egr_sz, adap->sge.blocked_fl);
 	len += sprintf(buf + len, "\n");
 	size = simple_read_from_buffer(ubuf, count, ppos, buf, len);
-	t4_free_mem(buf);
+	kvfree(buf);
 	return size;
 }
 
@@ -2773,7 +2773,7 @@ static ssize_t blocked_fl_write(struct file *filp, const char __user *ubuf,
 		return err;
 
 	bitmap_copy(adap->sge.blocked_fl, t, adap->sge.egr_sz);
-	t4_free_mem(t);
+	kvfree(t);
 	return count;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
index 02f80febeb91..0ba7866c8259 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
@@ -969,7 +969,7 @@ static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 {
 	int i, err = 0;
 	struct adapter *adapter = netdev2adap(dev);
-	u8 *buf = t4_alloc_mem(EEPROMSIZE);
+	u8 *buf = kvzalloc(EEPROMSIZE, GFP_KERNEL);
 
 	if (!buf)
 		return -ENOMEM;
@@ -980,7 +980,7 @@ static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 
 	if (!err)
 		memcpy(data, buf + e->offset, e->len);
-	t4_free_mem(buf);
+	kvfree(buf);
 	return err;
 }
 
@@ -1009,7 +1009,7 @@ static int set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 	if (aligned_offset != eeprom->offset || aligned_len != eeprom->len) {
 		/* RMW possibly needed for first or last words.
 		 */
-		buf = t4_alloc_mem(aligned_len);
+		buf = kvzalloc(aligned_len, GFP_KERNEL);
 		if (!buf)
 			return -ENOMEM;
 		err = eeprom_rd_phys(adapter, aligned_offset, (u32 *)buf);
@@ -1037,7 +1037,7 @@ static int set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 		err = t4_seeprom_wp(adapter, true);
 out:
 	if (buf != data)
-		t4_free_mem(buf);
+		kvfree(buf);
 	return err;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 49e000ebd2b9..a9f55ec139bb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -887,27 +887,6 @@ static int setup_sge_queues(struct adapter *adap)
 	return err;
 }
 
-/*
- * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc.
- * The allocated memory is cleared.
- */
-void *t4_alloc_mem(size_t size)
-{
-	void *p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!p)
-		p = vzalloc(size);
-	return p;
-}
-
-/*
- * Free memory allocated through alloc_mem().
- */
-void t4_free_mem(void *addr)
-{
-	kvfree(addr);
-}
-
 static u16 cxgb_select_queue(struct net_device *dev, struct sk_buff *skb,
 			     void *accel_priv, select_queue_fallback_t fallback)
 {
@@ -1306,7 +1285,7 @@ static int tid_init(struct tid_info *t)
 	       max_ftids * sizeof(*t->ftid_tab) +
 	       ftid_bmap_size * sizeof(long);
 
-	t->tid_tab = t4_alloc_mem(size);
+	t->tid_tab = kvzalloc(size, GFP_KERNEL);
 	if (!t->tid_tab)
 		return -ENOMEM;
 
@@ -3451,7 +3430,7 @@ static int adap_init0(struct adapter *adap)
 		/* allocate memory to read the header of the firmware on the
 		 * card
 		 */
-		card_fw = t4_alloc_mem(sizeof(*card_fw));
+		card_fw = kvzalloc(sizeof(*card_fw), GFP_KERNEL);
 
 		/* Get FW from from /lib/firmware/ */
 		ret = request_firmware(&fw, fw_info->fw_mod_name,
@@ -3471,7 +3450,7 @@ static int adap_init0(struct adapter *adap)
 
 		/* Cleaning up */
 		release_firmware(fw);
-		t4_free_mem(card_fw);
+		kvfree(card_fw);
 
 		if (ret < 0)
 			goto bye;
@@ -4467,9 +4446,9 @@ static void free_some_resources(struct adapter *adapter)
 {
 	unsigned int i;
 
-	t4_free_mem(adapter->l2t);
+	kvfree(adapter->l2t);
 	t4_cleanup_sched(adapter);
-	t4_free_mem(adapter->tids.tid_tab);
+	kvfree(adapter->tids.tid_tab);
 	cxgb4_cleanup_tc_u32(adapter);
 	kfree(adapter->sge.egr_map);
 	kfree(adapter->sge.ingr_map);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
index 52af62e0ecb6..127f91a747b1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
@@ -432,9 +432,9 @@ void cxgb4_cleanup_tc_u32(struct adapter *adap)
 	for (i = 0; i < t->size; i++) {
 		struct cxgb4_link *link = &t->table[i];
 
-		t4_free_mem(link->tid_map);
+		kvfree(link->tid_map);
 	}
-	t4_free_mem(adap->tc_u32);
+	kvfree(adap->tc_u32);
 }
 
 struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
@@ -446,8 +446,7 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 	if (!size)
 		return NULL;
 
-	t = t4_alloc_mem(sizeof(*t) +
-			 (size * sizeof(struct cxgb4_link)));
+	t = kvzalloc(sizeof(*t) + (size * sizeof(struct cxgb4_link)), GFP_KERNEL);
 	if (!t)
 		return NULL;
 
@@ -460,7 +459,7 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 
 		max_tids = adap->tids.nftids;
 		bmap_size = BITS_TO_LONGS(max_tids);
-		link->tid_map = t4_alloc_mem(sizeof(unsigned long) * bmap_size);
+		link->tid_map = kvzalloc(sizeof(unsigned long) * bmap_size, GFP_KERNEL);
 		if (!link->tid_map)
 			goto out_no_mem;
 		bitmap_zero(link->tid_map, max_tids);
@@ -473,11 +472,11 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 		struct cxgb4_link *link = &t->table[i];
 
 		if (link->tid_map)
-			t4_free_mem(link->tid_map);
+			kvfree(link->tid_map);
 	}
 
 	if (t)
-		t4_free_mem(t);
+		kvfree(t);
 
 	return NULL;
 }
diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.c b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
index 60a26037a1c6..fb593a1a751e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
@@ -646,7 +646,7 @@ struct l2t_data *t4_init_l2t(unsigned int l2t_start, unsigned int l2t_end)
 	if (l2t_size < L2T_MIN_HASH_BUCKETS)
 		return NULL;
 
-	d = t4_alloc_mem(sizeof(*d) + l2t_size * sizeof(struct l2t_entry));
+	d = kvzalloc(sizeof(*d) + l2t_size * sizeof(struct l2t_entry), GFP_KERNEL);
 	if (!d)
 		return NULL;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.c b/drivers/net/ethernet/chelsio/cxgb4/sched.c
index c9026352a842..02acff741f11 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.c
@@ -177,7 +177,7 @@ static int t4_sched_queue_unbind(struct port_info *pi, struct ch_sched_queue *p)
 		}
 
 		list_del(&qe->list);
-		t4_free_mem(qe);
+		kvfree(qe);
 		if (atomic_dec_and_test(&e->refcnt)) {
 			e->state = SCHED_STATE_UNUSED;
 			memset(&e->info, 0, sizeof(e->info));
@@ -201,7 +201,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	if (p->queue < 0 || p->queue >= pi->nqsets)
 		return -ERANGE;
 
-	qe = t4_alloc_mem(sizeof(struct sched_queue_entry));
+	qe = kvzalloc(sizeof(struct sched_queue_entry), GFP_KERNEL);
 	if (!qe)
 		return -ENOMEM;
 
@@ -211,7 +211,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	/* Unbind queue from any existing class */
 	err = t4_sched_queue_unbind(pi, p);
 	if (err) {
-		t4_free_mem(qe);
+		kvfree(qe);
 		goto out;
 	}
 
@@ -224,7 +224,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	spin_lock(&e->lock);
 	err = t4_sched_bind_unbind_op(pi, (void *)qe, SCHED_QUEUE, true);
 	if (err) {
-		t4_free_mem(qe);
+		kvfree(qe);
 		spin_unlock(&e->lock);
 		goto out;
 	}
@@ -512,7 +512,7 @@ struct sched_table *t4_init_sched(unsigned int sched_size)
 	struct sched_table *s;
 	unsigned int i;
 
-	s = t4_alloc_mem(sizeof(*s) + sched_size * sizeof(struct sched_class));
+	s = kvzalloc(sizeof(*s) + sched_size * sizeof(struct sched_class), GFP_KERNEL);
 	if (!s)
 		return NULL;
 
@@ -548,6 +548,6 @@ void t4_cleanup_sched(struct adapter *adap)
 				t4_sched_class_free(pi, e);
 			write_unlock(&s->rw_lock);
 		}
-		t4_free_mem(s);
+		kvfree(s);
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 5886ad78058f..a5c1b815145e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -70,13 +70,10 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
-	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
+	ring->tx_info = kvmalloc_node(tmp, GFP_KERNEL, node);
 	if (!ring->tx_info) {
-		ring->tx_info = vmalloc(tmp);
-		if (!ring->tx_info) {
-			err = -ENOMEM;
-			goto err_ring;
-		}
+		err = -ENOMEM;
+		goto err_ring;
 	}
 
 	en_dbg(DRV, priv, "Allocated tx_info ring at addr:%p size:%d\n",
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 395b5463cfd9..6583d4601480 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -115,12 +115,9 @@ static int mlx4_buddy_init(struct mlx4_buddy *buddy, int max_order)
 
 	for (i = 0; i <= buddy->max_order; ++i) {
 		s = BITS_TO_LONGS(1 << (buddy->max_order - i));
-		buddy->bits[i] = kcalloc(s, sizeof (long), GFP_KERNEL | __GFP_NOWARN);
-		if (!buddy->bits[i]) {
-			buddy->bits[i] = vzalloc(s * sizeof(long));
-			if (!buddy->bits[i])
-				goto err_out_free;
-		}
+		buddy->bits[i] = kvmalloc_array(s, sizeof(long), GFP_KERNEL | __GFP_ZERO);
+		if (!buddy->bits[i])
+			goto err_out_free;
 	}
 
 	set_bit(0, buddy->bits[buddy->max_order]);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 0eedc49e0d47..3bd332b167d9 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -102,10 +102,7 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd)
 		return -ENXIO;
 	}
 
-	ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
-	if (!ndd->data)
-		ndd->data = vmalloc(ndd->nsarea.config_size);
-
+	ndd->data = kvmalloc(ndd->nsarea.config_size, GFP_KERNEL);
 	if (!ndd->data)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c b/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
index a6a76a681ea9..8f638267e704 100644
--- a/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
+++ b/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
@@ -45,15 +45,6 @@ EXPORT_SYMBOL(libcfs_kvzalloc);
 void *libcfs_kvzalloc_cpt(struct cfs_cpt_table *cptab, int cpt, size_t size,
 			  gfp_t flags)
 {
-	void *ret;
-
-	ret = kzalloc_node(size, flags | __GFP_NOWARN,
-			   cfs_cpt_spread_node(cptab, cpt));
-	if (!ret) {
-		WARN_ON(!(flags & (__GFP_FS | __GFP_HIGH)));
-		ret = vmalloc_node(size, cfs_cpt_spread_node(cptab, cpt));
-	}
-
-	return ret;
+	return kvzalloc_node(size, flags, cfs_cpt_spread_node(cptab, cpt));
 }
 EXPORT_SYMBOL(libcfs_kvzalloc_cpt);
diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 6890897a6f30..10f1ef582659 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -87,18 +87,6 @@ struct user_evtchn {
 	bool enabled;
 };
 
-static evtchn_port_t *evtchn_alloc_ring(unsigned int size)
-{
-	evtchn_port_t *ring;
-	size_t s = size * sizeof(*ring);
-
-	ring = kmalloc(s, GFP_KERNEL);
-	if (!ring)
-		ring = vmalloc(s);
-
-	return ring;
-}
-
 static void evtchn_free_ring(evtchn_port_t *ring)
 {
 	kvfree(ring);
@@ -334,7 +322,7 @@ static int evtchn_resize_ring(struct per_user_data *u)
 	else
 		new_size = 2 * u->ring_size;
 
-	new_ring = evtchn_alloc_ring(new_size);
+	new_ring = kvmalloc(new_size * sizeof(*new_ring), GFP_KERNEL);
 	if (!new_ring)
 		return -ENOMEM;
 
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 72dd200f0478..72f2cf234ce2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5393,13 +5393,10 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 		goto out;
 	}
 
-	tmp_buf = kmalloc(fs_info->nodesize, GFP_KERNEL | __GFP_NOWARN);
+	tmp_buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
 	if (!tmp_buf) {
-		tmp_buf = vmalloc(fs_info->nodesize);
-		if (!tmp_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	left_path->search_commit_root = 1;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 99b1cce24eff..8f97bcc9e011 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3536,12 +3536,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 	u64 last_dest_end = destoff;
 
 	ret = -ENOMEM;
-	buf = kmalloc(fs_info->nodesize, GFP_KERNEL | __GFP_NOWARN);
-	if (!buf) {
-		buf = vmalloc(fs_info->nodesize);
-		if (!buf)
-			return ret;
-	}
+	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
+	if (!buf)
+		return ret;
 
 	path = btrfs_alloc_path();
 	if (!path) {
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 712922ea64d2..abbe7c6ac773 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6271,22 +6271,16 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 	sctx->clone_roots_cnt = arg->clone_sources_count;
 
 	sctx->send_max_size = BTRFS_SEND_BUF_SIZE;
-	sctx->send_buf = kmalloc(sctx->send_max_size, GFP_KERNEL | __GFP_NOWARN);
+	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
-		sctx->send_buf = vmalloc(sctx->send_max_size);
-		if (!sctx->send_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
-	sctx->read_buf = kmalloc(BTRFS_SEND_READ_SIZE, GFP_KERNEL | __GFP_NOWARN);
+	sctx->read_buf = kvmalloc(BTRFS_SEND_READ_SIZE, GFP_KERNEL);
 	if (!sctx->read_buf) {
-		sctx->read_buf = vmalloc(BTRFS_SEND_READ_SIZE);
-		if (!sctx->read_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	sctx->pending_dir_moves = RB_ROOT;
@@ -6307,13 +6301,10 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 	alloc_size = arg->clone_sources_count * sizeof(*arg->clone_sources);
 
 	if (arg->clone_sources_count) {
-		clone_sources_tmp = kmalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
+		clone_sources_tmp = kvmalloc(alloc_size, GFP_KERNEL);
 		if (!clone_sources_tmp) {
-			clone_sources_tmp = vmalloc(alloc_size);
-			if (!clone_sources_tmp) {
-				ret = -ENOMEM;
-				goto out;
-			}
+			ret = -ENOMEM;
+			goto out;
 		}
 
 		ret = copy_from_user(clone_sources_tmp, arg->clone_sources,
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 045d30d26624..78b18acf33ba 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -74,12 +74,9 @@ dio_get_pages_alloc(const struct iov_iter *it, size_t nbytes,
 	align = (unsigned long)(it->iov->iov_base + it->iov_offset) &
 		(PAGE_SIZE - 1);
 	npages = calc_pages_for(align, nbytes);
-	pages = kmalloc(sizeof(*pages) * npages, GFP_KERNEL);
-	if (!pages) {
-		pages = vmalloc(sizeof(*pages) * npages);
-		if (!pages)
-			return ERR_PTR(-ENOMEM);
-	}
+	pages = kvmalloc(sizeof(*pages) * npages, GFP_KERNEL);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
 
 	for (idx = 0; idx < npages; ) {
 		size_t start;
diff --git a/fs/select.c b/fs/select.c
index 305c0daf5d67..9e8e1189eb99 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -586,10 +586,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
-		if (!bits && alloc_size > PAGE_SIZE)
-			bits = vmalloc(alloc_size);
-
+		bits = kvmalloc(alloc_size, GFP_KERNEL);
 		if (!bits)
 			goto out_nofds;
 	}
diff --git a/fs/xattr.c b/fs/xattr.c
index 7e3317cf4045..464c94bf65f9 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -431,12 +431,9 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
 	if (size) {
 		if (size > XATTR_SIZE_MAX)
 			return -E2BIG;
-		kvalue = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-		if (!kvalue) {
-			kvalue = vmalloc(size);
-			if (!kvalue)
-				return -ENOMEM;
-		}
+		kvalue = kvmalloc(size, GFP_KERNEL);
+		if (!kvalue)
+			return -ENOMEM;
 		if (copy_from_user(kvalue, value, size)) {
 			error = -EFAULT;
 			goto out;
@@ -528,12 +525,9 @@ getxattr(struct dentry *d, const char __user *name, void __user *value,
 	if (size) {
 		if (size > XATTR_SIZE_MAX)
 			size = XATTR_SIZE_MAX;
-		kvalue = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-		if (!kvalue) {
-			kvalue = vmalloc(size);
-			if (!kvalue)
-				return -ENOMEM;
-		}
+		kvalue = kvzalloc(size, GFP_KERNEL);
+		if (!kvalue)
+			return -ENOMEM;
 	}
 
 	error = vfs_getxattr(d, kname, kvalue, size);
@@ -611,12 +605,9 @@ listxattr(struct dentry *d, char __user *list, size_t size)
 	if (size) {
 		if (size > XATTR_LIST_MAX)
 			size = XATTR_LIST_MAX;
-		klist = kmalloc(size, __GFP_NOWARN | GFP_KERNEL);
-		if (!klist) {
-			klist = vmalloc(size);
-			if (!klist)
-				return -ENOMEM;
-		}
+		klist = kvmalloc(size, GFP_KERNEL);
+		if (!klist)
+			return -ENOMEM;
 	}
 
 	error = vfs_listxattr(d, klist, size);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2fcff6b4503f..48d8b66a0efe 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -890,12 +890,7 @@ static inline u16 cmdif_rev(struct mlx5_core_dev *dev)
 
 static inline void *mlx5_vzalloc(unsigned long size)
 {
-	void *rtn;
-
-	rtn = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!rtn)
-		rtn = vzalloc(size);
-	return rtn;
+	return kvzalloc(size, GFP_KERNEL);
 }
 
 static inline u32 mlx5_base_mkey(const u32 key)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d9fdc0a2a7b..df82a2f34bc8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -512,6 +512,14 @@ static inline void *kvzalloc(size_t size, gfp_t flags)
 	return kvmalloc(size, flags | __GFP_ZERO);
 }
 
+static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+{
+	if (size != 0 && n > SIZE_MAX / size)
+		return NULL;
+
+	return kvmalloc(n * size, flags);
+}
+
 extern void kvfree(const void *addr);
 
 static inline atomic_t *compound_mapcount_ptr(struct page *page)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e68604ae3ced..4b2b3f5b3380 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -965,10 +965,7 @@ EXPORT_SYMBOL(iov_iter_get_pages);
 
 static struct page **get_pages_array(size_t n)
 {
-	struct page **p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
-	if (!p)
-		p = vmalloc(n * sizeof(struct page *));
-	return p;
+	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
 }
 
 static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index db77dcb38afd..72ebec18629c 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -200,10 +200,7 @@ struct frame_vector *frame_vector_create(unsigned int nr_frames)
 	 * Avoid higher order allocations, use vmalloc instead. It should
 	 * be rare anyway.
 	 */
-	if (size <= PAGE_SIZE)
-		vec = kmalloc(size, GFP_KERNEL);
-	else
-		vec = vmalloc(size);
+	vec = kvmalloc(size, GFP_KERNEL);
 	if (!vec)
 		return NULL;
 	vec->nr_allocated = nr_frames;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 8bea74298173..e9a59d2d91d4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -678,11 +678,7 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
 		/* no more locks than number of hash buckets */
 		nblocks = min(nblocks, hashinfo->ehash_mask + 1);
 
-		hashinfo->ehash_locks =	kmalloc_array(nblocks, locksz,
-						      GFP_KERNEL | __GFP_NOWARN);
-		if (!hashinfo->ehash_locks)
-			hashinfo->ehash_locks = vmalloc(nblocks * locksz);
-
+		hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
 		if (!hashinfo->ehash_locks)
 			return -ENOMEM;
 
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index b9ed0d50aead..d6992003bb8c 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1153,10 +1153,7 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	tcp_metrics_hash_log = order_base_2(slots);
 	size = sizeof(struct tcpm_hash_bucket) << tcp_metrics_hash_log;
 
-	tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!tcp_metrics_hash)
-		tcp_metrics_hash = vzalloc(size);
-
+	tcp_metrics_hash = kvzalloc(size, GFP_KERNEL);
 	if (!tcp_metrics_hash)
 		return -ENOMEM;
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 64d3bf269a26..1b8dad9dc7fd 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1653,10 +1653,7 @@ static int resize_platform_label_table(struct net *net, size_t limit)
 	unsigned index;
 
 	if (size) {
-		labels = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
-		if (!labels)
-			labels = vzalloc(size);
-
+		labels = kvzalloc(size, GFP_KERNEL);
 		if (!labels)
 			goto nolabels;
 	}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 14857afc9937..d529989f5791 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -763,17 +763,8 @@ EXPORT_SYMBOL(xt_check_entry_offsets);
  */
 unsigned int *xt_alloc_entry_offsets(unsigned int size)
 {
-	unsigned int *off;
+	return kvmalloc_array(size, sizeof(unsigned int), GFP_KERNEL | __GFP_ZERO);
 
-	off = kcalloc(size, sizeof(unsigned int), GFP_KERNEL | __GFP_NOWARN);
-
-	if (off)
-		return off;
-
-	if (size < (SIZE_MAX / sizeof(unsigned int)))
-		off = vmalloc(size * sizeof(unsigned int));
-
-	return off;
 }
 EXPORT_SYMBOL(xt_alloc_entry_offsets);
 
@@ -1114,7 +1105,7 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
 
 	size = sizeof(void **) * nr_cpu_ids;
 	if (size > PAGE_SIZE)
-		i->jumpstack = vzalloc(size);
+		i->jumpstack = kvzalloc(size, GFP_KERNEL);
 	else
 		i->jumpstack = kzalloc(size, GFP_KERNEL);
 	if (i->jumpstack == NULL)
@@ -1136,12 +1127,8 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
 	 */
 	size = sizeof(void *) * i->stacksize * 2u;
 	for_each_possible_cpu(cpu) {
-		if (size > PAGE_SIZE)
-			i->jumpstack[cpu] = vmalloc_node(size,
-				cpu_to_node(cpu));
-		else
-			i->jumpstack[cpu] = kmalloc_node(size,
-				GFP_KERNEL, cpu_to_node(cpu));
+		i->jumpstack[cpu] = kvmalloc_node(size, GFP_KERNEL,
+			cpu_to_node(cpu));
 		if (i->jumpstack[cpu] == NULL)
 			/*
 			 * Freeing will be done later on by the callers. The
diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 1d89a4eaf841..d6aa8f63ed2e 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -388,10 +388,7 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
 	}
 
 	sz = sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size;
-	if (sz <= PAGE_SIZE)
-		t = kzalloc(sz, GFP_KERNEL);
-	else
-		t = vzalloc(sz);
+	t = kvzalloc(sz, GFP_KERNEL);
 	if (t == NULL) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index 3b6d5bd69101..47cbfae44898 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -431,10 +431,7 @@ static int choke_change(struct Qdisc *sch, struct nlattr *opt)
 	if (mask != q->tab_mask) {
 		struct sk_buff **ntab;
 
-		ntab = kcalloc(mask + 1, sizeof(struct sk_buff *),
-			       GFP_KERNEL | __GFP_NOWARN);
-		if (!ntab)
-			ntab = vzalloc((mask + 1) * sizeof(struct sk_buff *));
+		ntab = kvmalloc_array((mask + 1), sizeof(struct sk_buff *), GFP_KERNEL | __GFP_ZERO);
 		if (!ntab)
 			return -ENOMEM;
 
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 2f50e4c72fb4..67d8dfc80bbb 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -446,27 +446,13 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
 	return 0;
 }
 
-static void *fq_codel_zalloc(size_t sz)
-{
-	void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vzalloc(sz);
-	return ptr;
-}
-
-static void fq_codel_free(void *addr)
-{
-	kvfree(addr);
-}
-
 static void fq_codel_destroy(struct Qdisc *sch)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 
 	tcf_destroy_chain(&q->filter_list);
-	fq_codel_free(q->backlogs);
-	fq_codel_free(q->flows);
+	kvfree(q->backlogs);
+	kvfree(q->flows);
 }
 
 static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
@@ -493,13 +479,13 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
 	}
 
 	if (!q->flows) {
-		q->flows = fq_codel_zalloc(q->flows_cnt *
-					   sizeof(struct fq_codel_flow));
+		q->flows = kvzalloc(q->flows_cnt *
+					   sizeof(struct fq_codel_flow), GFP_KERNEL);
 		if (!q->flows)
 			return -ENOMEM;
-		q->backlogs = fq_codel_zalloc(q->flows_cnt * sizeof(u32));
+		q->backlogs = kvzalloc(q->flows_cnt * sizeof(u32), GFP_KERNEL);
 		if (!q->backlogs) {
-			fq_codel_free(q->flows);
+			kvfree(q->flows);
 			return -ENOMEM;
 		}
 		for (i = 0; i < q->flows_cnt; i++) {
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index e3d0458af17b..2454055c737e 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -467,29 +467,14 @@ static void hhf_reset(struct Qdisc *sch)
 		rtnl_kfree_skbs(skb, skb);
 }
 
-static void *hhf_zalloc(size_t sz)
-{
-	void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vzalloc(sz);
-
-	return ptr;
-}
-
-static void hhf_free(void *addr)
-{
-	kvfree(addr);
-}
-
 static void hhf_destroy(struct Qdisc *sch)
 {
 	int i;
 	struct hhf_sched_data *q = qdisc_priv(sch);
 
 	for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-		hhf_free(q->hhf_arrays[i]);
-		hhf_free(q->hhf_valid_bits[i]);
+		kvfree(q->hhf_arrays[i]);
+		kvfree(q->hhf_valid_bits[i]);
 	}
 
 	for (i = 0; i < HH_FLOWS_CNT; i++) {
@@ -503,7 +488,7 @@ static void hhf_destroy(struct Qdisc *sch)
 			kfree(flow);
 		}
 	}
-	hhf_free(q->hh_flows);
+	kvfree(q->hh_flows);
 }
 
 static const struct nla_policy hhf_policy[TCA_HHF_MAX + 1] = {
@@ -609,8 +594,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 	if (!q->hh_flows) {
 		/* Initialize heavy-hitter flow table. */
-		q->hh_flows = hhf_zalloc(HH_FLOWS_CNT *
-					 sizeof(struct list_head));
+		q->hh_flows = kvzalloc(HH_FLOWS_CNT *
+					 sizeof(struct list_head), GFP_KERNEL);
 		if (!q->hh_flows)
 			return -ENOMEM;
 		for (i = 0; i < HH_FLOWS_CNT; i++)
@@ -624,8 +609,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 		/* Initialize heavy-hitter filter arrays. */
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-			q->hhf_arrays[i] = hhf_zalloc(HHF_ARRAYS_LEN *
-						      sizeof(u32));
+			q->hhf_arrays[i] = kvzalloc(HHF_ARRAYS_LEN *
+						      sizeof(u32), GFP_KERNEL);
 			if (!q->hhf_arrays[i]) {
 				hhf_destroy(sch);
 				return -ENOMEM;
@@ -635,8 +620,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 		/* Initialize valid bits of heavy-hitter filter arrays. */
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-			q->hhf_valid_bits[i] = hhf_zalloc(HHF_ARRAYS_LEN /
-							  BITS_PER_BYTE);
+			q->hhf_valid_bits[i] = kvzalloc(HHF_ARRAYS_LEN /
+							  BITS_PER_BYTE, GFP_KERNEL);
 			if (!q->hhf_valid_bits[i]) {
 				hhf_destroy(sch);
 				return -ENOMEM;
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index c8bb62a1e744..7abc01ef537c 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -692,15 +692,11 @@ static int get_dist_table(struct Qdisc *sch, const struct nlattr *attr)
 	spinlock_t *root_lock;
 	struct disttable *d;
 	int i;
-	size_t s;
 
 	if (n > NETEM_DIST_MAX)
 		return -EINVAL;
 
-	s = sizeof(struct disttable) + n * sizeof(s16);
-	d = kmalloc(s, GFP_KERNEL | __GFP_NOWARN);
-	if (!d)
-		d = vmalloc(s);
+	d = kvmalloc(sizeof(struct disttable) + n * sizeof(s16), GFP_KERNEL);
 	if (!d)
 		return -ENOMEM;
 
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 7f195ed4d568..5d70cd6a032d 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -684,11 +684,7 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
 
 static void *sfq_alloc(size_t sz)
 {
-	void *ptr = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vmalloc(sz);
-	return ptr;
+	return  kvmalloc(sz, GFP_KERNEL);
 }
 
 static void sfq_free(void *addr)
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 38c00e867bda..a5c21f05ece4 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -99,14 +99,9 @@ SYSCALL_DEFINE5(add_key, const char __user *, _type,
 
 	if (_payload) {
 		ret = -ENOMEM;
-		payload = kmalloc(plen, GFP_KERNEL | __GFP_NOWARN);
-		if (!payload) {
-			if (plen <= PAGE_SIZE)
-				goto error2;
-			payload = vmalloc(plen);
-			if (!payload)
-				goto error2;
-		}
+		payload = kvmalloc(plen, GFP_KERNEL);
+		if (!payload)
+			goto error2;
 
 		ret = -EFAULT;
 		if (copy_from_user(payload, _payload, plen) != 0)
@@ -1064,14 +1059,9 @@ long keyctl_instantiate_key_common(key_serial_t id,
 
 	if (from) {
 		ret = -ENOMEM;
-		payload = kmalloc(plen, GFP_KERNEL);
-		if (!payload) {
-			if (plen <= PAGE_SIZE)
-				goto error;
-			payload = vmalloc(plen);
-			if (!payload)
-				goto error;
-		}
+		payload = kvmalloc(plen, GFP_KERNEL);
+		if (!payload)
+			goto error;
 
 		ret = -EFAULT;
 		if (!copy_from_iter_full(payload, plen, from))
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 10:21   ` Leon Romanovsky
                     ` (2 more replies)
  0 siblings, 3 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Martin Schwidefsky,
	Heiko Carstens, Herbert Xu, Anton Vorontsov, Colin Cross,
	Kees Cook, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng, Alexei Starovoitov

From: Michal Hocko <mhocko@suse.com>

There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are usually
not considering all the aspects of the memory allocator. E.g. allocation
requests <= 32kB (with 4kB pages) are basically never failing and invoke
OOM killer to satisfy the allocation. This sounds too disruptive for
something that has a reasonable fallback - the vmalloc. On the other
hand those requests might fallback to vmalloc even when the memory
allocator would succeed after several more reclaim/compaction attempts
previously. There is no guarantee something like that happens though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Changes since v1
- add kvmalloc_array - this might silently fix some overflow issues
  because most users simply didn't check the overflow for the vmalloc
  fallback.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/s390/kvm/kvm-s390.c                           | 10 ++-----
 crypto/lzo.c                                       |  4 +--
 drivers/acpi/apei/erst.c                           |  8 ++----
 drivers/char/agp/generic.c                         |  8 +-----
 drivers/gpu/drm/nouveau/nouveau_gem.c              |  4 +--
 drivers/md/bcache/util.h                           | 12 ++------
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h    |  3 --
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 +++----------------
 drivers/net/ethernet/chelsio/cxgb3/l2t.c           |  8 +-----
 drivers/net/ethernet/chelsio/cxgb3/l2t.h           |  1 -
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c      | 12 ++++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |  3 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 +++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c |  8 +++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 31 ++++----------------
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  | 13 ++++-----
 drivers/net/ethernet/chelsio/cxgb4/l2t.c           |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/sched.c         | 12 ++++----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |  9 ++----
 drivers/net/ethernet/mellanox/mlx4/mr.c            |  9 ++----
 drivers/nvdimm/dimm_devs.c                         |  5 +---
 .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +-------
 drivers/xen/evtchn.c                               | 14 +--------
 fs/btrfs/ctree.c                                   |  9 ++----
 fs/btrfs/ioctl.c                                   |  9 ++----
 fs/btrfs/send.c                                    | 27 ++++++------------
 fs/ceph/file.c                                     |  9 ++----
 fs/select.c                                        |  5 +---
 fs/xattr.c                                         | 27 ++++++------------
 include/linux/mlx5/driver.h                        |  7 +----
 include/linux/mm.h                                 |  8 ++++++
 lib/iov_iter.c                                     |  5 +---
 mm/frame_vector.c                                  |  5 +---
 net/ipv4/inet_hashtables.c                         |  6 +---
 net/ipv4/tcp_metrics.c                             |  5 +---
 net/mpls/af_mpls.c                                 |  5 +---
 net/netfilter/x_tables.c                           | 21 +++-----------
 net/netfilter/xt_recent.c                          |  5 +---
 net/sched/sch_choke.c                              |  5 +---
 net/sched/sch_fq_codel.c                           | 26 ++++-------------
 net/sched/sch_hhf.c                                | 33 ++++++----------------
 net/sched/sch_netem.c                              |  6 +---
 net/sched/sch_sfq.c                                |  6 +---
 security/keys/keyctl.c                             | 22 ++++-----------
 44 files changed, 127 insertions(+), 350 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index dabd3b15bf11..ecc09b6648f3 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1159,10 +1159,7 @@ static long kvm_s390_get_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
 	if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
 		return -EINVAL;
 
-	keys = kmalloc_array(args->count, sizeof(uint8_t),
-			     GFP_KERNEL | __GFP_NOWARN);
-	if (!keys)
-		keys = vmalloc(sizeof(uint8_t) * args->count);
+	keys = kvmalloc_array(args->count, sizeof(uint8_t), GFP_KERNEL);
 	if (!keys)
 		return -ENOMEM;
 
@@ -1204,10 +1201,7 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
 	if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
 		return -EINVAL;
 
-	keys = kmalloc_array(args->count, sizeof(uint8_t),
-			     GFP_KERNEL | __GFP_NOWARN);
-	if (!keys)
-		keys = vmalloc(sizeof(uint8_t) * args->count);
+	keys = kvmalloc_array(args->count, sizeof(uint8_t), GFP_KERNEL);
 	if (!keys)
 		return -ENOMEM;
 
diff --git a/crypto/lzo.c b/crypto/lzo.c
index 168df784da84..218567d717d6 100644
--- a/crypto/lzo.c
+++ b/crypto/lzo.c
@@ -32,9 +32,7 @@ static void *lzo_alloc_ctx(struct crypto_scomp *tfm)
 {
 	void *ctx;
 
-	ctx = kmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL | __GFP_NOWARN);
-	if (!ctx)
-		ctx = vmalloc(LZO1X_MEM_COMPRESS);
+	ctx = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index ec4f507b524f..a2898df61744 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -513,7 +513,7 @@ static int __erst_record_id_cache_add_one(void)
 	if (i < erst_record_id_cache.len)
 		goto retry;
 	if (erst_record_id_cache.len >= erst_record_id_cache.size) {
-		int new_size, alloc_size;
+		int new_size;
 		u64 *new_entries;
 
 		new_size = erst_record_id_cache.size * 2;
@@ -524,11 +524,7 @@ static int __erst_record_id_cache_add_one(void)
 				pr_warn(FW_WARN "too many record IDs!\n");
 			return 0;
 		}
-		alloc_size = new_size * sizeof(entries[0]);
-		if (alloc_size < PAGE_SIZE)
-			new_entries = kmalloc(alloc_size, GFP_KERNEL);
-		else
-			new_entries = vmalloc(alloc_size);
+		new_entries = kvmalloc(new_size * sizeof(entries[0]), GFP_KERNEL);
 		if (!new_entries)
 			return -ENOMEM;
 		memcpy(new_entries, entries,
diff --git a/drivers/char/agp/generic.c b/drivers/char/agp/generic.c
index f002fa5d1887..bdf418cac8ef 100644
--- a/drivers/char/agp/generic.c
+++ b/drivers/char/agp/generic.c
@@ -88,13 +88,7 @@ static int agp_get_key(void)
 
 void agp_alloc_page_array(size_t size, struct agp_memory *mem)
 {
-	mem->pages = NULL;
-
-	if (size <= 2*PAGE_SIZE)
-		mem->pages = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (mem->pages == NULL) {
-		mem->pages = vmalloc(size);
-	}
+	mem->pages = kvmalloc(size, GFP_KERNEL);
 }
 EXPORT_SYMBOL(agp_alloc_page_array);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index 201b52b750dd..77dd73ff126f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -568,9 +568,7 @@ u_memcpya(uint64_t user, unsigned nmemb, unsigned size)
 
 	size *= nmemb;
 
-	mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!mem)
-		mem = vmalloc(size);
+	mem = kvmalloc(size, GFP_KERNEL);
 	if (!mem)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index cf2cbc211d83..d00bcb64d3a8 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -43,11 +43,7 @@ struct closure;
 	(heap)->used = 0;						\
 	(heap)->size = (_size);						\
 	_bytes = (heap)->size * sizeof(*(heap)->data);			\
-	(heap)->data = NULL;						\
-	if (_bytes < KMALLOC_MAX_SIZE)					\
-		(heap)->data = kmalloc(_bytes, (gfp));			\
-	if ((!(heap)->data) && ((gfp) & GFP_KERNEL))			\
-		(heap)->data = vmalloc(_bytes);				\
+	(heap)->data = kvmalloc(_bytes, (gfp) & GFP_KERNEL);		\
 	(heap)->data;							\
 })
 
@@ -136,12 +132,8 @@ do {									\
 									\
 	(fifo)->mask = _allocated_size - 1;				\
 	(fifo)->front = (fifo)->back = 0;				\
-	(fifo)->data = NULL;						\
 									\
-	if (_bytes < KMALLOC_MAX_SIZE)					\
-		(fifo)->data = kmalloc(_bytes, (gfp));			\
-	if ((!(fifo)->data) && ((gfp) & GFP_KERNEL))			\
-		(fifo)->data = vmalloc(_bytes);				\
+	(fifo)->data = kvmalloc(_bytes, (gfp) & GFP_KERNEL);		\
 	(fifo)->data;							\
 })
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
index 920d918ed193..f04e81f33795 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
@@ -41,9 +41,6 @@
 
 #define VALIDATE_TID 1
 
-void *cxgb_alloc_mem(unsigned long size);
-void cxgb_free_mem(void *addr);
-
 /*
  * Map an ATID or STID to their entries in the corresponding TID tables.
  */
diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
index 76684dcb874c..fa81445e334c 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
@@ -1152,27 +1152,6 @@ static void cxgb_redirect(struct dst_entry *old, struct dst_entry *new,
 }
 
 /*
- * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc.
- * The allocated memory is cleared.
- */
-void *cxgb_alloc_mem(unsigned long size)
-{
-	void *p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!p)
-		p = vzalloc(size);
-	return p;
-}
-
-/*
- * Free memory allocated through t3_alloc_mem().
- */
-void cxgb_free_mem(void *addr)
-{
-	kvfree(addr);
-}
-
-/*
  * Allocate and initialize the TID tables.  Returns 0 on success.
  */
 static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
@@ -1182,7 +1161,7 @@ static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
 	unsigned long size = ntids * sizeof(*t->tid_tab) +
 	    natids * sizeof(*t->atid_tab) + nstids * sizeof(*t->stid_tab);
 
-	t->tid_tab = cxgb_alloc_mem(size);
+	t->tid_tab = kvzalloc(size, GFP_KERNEL);
 	if (!t->tid_tab)
 		return -ENOMEM;
 
@@ -1218,7 +1197,7 @@ static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
 
 static void free_tid_maps(struct tid_info *t)
 {
-	cxgb_free_mem(t->tid_tab);
+	kvfree(t->tid_tab);
 }
 
 static inline void add_adapter(struct adapter *adap)
@@ -1293,7 +1272,7 @@ int cxgb3_offload_activate(struct adapter *adapter)
 	return 0;
 
 out_free_l2t:
-	t3_free_l2t(l2td);
+	kvfree(l2td);
 out_free:
 	kfree(t);
 	return err;
@@ -1302,7 +1281,7 @@ int cxgb3_offload_activate(struct adapter *adapter)
 static void clean_l2_data(struct rcu_head *head)
 {
 	struct l2t_data *d = container_of(head, struct l2t_data, rcu_head);
-	t3_free_l2t(d);
+	kvfree(d);
 }
 
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/l2t.c b/drivers/net/ethernet/chelsio/cxgb3/l2t.c
index 5f226eda8cd6..f93160271917 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/l2t.c
@@ -444,7 +444,7 @@ struct l2t_data *t3_init_l2t(unsigned int l2t_capacity)
 	struct l2t_data *d;
 	int i, size = sizeof(*d) + l2t_capacity * sizeof(struct l2t_entry);
 
-	d = cxgb_alloc_mem(size);
+	d = kvzalloc(size, GFP_KERNEL);
 	if (!d)
 		return NULL;
 
@@ -462,9 +462,3 @@ struct l2t_data *t3_init_l2t(unsigned int l2t_capacity)
 	}
 	return d;
 }
-
-void t3_free_l2t(struct l2t_data *d)
-{
-	cxgb_free_mem(d);
-}
-
diff --git a/drivers/net/ethernet/chelsio/cxgb3/l2t.h b/drivers/net/ethernet/chelsio/cxgb3/l2t.h
index 8cffcdfd5678..c2fd323c4078 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/l2t.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/l2t.h
@@ -115,7 +115,6 @@ int t3_l2t_send_slow(struct t3cdev *dev, struct sk_buff *skb,
 		     struct l2t_entry *e);
 void t3_l2t_send_event(struct t3cdev *dev, struct l2t_entry *e);
 struct l2t_data *t3_init_l2t(unsigned int l2t_capacity);
-void t3_free_l2t(struct l2t_data *d);
 
 int cxgb3_ofld_send(struct t3cdev *dev, struct sk_buff *skb);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
index 7ad43af6bde1..3103ef9b561d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
@@ -290,8 +290,8 @@ struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
 	if (clipt_size < CLIPT_MIN_HASH_BUCKETS)
 		return NULL;
 
-	ctbl = t4_alloc_mem(sizeof(*ctbl) +
-			    clipt_size*sizeof(struct list_head));
+	ctbl = kvzalloc(sizeof(*ctbl) +
+			    clipt_size*sizeof(struct list_head), GFP_KERNEL);
 	if (!ctbl)
 		return NULL;
 
@@ -305,9 +305,9 @@ struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
 	for (i = 0; i < ctbl->clipt_size; ++i)
 		INIT_LIST_HEAD(&ctbl->hash_list[i]);
 
-	cl_list = t4_alloc_mem(clipt_size*sizeof(struct clip_entry));
+	cl_list = kvzalloc(clipt_size*sizeof(struct clip_entry), GFP_KERNEL);
 	if (!cl_list) {
-		t4_free_mem(ctbl);
+		kvfree(ctbl);
 		return NULL;
 	}
 	ctbl->cl_list = (void *)cl_list;
@@ -326,8 +326,8 @@ void t4_cleanup_clip_tbl(struct adapter *adap)
 
 	if (ctbl) {
 		if (ctbl->cl_list)
-			t4_free_mem(ctbl->cl_list);
-		t4_free_mem(ctbl);
+			kvfree(ctbl->cl_list);
+		kvfree(ctbl);
 	}
 }
 EXPORT_SYMBOL(t4_cleanup_clip_tbl);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index ccb455f14d08..129bd41a051c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1296,8 +1296,6 @@ extern const char cxgb4_driver_version[];
 void t4_os_portmod_changed(const struct adapter *adap, int port_id);
 void t4_os_link_changed(struct adapter *adap, int port_id, int link_stat);
 
-void *t4_alloc_mem(size_t size);
-
 void t4_free_sge_resources(struct adapter *adap);
 void t4_free_ofld_rxqs(struct adapter *adap, int n, struct sge_ofld_rxq *q);
 irq_handler_t t4_intr_handler(struct adapter *adap);
@@ -1670,7 +1668,6 @@ int t4_sched_params(struct adapter *adapter, int type, int level, int mode,
 		    int rateunit, int ratemode, int channel, int class,
 		    int minrate, int maxrate, int weight, int pktsize);
 void t4_sge_decode_idma_state(struct adapter *adapter, int state);
-void t4_free_mem(void *addr);
 void t4_idma_monitor_init(struct adapter *adapter,
 			  struct sge_idma_monitor_state *idma);
 void t4_idma_monitor(struct adapter *adapter,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index f6e739da7bb7..1fa34b009891 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2634,7 +2634,7 @@ static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 	if (count > avail - pos)
 		count = avail - pos;
 
-	data = t4_alloc_mem(count);
+	data = kvzalloc(count, GFP_KERNEL);
 	if (!data)
 		return -ENOMEM;
 
@@ -2642,12 +2642,12 @@ static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 	ret = t4_memory_rw(adap, 0, mem, pos, count, data, T4_MEMORY_READ);
 	spin_unlock(&adap->win0_lock);
 	if (ret) {
-		t4_free_mem(data);
+		kvfree(data);
 		return ret;
 	}
 	ret = copy_to_user(buf, data, count);
 
-	t4_free_mem(data);
+	kvfree(data);
 	if (ret)
 		return -EFAULT;
 
@@ -2753,7 +2753,7 @@ static ssize_t blocked_fl_read(struct file *filp, char __user *ubuf,
 		       adap->sge.egr_sz, adap->sge.blocked_fl);
 	len += sprintf(buf + len, "\n");
 	size = simple_read_from_buffer(ubuf, count, ppos, buf, len);
-	t4_free_mem(buf);
+	kvfree(buf);
 	return size;
 }
 
@@ -2773,7 +2773,7 @@ static ssize_t blocked_fl_write(struct file *filp, const char __user *ubuf,
 		return err;
 
 	bitmap_copy(adap->sge.blocked_fl, t, adap->sge.egr_sz);
-	t4_free_mem(t);
+	kvfree(t);
 	return count;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
index 02f80febeb91..0ba7866c8259 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
@@ -969,7 +969,7 @@ static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 {
 	int i, err = 0;
 	struct adapter *adapter = netdev2adap(dev);
-	u8 *buf = t4_alloc_mem(EEPROMSIZE);
+	u8 *buf = kvzalloc(EEPROMSIZE, GFP_KERNEL);
 
 	if (!buf)
 		return -ENOMEM;
@@ -980,7 +980,7 @@ static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 
 	if (!err)
 		memcpy(data, buf + e->offset, e->len);
-	t4_free_mem(buf);
+	kvfree(buf);
 	return err;
 }
 
@@ -1009,7 +1009,7 @@ static int set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 	if (aligned_offset != eeprom->offset || aligned_len != eeprom->len) {
 		/* RMW possibly needed for first or last words.
 		 */
-		buf = t4_alloc_mem(aligned_len);
+		buf = kvzalloc(aligned_len, GFP_KERNEL);
 		if (!buf)
 			return -ENOMEM;
 		err = eeprom_rd_phys(adapter, aligned_offset, (u32 *)buf);
@@ -1037,7 +1037,7 @@ static int set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 		err = t4_seeprom_wp(adapter, true);
 out:
 	if (buf != data)
-		t4_free_mem(buf);
+		kvfree(buf);
 	return err;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 49e000ebd2b9..a9f55ec139bb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -887,27 +887,6 @@ static int setup_sge_queues(struct adapter *adap)
 	return err;
 }
 
-/*
- * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc.
- * The allocated memory is cleared.
- */
-void *t4_alloc_mem(size_t size)
-{
-	void *p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!p)
-		p = vzalloc(size);
-	return p;
-}
-
-/*
- * Free memory allocated through alloc_mem().
- */
-void t4_free_mem(void *addr)
-{
-	kvfree(addr);
-}
-
 static u16 cxgb_select_queue(struct net_device *dev, struct sk_buff *skb,
 			     void *accel_priv, select_queue_fallback_t fallback)
 {
@@ -1306,7 +1285,7 @@ static int tid_init(struct tid_info *t)
 	       max_ftids * sizeof(*t->ftid_tab) +
 	       ftid_bmap_size * sizeof(long);
 
-	t->tid_tab = t4_alloc_mem(size);
+	t->tid_tab = kvzalloc(size, GFP_KERNEL);
 	if (!t->tid_tab)
 		return -ENOMEM;
 
@@ -3451,7 +3430,7 @@ static int adap_init0(struct adapter *adap)
 		/* allocate memory to read the header of the firmware on the
 		 * card
 		 */
-		card_fw = t4_alloc_mem(sizeof(*card_fw));
+		card_fw = kvzalloc(sizeof(*card_fw), GFP_KERNEL);
 
 		/* Get FW from from /lib/firmware/ */
 		ret = request_firmware(&fw, fw_info->fw_mod_name,
@@ -3471,7 +3450,7 @@ static int adap_init0(struct adapter *adap)
 
 		/* Cleaning up */
 		release_firmware(fw);
-		t4_free_mem(card_fw);
+		kvfree(card_fw);
 
 		if (ret < 0)
 			goto bye;
@@ -4467,9 +4446,9 @@ static void free_some_resources(struct adapter *adapter)
 {
 	unsigned int i;
 
-	t4_free_mem(adapter->l2t);
+	kvfree(adapter->l2t);
 	t4_cleanup_sched(adapter);
-	t4_free_mem(adapter->tids.tid_tab);
+	kvfree(adapter->tids.tid_tab);
 	cxgb4_cleanup_tc_u32(adapter);
 	kfree(adapter->sge.egr_map);
 	kfree(adapter->sge.ingr_map);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
index 52af62e0ecb6..127f91a747b1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
@@ -432,9 +432,9 @@ void cxgb4_cleanup_tc_u32(struct adapter *adap)
 	for (i = 0; i < t->size; i++) {
 		struct cxgb4_link *link = &t->table[i];
 
-		t4_free_mem(link->tid_map);
+		kvfree(link->tid_map);
 	}
-	t4_free_mem(adap->tc_u32);
+	kvfree(adap->tc_u32);
 }
 
 struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
@@ -446,8 +446,7 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 	if (!size)
 		return NULL;
 
-	t = t4_alloc_mem(sizeof(*t) +
-			 (size * sizeof(struct cxgb4_link)));
+	t = kvzalloc(sizeof(*t) + (size * sizeof(struct cxgb4_link)), GFP_KERNEL);
 	if (!t)
 		return NULL;
 
@@ -460,7 +459,7 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 
 		max_tids = adap->tids.nftids;
 		bmap_size = BITS_TO_LONGS(max_tids);
-		link->tid_map = t4_alloc_mem(sizeof(unsigned long) * bmap_size);
+		link->tid_map = kvzalloc(sizeof(unsigned long) * bmap_size, GFP_KERNEL);
 		if (!link->tid_map)
 			goto out_no_mem;
 		bitmap_zero(link->tid_map, max_tids);
@@ -473,11 +472,11 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 		struct cxgb4_link *link = &t->table[i];
 
 		if (link->tid_map)
-			t4_free_mem(link->tid_map);
+			kvfree(link->tid_map);
 	}
 
 	if (t)
-		t4_free_mem(t);
+		kvfree(t);
 
 	return NULL;
 }
diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.c b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
index 60a26037a1c6..fb593a1a751e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
@@ -646,7 +646,7 @@ struct l2t_data *t4_init_l2t(unsigned int l2t_start, unsigned int l2t_end)
 	if (l2t_size < L2T_MIN_HASH_BUCKETS)
 		return NULL;
 
-	d = t4_alloc_mem(sizeof(*d) + l2t_size * sizeof(struct l2t_entry));
+	d = kvzalloc(sizeof(*d) + l2t_size * sizeof(struct l2t_entry), GFP_KERNEL);
 	if (!d)
 		return NULL;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.c b/drivers/net/ethernet/chelsio/cxgb4/sched.c
index c9026352a842..02acff741f11 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.c
@@ -177,7 +177,7 @@ static int t4_sched_queue_unbind(struct port_info *pi, struct ch_sched_queue *p)
 		}
 
 		list_del(&qe->list);
-		t4_free_mem(qe);
+		kvfree(qe);
 		if (atomic_dec_and_test(&e->refcnt)) {
 			e->state = SCHED_STATE_UNUSED;
 			memset(&e->info, 0, sizeof(e->info));
@@ -201,7 +201,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	if (p->queue < 0 || p->queue >= pi->nqsets)
 		return -ERANGE;
 
-	qe = t4_alloc_mem(sizeof(struct sched_queue_entry));
+	qe = kvzalloc(sizeof(struct sched_queue_entry), GFP_KERNEL);
 	if (!qe)
 		return -ENOMEM;
 
@@ -211,7 +211,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	/* Unbind queue from any existing class */
 	err = t4_sched_queue_unbind(pi, p);
 	if (err) {
-		t4_free_mem(qe);
+		kvfree(qe);
 		goto out;
 	}
 
@@ -224,7 +224,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	spin_lock(&e->lock);
 	err = t4_sched_bind_unbind_op(pi, (void *)qe, SCHED_QUEUE, true);
 	if (err) {
-		t4_free_mem(qe);
+		kvfree(qe);
 		spin_unlock(&e->lock);
 		goto out;
 	}
@@ -512,7 +512,7 @@ struct sched_table *t4_init_sched(unsigned int sched_size)
 	struct sched_table *s;
 	unsigned int i;
 
-	s = t4_alloc_mem(sizeof(*s) + sched_size * sizeof(struct sched_class));
+	s = kvzalloc(sizeof(*s) + sched_size * sizeof(struct sched_class), GFP_KERNEL);
 	if (!s)
 		return NULL;
 
@@ -548,6 +548,6 @@ void t4_cleanup_sched(struct adapter *adap)
 				t4_sched_class_free(pi, e);
 			write_unlock(&s->rw_lock);
 		}
-		t4_free_mem(s);
+		kvfree(s);
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 5886ad78058f..a5c1b815145e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -70,13 +70,10 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
-	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
+	ring->tx_info = kvmalloc_node(tmp, GFP_KERNEL, node);
 	if (!ring->tx_info) {
-		ring->tx_info = vmalloc(tmp);
-		if (!ring->tx_info) {
-			err = -ENOMEM;
-			goto err_ring;
-		}
+		err = -ENOMEM;
+		goto err_ring;
 	}
 
 	en_dbg(DRV, priv, "Allocated tx_info ring at addr:%p size:%d\n",
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 395b5463cfd9..6583d4601480 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -115,12 +115,9 @@ static int mlx4_buddy_init(struct mlx4_buddy *buddy, int max_order)
 
 	for (i = 0; i <= buddy->max_order; ++i) {
 		s = BITS_TO_LONGS(1 << (buddy->max_order - i));
-		buddy->bits[i] = kcalloc(s, sizeof (long), GFP_KERNEL | __GFP_NOWARN);
-		if (!buddy->bits[i]) {
-			buddy->bits[i] = vzalloc(s * sizeof(long));
-			if (!buddy->bits[i])
-				goto err_out_free;
-		}
+		buddy->bits[i] = kvmalloc_array(s, sizeof(long), GFP_KERNEL | __GFP_ZERO);
+		if (!buddy->bits[i])
+			goto err_out_free;
 	}
 
 	set_bit(0, buddy->bits[buddy->max_order]);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 0eedc49e0d47..3bd332b167d9 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -102,10 +102,7 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd)
 		return -ENXIO;
 	}
 
-	ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
-	if (!ndd->data)
-		ndd->data = vmalloc(ndd->nsarea.config_size);
-
+	ndd->data = kvmalloc(ndd->nsarea.config_size, GFP_KERNEL);
 	if (!ndd->data)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c b/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
index a6a76a681ea9..8f638267e704 100644
--- a/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
+++ b/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
@@ -45,15 +45,6 @@ EXPORT_SYMBOL(libcfs_kvzalloc);
 void *libcfs_kvzalloc_cpt(struct cfs_cpt_table *cptab, int cpt, size_t size,
 			  gfp_t flags)
 {
-	void *ret;
-
-	ret = kzalloc_node(size, flags | __GFP_NOWARN,
-			   cfs_cpt_spread_node(cptab, cpt));
-	if (!ret) {
-		WARN_ON(!(flags & (__GFP_FS | __GFP_HIGH)));
-		ret = vmalloc_node(size, cfs_cpt_spread_node(cptab, cpt));
-	}
-
-	return ret;
+	return kvzalloc_node(size, flags, cfs_cpt_spread_node(cptab, cpt));
 }
 EXPORT_SYMBOL(libcfs_kvzalloc_cpt);
diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 6890897a6f30..10f1ef582659 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -87,18 +87,6 @@ struct user_evtchn {
 	bool enabled;
 };
 
-static evtchn_port_t *evtchn_alloc_ring(unsigned int size)
-{
-	evtchn_port_t *ring;
-	size_t s = size * sizeof(*ring);
-
-	ring = kmalloc(s, GFP_KERNEL);
-	if (!ring)
-		ring = vmalloc(s);
-
-	return ring;
-}
-
 static void evtchn_free_ring(evtchn_port_t *ring)
 {
 	kvfree(ring);
@@ -334,7 +322,7 @@ static int evtchn_resize_ring(struct per_user_data *u)
 	else
 		new_size = 2 * u->ring_size;
 
-	new_ring = evtchn_alloc_ring(new_size);
+	new_ring = kvmalloc(new_size * sizeof(*new_ring), GFP_KERNEL);
 	if (!new_ring)
 		return -ENOMEM;
 
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 72dd200f0478..72f2cf234ce2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5393,13 +5393,10 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 		goto out;
 	}
 
-	tmp_buf = kmalloc(fs_info->nodesize, GFP_KERNEL | __GFP_NOWARN);
+	tmp_buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
 	if (!tmp_buf) {
-		tmp_buf = vmalloc(fs_info->nodesize);
-		if (!tmp_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	left_path->search_commit_root = 1;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 99b1cce24eff..8f97bcc9e011 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3536,12 +3536,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 	u64 last_dest_end = destoff;
 
 	ret = -ENOMEM;
-	buf = kmalloc(fs_info->nodesize, GFP_KERNEL | __GFP_NOWARN);
-	if (!buf) {
-		buf = vmalloc(fs_info->nodesize);
-		if (!buf)
-			return ret;
-	}
+	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
+	if (!buf)
+		return ret;
 
 	path = btrfs_alloc_path();
 	if (!path) {
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 712922ea64d2..abbe7c6ac773 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6271,22 +6271,16 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 	sctx->clone_roots_cnt = arg->clone_sources_count;
 
 	sctx->send_max_size = BTRFS_SEND_BUF_SIZE;
-	sctx->send_buf = kmalloc(sctx->send_max_size, GFP_KERNEL | __GFP_NOWARN);
+	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
-		sctx->send_buf = vmalloc(sctx->send_max_size);
-		if (!sctx->send_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
-	sctx->read_buf = kmalloc(BTRFS_SEND_READ_SIZE, GFP_KERNEL | __GFP_NOWARN);
+	sctx->read_buf = kvmalloc(BTRFS_SEND_READ_SIZE, GFP_KERNEL);
 	if (!sctx->read_buf) {
-		sctx->read_buf = vmalloc(BTRFS_SEND_READ_SIZE);
-		if (!sctx->read_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	sctx->pending_dir_moves = RB_ROOT;
@@ -6307,13 +6301,10 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 	alloc_size = arg->clone_sources_count * sizeof(*arg->clone_sources);
 
 	if (arg->clone_sources_count) {
-		clone_sources_tmp = kmalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
+		clone_sources_tmp = kvmalloc(alloc_size, GFP_KERNEL);
 		if (!clone_sources_tmp) {
-			clone_sources_tmp = vmalloc(alloc_size);
-			if (!clone_sources_tmp) {
-				ret = -ENOMEM;
-				goto out;
-			}
+			ret = -ENOMEM;
+			goto out;
 		}
 
 		ret = copy_from_user(clone_sources_tmp, arg->clone_sources,
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 045d30d26624..78b18acf33ba 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -74,12 +74,9 @@ dio_get_pages_alloc(const struct iov_iter *it, size_t nbytes,
 	align = (unsigned long)(it->iov->iov_base + it->iov_offset) &
 		(PAGE_SIZE - 1);
 	npages = calc_pages_for(align, nbytes);
-	pages = kmalloc(sizeof(*pages) * npages, GFP_KERNEL);
-	if (!pages) {
-		pages = vmalloc(sizeof(*pages) * npages);
-		if (!pages)
-			return ERR_PTR(-ENOMEM);
-	}
+	pages = kvmalloc(sizeof(*pages) * npages, GFP_KERNEL);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
 
 	for (idx = 0; idx < npages; ) {
 		size_t start;
diff --git a/fs/select.c b/fs/select.c
index 305c0daf5d67..9e8e1189eb99 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -586,10 +586,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
-		if (!bits && alloc_size > PAGE_SIZE)
-			bits = vmalloc(alloc_size);
-
+		bits = kvmalloc(alloc_size, GFP_KERNEL);
 		if (!bits)
 			goto out_nofds;
 	}
diff --git a/fs/xattr.c b/fs/xattr.c
index 7e3317cf4045..464c94bf65f9 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -431,12 +431,9 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
 	if (size) {
 		if (size > XATTR_SIZE_MAX)
 			return -E2BIG;
-		kvalue = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-		if (!kvalue) {
-			kvalue = vmalloc(size);
-			if (!kvalue)
-				return -ENOMEM;
-		}
+		kvalue = kvmalloc(size, GFP_KERNEL);
+		if (!kvalue)
+			return -ENOMEM;
 		if (copy_from_user(kvalue, value, size)) {
 			error = -EFAULT;
 			goto out;
@@ -528,12 +525,9 @@ getxattr(struct dentry *d, const char __user *name, void __user *value,
 	if (size) {
 		if (size > XATTR_SIZE_MAX)
 			size = XATTR_SIZE_MAX;
-		kvalue = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-		if (!kvalue) {
-			kvalue = vmalloc(size);
-			if (!kvalue)
-				return -ENOMEM;
-		}
+		kvalue = kvzalloc(size, GFP_KERNEL);
+		if (!kvalue)
+			return -ENOMEM;
 	}
 
 	error = vfs_getxattr(d, kname, kvalue, size);
@@ -611,12 +605,9 @@ listxattr(struct dentry *d, char __user *list, size_t size)
 	if (size) {
 		if (size > XATTR_LIST_MAX)
 			size = XATTR_LIST_MAX;
-		klist = kmalloc(size, __GFP_NOWARN | GFP_KERNEL);
-		if (!klist) {
-			klist = vmalloc(size);
-			if (!klist)
-				return -ENOMEM;
-		}
+		klist = kvmalloc(size, GFP_KERNEL);
+		if (!klist)
+			return -ENOMEM;
 	}
 
 	error = vfs_listxattr(d, klist, size);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2fcff6b4503f..48d8b66a0efe 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -890,12 +890,7 @@ static inline u16 cmdif_rev(struct mlx5_core_dev *dev)
 
 static inline void *mlx5_vzalloc(unsigned long size)
 {
-	void *rtn;
-
-	rtn = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!rtn)
-		rtn = vzalloc(size);
-	return rtn;
+	return kvzalloc(size, GFP_KERNEL);
 }
 
 static inline u32 mlx5_base_mkey(const u32 key)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d9fdc0a2a7b..df82a2f34bc8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -512,6 +512,14 @@ static inline void *kvzalloc(size_t size, gfp_t flags)
 	return kvmalloc(size, flags | __GFP_ZERO);
 }
 
+static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+{
+	if (size != 0 && n > SIZE_MAX / size)
+		return NULL;
+
+	return kvmalloc(n * size, flags);
+}
+
 extern void kvfree(const void *addr);
 
 static inline atomic_t *compound_mapcount_ptr(struct page *page)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e68604ae3ced..4b2b3f5b3380 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -965,10 +965,7 @@ EXPORT_SYMBOL(iov_iter_get_pages);
 
 static struct page **get_pages_array(size_t n)
 {
-	struct page **p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
-	if (!p)
-		p = vmalloc(n * sizeof(struct page *));
-	return p;
+	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
 }
 
 static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index db77dcb38afd..72ebec18629c 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -200,10 +200,7 @@ struct frame_vector *frame_vector_create(unsigned int nr_frames)
 	 * Avoid higher order allocations, use vmalloc instead. It should
 	 * be rare anyway.
 	 */
-	if (size <= PAGE_SIZE)
-		vec = kmalloc(size, GFP_KERNEL);
-	else
-		vec = vmalloc(size);
+	vec = kvmalloc(size, GFP_KERNEL);
 	if (!vec)
 		return NULL;
 	vec->nr_allocated = nr_frames;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 8bea74298173..e9a59d2d91d4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -678,11 +678,7 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
 		/* no more locks than number of hash buckets */
 		nblocks = min(nblocks, hashinfo->ehash_mask + 1);
 
-		hashinfo->ehash_locks =	kmalloc_array(nblocks, locksz,
-						      GFP_KERNEL | __GFP_NOWARN);
-		if (!hashinfo->ehash_locks)
-			hashinfo->ehash_locks = vmalloc(nblocks * locksz);
-
+		hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
 		if (!hashinfo->ehash_locks)
 			return -ENOMEM;
 
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index b9ed0d50aead..d6992003bb8c 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1153,10 +1153,7 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	tcp_metrics_hash_log = order_base_2(slots);
 	size = sizeof(struct tcpm_hash_bucket) << tcp_metrics_hash_log;
 
-	tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!tcp_metrics_hash)
-		tcp_metrics_hash = vzalloc(size);
-
+	tcp_metrics_hash = kvzalloc(size, GFP_KERNEL);
 	if (!tcp_metrics_hash)
 		return -ENOMEM;
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 64d3bf269a26..1b8dad9dc7fd 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1653,10 +1653,7 @@ static int resize_platform_label_table(struct net *net, size_t limit)
 	unsigned index;
 
 	if (size) {
-		labels = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
-		if (!labels)
-			labels = vzalloc(size);
-
+		labels = kvzalloc(size, GFP_KERNEL);
 		if (!labels)
 			goto nolabels;
 	}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 14857afc9937..d529989f5791 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -763,17 +763,8 @@ EXPORT_SYMBOL(xt_check_entry_offsets);
  */
 unsigned int *xt_alloc_entry_offsets(unsigned int size)
 {
-	unsigned int *off;
+	return kvmalloc_array(size, sizeof(unsigned int), GFP_KERNEL | __GFP_ZERO);
 
-	off = kcalloc(size, sizeof(unsigned int), GFP_KERNEL | __GFP_NOWARN);
-
-	if (off)
-		return off;
-
-	if (size < (SIZE_MAX / sizeof(unsigned int)))
-		off = vmalloc(size * sizeof(unsigned int));
-
-	return off;
 }
 EXPORT_SYMBOL(xt_alloc_entry_offsets);
 
@@ -1114,7 +1105,7 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
 
 	size = sizeof(void **) * nr_cpu_ids;
 	if (size > PAGE_SIZE)
-		i->jumpstack = vzalloc(size);
+		i->jumpstack = kvzalloc(size, GFP_KERNEL);
 	else
 		i->jumpstack = kzalloc(size, GFP_KERNEL);
 	if (i->jumpstack == NULL)
@@ -1136,12 +1127,8 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
 	 */
 	size = sizeof(void *) * i->stacksize * 2u;
 	for_each_possible_cpu(cpu) {
-		if (size > PAGE_SIZE)
-			i->jumpstack[cpu] = vmalloc_node(size,
-				cpu_to_node(cpu));
-		else
-			i->jumpstack[cpu] = kmalloc_node(size,
-				GFP_KERNEL, cpu_to_node(cpu));
+		i->jumpstack[cpu] = kvmalloc_node(size, GFP_KERNEL,
+			cpu_to_node(cpu));
 		if (i->jumpstack[cpu] == NULL)
 			/*
 			 * Freeing will be done later on by the callers. The
diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 1d89a4eaf841..d6aa8f63ed2e 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -388,10 +388,7 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
 	}
 
 	sz = sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size;
-	if (sz <= PAGE_SIZE)
-		t = kzalloc(sz, GFP_KERNEL);
-	else
-		t = vzalloc(sz);
+	t = kvzalloc(sz, GFP_KERNEL);
 	if (t == NULL) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index 3b6d5bd69101..47cbfae44898 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -431,10 +431,7 @@ static int choke_change(struct Qdisc *sch, struct nlattr *opt)
 	if (mask != q->tab_mask) {
 		struct sk_buff **ntab;
 
-		ntab = kcalloc(mask + 1, sizeof(struct sk_buff *),
-			       GFP_KERNEL | __GFP_NOWARN);
-		if (!ntab)
-			ntab = vzalloc((mask + 1) * sizeof(struct sk_buff *));
+		ntab = kvmalloc_array((mask + 1), sizeof(struct sk_buff *), GFP_KERNEL | __GFP_ZERO);
 		if (!ntab)
 			return -ENOMEM;
 
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 2f50e4c72fb4..67d8dfc80bbb 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -446,27 +446,13 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
 	return 0;
 }
 
-static void *fq_codel_zalloc(size_t sz)
-{
-	void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vzalloc(sz);
-	return ptr;
-}
-
-static void fq_codel_free(void *addr)
-{
-	kvfree(addr);
-}
-
 static void fq_codel_destroy(struct Qdisc *sch)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 
 	tcf_destroy_chain(&q->filter_list);
-	fq_codel_free(q->backlogs);
-	fq_codel_free(q->flows);
+	kvfree(q->backlogs);
+	kvfree(q->flows);
 }
 
 static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
@@ -493,13 +479,13 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
 	}
 
 	if (!q->flows) {
-		q->flows = fq_codel_zalloc(q->flows_cnt *
-					   sizeof(struct fq_codel_flow));
+		q->flows = kvzalloc(q->flows_cnt *
+					   sizeof(struct fq_codel_flow), GFP_KERNEL);
 		if (!q->flows)
 			return -ENOMEM;
-		q->backlogs = fq_codel_zalloc(q->flows_cnt * sizeof(u32));
+		q->backlogs = kvzalloc(q->flows_cnt * sizeof(u32), GFP_KERNEL);
 		if (!q->backlogs) {
-			fq_codel_free(q->flows);
+			kvfree(q->flows);
 			return -ENOMEM;
 		}
 		for (i = 0; i < q->flows_cnt; i++) {
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index e3d0458af17b..2454055c737e 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -467,29 +467,14 @@ static void hhf_reset(struct Qdisc *sch)
 		rtnl_kfree_skbs(skb, skb);
 }
 
-static void *hhf_zalloc(size_t sz)
-{
-	void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vzalloc(sz);
-
-	return ptr;
-}
-
-static void hhf_free(void *addr)
-{
-	kvfree(addr);
-}
-
 static void hhf_destroy(struct Qdisc *sch)
 {
 	int i;
 	struct hhf_sched_data *q = qdisc_priv(sch);
 
 	for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-		hhf_free(q->hhf_arrays[i]);
-		hhf_free(q->hhf_valid_bits[i]);
+		kvfree(q->hhf_arrays[i]);
+		kvfree(q->hhf_valid_bits[i]);
 	}
 
 	for (i = 0; i < HH_FLOWS_CNT; i++) {
@@ -503,7 +488,7 @@ static void hhf_destroy(struct Qdisc *sch)
 			kfree(flow);
 		}
 	}
-	hhf_free(q->hh_flows);
+	kvfree(q->hh_flows);
 }
 
 static const struct nla_policy hhf_policy[TCA_HHF_MAX + 1] = {
@@ -609,8 +594,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 	if (!q->hh_flows) {
 		/* Initialize heavy-hitter flow table. */
-		q->hh_flows = hhf_zalloc(HH_FLOWS_CNT *
-					 sizeof(struct list_head));
+		q->hh_flows = kvzalloc(HH_FLOWS_CNT *
+					 sizeof(struct list_head), GFP_KERNEL);
 		if (!q->hh_flows)
 			return -ENOMEM;
 		for (i = 0; i < HH_FLOWS_CNT; i++)
@@ -624,8 +609,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 		/* Initialize heavy-hitter filter arrays. */
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-			q->hhf_arrays[i] = hhf_zalloc(HHF_ARRAYS_LEN *
-						      sizeof(u32));
+			q->hhf_arrays[i] = kvzalloc(HHF_ARRAYS_LEN *
+						      sizeof(u32), GFP_KERNEL);
 			if (!q->hhf_arrays[i]) {
 				hhf_destroy(sch);
 				return -ENOMEM;
@@ -635,8 +620,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 		/* Initialize valid bits of heavy-hitter filter arrays. */
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-			q->hhf_valid_bits[i] = hhf_zalloc(HHF_ARRAYS_LEN /
-							  BITS_PER_BYTE);
+			q->hhf_valid_bits[i] = kvzalloc(HHF_ARRAYS_LEN /
+							  BITS_PER_BYTE, GFP_KERNEL);
 			if (!q->hhf_valid_bits[i]) {
 				hhf_destroy(sch);
 				return -ENOMEM;
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index c8bb62a1e744..7abc01ef537c 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -692,15 +692,11 @@ static int get_dist_table(struct Qdisc *sch, const struct nlattr *attr)
 	spinlock_t *root_lock;
 	struct disttable *d;
 	int i;
-	size_t s;
 
 	if (n > NETEM_DIST_MAX)
 		return -EINVAL;
 
-	s = sizeof(struct disttable) + n * sizeof(s16);
-	d = kmalloc(s, GFP_KERNEL | __GFP_NOWARN);
-	if (!d)
-		d = vmalloc(s);
+	d = kvmalloc(sizeof(struct disttable) + n * sizeof(s16), GFP_KERNEL);
 	if (!d)
 		return -ENOMEM;
 
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 7f195ed4d568..5d70cd6a032d 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -684,11 +684,7 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
 
 static void *sfq_alloc(size_t sz)
 {
-	void *ptr = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vmalloc(sz);
-	return ptr;
+	return  kvmalloc(sz, GFP_KERNEL);
 }
 
 static void sfq_free(void *addr)
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 38c00e867bda..a5c21f05ece4 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -99,14 +99,9 @@ SYSCALL_DEFINE5(add_key, const char __user *, _type,
 
 	if (_payload) {
 		ret = -ENOMEM;
-		payload = kmalloc(plen, GFP_KERNEL | __GFP_NOWARN);
-		if (!payload) {
-			if (plen <= PAGE_SIZE)
-				goto error2;
-			payload = vmalloc(plen);
-			if (!payload)
-				goto error2;
-		}
+		payload = kvmalloc(plen, GFP_KERNEL);
+		if (!payload)
+			goto error2;
 
 		ret = -EFAULT;
 		if (copy_from_user(payload, _payload, plen) != 0)
@@ -1064,14 +1059,9 @@ long keyctl_instantiate_key_common(key_serial_t id,
 
 	if (from) {
 		ret = -ENOMEM;
-		payload = kmalloc(plen, GFP_KERNEL);
-		if (!payload) {
-			if (plen <= PAGE_SIZE)
-				goto error;
-			payload = vmalloc(plen);
-			if (!payload)
-				goto error;
-		}
+		payload = kvmalloc(plen, GFP_KERNEL);
+		if (!payload)
+			goto error;
 
 		ret = -EFAULT;
 		if (!copy_from_iter_full(payload, plen, from))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 10:21   ` Leon Romanovsky
                     ` (2 more replies)
  0 siblings, 3 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Martin Schwidefsky,
	Heiko Carstens, Herbert Xu, Anton Vorontsov, Colin Cross,
	Kees Cook, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng, Alexei Starovoitov, Eric Dumazet,
	netdev

From: Michal Hocko <mhocko@suse.com>

There are many code paths opencoding kvmalloc. Let's use the helper
instead. The main difference to kvmalloc is that those users are usually
not considering all the aspects of the memory allocator. E.g. allocation
requests <= 32kB (with 4kB pages) are basically never failing and invoke
OOM killer to satisfy the allocation. This sounds too disruptive for
something that has a reasonable fallback - the vmalloc. On the other
hand those requests might fallback to vmalloc even when the memory
allocator would succeed after several more reclaim/compaction attempts
previously. There is no guarantee something like that happens though.

This patch converts many of those places to kv[mz]alloc* helpers because
they are more conservative.

Changes since v1
- add kvmalloc_array - this might silently fix some overflow issues
  because most users simply didn't check the overflow for the vmalloc
  fallback.

Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Santosh Raspatur <santosh@chelsio.com>
Cc: Hariprasad S <hariprasad@chelsio.com>
Cc: Yishai Hadas <yishaih@mellanox.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev@vger.kernel.org
Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
Acked-by: David Sterba <dsterba@suse.com> # btrfs
Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/s390/kvm/kvm-s390.c                           | 10 ++-----
 crypto/lzo.c                                       |  4 +--
 drivers/acpi/apei/erst.c                           |  8 ++----
 drivers/char/agp/generic.c                         |  8 +-----
 drivers/gpu/drm/nouveau/nouveau_gem.c              |  4 +--
 drivers/md/bcache/util.h                           | 12 ++------
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h    |  3 --
 drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c | 29 +++----------------
 drivers/net/ethernet/chelsio/cxgb3/l2t.c           |  8 +-----
 drivers/net/ethernet/chelsio/cxgb3/l2t.h           |  1 -
 drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c      | 12 ++++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |  3 --
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 10 +++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c |  8 +++---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 31 ++++----------------
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c  | 13 ++++-----
 drivers/net/ethernet/chelsio/cxgb4/l2t.c           |  2 +-
 drivers/net/ethernet/chelsio/cxgb4/sched.c         | 12 ++++----
 drivers/net/ethernet/mellanox/mlx4/en_tx.c         |  9 ++----
 drivers/net/ethernet/mellanox/mlx4/mr.c            |  9 ++----
 drivers/nvdimm/dimm_devs.c                         |  5 +---
 .../staging/lustre/lnet/libcfs/linux/linux-mem.c   | 11 +-------
 drivers/xen/evtchn.c                               | 14 +--------
 fs/btrfs/ctree.c                                   |  9 ++----
 fs/btrfs/ioctl.c                                   |  9 ++----
 fs/btrfs/send.c                                    | 27 ++++++------------
 fs/ceph/file.c                                     |  9 ++----
 fs/select.c                                        |  5 +---
 fs/xattr.c                                         | 27 ++++++------------
 include/linux/mlx5/driver.h                        |  7 +----
 include/linux/mm.h                                 |  8 ++++++
 lib/iov_iter.c                                     |  5 +---
 mm/frame_vector.c                                  |  5 +---
 net/ipv4/inet_hashtables.c                         |  6 +---
 net/ipv4/tcp_metrics.c                             |  5 +---
 net/mpls/af_mpls.c                                 |  5 +---
 net/netfilter/x_tables.c                           | 21 +++-----------
 net/netfilter/xt_recent.c                          |  5 +---
 net/sched/sch_choke.c                              |  5 +---
 net/sched/sch_fq_codel.c                           | 26 ++++-------------
 net/sched/sch_hhf.c                                | 33 ++++++----------------
 net/sched/sch_netem.c                              |  6 +---
 net/sched/sch_sfq.c                                |  6 +---
 security/keys/keyctl.c                             | 22 ++++-----------
 44 files changed, 127 insertions(+), 350 deletions(-)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index dabd3b15bf11..ecc09b6648f3 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1159,10 +1159,7 @@ static long kvm_s390_get_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
 	if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
 		return -EINVAL;
 
-	keys = kmalloc_array(args->count, sizeof(uint8_t),
-			     GFP_KERNEL | __GFP_NOWARN);
-	if (!keys)
-		keys = vmalloc(sizeof(uint8_t) * args->count);
+	keys = kvmalloc_array(args->count, sizeof(uint8_t), GFP_KERNEL);
 	if (!keys)
 		return -ENOMEM;
 
@@ -1204,10 +1201,7 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args)
 	if (args->count < 1 || args->count > KVM_S390_SKEYS_MAX)
 		return -EINVAL;
 
-	keys = kmalloc_array(args->count, sizeof(uint8_t),
-			     GFP_KERNEL | __GFP_NOWARN);
-	if (!keys)
-		keys = vmalloc(sizeof(uint8_t) * args->count);
+	keys = kvmalloc_array(args->count, sizeof(uint8_t), GFP_KERNEL);
 	if (!keys)
 		return -ENOMEM;
 
diff --git a/crypto/lzo.c b/crypto/lzo.c
index 168df784da84..218567d717d6 100644
--- a/crypto/lzo.c
+++ b/crypto/lzo.c
@@ -32,9 +32,7 @@ static void *lzo_alloc_ctx(struct crypto_scomp *tfm)
 {
 	void *ctx;
 
-	ctx = kmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL | __GFP_NOWARN);
-	if (!ctx)
-		ctx = vmalloc(LZO1X_MEM_COMPRESS);
+	ctx = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL);
 	if (!ctx)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/acpi/apei/erst.c b/drivers/acpi/apei/erst.c
index ec4f507b524f..a2898df61744 100644
--- a/drivers/acpi/apei/erst.c
+++ b/drivers/acpi/apei/erst.c
@@ -513,7 +513,7 @@ static int __erst_record_id_cache_add_one(void)
 	if (i < erst_record_id_cache.len)
 		goto retry;
 	if (erst_record_id_cache.len >= erst_record_id_cache.size) {
-		int new_size, alloc_size;
+		int new_size;
 		u64 *new_entries;
 
 		new_size = erst_record_id_cache.size * 2;
@@ -524,11 +524,7 @@ static int __erst_record_id_cache_add_one(void)
 				pr_warn(FW_WARN "too many record IDs!\n");
 			return 0;
 		}
-		alloc_size = new_size * sizeof(entries[0]);
-		if (alloc_size < PAGE_SIZE)
-			new_entries = kmalloc(alloc_size, GFP_KERNEL);
-		else
-			new_entries = vmalloc(alloc_size);
+		new_entries = kvmalloc(new_size * sizeof(entries[0]), GFP_KERNEL);
 		if (!new_entries)
 			return -ENOMEM;
 		memcpy(new_entries, entries,
diff --git a/drivers/char/agp/generic.c b/drivers/char/agp/generic.c
index f002fa5d1887..bdf418cac8ef 100644
--- a/drivers/char/agp/generic.c
+++ b/drivers/char/agp/generic.c
@@ -88,13 +88,7 @@ static int agp_get_key(void)
 
 void agp_alloc_page_array(size_t size, struct agp_memory *mem)
 {
-	mem->pages = NULL;
-
-	if (size <= 2*PAGE_SIZE)
-		mem->pages = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (mem->pages == NULL) {
-		mem->pages = vmalloc(size);
-	}
+	mem->pages = kvmalloc(size, GFP_KERNEL);
 }
 EXPORT_SYMBOL(agp_alloc_page_array);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index 201b52b750dd..77dd73ff126f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -568,9 +568,7 @@ u_memcpya(uint64_t user, unsigned nmemb, unsigned size)
 
 	size *= nmemb;
 
-	mem = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!mem)
-		mem = vmalloc(size);
+	mem = kvmalloc(size, GFP_KERNEL);
 	if (!mem)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index cf2cbc211d83..d00bcb64d3a8 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -43,11 +43,7 @@ struct closure;
 	(heap)->used = 0;						\
 	(heap)->size = (_size);						\
 	_bytes = (heap)->size * sizeof(*(heap)->data);			\
-	(heap)->data = NULL;						\
-	if (_bytes < KMALLOC_MAX_SIZE)					\
-		(heap)->data = kmalloc(_bytes, (gfp));			\
-	if ((!(heap)->data) && ((gfp) & GFP_KERNEL))			\
-		(heap)->data = vmalloc(_bytes);				\
+	(heap)->data = kvmalloc(_bytes, (gfp) & GFP_KERNEL);		\
 	(heap)->data;							\
 })
 
@@ -136,12 +132,8 @@ do {									\
 									\
 	(fifo)->mask = _allocated_size - 1;				\
 	(fifo)->front = (fifo)->back = 0;				\
-	(fifo)->data = NULL;						\
 									\
-	if (_bytes < KMALLOC_MAX_SIZE)					\
-		(fifo)->data = kmalloc(_bytes, (gfp));			\
-	if ((!(fifo)->data) && ((gfp) & GFP_KERNEL))			\
-		(fifo)->data = vmalloc(_bytes);				\
+	(fifo)->data = kvmalloc(_bytes, (gfp) & GFP_KERNEL);		\
 	(fifo)->data;							\
 })
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
index 920d918ed193..f04e81f33795 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_defs.h
@@ -41,9 +41,6 @@
 
 #define VALIDATE_TID 1
 
-void *cxgb_alloc_mem(unsigned long size);
-void cxgb_free_mem(void *addr);
-
 /*
  * Map an ATID or STID to their entries in the corresponding TID tables.
  */
diff --git a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
index 76684dcb874c..fa81445e334c 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/cxgb3_offload.c
@@ -1152,27 +1152,6 @@ static void cxgb_redirect(struct dst_entry *old, struct dst_entry *new,
 }
 
 /*
- * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc.
- * The allocated memory is cleared.
- */
-void *cxgb_alloc_mem(unsigned long size)
-{
-	void *p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!p)
-		p = vzalloc(size);
-	return p;
-}
-
-/*
- * Free memory allocated through t3_alloc_mem().
- */
-void cxgb_free_mem(void *addr)
-{
-	kvfree(addr);
-}
-
-/*
  * Allocate and initialize the TID tables.  Returns 0 on success.
  */
 static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
@@ -1182,7 +1161,7 @@ static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
 	unsigned long size = ntids * sizeof(*t->tid_tab) +
 	    natids * sizeof(*t->atid_tab) + nstids * sizeof(*t->stid_tab);
 
-	t->tid_tab = cxgb_alloc_mem(size);
+	t->tid_tab = kvzalloc(size, GFP_KERNEL);
 	if (!t->tid_tab)
 		return -ENOMEM;
 
@@ -1218,7 +1197,7 @@ static int init_tid_tabs(struct tid_info *t, unsigned int ntids,
 
 static void free_tid_maps(struct tid_info *t)
 {
-	cxgb_free_mem(t->tid_tab);
+	kvfree(t->tid_tab);
 }
 
 static inline void add_adapter(struct adapter *adap)
@@ -1293,7 +1272,7 @@ int cxgb3_offload_activate(struct adapter *adapter)
 	return 0;
 
 out_free_l2t:
-	t3_free_l2t(l2td);
+	kvfree(l2td);
 out_free:
 	kfree(t);
 	return err;
@@ -1302,7 +1281,7 @@ int cxgb3_offload_activate(struct adapter *adapter)
 static void clean_l2_data(struct rcu_head *head)
 {
 	struct l2t_data *d = container_of(head, struct l2t_data, rcu_head);
-	t3_free_l2t(d);
+	kvfree(d);
 }
 
 
diff --git a/drivers/net/ethernet/chelsio/cxgb3/l2t.c b/drivers/net/ethernet/chelsio/cxgb3/l2t.c
index 5f226eda8cd6..f93160271917 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/l2t.c
@@ -444,7 +444,7 @@ struct l2t_data *t3_init_l2t(unsigned int l2t_capacity)
 	struct l2t_data *d;
 	int i, size = sizeof(*d) + l2t_capacity * sizeof(struct l2t_entry);
 
-	d = cxgb_alloc_mem(size);
+	d = kvzalloc(size, GFP_KERNEL);
 	if (!d)
 		return NULL;
 
@@ -462,9 +462,3 @@ struct l2t_data *t3_init_l2t(unsigned int l2t_capacity)
 	}
 	return d;
 }
-
-void t3_free_l2t(struct l2t_data *d)
-{
-	cxgb_free_mem(d);
-}
-
diff --git a/drivers/net/ethernet/chelsio/cxgb3/l2t.h b/drivers/net/ethernet/chelsio/cxgb3/l2t.h
index 8cffcdfd5678..c2fd323c4078 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/l2t.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/l2t.h
@@ -115,7 +115,6 @@ int t3_l2t_send_slow(struct t3cdev *dev, struct sk_buff *skb,
 		     struct l2t_entry *e);
 void t3_l2t_send_event(struct t3cdev *dev, struct l2t_entry *e);
 struct l2t_data *t3_init_l2t(unsigned int l2t_capacity);
-void t3_free_l2t(struct l2t_data *d);
 
 int cxgb3_ofld_send(struct t3cdev *dev, struct sk_buff *skb);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
index 7ad43af6bde1..3103ef9b561d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/clip_tbl.c
@@ -290,8 +290,8 @@ struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
 	if (clipt_size < CLIPT_MIN_HASH_BUCKETS)
 		return NULL;
 
-	ctbl = t4_alloc_mem(sizeof(*ctbl) +
-			    clipt_size*sizeof(struct list_head));
+	ctbl = kvzalloc(sizeof(*ctbl) +
+			    clipt_size*sizeof(struct list_head), GFP_KERNEL);
 	if (!ctbl)
 		return NULL;
 
@@ -305,9 +305,9 @@ struct clip_tbl *t4_init_clip_tbl(unsigned int clipt_start,
 	for (i = 0; i < ctbl->clipt_size; ++i)
 		INIT_LIST_HEAD(&ctbl->hash_list[i]);
 
-	cl_list = t4_alloc_mem(clipt_size*sizeof(struct clip_entry));
+	cl_list = kvzalloc(clipt_size*sizeof(struct clip_entry), GFP_KERNEL);
 	if (!cl_list) {
-		t4_free_mem(ctbl);
+		kvfree(ctbl);
 		return NULL;
 	}
 	ctbl->cl_list = (void *)cl_list;
@@ -326,8 +326,8 @@ void t4_cleanup_clip_tbl(struct adapter *adap)
 
 	if (ctbl) {
 		if (ctbl->cl_list)
-			t4_free_mem(ctbl->cl_list);
-		t4_free_mem(ctbl);
+			kvfree(ctbl->cl_list);
+		kvfree(ctbl);
 	}
 }
 EXPORT_SYMBOL(t4_cleanup_clip_tbl);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index ccb455f14d08..129bd41a051c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1296,8 +1296,6 @@ extern const char cxgb4_driver_version[];
 void t4_os_portmod_changed(const struct adapter *adap, int port_id);
 void t4_os_link_changed(struct adapter *adap, int port_id, int link_stat);
 
-void *t4_alloc_mem(size_t size);
-
 void t4_free_sge_resources(struct adapter *adap);
 void t4_free_ofld_rxqs(struct adapter *adap, int n, struct sge_ofld_rxq *q);
 irq_handler_t t4_intr_handler(struct adapter *adap);
@@ -1670,7 +1668,6 @@ int t4_sched_params(struct adapter *adapter, int type, int level, int mode,
 		    int rateunit, int ratemode, int channel, int class,
 		    int minrate, int maxrate, int weight, int pktsize);
 void t4_sge_decode_idma_state(struct adapter *adapter, int state);
-void t4_free_mem(void *addr);
 void t4_idma_monitor_init(struct adapter *adapter,
 			  struct sge_idma_monitor_state *idma);
 void t4_idma_monitor(struct adapter *adapter,
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index f6e739da7bb7..1fa34b009891 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -2634,7 +2634,7 @@ static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 	if (count > avail - pos)
 		count = avail - pos;
 
-	data = t4_alloc_mem(count);
+	data = kvzalloc(count, GFP_KERNEL);
 	if (!data)
 		return -ENOMEM;
 
@@ -2642,12 +2642,12 @@ static ssize_t mem_read(struct file *file, char __user *buf, size_t count,
 	ret = t4_memory_rw(adap, 0, mem, pos, count, data, T4_MEMORY_READ);
 	spin_unlock(&adap->win0_lock);
 	if (ret) {
-		t4_free_mem(data);
+		kvfree(data);
 		return ret;
 	}
 	ret = copy_to_user(buf, data, count);
 
-	t4_free_mem(data);
+	kvfree(data);
 	if (ret)
 		return -EFAULT;
 
@@ -2753,7 +2753,7 @@ static ssize_t blocked_fl_read(struct file *filp, char __user *ubuf,
 		       adap->sge.egr_sz, adap->sge.blocked_fl);
 	len += sprintf(buf + len, "\n");
 	size = simple_read_from_buffer(ubuf, count, ppos, buf, len);
-	t4_free_mem(buf);
+	kvfree(buf);
 	return size;
 }
 
@@ -2773,7 +2773,7 @@ static ssize_t blocked_fl_write(struct file *filp, const char __user *ubuf,
 		return err;
 
 	bitmap_copy(adap->sge.blocked_fl, t, adap->sge.egr_sz);
-	t4_free_mem(t);
+	kvfree(t);
 	return count;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
index 02f80febeb91..0ba7866c8259 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
@@ -969,7 +969,7 @@ static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 {
 	int i, err = 0;
 	struct adapter *adapter = netdev2adap(dev);
-	u8 *buf = t4_alloc_mem(EEPROMSIZE);
+	u8 *buf = kvzalloc(EEPROMSIZE, GFP_KERNEL);
 
 	if (!buf)
 		return -ENOMEM;
@@ -980,7 +980,7 @@ static int get_eeprom(struct net_device *dev, struct ethtool_eeprom *e,
 
 	if (!err)
 		memcpy(data, buf + e->offset, e->len);
-	t4_free_mem(buf);
+	kvfree(buf);
 	return err;
 }
 
@@ -1009,7 +1009,7 @@ static int set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 	if (aligned_offset != eeprom->offset || aligned_len != eeprom->len) {
 		/* RMW possibly needed for first or last words.
 		 */
-		buf = t4_alloc_mem(aligned_len);
+		buf = kvzalloc(aligned_len, GFP_KERNEL);
 		if (!buf)
 			return -ENOMEM;
 		err = eeprom_rd_phys(adapter, aligned_offset, (u32 *)buf);
@@ -1037,7 +1037,7 @@ static int set_eeprom(struct net_device *dev, struct ethtool_eeprom *eeprom,
 		err = t4_seeprom_wp(adapter, true);
 out:
 	if (buf != data)
-		t4_free_mem(buf);
+		kvfree(buf);
 	return err;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 49e000ebd2b9..a9f55ec139bb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -887,27 +887,6 @@ static int setup_sge_queues(struct adapter *adap)
 	return err;
 }
 
-/*
- * Allocate a chunk of memory using kmalloc or, if that fails, vmalloc.
- * The allocated memory is cleared.
- */
-void *t4_alloc_mem(size_t size)
-{
-	void *p = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!p)
-		p = vzalloc(size);
-	return p;
-}
-
-/*
- * Free memory allocated through alloc_mem().
- */
-void t4_free_mem(void *addr)
-{
-	kvfree(addr);
-}
-
 static u16 cxgb_select_queue(struct net_device *dev, struct sk_buff *skb,
 			     void *accel_priv, select_queue_fallback_t fallback)
 {
@@ -1306,7 +1285,7 @@ static int tid_init(struct tid_info *t)
 	       max_ftids * sizeof(*t->ftid_tab) +
 	       ftid_bmap_size * sizeof(long);
 
-	t->tid_tab = t4_alloc_mem(size);
+	t->tid_tab = kvzalloc(size, GFP_KERNEL);
 	if (!t->tid_tab)
 		return -ENOMEM;
 
@@ -3451,7 +3430,7 @@ static int adap_init0(struct adapter *adap)
 		/* allocate memory to read the header of the firmware on the
 		 * card
 		 */
-		card_fw = t4_alloc_mem(sizeof(*card_fw));
+		card_fw = kvzalloc(sizeof(*card_fw), GFP_KERNEL);
 
 		/* Get FW from from /lib/firmware/ */
 		ret = request_firmware(&fw, fw_info->fw_mod_name,
@@ -3471,7 +3450,7 @@ static int adap_init0(struct adapter *adap)
 
 		/* Cleaning up */
 		release_firmware(fw);
-		t4_free_mem(card_fw);
+		kvfree(card_fw);
 
 		if (ret < 0)
 			goto bye;
@@ -4467,9 +4446,9 @@ static void free_some_resources(struct adapter *adapter)
 {
 	unsigned int i;
 
-	t4_free_mem(adapter->l2t);
+	kvfree(adapter->l2t);
 	t4_cleanup_sched(adapter);
-	t4_free_mem(adapter->tids.tid_tab);
+	kvfree(adapter->tids.tid_tab);
 	cxgb4_cleanup_tc_u32(adapter);
 	kfree(adapter->sge.egr_map);
 	kfree(adapter->sge.ingr_map);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
index 52af62e0ecb6..127f91a747b1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_u32.c
@@ -432,9 +432,9 @@ void cxgb4_cleanup_tc_u32(struct adapter *adap)
 	for (i = 0; i < t->size; i++) {
 		struct cxgb4_link *link = &t->table[i];
 
-		t4_free_mem(link->tid_map);
+		kvfree(link->tid_map);
 	}
-	t4_free_mem(adap->tc_u32);
+	kvfree(adap->tc_u32);
 }
 
 struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
@@ -446,8 +446,7 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 	if (!size)
 		return NULL;
 
-	t = t4_alloc_mem(sizeof(*t) +
-			 (size * sizeof(struct cxgb4_link)));
+	t = kvzalloc(sizeof(*t) + (size * sizeof(struct cxgb4_link)), GFP_KERNEL);
 	if (!t)
 		return NULL;
 
@@ -460,7 +459,7 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 
 		max_tids = adap->tids.nftids;
 		bmap_size = BITS_TO_LONGS(max_tids);
-		link->tid_map = t4_alloc_mem(sizeof(unsigned long) * bmap_size);
+		link->tid_map = kvzalloc(sizeof(unsigned long) * bmap_size, GFP_KERNEL);
 		if (!link->tid_map)
 			goto out_no_mem;
 		bitmap_zero(link->tid_map, max_tids);
@@ -473,11 +472,11 @@ struct cxgb4_tc_u32_table *cxgb4_init_tc_u32(struct adapter *adap,
 		struct cxgb4_link *link = &t->table[i];
 
 		if (link->tid_map)
-			t4_free_mem(link->tid_map);
+			kvfree(link->tid_map);
 	}
 
 	if (t)
-		t4_free_mem(t);
+		kvfree(t);
 
 	return NULL;
 }
diff --git a/drivers/net/ethernet/chelsio/cxgb4/l2t.c b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
index 60a26037a1c6..fb593a1a751e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/l2t.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/l2t.c
@@ -646,7 +646,7 @@ struct l2t_data *t4_init_l2t(unsigned int l2t_start, unsigned int l2t_end)
 	if (l2t_size < L2T_MIN_HASH_BUCKETS)
 		return NULL;
 
-	d = t4_alloc_mem(sizeof(*d) + l2t_size * sizeof(struct l2t_entry));
+	d = kvzalloc(sizeof(*d) + l2t_size * sizeof(struct l2t_entry), GFP_KERNEL);
 	if (!d)
 		return NULL;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.c b/drivers/net/ethernet/chelsio/cxgb4/sched.c
index c9026352a842..02acff741f11 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.c
@@ -177,7 +177,7 @@ static int t4_sched_queue_unbind(struct port_info *pi, struct ch_sched_queue *p)
 		}
 
 		list_del(&qe->list);
-		t4_free_mem(qe);
+		kvfree(qe);
 		if (atomic_dec_and_test(&e->refcnt)) {
 			e->state = SCHED_STATE_UNUSED;
 			memset(&e->info, 0, sizeof(e->info));
@@ -201,7 +201,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	if (p->queue < 0 || p->queue >= pi->nqsets)
 		return -ERANGE;
 
-	qe = t4_alloc_mem(sizeof(struct sched_queue_entry));
+	qe = kvzalloc(sizeof(struct sched_queue_entry), GFP_KERNEL);
 	if (!qe)
 		return -ENOMEM;
 
@@ -211,7 +211,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	/* Unbind queue from any existing class */
 	err = t4_sched_queue_unbind(pi, p);
 	if (err) {
-		t4_free_mem(qe);
+		kvfree(qe);
 		goto out;
 	}
 
@@ -224,7 +224,7 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
 	spin_lock(&e->lock);
 	err = t4_sched_bind_unbind_op(pi, (void *)qe, SCHED_QUEUE, true);
 	if (err) {
-		t4_free_mem(qe);
+		kvfree(qe);
 		spin_unlock(&e->lock);
 		goto out;
 	}
@@ -512,7 +512,7 @@ struct sched_table *t4_init_sched(unsigned int sched_size)
 	struct sched_table *s;
 	unsigned int i;
 
-	s = t4_alloc_mem(sizeof(*s) + sched_size * sizeof(struct sched_class));
+	s = kvzalloc(sizeof(*s) + sched_size * sizeof(struct sched_class), GFP_KERNEL);
 	if (!s)
 		return NULL;
 
@@ -548,6 +548,6 @@ void t4_cleanup_sched(struct adapter *adap)
 				t4_sched_class_free(pi, e);
 			write_unlock(&s->rw_lock);
 		}
-		t4_free_mem(s);
+		kvfree(s);
 	}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 5886ad78058f..a5c1b815145e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -70,13 +70,10 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 	ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
 
 	tmp = size * sizeof(struct mlx4_en_tx_info);
-	ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
+	ring->tx_info = kvmalloc_node(tmp, GFP_KERNEL, node);
 	if (!ring->tx_info) {
-		ring->tx_info = vmalloc(tmp);
-		if (!ring->tx_info) {
-			err = -ENOMEM;
-			goto err_ring;
-		}
+		err = -ENOMEM;
+		goto err_ring;
 	}
 
 	en_dbg(DRV, priv, "Allocated tx_info ring at addr:%p size:%d\n",
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 395b5463cfd9..6583d4601480 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -115,12 +115,9 @@ static int mlx4_buddy_init(struct mlx4_buddy *buddy, int max_order)
 
 	for (i = 0; i <= buddy->max_order; ++i) {
 		s = BITS_TO_LONGS(1 << (buddy->max_order - i));
-		buddy->bits[i] = kcalloc(s, sizeof (long), GFP_KERNEL | __GFP_NOWARN);
-		if (!buddy->bits[i]) {
-			buddy->bits[i] = vzalloc(s * sizeof(long));
-			if (!buddy->bits[i])
-				goto err_out_free;
-		}
+		buddy->bits[i] = kvmalloc_array(s, sizeof(long), GFP_KERNEL | __GFP_ZERO);
+		if (!buddy->bits[i])
+			goto err_out_free;
 	}
 
 	set_bit(0, buddy->bits[buddy->max_order]);
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 0eedc49e0d47..3bd332b167d9 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -102,10 +102,7 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd)
 		return -ENXIO;
 	}
 
-	ndd->data = kmalloc(ndd->nsarea.config_size, GFP_KERNEL);
-	if (!ndd->data)
-		ndd->data = vmalloc(ndd->nsarea.config_size);
-
+	ndd->data = kvmalloc(ndd->nsarea.config_size, GFP_KERNEL);
 	if (!ndd->data)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c b/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
index a6a76a681ea9..8f638267e704 100644
--- a/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
+++ b/drivers/staging/lustre/lnet/libcfs/linux/linux-mem.c
@@ -45,15 +45,6 @@ EXPORT_SYMBOL(libcfs_kvzalloc);
 void *libcfs_kvzalloc_cpt(struct cfs_cpt_table *cptab, int cpt, size_t size,
 			  gfp_t flags)
 {
-	void *ret;
-
-	ret = kzalloc_node(size, flags | __GFP_NOWARN,
-			   cfs_cpt_spread_node(cptab, cpt));
-	if (!ret) {
-		WARN_ON(!(flags & (__GFP_FS | __GFP_HIGH)));
-		ret = vmalloc_node(size, cfs_cpt_spread_node(cptab, cpt));
-	}
-
-	return ret;
+	return kvzalloc_node(size, flags, cfs_cpt_spread_node(cptab, cpt));
 }
 EXPORT_SYMBOL(libcfs_kvzalloc_cpt);
diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 6890897a6f30..10f1ef582659 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -87,18 +87,6 @@ struct user_evtchn {
 	bool enabled;
 };
 
-static evtchn_port_t *evtchn_alloc_ring(unsigned int size)
-{
-	evtchn_port_t *ring;
-	size_t s = size * sizeof(*ring);
-
-	ring = kmalloc(s, GFP_KERNEL);
-	if (!ring)
-		ring = vmalloc(s);
-
-	return ring;
-}
-
 static void evtchn_free_ring(evtchn_port_t *ring)
 {
 	kvfree(ring);
@@ -334,7 +322,7 @@ static int evtchn_resize_ring(struct per_user_data *u)
 	else
 		new_size = 2 * u->ring_size;
 
-	new_ring = evtchn_alloc_ring(new_size);
+	new_ring = kvmalloc(new_size * sizeof(*new_ring), GFP_KERNEL);
 	if (!new_ring)
 		return -ENOMEM;
 
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 72dd200f0478..72f2cf234ce2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5393,13 +5393,10 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 		goto out;
 	}
 
-	tmp_buf = kmalloc(fs_info->nodesize, GFP_KERNEL | __GFP_NOWARN);
+	tmp_buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
 	if (!tmp_buf) {
-		tmp_buf = vmalloc(fs_info->nodesize);
-		if (!tmp_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	left_path->search_commit_root = 1;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 99b1cce24eff..8f97bcc9e011 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3536,12 +3536,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 	u64 last_dest_end = destoff;
 
 	ret = -ENOMEM;
-	buf = kmalloc(fs_info->nodesize, GFP_KERNEL | __GFP_NOWARN);
-	if (!buf) {
-		buf = vmalloc(fs_info->nodesize);
-		if (!buf)
-			return ret;
-	}
+	buf = kvmalloc(fs_info->nodesize, GFP_KERNEL);
+	if (!buf)
+		return ret;
 
 	path = btrfs_alloc_path();
 	if (!path) {
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 712922ea64d2..abbe7c6ac773 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6271,22 +6271,16 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 	sctx->clone_roots_cnt = arg->clone_sources_count;
 
 	sctx->send_max_size = BTRFS_SEND_BUF_SIZE;
-	sctx->send_buf = kmalloc(sctx->send_max_size, GFP_KERNEL | __GFP_NOWARN);
+	sctx->send_buf = kvmalloc(sctx->send_max_size, GFP_KERNEL);
 	if (!sctx->send_buf) {
-		sctx->send_buf = vmalloc(sctx->send_max_size);
-		if (!sctx->send_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
-	sctx->read_buf = kmalloc(BTRFS_SEND_READ_SIZE, GFP_KERNEL | __GFP_NOWARN);
+	sctx->read_buf = kvmalloc(BTRFS_SEND_READ_SIZE, GFP_KERNEL);
 	if (!sctx->read_buf) {
-		sctx->read_buf = vmalloc(BTRFS_SEND_READ_SIZE);
-		if (!sctx->read_buf) {
-			ret = -ENOMEM;
-			goto out;
-		}
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	sctx->pending_dir_moves = RB_ROOT;
@@ -6307,13 +6301,10 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)
 	alloc_size = arg->clone_sources_count * sizeof(*arg->clone_sources);
 
 	if (arg->clone_sources_count) {
-		clone_sources_tmp = kmalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN);
+		clone_sources_tmp = kvmalloc(alloc_size, GFP_KERNEL);
 		if (!clone_sources_tmp) {
-			clone_sources_tmp = vmalloc(alloc_size);
-			if (!clone_sources_tmp) {
-				ret = -ENOMEM;
-				goto out;
-			}
+			ret = -ENOMEM;
+			goto out;
 		}
 
 		ret = copy_from_user(clone_sources_tmp, arg->clone_sources,
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 045d30d26624..78b18acf33ba 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -74,12 +74,9 @@ dio_get_pages_alloc(const struct iov_iter *it, size_t nbytes,
 	align = (unsigned long)(it->iov->iov_base + it->iov_offset) &
 		(PAGE_SIZE - 1);
 	npages = calc_pages_for(align, nbytes);
-	pages = kmalloc(sizeof(*pages) * npages, GFP_KERNEL);
-	if (!pages) {
-		pages = vmalloc(sizeof(*pages) * npages);
-		if (!pages)
-			return ERR_PTR(-ENOMEM);
-	}
+	pages = kvmalloc(sizeof(*pages) * npages, GFP_KERNEL);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
 
 	for (idx = 0; idx < npages; ) {
 		size_t start;
diff --git a/fs/select.c b/fs/select.c
index 305c0daf5d67..9e8e1189eb99 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -586,10 +586,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kmalloc(alloc_size, GFP_KERNEL|__GFP_NOWARN);
-		if (!bits && alloc_size > PAGE_SIZE)
-			bits = vmalloc(alloc_size);
-
+		bits = kvmalloc(alloc_size, GFP_KERNEL);
 		if (!bits)
 			goto out_nofds;
 	}
diff --git a/fs/xattr.c b/fs/xattr.c
index 7e3317cf4045..464c94bf65f9 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -431,12 +431,9 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
 	if (size) {
 		if (size > XATTR_SIZE_MAX)
 			return -E2BIG;
-		kvalue = kmalloc(size, GFP_KERNEL | __GFP_NOWARN);
-		if (!kvalue) {
-			kvalue = vmalloc(size);
-			if (!kvalue)
-				return -ENOMEM;
-		}
+		kvalue = kvmalloc(size, GFP_KERNEL);
+		if (!kvalue)
+			return -ENOMEM;
 		if (copy_from_user(kvalue, value, size)) {
 			error = -EFAULT;
 			goto out;
@@ -528,12 +525,9 @@ getxattr(struct dentry *d, const char __user *name, void __user *value,
 	if (size) {
 		if (size > XATTR_SIZE_MAX)
 			size = XATTR_SIZE_MAX;
-		kvalue = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-		if (!kvalue) {
-			kvalue = vmalloc(size);
-			if (!kvalue)
-				return -ENOMEM;
-		}
+		kvalue = kvzalloc(size, GFP_KERNEL);
+		if (!kvalue)
+			return -ENOMEM;
 	}
 
 	error = vfs_getxattr(d, kname, kvalue, size);
@@ -611,12 +605,9 @@ listxattr(struct dentry *d, char __user *list, size_t size)
 	if (size) {
 		if (size > XATTR_LIST_MAX)
 			size = XATTR_LIST_MAX;
-		klist = kmalloc(size, __GFP_NOWARN | GFP_KERNEL);
-		if (!klist) {
-			klist = vmalloc(size);
-			if (!klist)
-				return -ENOMEM;
-		}
+		klist = kvmalloc(size, GFP_KERNEL);
+		if (!klist)
+			return -ENOMEM;
 	}
 
 	error = vfs_listxattr(d, klist, size);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 2fcff6b4503f..48d8b66a0efe 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -890,12 +890,7 @@ static inline u16 cmdif_rev(struct mlx5_core_dev *dev)
 
 static inline void *mlx5_vzalloc(unsigned long size)
 {
-	void *rtn;
-
-	rtn = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!rtn)
-		rtn = vzalloc(size);
-	return rtn;
+	return kvzalloc(size, GFP_KERNEL);
 }
 
 static inline u32 mlx5_base_mkey(const u32 key)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0d9fdc0a2a7b..df82a2f34bc8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -512,6 +512,14 @@ static inline void *kvzalloc(size_t size, gfp_t flags)
 	return kvmalloc(size, flags | __GFP_ZERO);
 }
 
+static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+{
+	if (size != 0 && n > SIZE_MAX / size)
+		return NULL;
+
+	return kvmalloc(n * size, flags);
+}
+
 extern void kvfree(const void *addr);
 
 static inline atomic_t *compound_mapcount_ptr(struct page *page)
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e68604ae3ced..4b2b3f5b3380 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -965,10 +965,7 @@ EXPORT_SYMBOL(iov_iter_get_pages);
 
 static struct page **get_pages_array(size_t n)
 {
-	struct page **p = kmalloc(n * sizeof(struct page *), GFP_KERNEL);
-	if (!p)
-		p = vmalloc(n * sizeof(struct page *));
-	return p;
+	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
 }
 
 static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
diff --git a/mm/frame_vector.c b/mm/frame_vector.c
index db77dcb38afd..72ebec18629c 100644
--- a/mm/frame_vector.c
+++ b/mm/frame_vector.c
@@ -200,10 +200,7 @@ struct frame_vector *frame_vector_create(unsigned int nr_frames)
 	 * Avoid higher order allocations, use vmalloc instead. It should
 	 * be rare anyway.
 	 */
-	if (size <= PAGE_SIZE)
-		vec = kmalloc(size, GFP_KERNEL);
-	else
-		vec = vmalloc(size);
+	vec = kvmalloc(size, GFP_KERNEL);
 	if (!vec)
 		return NULL;
 	vec->nr_allocated = nr_frames;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 8bea74298173..e9a59d2d91d4 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -678,11 +678,7 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
 		/* no more locks than number of hash buckets */
 		nblocks = min(nblocks, hashinfo->ehash_mask + 1);
 
-		hashinfo->ehash_locks =	kmalloc_array(nblocks, locksz,
-						      GFP_KERNEL | __GFP_NOWARN);
-		if (!hashinfo->ehash_locks)
-			hashinfo->ehash_locks = vmalloc(nblocks * locksz);
-
+		hashinfo->ehash_locks = kvmalloc_array(nblocks, locksz, GFP_KERNEL);
 		if (!hashinfo->ehash_locks)
 			return -ENOMEM;
 
diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index b9ed0d50aead..d6992003bb8c 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -1153,10 +1153,7 @@ static int __net_init tcp_net_metrics_init(struct net *net)
 	tcp_metrics_hash_log = order_base_2(slots);
 	size = sizeof(struct tcpm_hash_bucket) << tcp_metrics_hash_log;
 
-	tcp_metrics_hash = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
-	if (!tcp_metrics_hash)
-		tcp_metrics_hash = vzalloc(size);
-
+	tcp_metrics_hash = kvzalloc(size, GFP_KERNEL);
 	if (!tcp_metrics_hash)
 		return -ENOMEM;
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 64d3bf269a26..1b8dad9dc7fd 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1653,10 +1653,7 @@ static int resize_platform_label_table(struct net *net, size_t limit)
 	unsigned index;
 
 	if (size) {
-		labels = kzalloc(size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
-		if (!labels)
-			labels = vzalloc(size);
-
+		labels = kvzalloc(size, GFP_KERNEL);
 		if (!labels)
 			goto nolabels;
 	}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 14857afc9937..d529989f5791 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -763,17 +763,8 @@ EXPORT_SYMBOL(xt_check_entry_offsets);
  */
 unsigned int *xt_alloc_entry_offsets(unsigned int size)
 {
-	unsigned int *off;
+	return kvmalloc_array(size, sizeof(unsigned int), GFP_KERNEL | __GFP_ZERO);
 
-	off = kcalloc(size, sizeof(unsigned int), GFP_KERNEL | __GFP_NOWARN);
-
-	if (off)
-		return off;
-
-	if (size < (SIZE_MAX / sizeof(unsigned int)))
-		off = vmalloc(size * sizeof(unsigned int));
-
-	return off;
 }
 EXPORT_SYMBOL(xt_alloc_entry_offsets);
 
@@ -1114,7 +1105,7 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
 
 	size = sizeof(void **) * nr_cpu_ids;
 	if (size > PAGE_SIZE)
-		i->jumpstack = vzalloc(size);
+		i->jumpstack = kvzalloc(size, GFP_KERNEL);
 	else
 		i->jumpstack = kzalloc(size, GFP_KERNEL);
 	if (i->jumpstack == NULL)
@@ -1136,12 +1127,8 @@ static int xt_jumpstack_alloc(struct xt_table_info *i)
 	 */
 	size = sizeof(void *) * i->stacksize * 2u;
 	for_each_possible_cpu(cpu) {
-		if (size > PAGE_SIZE)
-			i->jumpstack[cpu] = vmalloc_node(size,
-				cpu_to_node(cpu));
-		else
-			i->jumpstack[cpu] = kmalloc_node(size,
-				GFP_KERNEL, cpu_to_node(cpu));
+		i->jumpstack[cpu] = kvmalloc_node(size, GFP_KERNEL,
+			cpu_to_node(cpu));
 		if (i->jumpstack[cpu] == NULL)
 			/*
 			 * Freeing will be done later on by the callers. The
diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 1d89a4eaf841..d6aa8f63ed2e 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -388,10 +388,7 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
 	}
 
 	sz = sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size;
-	if (sz <= PAGE_SIZE)
-		t = kzalloc(sz, GFP_KERNEL);
-	else
-		t = vzalloc(sz);
+	t = kvzalloc(sz, GFP_KERNEL);
 	if (t == NULL) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/sched/sch_choke.c b/net/sched/sch_choke.c
index 3b6d5bd69101..47cbfae44898 100644
--- a/net/sched/sch_choke.c
+++ b/net/sched/sch_choke.c
@@ -431,10 +431,7 @@ static int choke_change(struct Qdisc *sch, struct nlattr *opt)
 	if (mask != q->tab_mask) {
 		struct sk_buff **ntab;
 
-		ntab = kcalloc(mask + 1, sizeof(struct sk_buff *),
-			       GFP_KERNEL | __GFP_NOWARN);
-		if (!ntab)
-			ntab = vzalloc((mask + 1) * sizeof(struct sk_buff *));
+		ntab = kvmalloc_array((mask + 1), sizeof(struct sk_buff *), GFP_KERNEL | __GFP_ZERO);
 		if (!ntab)
 			return -ENOMEM;
 
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 2f50e4c72fb4..67d8dfc80bbb 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -446,27 +446,13 @@ static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
 	return 0;
 }
 
-static void *fq_codel_zalloc(size_t sz)
-{
-	void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vzalloc(sz);
-	return ptr;
-}
-
-static void fq_codel_free(void *addr)
-{
-	kvfree(addr);
-}
-
 static void fq_codel_destroy(struct Qdisc *sch)
 {
 	struct fq_codel_sched_data *q = qdisc_priv(sch);
 
 	tcf_destroy_chain(&q->filter_list);
-	fq_codel_free(q->backlogs);
-	fq_codel_free(q->flows);
+	kvfree(q->backlogs);
+	kvfree(q->flows);
 }
 
 static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
@@ -493,13 +479,13 @@ static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
 	}
 
 	if (!q->flows) {
-		q->flows = fq_codel_zalloc(q->flows_cnt *
-					   sizeof(struct fq_codel_flow));
+		q->flows = kvzalloc(q->flows_cnt *
+					   sizeof(struct fq_codel_flow), GFP_KERNEL);
 		if (!q->flows)
 			return -ENOMEM;
-		q->backlogs = fq_codel_zalloc(q->flows_cnt * sizeof(u32));
+		q->backlogs = kvzalloc(q->flows_cnt * sizeof(u32), GFP_KERNEL);
 		if (!q->backlogs) {
-			fq_codel_free(q->flows);
+			kvfree(q->flows);
 			return -ENOMEM;
 		}
 		for (i = 0; i < q->flows_cnt; i++) {
diff --git a/net/sched/sch_hhf.c b/net/sched/sch_hhf.c
index e3d0458af17b..2454055c737e 100644
--- a/net/sched/sch_hhf.c
+++ b/net/sched/sch_hhf.c
@@ -467,29 +467,14 @@ static void hhf_reset(struct Qdisc *sch)
 		rtnl_kfree_skbs(skb, skb);
 }
 
-static void *hhf_zalloc(size_t sz)
-{
-	void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vzalloc(sz);
-
-	return ptr;
-}
-
-static void hhf_free(void *addr)
-{
-	kvfree(addr);
-}
-
 static void hhf_destroy(struct Qdisc *sch)
 {
 	int i;
 	struct hhf_sched_data *q = qdisc_priv(sch);
 
 	for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-		hhf_free(q->hhf_arrays[i]);
-		hhf_free(q->hhf_valid_bits[i]);
+		kvfree(q->hhf_arrays[i]);
+		kvfree(q->hhf_valid_bits[i]);
 	}
 
 	for (i = 0; i < HH_FLOWS_CNT; i++) {
@@ -503,7 +488,7 @@ static void hhf_destroy(struct Qdisc *sch)
 			kfree(flow);
 		}
 	}
-	hhf_free(q->hh_flows);
+	kvfree(q->hh_flows);
 }
 
 static const struct nla_policy hhf_policy[TCA_HHF_MAX + 1] = {
@@ -609,8 +594,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 	if (!q->hh_flows) {
 		/* Initialize heavy-hitter flow table. */
-		q->hh_flows = hhf_zalloc(HH_FLOWS_CNT *
-					 sizeof(struct list_head));
+		q->hh_flows = kvzalloc(HH_FLOWS_CNT *
+					 sizeof(struct list_head), GFP_KERNEL);
 		if (!q->hh_flows)
 			return -ENOMEM;
 		for (i = 0; i < HH_FLOWS_CNT; i++)
@@ -624,8 +609,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 		/* Initialize heavy-hitter filter arrays. */
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-			q->hhf_arrays[i] = hhf_zalloc(HHF_ARRAYS_LEN *
-						      sizeof(u32));
+			q->hhf_arrays[i] = kvzalloc(HHF_ARRAYS_LEN *
+						      sizeof(u32), GFP_KERNEL);
 			if (!q->hhf_arrays[i]) {
 				hhf_destroy(sch);
 				return -ENOMEM;
@@ -635,8 +620,8 @@ static int hhf_init(struct Qdisc *sch, struct nlattr *opt)
 
 		/* Initialize valid bits of heavy-hitter filter arrays. */
 		for (i = 0; i < HHF_ARRAYS_CNT; i++) {
-			q->hhf_valid_bits[i] = hhf_zalloc(HHF_ARRAYS_LEN /
-							  BITS_PER_BYTE);
+			q->hhf_valid_bits[i] = kvzalloc(HHF_ARRAYS_LEN /
+							  BITS_PER_BYTE, GFP_KERNEL);
 			if (!q->hhf_valid_bits[i]) {
 				hhf_destroy(sch);
 				return -ENOMEM;
diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index c8bb62a1e744..7abc01ef537c 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -692,15 +692,11 @@ static int get_dist_table(struct Qdisc *sch, const struct nlattr *attr)
 	spinlock_t *root_lock;
 	struct disttable *d;
 	int i;
-	size_t s;
 
 	if (n > NETEM_DIST_MAX)
 		return -EINVAL;
 
-	s = sizeof(struct disttable) + n * sizeof(s16);
-	d = kmalloc(s, GFP_KERNEL | __GFP_NOWARN);
-	if (!d)
-		d = vmalloc(s);
+	d = kvmalloc(sizeof(struct disttable) + n * sizeof(s16), GFP_KERNEL);
 	if (!d)
 		return -ENOMEM;
 
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 7f195ed4d568..5d70cd6a032d 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -684,11 +684,7 @@ static int sfq_change(struct Qdisc *sch, struct nlattr *opt)
 
 static void *sfq_alloc(size_t sz)
 {
-	void *ptr = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN);
-
-	if (!ptr)
-		ptr = vmalloc(sz);
-	return ptr;
+	return  kvmalloc(sz, GFP_KERNEL);
 }
 
 static void sfq_free(void *addr)
diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 38c00e867bda..a5c21f05ece4 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -99,14 +99,9 @@ SYSCALL_DEFINE5(add_key, const char __user *, _type,
 
 	if (_payload) {
 		ret = -ENOMEM;
-		payload = kmalloc(plen, GFP_KERNEL | __GFP_NOWARN);
-		if (!payload) {
-			if (plen <= PAGE_SIZE)
-				goto error2;
-			payload = vmalloc(plen);
-			if (!payload)
-				goto error2;
-		}
+		payload = kvmalloc(plen, GFP_KERNEL);
+		if (!payload)
+			goto error2;
 
 		ret = -EFAULT;
 		if (copy_from_user(payload, _payload, plen) != 0)
@@ -1064,14 +1059,9 @@ long keyctl_instantiate_key_common(key_serial_t id,
 
 	if (from) {
 		ret = -ENOMEM;
-		payload = kmalloc(plen, GFP_KERNEL);
-		if (!payload) {
-			if (plen <= PAGE_SIZE)
-				goto error;
-			payload = vmalloc(plen);
-			if (!payload)
-				goto error;
-		}
+		payload = kvmalloc(plen, GFP_KERNEL);
+		if (!payload)
+			goto error;
 
 		ret = -EFAULT;
 		if (!copy_from_iter_full(payload, plen, from))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
                   ` (4 preceding siblings ...)
  2017-01-30  9:49 ` [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 16:36   ` Vlastimil Babka
  2017-01-30  9:49 ` [PATCH 7/9] md: use kvmalloc rather than opencoded variant Michal Hocko
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Eric Dumazet, netdev

From: Michal Hocko <mhocko@suse.com>

fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc
with vmalloc fallback. Use the kvmalloc variant instead. Keep the
__GFP_REPEAT flag based on explanation from Eric:
"
At the time, tests on the hardware I had in my labs showed that
vmalloc() could deliver pages spread all over the memory and that was a
small penalty (once memory is fragmented enough, not at boot time)
"

The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code. This is rather disruptive for
something that can be achived with the fallback. On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests. So the
effect of this patch is that requests smaller than 64kB will fallback to
vmalloc esier now.

Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/core/dev.c     | 24 +++++++++---------------
 net/sched/sch_fq.c | 12 +-----------
 2 files changed, 10 insertions(+), 26 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index be11abac89b3..707e730821a6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7081,12 +7081,10 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!rx) {
-		rx = vzalloc(sz);
-		if (!rx)
-			return -ENOMEM;
-	}
+	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	if (!rx)
+		return -ENOMEM;
+
 	dev->_rx = rx;
 
 	for (i = 0; i < count; i++)
@@ -7123,12 +7121,10 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!tx) {
-		tx = vzalloc(sz);
-		if (!tx)
-			return -ENOMEM;
-	}
+	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	if (!tx)
+		return -ENOMEM;
+
 	dev->_tx = tx;
 
 	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
@@ -7661,9 +7657,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!p)
-		p = vzalloc(alloc_size);
+	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
 	if (!p)
 		return NULL;
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index a4f738ac7728..594f77d89f6c 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -624,16 +624,6 @@ static void fq_rehash(struct fq_sched_data *q,
 	q->stat_gc_flows += fcnt;
 }
 
-static void *fq_alloc_node(size_t sz, int node)
-{
-	void *ptr;
-
-	ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node);
-	if (!ptr)
-		ptr = vmalloc_node(sz, node);
-	return ptr;
-}
-
 static void fq_free(void *addr)
 {
 	kvfree(addr);
@@ -650,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
 		return 0;
 
 	/* If XPS was setup, we can allocate memory on right NUMA node */
-	array = fq_alloc_node(sizeof(struct rb_root) << log,
+	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
 			      netdev_queue_numa_node_read(sch->dev_queue));
 	if (!array)
 		return -ENOMEM;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 16:36   ` Vlastimil Babka
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Eric Dumazet, netdev

From: Michal Hocko <mhocko@suse.com>

fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc
with vmalloc fallback. Use the kvmalloc variant instead. Keep the
__GFP_REPEAT flag based on explanation from Eric:
"
At the time, tests on the hardware I had in my labs showed that
vmalloc() could deliver pages spread all over the memory and that was a
small penalty (once memory is fragmented enough, not at boot time)
"

The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code. This is rather disruptive for
something that can be achived with the fallback. On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests. So the
effect of this patch is that requests smaller than 64kB will fallback to
vmalloc esier now.

Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/core/dev.c     | 24 +++++++++---------------
 net/sched/sch_fq.c | 12 +-----------
 2 files changed, 10 insertions(+), 26 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index be11abac89b3..707e730821a6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7081,12 +7081,10 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!rx) {
-		rx = vzalloc(sz);
-		if (!rx)
-			return -ENOMEM;
-	}
+	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	if (!rx)
+		return -ENOMEM;
+
 	dev->_rx = rx;
 
 	for (i = 0; i < count; i++)
@@ -7123,12 +7121,10 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!tx) {
-		tx = vzalloc(sz);
-		if (!tx)
-			return -ENOMEM;
-	}
+	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	if (!tx)
+		return -ENOMEM;
+
 	dev->_tx = tx;
 
 	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
@@ -7661,9 +7657,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!p)
-		p = vzalloc(alloc_size);
+	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
 	if (!p)
 		return NULL;
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index a4f738ac7728..594f77d89f6c 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -624,16 +624,6 @@ static void fq_rehash(struct fq_sched_data *q,
 	q->stat_gc_flows += fcnt;
 }
 
-static void *fq_alloc_node(size_t sz, int node)
-{
-	void *ptr;
-
-	ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node);
-	if (!ptr)
-		ptr = vmalloc_node(sz, node);
-	return ptr;
-}
-
 static void fq_free(void *addr)
 {
 	kvfree(addr);
@@ -650,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
 		return 0;
 
 	/* If XPS was setup, we can allocate memory on right NUMA node */
-	array = fq_alloc_node(sizeof(struct rb_root) << log,
+	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
 			      netdev_queue_numa_node_read(sch->dev_queue));
 	if (!array)
 		return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 16:36   ` Vlastimil Babka
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Eric Dumazet, netdev

From: Michal Hocko <mhocko@suse.com>

fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc
with vmalloc fallback. Use the kvmalloc variant instead. Keep the
__GFP_REPEAT flag based on explanation from Eric:
"
At the time, tests on the hardware I had in my labs showed that
vmalloc() could deliver pages spread all over the memory and that was a
small penalty (once memory is fragmented enough, not at boot time)
"

The way how the code is constructed means, however, that we prefer to go
and hit the OOM killer before we fall back to the vmalloc for requests
<=32kB (with 4kB pages) in the current code. This is rather disruptive for
something that can be achived with the fallback. On the other hand
__GFP_REPEAT doesn't have any useful semantic for these requests. So the
effect of this patch is that requests smaller than 64kB will fallback to
vmalloc esier now.

Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/core/dev.c     | 24 +++++++++---------------
 net/sched/sch_fq.c | 12 +-----------
 2 files changed, 10 insertions(+), 26 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index be11abac89b3..707e730821a6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7081,12 +7081,10 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!rx) {
-		rx = vzalloc(sz);
-		if (!rx)
-			return -ENOMEM;
-	}
+	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	if (!rx)
+		return -ENOMEM;
+
 	dev->_rx = rx;
 
 	for (i = 0; i < count; i++)
@@ -7123,12 +7121,10 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!tx) {
-		tx = vzalloc(sz);
-		if (!tx)
-			return -ENOMEM;
-	}
+	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	if (!tx)
+		return -ENOMEM;
+
 	dev->_tx = tx;
 
 	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
@@ -7661,9 +7657,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
-	if (!p)
-		p = vzalloc(alloc_size);
+	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
 	if (!p)
 		return NULL;
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index a4f738ac7728..594f77d89f6c 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -624,16 +624,6 @@ static void fq_rehash(struct fq_sched_data *q,
 	q->stat_gc_flows += fcnt;
 }
 
-static void *fq_alloc_node(size_t sz, int node)
-{
-	void *ptr;
-
-	ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node);
-	if (!ptr)
-		ptr = vmalloc_node(sz, node);
-	return ptr;
-}
-
 static void fq_free(void *addr)
 {
 	kvfree(addr);
@@ -650,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
 		return 0;
 
 	/* If XPS was setup, we can allocate memory on right NUMA node */
-	array = fq_alloc_node(sizeof(struct rb_root) << log,
+	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
 			      netdev_queue_numa_node_read(sch->dev_queue));
 	if (!array)
 		return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 7/9] md: use kvmalloc rather than opencoded variant
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
                   ` (5 preceding siblings ...)
  2017-01-30  9:49 ` [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 16:42   ` Vlastimil Babka
  2017-02-01 17:29   ` Mikulas Patocka
  2017-01-30  9:49 ` [PATCH 8/9] bcache: use kvmalloc Michal Hocko
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Mikulas Patocka,
	Mike Snitzer

From: Michal Hocko <mhocko@suse.com>

copy_params uses kmalloc with vmalloc fallback. We already have a helper
for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
been converted with many others by previous patches. All we need to
achieve this semantic is to use the scope memalloc_noio_{save,restore}
around kvmalloc.

Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/md/dm-ioctl.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index a5a9b17f0f7f..dbf5b981f7d7 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	struct dm_ioctl *dmi;
 	int secure_data;
 	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
+	unsigned noio_flag;
 
 	if (copy_from_user(param_kernel, user, minimum_data_size))
 		return -EFAULT;
@@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	 * Use kmalloc() rather than vmalloc() when we can.
 	 */
 	dmi = NULL;
-	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
-		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
-
-	if (!dmi) {
-		unsigned noio_flag;
-		noio_flag = memalloc_noio_save();
-		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
-		memalloc_noio_restore(noio_flag);
-	}
+	noio_flag = memalloc_noio_save();
+	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
+	memalloc_noio_restore(noio_flag);
 
 	if (!dmi) {
 		if (secure_data && clear_user(user, param_kernel->data_size))
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 7/9] md: use kvmalloc rather than opencoded variant
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 16:42   ` Vlastimil Babka
  2017-02-01 17:29   ` Mikulas Patocka
  0 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Mikulas Patocka,
	Mike Snitzer

From: Michal Hocko <mhocko@suse.com>

copy_params uses kmalloc with vmalloc fallback. We already have a helper
for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
been converted with many others by previous patches. All we need to
achieve this semantic is to use the scope memalloc_noio_{save,restore}
around kvmalloc.

Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/md/dm-ioctl.c | 13 ++++---------
 1 file changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index a5a9b17f0f7f..dbf5b981f7d7 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	struct dm_ioctl *dmi;
 	int secure_data;
 	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
+	unsigned noio_flag;
 
 	if (copy_from_user(param_kernel, user, minimum_data_size))
 		return -EFAULT;
@@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
 	 * Use kmalloc() rather than vmalloc() when we can.
 	 */
 	dmi = NULL;
-	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
-		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
-
-	if (!dmi) {
-		unsigned noio_flag;
-		noio_flag = memalloc_noio_save();
-		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
-		memalloc_noio_restore(noio_flag);
-	}
+	noio_flag = memalloc_noio_save();
+	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
+	memalloc_noio_restore(noio_flag);
 
 	if (!dmi) {
 		if (secure_data && clear_user(user, param_kernel->data_size))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 8/9] bcache: use kvmalloc
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
                   ` (6 preceding siblings ...)
  2017-01-30  9:49 ` [PATCH 7/9] md: use kvmalloc rather than opencoded variant Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 16:47   ` Vlastimil Babka
  2017-01-30  9:49 ` [PATCH 9/9] net, bpf: use kvzalloc helper Michal Hocko
  2017-02-05 10:23 ` [PATCH 0/6 v3] kvmalloc Michal Hocko
  9 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Kent Overstreet

From: Michal Hocko <mhocko@suse.com>

bcache_device_init uses kmalloc for small requests and vmalloc for those
which are larger than 64 pages. This alone is a strange criterion.
Moreover kmalloc can fallback to vmalloc on the failure. Let's simply
use kvmalloc instead as it knows how to handle the fallback properly

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/md/bcache/super.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 3a19cbc8b230..4cb6b88a1465 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -767,16 +767,12 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 	}
 
 	n = d->nr_stripes * sizeof(atomic_t);
-	d->stripe_sectors_dirty = n < PAGE_SIZE << 6
-		? kzalloc(n, GFP_KERNEL)
-		: vzalloc(n);
+	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
 	if (!d->stripe_sectors_dirty)
 		return -ENOMEM;
 
 	n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
-	d->full_dirty_stripes = n < PAGE_SIZE << 6
-		? kzalloc(n, GFP_KERNEL)
-		: vzalloc(n);
+	d->full_dirty_stripes = kvzalloc(n, GFP_KERNEL);
 	if (!d->full_dirty_stripes)
 		return -ENOMEM;
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 8/9] bcache: use kvmalloc
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 16:47   ` Vlastimil Babka
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Kent Overstreet

From: Michal Hocko <mhocko@suse.com>

bcache_device_init uses kmalloc for small requests and vmalloc for those
which are larger than 64 pages. This alone is a strange criterion.
Moreover kmalloc can fallback to vmalloc on the failure. Let's simply
use kvmalloc instead as it knows how to handle the fallback properly

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/md/bcache/super.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 3a19cbc8b230..4cb6b88a1465 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -767,16 +767,12 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 	}
 
 	n = d->nr_stripes * sizeof(atomic_t);
-	d->stripe_sectors_dirty = n < PAGE_SIZE << 6
-		? kzalloc(n, GFP_KERNEL)
-		: vzalloc(n);
+	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
 	if (!d->stripe_sectors_dirty)
 		return -ENOMEM;
 
 	n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
-	d->full_dirty_stripes = n < PAGE_SIZE << 6
-		? kzalloc(n, GFP_KERNEL)
-		: vzalloc(n);
+	d->full_dirty_stripes = kvzalloc(n, GFP_KERNEL);
 	if (!d->full_dirty_stripes)
 		return -ENOMEM;
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 9/9] net, bpf: use kvzalloc helper
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
                   ` (7 preceding siblings ...)
  2017-01-30  9:49 ` [PATCH 8/9] bcache: use kvmalloc Michal Hocko
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 17:20   ` Michal Hocko
  2017-02-05 10:23 ` [PATCH 0/6 v3] kvmalloc Michal Hocko
  9 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Alexei Starovoitov,
	Andrey Konovalov, Marcelo Ricardo Leitner, Pablo Neira Ayuso

From: Michal Hocko <mhocko@suse.com>

both bpf_map_area_alloc and xt_alloc_table_info try really hard to
play nicely with large memory requests which can be triggered from
the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
don't trigger OOM killer under pressure with map alloc").

The current allocation pattern strongly resembles kvmalloc helper except
for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
main reason why kvmalloc doesn't really support __GFP_NORETRY is
because vmalloc doesn't support this flag properly and it is far from
straightforward to make it understand it because there are some hard
coded GFP_KERNEL allocation deep in the call chains. This patch simply
replaces the open coded variants with kvmalloc and puts a note to
push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
to be really needed along with OOM report pointing at vmalloc.

If there is an immediate need and no full support yet then
	kvmalloc(size, gfp | __GFP_NORETRY)
will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
might trigger the OOM in some cases.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/bpf/syscall.c     | 19 +++++--------------
 net/netfilter/x_tables.c | 16 ++++++----------
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 08a4d287226b..3d38c7a51e1a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -54,21 +54,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
-	 * trigger under memory pressure as we really just want to
-	 * fail instead.
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
 	 */
-	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-	void *area;
-
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		area = kmalloc(size, GFP_USER | flags);
-		if (area != NULL)
-			return area;
-	}
-
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-			 PAGE_KERNEL);
+	return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d529989f5791..ba8ba633da72 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
 	if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
 		return NULL;
 
-	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
-		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
-	if (!info) {
-		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
-				     __GFP_NORETRY | __GFP_HIGHMEM,
-				 PAGE_KERNEL);
-		if (!info)
-			return NULL;
-	}
-	memset(info, 0, sizeof(*info));
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
+	 */
+	info = kvzalloc(sz, GFP_KERNEL);
 	info->size = size;
 	return info;
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 9/9] net, bpf: use kvzalloc helper
@ 2017-01-30  9:49 ` Michal Hocko
  2017-01-30 17:20   ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  9:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Michal Hocko, Alexei Starovoitov,
	Andrey Konovalov, Marcelo Ricardo Leitner, Pablo Neira Ayuso

From: Michal Hocko <mhocko@suse.com>

both bpf_map_area_alloc and xt_alloc_table_info try really hard to
play nicely with large memory requests which can be triggered from
the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
don't trigger OOM killer under pressure with map alloc").

The current allocation pattern strongly resembles kvmalloc helper except
for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
main reason why kvmalloc doesn't really support __GFP_NORETRY is
because vmalloc doesn't support this flag properly and it is far from
straightforward to make it understand it because there are some hard
coded GFP_KERNEL allocation deep in the call chains. This patch simply
replaces the open coded variants with kvmalloc and puts a note to
push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
to be really needed along with OOM report pointing at vmalloc.

If there is an immediate need and no full support yet then
	kvmalloc(size, gfp | __GFP_NORETRY)
will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
might trigger the OOM in some cases.

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/bpf/syscall.c     | 19 +++++--------------
 net/netfilter/x_tables.c | 16 ++++++----------
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 08a4d287226b..3d38c7a51e1a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -54,21 +54,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
-	 * trigger under memory pressure as we really just want to
-	 * fail instead.
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
 	 */
-	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-	void *area;
-
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		area = kmalloc(size, GFP_USER | flags);
-		if (area != NULL)
-			return area;
-	}
-
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-			 PAGE_KERNEL);
+	return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d529989f5791..ba8ba633da72 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
 	if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
 		return NULL;
 
-	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
-		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
-	if (!info) {
-		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
-				     __GFP_NORETRY | __GFP_HIGHMEM,
-				 PAGE_KERNEL);
-		if (!info)
-			return NULL;
-	}
-	memset(info, 0, sizeof(*info));
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
+	 */
+	info = kvzalloc(sz, GFP_KERNEL);
 	info->size = size;
 	return info;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
  2017-01-30  9:49 ` [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants Michal Hocko
@ 2017-01-30 10:21   ` Leon Romanovsky
  2017-01-30 16:14   ` Vlastimil Babka
  2017-01-30 19:24   ` Kees Cook
  2 siblings, 0 replies; 111+ messages in thread
From: Leon Romanovsky @ 2017-01-30 10:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Michal Hocko,
	Martin Schwidefsky, Heiko Carstens, Herbert Xu, Anton Vorontsov,
	Colin Cross, Kees Cook, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng, Alexei Starovoitov, Eric Dumazet,
	netdev

[-- Attachment #1: Type: text/plain, Size: 2377 bytes --]

On Mon, Jan 30, 2017 at 10:49:36AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.
>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Anton Vorontsov <anton@enomsg.org>
> Cc: Colin Cross <ccross@android.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Kent Overstreet <kent.overstreet@gmail.com>
> Cc: Santosh Raspatur <santosh@chelsio.com>
> Cc: Hariprasad S <hariprasad@chelsio.com>
> Cc: Yishai Hadas <yishaih@mellanox.com>
> Cc: Oleg Drokin <oleg.drokin@intel.com>
> Cc: "Yan, Zheng" <zyan@redhat.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: netdev@vger.kernel.org
> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
> Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
> Acked-by: David Sterba <dsterba@suse.com> # btrfs
> Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
> Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
@ 2017-01-30 10:21   ` Leon Romanovsky
  0 siblings, 0 replies; 111+ messages in thread
From: Leon Romanovsky @ 2017-01-30 10:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Michal Hocko,
	Martin Schwidefsky, Heiko Carstens, Herbert Xu, Anton Vorontsov,
	Colin Cross, Kees Cook, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng

[-- Attachment #1: Type: text/plain, Size: 2377 bytes --]

On Mon, Jan 30, 2017 at 10:49:36AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.
>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Anton Vorontsov <anton@enomsg.org>
> Cc: Colin Cross <ccross@android.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Kent Overstreet <kent.overstreet@gmail.com>
> Cc: Santosh Raspatur <santosh@chelsio.com>
> Cc: Hariprasad S <hariprasad@chelsio.com>
> Cc: Yishai Hadas <yishaih@mellanox.com>
> Cc: Oleg Drokin <oleg.drokin@intel.com>
> Cc: "Yan, Zheng" <zyan@redhat.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: netdev@vger.kernel.org
> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
> Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
> Acked-by: David Sterba <dsterba@suse.com> # btrfs
> Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
> Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 3/9] rhashtable: simplify a strange allocation pattern
  2017-01-30  9:49 ` [PATCH 3/9] rhashtable: simplify a strange allocation pattern Michal Hocko
@ 2017-01-30 14:04   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 14:04 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Tom Herbert, Eric Dumazet

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> alloc_bucket_locks allocation pattern is quite unusual. We are
> preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that
> vmalloc will respect the memory policy of the current process and so the
> backing memory will get distributed over multiple nodes if the requester
> is configured properly. At least that is the intention, in reality
> rhastable is shrunk and expanded from a kernel worker so no mempolicy
> can be assumed.
>
> Let's just simplify the code and use kvmalloc helper, which is a
> transparent way to use kmalloc with vmalloc fallback, if the caller
> is allowed to block and use the flag otherwise.
>
> Cc: Tom Herbert <tom@herbertland.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  lib/rhashtable.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 32d0ad058380..1a487ea70829 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -77,16 +77,9 @@ static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl,
>  	size = min_t(unsigned int, size, tbl->size >> 1);
>
>  	if (sizeof(spinlock_t) != 0) {
> -		tbl->locks = NULL;
> -#ifdef CONFIG_NUMA
> -		if (size * sizeof(spinlock_t) > PAGE_SIZE &&
> -		    gfp == GFP_KERNEL)
> -			tbl->locks = vmalloc(size * sizeof(spinlock_t));
> -#endif
> -		if (gfp != GFP_KERNEL)
> -			gfp |= __GFP_NOWARN | __GFP_NORETRY;
> -
> -		if (!tbl->locks)
> +		if (gfpflags_allow_blocking(gfp))
> +			tbl->locks = kvmalloc(size * sizeof(spinlock_t), gfp);
> +		else
>  			tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
>  						   gfp);
>  		if (!tbl->locks)
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 3/9] rhashtable: simplify a strange allocation pattern
@ 2017-01-30 14:04   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 14:04 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Tom Herbert, Eric Dumazet

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> alloc_bucket_locks allocation pattern is quite unusual. We are
> preferring vmalloc when CONFIG_NUMA is enabled. The rationale is that
> vmalloc will respect the memory policy of the current process and so the
> backing memory will get distributed over multiple nodes if the requester
> is configured properly. At least that is the intention, in reality
> rhastable is shrunk and expanded from a kernel worker so no mempolicy
> can be assumed.
>
> Let's just simplify the code and use kvmalloc helper, which is a
> transparent way to use kmalloc with vmalloc fallback, if the caller
> is allowed to block and use the flag otherwise.
>
> Cc: Tom Herbert <tom@herbertland.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  lib/rhashtable.c | 13 +++----------
>  1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/lib/rhashtable.c b/lib/rhashtable.c
> index 32d0ad058380..1a487ea70829 100644
> --- a/lib/rhashtable.c
> +++ b/lib/rhashtable.c
> @@ -77,16 +77,9 @@ static int alloc_bucket_locks(struct rhashtable *ht, struct bucket_table *tbl,
>  	size = min_t(unsigned int, size, tbl->size >> 1);
>
>  	if (sizeof(spinlock_t) != 0) {
> -		tbl->locks = NULL;
> -#ifdef CONFIG_NUMA
> -		if (size * sizeof(spinlock_t) > PAGE_SIZE &&
> -		    gfp == GFP_KERNEL)
> -			tbl->locks = vmalloc(size * sizeof(spinlock_t));
> -#endif
> -		if (gfp != GFP_KERNEL)
> -			gfp |= __GFP_NOWARN | __GFP_NORETRY;
> -
> -		if (!tbl->locks)
> +		if (gfpflags_allow_blocking(gfp))
> +			tbl->locks = kvmalloc(size * sizeof(spinlock_t), gfp);
> +		else
>  			tbl->locks = kmalloc_array(size, sizeof(spinlock_t),
>  						   gfp);
>  		if (!tbl->locks)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 4/9] ila: simplify a strange allocation pattern
  2017-01-30  9:49 ` [PATCH 4/9] ila: " Michal Hocko
@ 2017-01-30 15:21   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 15:21 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Tom Herbert, Eric Dumazet

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation
> pattern which is quite unusual. The default allocation size is 320 *
> sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
> performance benefit is really questionable and not worth the subtle code
> IMHO. Also note that the context when we call ila_init_net (modprobe or
> a task creating a net namespace) has to be properly configured.
>
> Let's just simplify the code and use kvmalloc helper which is a
> transparent way to use kmalloc with vmalloc fallback.
>
> Cc: Tom Herbert <tom@herbertland.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  net/ipv6/ila/ila_xlat.c | 8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
> index af8f52ee7180..2fd5ca151dcf 100644
> --- a/net/ipv6/ila/ila_xlat.c
> +++ b/net/ipv6/ila/ila_xlat.c
> @@ -41,13 +41,7 @@ static int alloc_ila_locks(struct ila_net *ilan)
>  	size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU);
>
>  	if (sizeof(spinlock_t) != 0) {
> -#ifdef CONFIG_NUMA
> -		if (size * sizeof(spinlock_t) > PAGE_SIZE)
> -			ilan->locks = vmalloc(size * sizeof(spinlock_t));
> -		else
> -#endif
> -		ilan->locks = kmalloc_array(size, sizeof(spinlock_t),
> -					    GFP_KERNEL);
> +		ilan->locks = kvmalloc(size * sizeof(spinlock_t), GFP_KERNEL);
>  		if (!ilan->locks)
>  			return -ENOMEM;
>  		for (i = 0; i < size; i++)
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 4/9] ila: simplify a strange allocation pattern
@ 2017-01-30 15:21   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 15:21 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Tom Herbert, Eric Dumazet

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> alloc_ila_locks seemed to c&p from alloc_bucket_locks allocation
> pattern which is quite unusual. The default allocation size is 320 *
> sizeof(spinlock_t) which is sub page unless lockdep is enabled when the
> performance benefit is really questionable and not worth the subtle code
> IMHO. Also note that the context when we call ila_init_net (modprobe or
> a task creating a net namespace) has to be properly configured.
>
> Let's just simplify the code and use kvmalloc helper which is a
> transparent way to use kmalloc with vmalloc fallback.
>
> Cc: Tom Herbert <tom@herbertland.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  net/ipv6/ila/ila_xlat.c | 8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
> index af8f52ee7180..2fd5ca151dcf 100644
> --- a/net/ipv6/ila/ila_xlat.c
> +++ b/net/ipv6/ila/ila_xlat.c
> @@ -41,13 +41,7 @@ static int alloc_ila_locks(struct ila_net *ilan)
>  	size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU);
>
>  	if (sizeof(spinlock_t) != 0) {
> -#ifdef CONFIG_NUMA
> -		if (size * sizeof(spinlock_t) > PAGE_SIZE)
> -			ilan->locks = vmalloc(size * sizeof(spinlock_t));
> -		else
> -#endif
> -		ilan->locks = kmalloc_array(size, sizeof(spinlock_t),
> -					    GFP_KERNEL);
> +		ilan->locks = kvmalloc(size * sizeof(spinlock_t), GFP_KERNEL);
>  		if (!ilan->locks)
>  			return -ENOMEM;
>  		for (i = 0; i < size; i++)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
  2017-01-30  9:49 ` [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants Michal Hocko
  2017-01-30 10:21   ` Leon Romanovsky
@ 2017-01-30 16:14   ` Vlastimil Babka
  2017-01-30 19:24   ` Kees Cook
  2 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:14 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Martin Schwidefsky, Heiko Carstens,
	Herbert Xu, Anton Vorontsov, Colin Cross, Kees Cook, Tony Luck,
	Rafael J. Wysocki, Ben Skeggs, Kent Overstreet, Santosh Raspatur,
	Hariprasad S, Yishai Hadas, Oleg Drokin, Yan, Zheng,
	Alexei Starovoitov, Eric Dumazet, netdev

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.
>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Anton Vorontsov <anton@enomsg.org>
> Cc: Colin Cross <ccross@android.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Kent Overstreet <kent.overstreet@gmail.com>
> Cc: Santosh Raspatur <santosh@chelsio.com>
> Cc: Hariprasad S <hariprasad@chelsio.com>
> Cc: Yishai Hadas <yishaih@mellanox.com>
> Cc: Oleg Drokin <oleg.drokin@intel.com>
> Cc: "Yan, Zheng" <zyan@redhat.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: netdev@vger.kernel.org
> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
> Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
> Acked-by: David Sterba <dsterba@suse.com> # btrfs
> Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
> Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
@ 2017-01-30 16:14   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:14 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Martin Schwidefsky, Heiko Carstens,
	Herbert Xu, Anton Vorontsov, Colin Cross, Kees Cook, Tony Luck,
	Rafael J. Wysocki, Ben Skeggs, Kent Overstreet, Santosh Raspatur,
	Hariprasad S, Yishai Hadas, Oleg Drokin, Yan, Zheng,
	Alexei Starovoitov, Eric Dumazet, net

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.
>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Anton Vorontsov <anton@enomsg.org>
> Cc: Colin Cross <ccross@android.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Kent Overstreet <kent.overstreet@gmail.com>
> Cc: Santosh Raspatur <santosh@chelsio.com>
> Cc: Hariprasad S <hariprasad@chelsio.com>
> Cc: Yishai Hadas <yishaih@mellanox.com>
> Cc: Oleg Drokin <oleg.drokin@intel.com>
> Cc: "Yan, Zheng" <zyan@redhat.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: netdev@vger.kernel.org
> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
> Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
> Acked-by: David Sterba <dsterba@suse.com> # btrfs
> Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
> Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
@ 2017-01-30 16:14   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:14 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Martin Schwidefsky, Heiko Carstens,
	Herbert Xu, Anton Vorontsov, Colin Cross, Kees Cook, Tony Luck,
	Rafael J. Wysocki, Ben Skeggs, Kent Overstreet, Santosh Raspatur,
	Hariprasad S, Yishai Hadas, Oleg Drokin, Yan, Zheng,
	Alexei Starovoitov, Eric Dumazet, netdev

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.
>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Anton Vorontsov <anton@enomsg.org>
> Cc: Colin Cross <ccross@android.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Kent Overstreet <kent.overstreet@gmail.com>
> Cc: Santosh Raspatur <santosh@chelsio.com>
> Cc: Hariprasad S <hariprasad@chelsio.com>
> Cc: Yishai Hadas <yishaih@mellanox.com>
> Cc: Oleg Drokin <oleg.drokin@intel.com>
> Cc: "Yan, Zheng" <zyan@redhat.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: netdev@vger.kernel.org
> Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
> Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
> Acked-by: David Sterba <dsterba@suse.com> # btrfs
> Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
> Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant
  2017-01-30  9:49 ` [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant Michal Hocko
@ 2017-01-30 16:36   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:36 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Eric Dumazet, netdev

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc
> with vmalloc fallback. Use the kvmalloc variant instead. Keep the
> __GFP_REPEAT flag based on explanation from Eric:
> "
> At the time, tests on the hardware I had in my labs showed that
> vmalloc() could deliver pages spread all over the memory and that was a
> small penalty (once memory is fragmented enough, not at boot time)
> "
>
> The way how the code is constructed means, however, that we prefer to go
> and hit the OOM killer before we fall back to the vmalloc for requests
> <=32kB (with 4kB pages) in the current code. This is rather disruptive for
> something that can be achived with the fallback. On the other hand
> __GFP_REPEAT doesn't have any useful semantic for these requests. So the
> effect of this patch is that requests smaller than 64kB will fallback to
> vmalloc esier now.

           easier

Although it's somewhat of an euphemism when compared to "basically never" :)

>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  net/core/dev.c     | 24 +++++++++---------------
>  net/sched/sch_fq.c | 12 +-----------
>  2 files changed, 10 insertions(+), 26 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index be11abac89b3..707e730821a6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7081,12 +7081,10 @@ static int netif_alloc_rx_queues(struct net_device *dev)
>
>  	BUG_ON(count < 1);
>
> -	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -	if (!rx) {
> -		rx = vzalloc(sz);
> -		if (!rx)
> -			return -ENOMEM;
> -	}
> +	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
> +	if (!rx)
> +		return -ENOMEM;
> +
>  	dev->_rx = rx;
>
>  	for (i = 0; i < count; i++)
> @@ -7123,12 +7121,10 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
>  	if (count < 1 || count > 0xffff)
>  		return -EINVAL;
>
> -	tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -	if (!tx) {
> -		tx = vzalloc(sz);
> -		if (!tx)
> -			return -ENOMEM;
> -	}
> +	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
> +	if (!tx)
> +		return -ENOMEM;
> +
>  	dev->_tx = tx;
>
>  	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
> @@ -7661,9 +7657,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	/* ensure 32-byte alignment of whole construct */
>  	alloc_size += NETDEV_ALIGN - 1;
>
> -	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -	if (!p)
> -		p = vzalloc(alloc_size);
> +	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
>  	if (!p)
>  		return NULL;
>
> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index a4f738ac7728..594f77d89f6c 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -624,16 +624,6 @@ static void fq_rehash(struct fq_sched_data *q,
>  	q->stat_gc_flows += fcnt;
>  }
>
> -static void *fq_alloc_node(size_t sz, int node)
> -{
> -	void *ptr;
> -
> -	ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node);
> -	if (!ptr)
> -		ptr = vmalloc_node(sz, node);
> -	return ptr;
> -}
> -
>  static void fq_free(void *addr)
>  {
>  	kvfree(addr);
> @@ -650,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
>  		return 0;
>
>  	/* If XPS was setup, we can allocate memory on right NUMA node */
> -	array = fq_alloc_node(sizeof(struct rb_root) << log,
> +	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
>  			      netdev_queue_numa_node_read(sch->dev_queue));
>  	if (!array)
>  		return -ENOMEM;
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant
@ 2017-01-30 16:36   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:36 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Eric Dumazet, netdev

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc
> with vmalloc fallback. Use the kvmalloc variant instead. Keep the
> __GFP_REPEAT flag based on explanation from Eric:
> "
> At the time, tests on the hardware I had in my labs showed that
> vmalloc() could deliver pages spread all over the memory and that was a
> small penalty (once memory is fragmented enough, not at boot time)
> "
>
> The way how the code is constructed means, however, that we prefer to go
> and hit the OOM killer before we fall back to the vmalloc for requests
> <=32kB (with 4kB pages) in the current code. This is rather disruptive for
> something that can be achived with the fallback. On the other hand
> __GFP_REPEAT doesn't have any useful semantic for these requests. So the
> effect of this patch is that requests smaller than 64kB will fallback to
> vmalloc esier now.

           easier

Although it's somewhat of an euphemism when compared to "basically never" :)

>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  net/core/dev.c     | 24 +++++++++---------------
>  net/sched/sch_fq.c | 12 +-----------
>  2 files changed, 10 insertions(+), 26 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index be11abac89b3..707e730821a6 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -7081,12 +7081,10 @@ static int netif_alloc_rx_queues(struct net_device *dev)
>
>  	BUG_ON(count < 1);
>
> -	rx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -	if (!rx) {
> -		rx = vzalloc(sz);
> -		if (!rx)
> -			return -ENOMEM;
> -	}
> +	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
> +	if (!rx)
> +		return -ENOMEM;
> +
>  	dev->_rx = rx;
>
>  	for (i = 0; i < count; i++)
> @@ -7123,12 +7121,10 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
>  	if (count < 1 || count > 0xffff)
>  		return -EINVAL;
>
> -	tx = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -	if (!tx) {
> -		tx = vzalloc(sz);
> -		if (!tx)
> -			return -ENOMEM;
> -	}
> +	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
> +	if (!tx)
> +		return -ENOMEM;
> +
>  	dev->_tx = tx;
>
>  	netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
> @@ -7661,9 +7657,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>  	/* ensure 32-byte alignment of whole construct */
>  	alloc_size += NETDEV_ALIGN - 1;
>
> -	p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
> -	if (!p)
> -		p = vzalloc(alloc_size);
> +	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
>  	if (!p)
>  		return NULL;
>
> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index a4f738ac7728..594f77d89f6c 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -624,16 +624,6 @@ static void fq_rehash(struct fq_sched_data *q,
>  	q->stat_gc_flows += fcnt;
>  }
>
> -static void *fq_alloc_node(size_t sz, int node)
> -{
> -	void *ptr;
> -
> -	ptr = kmalloc_node(sz, GFP_KERNEL | __GFP_REPEAT | __GFP_NOWARN, node);
> -	if (!ptr)
> -		ptr = vmalloc_node(sz, node);
> -	return ptr;
> -}
> -
>  static void fq_free(void *addr)
>  {
>  	kvfree(addr);
> @@ -650,7 +640,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
>  		return 0;
>
>  	/* If XPS was setup, we can allocate memory on right NUMA node */
> -	array = fq_alloc_node(sizeof(struct rb_root) << log,
> +	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
>  			      netdev_queue_numa_node_read(sch->dev_queue));
>  	if (!array)
>  		return -ENOMEM;
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 7/9] md: use kvmalloc rather than opencoded variant
  2017-01-30  9:49 ` [PATCH 7/9] md: use kvmalloc rather than opencoded variant Michal Hocko
@ 2017-01-30 16:42   ` Vlastimil Babka
  2017-02-01 17:29   ` Mikulas Patocka
  1 sibling, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:42 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Mikulas Patocka, Mike Snitzer

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> copy_params uses kmalloc with vmalloc fallback. We already have a helper
> for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
> been converted with many others by previous patches. All we need to
> achieve this semantic is to use the scope memalloc_noio_{save,restore}
> around kvmalloc.
>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  drivers/md/dm-ioctl.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index a5a9b17f0f7f..dbf5b981f7d7 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	struct dm_ioctl *dmi;
>  	int secure_data;
>  	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
> +	unsigned noio_flag;
>
>  	if (copy_from_user(param_kernel, user, minimum_data_size))
>  		return -EFAULT;
> @@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	 * Use kmalloc() rather than vmalloc() when we can.
>  	 */
>  	dmi = NULL;
> -	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
> -		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -
> -	if (!dmi) {
> -		unsigned noio_flag;
> -		noio_flag = memalloc_noio_save();
> -		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
> -		memalloc_noio_restore(noio_flag);
> -	}
> +	noio_flag = memalloc_noio_save();
> +	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
> +	memalloc_noio_restore(noio_flag);
>
>  	if (!dmi) {
>  		if (secure_data && clear_user(user, param_kernel->data_size))
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 7/9] md: use kvmalloc rather than opencoded variant
@ 2017-01-30 16:42   ` Vlastimil Babka
  0 siblings, 0 replies; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:42 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Mikulas Patocka, Mike Snitzer

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> copy_params uses kmalloc with vmalloc fallback. We already have a helper
> for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
> been converted with many others by previous patches. All we need to
> achieve this semantic is to use the scope memalloc_noio_{save,restore}
> around kvmalloc.
>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  drivers/md/dm-ioctl.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index a5a9b17f0f7f..dbf5b981f7d7 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	struct dm_ioctl *dmi;
>  	int secure_data;
>  	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
> +	unsigned noio_flag;
>
>  	if (copy_from_user(param_kernel, user, minimum_data_size))
>  		return -EFAULT;
> @@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	 * Use kmalloc() rather than vmalloc() when we can.
>  	 */
>  	dmi = NULL;
> -	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
> -		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -
> -	if (!dmi) {
> -		unsigned noio_flag;
> -		noio_flag = memalloc_noio_save();
> -		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
> -		memalloc_noio_restore(noio_flag);
> -	}
> +	noio_flag = memalloc_noio_save();
> +	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
> +	memalloc_noio_restore(noio_flag);
>
>  	if (!dmi) {
>  		if (secure_data && clear_user(user, param_kernel->data_size))
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 8/9] bcache: use kvmalloc
  2017-01-30  9:49 ` [PATCH 8/9] bcache: use kvmalloc Michal Hocko
@ 2017-01-30 16:47   ` Vlastimil Babka
  2017-01-30 17:25     ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:47 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Kent Overstreet

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> bcache_device_init uses kmalloc for small requests and vmalloc for those
> which are larger than 64 pages. This alone is a strange criterion.
> Moreover kmalloc can fallback to vmalloc on the failure. Let's simply
> use kvmalloc instead as it knows how to handle the fallback properly

I don't see why separate patch, some of the conversions in 5/9 were quite 
similar (except comparing with PAGE_SIZE, not 64*PAGE_SIZE), but nevermind.

> Cc: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  drivers/md/bcache/super.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 3a19cbc8b230..4cb6b88a1465 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -767,16 +767,12 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
>  	}
>
>  	n = d->nr_stripes * sizeof(atomic_t);
> -	d->stripe_sectors_dirty = n < PAGE_SIZE << 6
> -		? kzalloc(n, GFP_KERNEL)
> -		: vzalloc(n);
> +	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
>  	if (!d->stripe_sectors_dirty)
>  		return -ENOMEM;
>
>  	n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> -	d->full_dirty_stripes = n < PAGE_SIZE << 6
> -		? kzalloc(n, GFP_KERNEL)
> -		: vzalloc(n);
> +	d->full_dirty_stripes = kvzalloc(n, GFP_KERNEL);
>  	if (!d->full_dirty_stripes)
>  		return -ENOMEM;
>
>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 8/9] bcache: use kvmalloc
@ 2017-01-30 16:47   ` Vlastimil Babka
  2017-01-30 17:25     ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Vlastimil Babka @ 2017-01-30 16:47 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: David Rientjes, Mel Gorman, Johannes Weiner, Al Viro, linux-mm,
	LKML, Michal Hocko, Kent Overstreet

On 01/30/2017 10:49 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> bcache_device_init uses kmalloc for small requests and vmalloc for those
> which are larger than 64 pages. This alone is a strange criterion.
> Moreover kmalloc can fallback to vmalloc on the failure. Let's simply
> use kvmalloc instead as it knows how to handle the fallback properly

I don't see why separate patch, some of the conversions in 5/9 were quite 
similar (except comparing with PAGE_SIZE, not 64*PAGE_SIZE), but nevermind.

> Cc: Kent Overstreet <kent.overstreet@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  drivers/md/bcache/super.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 3a19cbc8b230..4cb6b88a1465 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -767,16 +767,12 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
>  	}
>
>  	n = d->nr_stripes * sizeof(atomic_t);
> -	d->stripe_sectors_dirty = n < PAGE_SIZE << 6
> -		? kzalloc(n, GFP_KERNEL)
> -		: vzalloc(n);
> +	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
>  	if (!d->stripe_sectors_dirty)
>  		return -ENOMEM;
>
>  	n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> -	d->full_dirty_stripes = n < PAGE_SIZE << 6
> -		? kzalloc(n, GFP_KERNEL)
> -		: vzalloc(n);
> +	d->full_dirty_stripes = kvzalloc(n, GFP_KERNEL);
>  	if (!d->full_dirty_stripes)
>  		return -ENOMEM;
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 9/9] net, bpf: use kvzalloc helper
  2017-01-30  9:49 ` [PATCH 9/9] net, bpf: use kvzalloc helper Michal Hocko
@ 2017-01-30 17:20   ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30 17:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Andrey Konovalov,
	Marcelo Ricardo Leitner, Pablo Neira Ayuso, Daniel Borkmann

Andrew, please ignore this one.

On Mon 30-01-17 10:49:40, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> both bpf_map_area_alloc and xt_alloc_table_info try really hard to
> play nicely with large memory requests which can be triggered from
> the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
> avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
> don't trigger OOM killer under pressure with map alloc").
> 
> The current allocation pattern strongly resembles kvmalloc helper except
> for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
> main reason why kvmalloc doesn't really support __GFP_NORETRY is
> because vmalloc doesn't support this flag properly and it is far from
> straightforward to make it understand it because there are some hard
> coded GFP_KERNEL allocation deep in the call chains. This patch simply
> replaces the open coded variants with kvmalloc and puts a note to
> push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
> to be really needed along with OOM report pointing at vmalloc.
> 
> If there is an immediate need and no full support yet then
> 	kvmalloc(size, gfp | __GFP_NORETRY)
> will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
> might trigger the OOM in some cases.
> 
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrey Konovalov <andreyknvl@google.com>
> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  kernel/bpf/syscall.c     | 19 +++++--------------
>  net/netfilter/x_tables.c | 16 ++++++----------
>  2 files changed, 11 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 08a4d287226b..3d38c7a51e1a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -54,21 +54,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>  
>  void *bpf_map_area_alloc(size_t size)
>  {
> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -	 * trigger under memory pressure as we really just want to
> -	 * fail instead.
> +	/*
> +	 * FIXME: we would really like to not trigger the OOM killer and rather
> +	 * fail instead. This is not supported right now. Please nag MM people
> +	 * if these OOM start bothering people.
>  	 */
> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> -	void *area;
> -
> -	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -		area = kmalloc(size, GFP_USER | flags);
> -		if (area != NULL)
> -			return area;
> -	}
> -
> -	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> -			 PAGE_KERNEL);
> +	return kvzalloc(size, GFP_USER);
>  }
>  
>  void bpf_map_area_free(void *area)
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d529989f5791..ba8ba633da72 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
>  	if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
>  		return NULL;
>  
> -	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> -		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> -	if (!info) {
> -		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> -				     __GFP_NORETRY | __GFP_HIGHMEM,
> -				 PAGE_KERNEL);
> -		if (!info)
> -			return NULL;
> -	}
> -	memset(info, 0, sizeof(*info));
> +	/*
> +	 * FIXME: we would really like to not trigger the OOM killer and rather
> +	 * fail instead. This is not supported right now. Please nag MM people
> +	 * if these OOM start bothering people.
> +	 */
> +	info = kvzalloc(sz, GFP_KERNEL);
>  	info->size = size;
>  	return info;
>  }
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 9/9] net, bpf: use kvzalloc helper
@ 2017-01-30 17:20   ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30 17:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Andrey Konovalov,
	Marcelo Ricardo Leitner, Pablo Neira Ayuso, Daniel Borkmann

Andrew, please ignore this one.

On Mon 30-01-17 10:49:40, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> both bpf_map_area_alloc and xt_alloc_table_info try really hard to
> play nicely with large memory requests which can be triggered from
> the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
> avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
> don't trigger OOM killer under pressure with map alloc").
> 
> The current allocation pattern strongly resembles kvmalloc helper except
> for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
> main reason why kvmalloc doesn't really support __GFP_NORETRY is
> because vmalloc doesn't support this flag properly and it is far from
> straightforward to make it understand it because there are some hard
> coded GFP_KERNEL allocation deep in the call chains. This patch simply
> replaces the open coded variants with kvmalloc and puts a note to
> push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
> to be really needed along with OOM report pointing at vmalloc.
> 
> If there is an immediate need and no full support yet then
> 	kvmalloc(size, gfp | __GFP_NORETRY)
> will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
> might trigger the OOM in some cases.
> 
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Andrey Konovalov <andreyknvl@google.com>
> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  kernel/bpf/syscall.c     | 19 +++++--------------
>  net/netfilter/x_tables.c | 16 ++++++----------
>  2 files changed, 11 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 08a4d287226b..3d38c7a51e1a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -54,21 +54,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>  
>  void *bpf_map_area_alloc(size_t size)
>  {
> -	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -	 * trigger under memory pressure as we really just want to
> -	 * fail instead.
> +	/*
> +	 * FIXME: we would really like to not trigger the OOM killer and rather
> +	 * fail instead. This is not supported right now. Please nag MM people
> +	 * if these OOM start bothering people.
>  	 */
> -	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> -	void *area;
> -
> -	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -		area = kmalloc(size, GFP_USER | flags);
> -		if (area != NULL)
> -			return area;
> -	}
> -
> -	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> -			 PAGE_KERNEL);
> +	return kvzalloc(size, GFP_USER);
>  }
>  
>  void bpf_map_area_free(void *area)
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d529989f5791..ba8ba633da72 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
>  	if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
>  		return NULL;
>  
> -	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> -		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> -	if (!info) {
> -		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
> -				     __GFP_NORETRY | __GFP_HIGHMEM,
> -				 PAGE_KERNEL);
> -		if (!info)
> -			return NULL;
> -	}
> -	memset(info, 0, sizeof(*info));
> +	/*
> +	 * FIXME: we would really like to not trigger the OOM killer and rather
> +	 * fail instead. This is not supported right now. Please nag MM people
> +	 * if these OOM start bothering people.
> +	 */
> +	info = kvzalloc(sz, GFP_KERNEL);
>  	info->size = size;
>  	return info;
>  }
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 8/9] bcache: use kvmalloc
  2017-01-30 16:47   ` Vlastimil Babka
@ 2017-01-30 17:25     ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30 17:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Kent Overstreet

On Mon 30-01-17 17:47:31, Vlastimil Babka wrote:
> On 01/30/2017 10:49 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > bcache_device_init uses kmalloc for small requests and vmalloc for those
> > which are larger than 64 pages. This alone is a strange criterion.
> > Moreover kmalloc can fallback to vmalloc on the failure. Let's simply
> > use kvmalloc instead as it knows how to handle the fallback properly
> 
> I don't see why separate patch, some of the conversions in 5/9 were quite
> similar (except comparing with PAGE_SIZE, not 64*PAGE_SIZE), but nevermind.

I just found it later so I kept it separate. It can be folded to 5/9 if
that makes more sense.
 
> > Cc: Kent Overstreet <kent.overstreet@gmail.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > ---
> >  drivers/md/bcache/super.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index 3a19cbc8b230..4cb6b88a1465 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -767,16 +767,12 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
> >  	}
> > 
> >  	n = d->nr_stripes * sizeof(atomic_t);
> > -	d->stripe_sectors_dirty = n < PAGE_SIZE << 6
> > -		? kzalloc(n, GFP_KERNEL)
> > -		: vzalloc(n);
> > +	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
> >  	if (!d->stripe_sectors_dirty)
> >  		return -ENOMEM;
> > 
> >  	n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> > -	d->full_dirty_stripes = n < PAGE_SIZE << 6
> > -		? kzalloc(n, GFP_KERNEL)
> > -		: vzalloc(n);
> > +	d->full_dirty_stripes = kvzalloc(n, GFP_KERNEL);
> >  	if (!d->full_dirty_stripes)
> >  		return -ENOMEM;
> > 
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 8/9] bcache: use kvmalloc
@ 2017-01-30 17:25     ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-30 17:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Kent Overstreet

On Mon 30-01-17 17:47:31, Vlastimil Babka wrote:
> On 01/30/2017 10:49 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > bcache_device_init uses kmalloc for small requests and vmalloc for those
> > which are larger than 64 pages. This alone is a strange criterion.
> > Moreover kmalloc can fallback to vmalloc on the failure. Let's simply
> > use kvmalloc instead as it knows how to handle the fallback properly
> 
> I don't see why separate patch, some of the conversions in 5/9 were quite
> similar (except comparing with PAGE_SIZE, not 64*PAGE_SIZE), but nevermind.

I just found it later so I kept it separate. It can be folded to 5/9 if
that makes more sense.
 
> > Cc: Kent Overstreet <kent.overstreet@gmail.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > ---
> >  drivers/md/bcache/super.c | 8 ++------
> >  1 file changed, 2 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> > index 3a19cbc8b230..4cb6b88a1465 100644
> > --- a/drivers/md/bcache/super.c
> > +++ b/drivers/md/bcache/super.c
> > @@ -767,16 +767,12 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
> >  	}
> > 
> >  	n = d->nr_stripes * sizeof(atomic_t);
> > -	d->stripe_sectors_dirty = n < PAGE_SIZE << 6
> > -		? kzalloc(n, GFP_KERNEL)
> > -		: vzalloc(n);
> > +	d->stripe_sectors_dirty = kvzalloc(n, GFP_KERNEL);
> >  	if (!d->stripe_sectors_dirty)
> >  		return -ENOMEM;
> > 
> >  	n = BITS_TO_LONGS(d->nr_stripes) * sizeof(unsigned long);
> > -	d->full_dirty_stripes = n < PAGE_SIZE << 6
> > -		? kzalloc(n, GFP_KERNEL)
> > -		: vzalloc(n);
> > +	d->full_dirty_stripes = kvzalloc(n, GFP_KERNEL);
> >  	if (!d->full_dirty_stripes)
> >  		return -ENOMEM;
> > 
> > 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
  2017-01-30  9:49 ` [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants Michal Hocko
  2017-01-30 10:21   ` Leon Romanovsky
  2017-01-30 16:14   ` Vlastimil Babka
@ 2017-01-30 19:24   ` Kees Cook
  2 siblings, 0 replies; 111+ messages in thread
From: Kees Cook @ 2017-01-30 19:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, Linux-MM, LKML, Michal Hocko,
	Martin Schwidefsky, Heiko Carstens, Herbert Xu, Anton Vorontsov,
	Colin Cross, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng, Alexei Starovoitov, Eric Dumazet,
	Network Development

On Mon, Jan 30, 2017 at 1:49 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.

Awesome, thanks for adding that API. :)

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

-- 
Kees Cook
Nexus Security

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
@ 2017-01-30 19:24   ` Kees Cook
  0 siblings, 0 replies; 111+ messages in thread
From: Kees Cook @ 2017-01-30 19:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, Linux-MM, LKML, Michal Hocko,
	Martin Schwidefsky, Heiko Carstens, Herbert Xu, Anton Vorontsov,
	Colin Cross, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng, Alexei

On Mon, Jan 30, 2017 at 1:49 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.

Awesome, thanks for adding that API. :)

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

-- 
Kees Cook
Nexus Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants
@ 2017-01-30 19:24   ` Kees Cook
  0 siblings, 0 replies; 111+ messages in thread
From: Kees Cook @ 2017-01-30 19:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, Linux-MM, LKML, Michal Hocko,
	Martin Schwidefsky, Heiko Carstens, Herbert Xu, Anton Vorontsov,
	Colin Cross, Tony Luck, Rafael J. Wysocki, Ben Skeggs,
	Kent Overstreet, Santosh Raspatur, Hariprasad S, Yishai Hadas,
	Oleg Drokin, Yan, Zheng, Alexei Starovoitov, Eric Dumazet,
	Network Development

On Mon, Jan 30, 2017 at 1:49 AM, Michal Hocko <mhocko@kernel.org> wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> There are many code paths opencoding kvmalloc. Let's use the helper
> instead. The main difference to kvmalloc is that those users are usually
> not considering all the aspects of the memory allocator. E.g. allocation
> requests <= 32kB (with 4kB pages) are basically never failing and invoke
> OOM killer to satisfy the allocation. This sounds too disruptive for
> something that has a reasonable fallback - the vmalloc. On the other
> hand those requests might fallback to vmalloc even when the memory
> allocator would succeed after several more reclaim/compaction attempts
> previously. There is no guarantee something like that happens though.
>
> This patch converts many of those places to kv[mz]alloc* helpers because
> they are more conservative.
>
> Changes since v1
> - add kvmalloc_array - this might silently fix some overflow issues
>   because most users simply didn't check the overflow for the vmalloc
>   fallback.

Awesome, thanks for adding that API. :)

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

-- 
Kees Cook
Nexus Security

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 7/9] md: use kvmalloc rather than opencoded variant
  2017-01-30  9:49 ` [PATCH 7/9] md: use kvmalloc rather than opencoded variant Michal Hocko
  2017-01-30 16:42   ` Vlastimil Babka
@ 2017-02-01 17:29   ` Mikulas Patocka
  2017-02-01 17:58     ` Michal Hocko
  1 sibling, 1 reply; 111+ messages in thread
From: Mikulas Patocka @ 2017-02-01 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Michal Hocko,
	Mike Snitzer



On Mon, 30 Jan 2017, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> copy_params uses kmalloc with vmalloc fallback. We already have a helper
> for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
> been converted with many others by previous patches. All we need to
> achieve this semantic is to use the scope memalloc_noio_{save,restore}
> around kvmalloc.
> 
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  drivers/md/dm-ioctl.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index a5a9b17f0f7f..dbf5b981f7d7 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	struct dm_ioctl *dmi;
>  	int secure_data;
>  	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
> +	unsigned noio_flag;
>  
>  	if (copy_from_user(param_kernel, user, minimum_data_size))
>  		return -EFAULT;
> @@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	 * Use kmalloc() rather than vmalloc() when we can.
>  	 */
>  	dmi = NULL;
> -	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
> -		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -
> -	if (!dmi) {
> -		unsigned noio_flag;
> -		noio_flag = memalloc_noio_save();
> -		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
> -		memalloc_noio_restore(noio_flag);
> -	}
> +	noio_flag = memalloc_noio_save();
> +	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
> +	memalloc_noio_restore(noio_flag);
>  
>  	if (!dmi) {
>  		if (secure_data && clear_user(user, param_kernel->data_size))
> -- 
> 2.11.0

I would push these memalloc_noio_save/memalloc_noio_restore calls to 
kvmalloc, so that the othe callers can use them too.

Something like
	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
		noio_flag = memalloc_noio_save();
	ptr = __vmalloc_node_flags(size, node, flags);
	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
		memalloc_noio_restore(noio_flag)

Or perhaps even better - push memalloc_noio_save/memalloc_noio_restore 
directly to __vmalloc, so that __vmalloc respects the gfp flags properly - 
note that there are 14 places in the kernel where __vmalloc is called with 
GFP_NOFS and they are all buggy because __vmalloc doesn't respect the 
GFP_NOFS flag.

Mikulas

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 7/9] md: use kvmalloc rather than opencoded variant
@ 2017-02-01 17:29   ` Mikulas Patocka
  2017-02-01 17:58     ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Mikulas Patocka @ 2017-02-01 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Michal Hocko,
	Mike Snitzer



On Mon, 30 Jan 2017, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> copy_params uses kmalloc with vmalloc fallback. We already have a helper
> for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
> been converted with many others by previous patches. All we need to
> achieve this semantic is to use the scope memalloc_noio_{save,restore}
> around kvmalloc.
> 
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  drivers/md/dm-ioctl.c | 13 ++++---------
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> index a5a9b17f0f7f..dbf5b981f7d7 100644
> --- a/drivers/md/dm-ioctl.c
> +++ b/drivers/md/dm-ioctl.c
> @@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	struct dm_ioctl *dmi;
>  	int secure_data;
>  	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
> +	unsigned noio_flag;
>  
>  	if (copy_from_user(param_kernel, user, minimum_data_size))
>  		return -EFAULT;
> @@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
>  	 * Use kmalloc() rather than vmalloc() when we can.
>  	 */
>  	dmi = NULL;
> -	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
> -		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> -
> -	if (!dmi) {
> -		unsigned noio_flag;
> -		noio_flag = memalloc_noio_save();
> -		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
> -		memalloc_noio_restore(noio_flag);
> -	}
> +	noio_flag = memalloc_noio_save();
> +	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
> +	memalloc_noio_restore(noio_flag);
>  
>  	if (!dmi) {
>  		if (secure_data && clear_user(user, param_kernel->data_size))
> -- 
> 2.11.0

I would push these memalloc_noio_save/memalloc_noio_restore calls to 
kvmalloc, so that the othe callers can use them too.

Something like
	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
		noio_flag = memalloc_noio_save();
	ptr = __vmalloc_node_flags(size, node, flags);
	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
		memalloc_noio_restore(noio_flag)

Or perhaps even better - push memalloc_noio_save/memalloc_noio_restore 
directly to __vmalloc, so that __vmalloc respects the gfp flags properly - 
note that there are 14 places in the kernel where __vmalloc is called with 
GFP_NOFS and they are all buggy because __vmalloc doesn't respect the 
GFP_NOFS flag.

Mikulas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 7/9] md: use kvmalloc rather than opencoded variant
  2017-02-01 17:29   ` Mikulas Patocka
@ 2017-02-01 17:58     ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-02-01 17:58 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Mike Snitzer

On Wed 01-02-17 12:29:56, Mikulas Patocka wrote:
> 
> 
> On Mon, 30 Jan 2017, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > copy_params uses kmalloc with vmalloc fallback. We already have a helper
> > for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
> > been converted with many others by previous patches. All we need to
> > achieve this semantic is to use the scope memalloc_noio_{save,restore}
> > around kvmalloc.
> > 
> > Cc: Mikulas Patocka <mpatocka@redhat.com>
> > Cc: Mike Snitzer <snitzer@redhat.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  drivers/md/dm-ioctl.c | 13 ++++---------
> >  1 file changed, 4 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> > index a5a9b17f0f7f..dbf5b981f7d7 100644
> > --- a/drivers/md/dm-ioctl.c
> > +++ b/drivers/md/dm-ioctl.c
> > @@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
> >  	struct dm_ioctl *dmi;
> >  	int secure_data;
> >  	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
> > +	unsigned noio_flag;
> >  
> >  	if (copy_from_user(param_kernel, user, minimum_data_size))
> >  		return -EFAULT;
> > @@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
> >  	 * Use kmalloc() rather than vmalloc() when we can.
> >  	 */
> >  	dmi = NULL;
> > -	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
> > -		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > -
> > -	if (!dmi) {
> > -		unsigned noio_flag;
> > -		noio_flag = memalloc_noio_save();
> > -		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
> > -		memalloc_noio_restore(noio_flag);
> > -	}
> > +	noio_flag = memalloc_noio_save();
> > +	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
> > +	memalloc_noio_restore(noio_flag);
> >  
> >  	if (!dmi) {
> >  		if (secure_data && clear_user(user, param_kernel->data_size))
> > -- 
> > 2.11.0
> 
> I would push these memalloc_noio_save/memalloc_noio_restore calls to 
> kvmalloc, so that the othe callers can use them too.
> 
> Something like
> 	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
> 		noio_flag = memalloc_noio_save();
> 	ptr = __vmalloc_node_flags(size, node, flags);
> 	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
> 		memalloc_noio_restore(noio_flag)
> 
> Or perhaps even better - push memalloc_noio_save/memalloc_noio_restore 
> directly to __vmalloc, so that __vmalloc respects the gfp flags properly - 
> note that there are 14 places in the kernel where __vmalloc is called with 
> GFP_NOFS and they are all buggy because __vmalloc doesn't respect the 
> GFP_NOFS flag.

That is out of scope of this patch series. I would like to deal with
NOIO an NOFS contexts separately.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 7/9] md: use kvmalloc rather than opencoded variant
@ 2017-02-01 17:58     ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-02-01 17:58 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Mike Snitzer

On Wed 01-02-17 12:29:56, Mikulas Patocka wrote:
> 
> 
> On Mon, 30 Jan 2017, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > copy_params uses kmalloc with vmalloc fallback. We already have a helper
> > for that - kvmalloc. This caller requires GFP_NOIO semantic so it hasn't
> > been converted with many others by previous patches. All we need to
> > achieve this semantic is to use the scope memalloc_noio_{save,restore}
> > around kvmalloc.
> > 
> > Cc: Mikulas Patocka <mpatocka@redhat.com>
> > Cc: Mike Snitzer <snitzer@redhat.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  drivers/md/dm-ioctl.c | 13 ++++---------
> >  1 file changed, 4 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
> > index a5a9b17f0f7f..dbf5b981f7d7 100644
> > --- a/drivers/md/dm-ioctl.c
> > +++ b/drivers/md/dm-ioctl.c
> > @@ -1698,6 +1698,7 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
> >  	struct dm_ioctl *dmi;
> >  	int secure_data;
> >  	const size_t minimum_data_size = offsetof(struct dm_ioctl, data);
> > +	unsigned noio_flag;
> >  
> >  	if (copy_from_user(param_kernel, user, minimum_data_size))
> >  		return -EFAULT;
> > @@ -1720,15 +1721,9 @@ static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl *param_kern
> >  	 * Use kmalloc() rather than vmalloc() when we can.
> >  	 */
> >  	dmi = NULL;
> > -	if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
> > -		dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > -
> > -	if (!dmi) {
> > -		unsigned noio_flag;
> > -		noio_flag = memalloc_noio_save();
> > -		dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH | __GFP_HIGHMEM, PAGE_KERNEL);
> > -		memalloc_noio_restore(noio_flag);
> > -	}
> > +	noio_flag = memalloc_noio_save();
> > +	dmi = kvmalloc(param_kernel->data_size, GFP_KERNEL);
> > +	memalloc_noio_restore(noio_flag);
> >  
> >  	if (!dmi) {
> >  		if (secure_data && clear_user(user, param_kernel->data_size))
> > -- 
> > 2.11.0
> 
> I would push these memalloc_noio_save/memalloc_noio_restore calls to 
> kvmalloc, so that the othe callers can use them too.
> 
> Something like
> 	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
> 		noio_flag = memalloc_noio_save();
> 	ptr = __vmalloc_node_flags(size, node, flags);
> 	if ((flags & (__GFP_IO | __GFP_FS)) != (__GFP_IO | __GFP_FS))
> 		memalloc_noio_restore(noio_flag)
> 
> Or perhaps even better - push memalloc_noio_save/memalloc_noio_restore 
> directly to __vmalloc, so that __vmalloc respects the gfp flags properly - 
> note that there are 14 places in the kernel where __vmalloc is called with 
> GFP_NOFS and they are all buggy because __vmalloc doesn't respect the 
> GFP_NOFS flag.

That is out of scope of this patch series. I would like to deal with
NOIO an NOFS contexts separately.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
                   ` (8 preceding siblings ...)
  2017-01-30  9:49 ` [PATCH 9/9] net, bpf: use kvzalloc helper Michal Hocko
@ 2017-02-05 10:23 ` Michal Hocko
  9 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-02-05 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Andreas Dilger,
	Andreas Dilger, Andrey Konovalov, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Christian Borntraeger, Colin Cross,
	Daniel Borkmann, Dan Williams, David Sterba, Eric Dumazet,
	Eric Dumazet, Hariprasad S, Heiko Carstens, Herbert Xu,
	Ilya Dryomov, John Hubbard, Kees Cook, Kent Overstreet,
	Marcelo Ricardo Leitner, Martin Schwidefsky, Michael S. Tsirkin,
	Mike Snitzer, Mikulas Patocka, Oleg Drokin, Pablo Neira Ayuso,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

Is there anything more to be done before this can get merged? I would
relly like to target this to the next merge window. I already have some
more changes which depend on this.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-02-05 10:23 ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-02-05 10:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Andreas Dilger,
	Andreas Dilger, Andrey Konovalov, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Christian Borntraeger, Colin Cross,
	Daniel Borkmann, Dan Williams, David Sterba, Eric Dumazet,
	Eric Dumazet, Hariprasad S, Heiko Carstens, Herbert Xu,
	Ilya Dryomov, John Hubbard, Kees Cook, Kent Overstreet,
	Marcelo Ricardo Leitner, Martin Schwidefsky, Michael S. Tsirkin,
	Mike Snitzer, Mikulas Patocka, Oleg Drokin, Pablo Neira Ayuso,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

Is there anything more to be done before this can get merged? I would
relly like to target this to the next merge window. I already have some
more changes which depend on this.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-30 16:28                           ` Michal Hocko
@ 2017-01-30 16:45                             ` Daniel Borkmann
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-30 16:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/30/2017 05:28 PM, Michal Hocko wrote:
> On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:
>> On 01/30/2017 08:56 AM, Michal Hocko wrote:
>>> On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
>>>> On 01/27/2017 11:05 AM, Michal Hocko wrote:
>>>>> On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
>>> [...]
>>>>>> So to answer your second email with the bpf and netfilter hunks, why
>>>>>> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
>>>>>> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
>>>>>> is not harmful though has only /partial/ effect right now and that full
>>>>>> support needs to be implemented in future. That would still be better
>>>>>> that not having it, imo, and the FIXME would make expectations clear
>>>>>> to anyone reading that code.
>>>>>
>>>>> Well, we can do that, I just would like to prevent from this (ab)use
>>>>> if there is no _real_ and _sensible_ usecase for it. Having a real bug
>>>>
>>>> Understandable.
>>>>
>>>>> report or a fallback mechanism you are mentioning above would justify
>>>>> the (ab)use IMHO. But that abuse would be documented properly and have a
>>>>> real reason to exist. That sounds like a better approach to me.
>>>>>
>>>>> But if you absolutely _insist_ I can change that.
>>>>
>>>> Yeah, please do (with a big FIXME comment as mentioned), this originally
>>>> came from a real bug report. Anyway, feel free to add my Acked-by then.
>>>
>>> Thanks! I will repost the whole series today.
>>
>> Looks like I got only Cc'ed on the cover letter of your v3 from today
>> (should have been v4 actually?).
>
> Yes
>
>> Anyway, I looked up the last patch
>> on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?
>
> I misread your response. I thought you were OK with the FIXME
> explanation.
>
>> At least that was what was discussed above (insisting on __GFP_NORETRY
>> plus FIXME comment) for providing my Acked-by then. Can you still fix
>> that up in a final respin?
>
> I will probably just drop that last patch instead. I am not convinced
> that we should bend the new API over and let people mimic that
> throughout the code. I have just seen too many examples of this pattern
> already.
>
> I would also like to prevent the next rebase, unless there any issues
> with some patches of course.

Ok, I'm fine with that as well.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-30 16:45                             ` Daniel Borkmann
  0 siblings, 0 replies; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-30 16:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/30/2017 05:28 PM, Michal Hocko wrote:
> On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:
>> On 01/30/2017 08:56 AM, Michal Hocko wrote:
>>> On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
>>>> On 01/27/2017 11:05 AM, Michal Hocko wrote:
>>>>> On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
>>> [...]
>>>>>> So to answer your second email with the bpf and netfilter hunks, why
>>>>>> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
>>>>>> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
>>>>>> is not harmful though has only /partial/ effect right now and that full
>>>>>> support needs to be implemented in future. That would still be better
>>>>>> that not having it, imo, and the FIXME would make expectations clear
>>>>>> to anyone reading that code.
>>>>>
>>>>> Well, we can do that, I just would like to prevent from this (ab)use
>>>>> if there is no _real_ and _sensible_ usecase for it. Having a real bug
>>>>
>>>> Understandable.
>>>>
>>>>> report or a fallback mechanism you are mentioning above would justify
>>>>> the (ab)use IMHO. But that abuse would be documented properly and have a
>>>>> real reason to exist. That sounds like a better approach to me.
>>>>>
>>>>> But if you absolutely _insist_ I can change that.
>>>>
>>>> Yeah, please do (with a big FIXME comment as mentioned), this originally
>>>> came from a real bug report. Anyway, feel free to add my Acked-by then.
>>>
>>> Thanks! I will repost the whole series today.
>>
>> Looks like I got only Cc'ed on the cover letter of your v3 from today
>> (should have been v4 actually?).
>
> Yes
>
>> Anyway, I looked up the last patch
>> on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?
>
> I misread your response. I thought you were OK with the FIXME
> explanation.
>
>> At least that was what was discussed above (insisting on __GFP_NORETRY
>> plus FIXME comment) for providing my Acked-by then. Can you still fix
>> that up in a final respin?
>
> I will probably just drop that last patch instead. I am not convinced
> that we should bend the new API over and let people mimic that
> throughout the code. I have just seen too many examples of this pattern
> already.
>
> I would also like to prevent the next rebase, unless there any issues
> with some patches of course.

Ok, I'm fine with that as well.

Thanks,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-30 16:15                         ` Daniel Borkmann
@ 2017-01-30 16:28                           ` Michal Hocko
  2017-01-30 16:45                             ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30 16:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:
> On 01/30/2017 08:56 AM, Michal Hocko wrote:
> > On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> > > On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> > [...]
> > > > > So to answer your second email with the bpf and netfilter hunks, why
> > > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> > > > > is not harmful though has only /partial/ effect right now and that full
> > > > > support needs to be implemented in future. That would still be better
> > > > > that not having it, imo, and the FIXME would make expectations clear
> > > > > to anyone reading that code.
> > > > 
> > > > Well, we can do that, I just would like to prevent from this (ab)use
> > > > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> > > 
> > > Understandable.
> > > 
> > > > report or a fallback mechanism you are mentioning above would justify
> > > > the (ab)use IMHO. But that abuse would be documented properly and have a
> > > > real reason to exist. That sounds like a better approach to me.
> > > > 
> > > > But if you absolutely _insist_ I can change that.
> > > 
> > > Yeah, please do (with a big FIXME comment as mentioned), this originally
> > > came from a real bug report. Anyway, feel free to add my Acked-by then.
> > 
> > Thanks! I will repost the whole series today.
> 
> Looks like I got only Cc'ed on the cover letter of your v3 from today
> (should have been v4 actually?).

Yes

> Anyway, I looked up the last patch
> on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?

I misread your response. I thought you were OK with the FIXME
explanation.

> At least that was what was discussed above (insisting on __GFP_NORETRY
> plus FIXME comment) for providing my Acked-by then. Can you still fix
> that up in a final respin?

I will probably just drop that last patch instead. I am not convinced
that we should bend the new API over and let people mimic that
throughout the code. I have just seen too many examples of this pattern
already.

I would also like to prevent the next rebase, unless there any issues
with some patches of course.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-30 16:28                           ` Michal Hocko
  2017-01-30 16:45                             ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30 16:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Mon 30-01-17 17:15:08, Daniel Borkmann wrote:
> On 01/30/2017 08:56 AM, Michal Hocko wrote:
> > On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> > > On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > > > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> > [...]
> > > > > So to answer your second email with the bpf and netfilter hunks, why
> > > > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> > > > > is not harmful though has only /partial/ effect right now and that full
> > > > > support needs to be implemented in future. That would still be better
> > > > > that not having it, imo, and the FIXME would make expectations clear
> > > > > to anyone reading that code.
> > > > 
> > > > Well, we can do that, I just would like to prevent from this (ab)use
> > > > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> > > 
> > > Understandable.
> > > 
> > > > report or a fallback mechanism you are mentioning above would justify
> > > > the (ab)use IMHO. But that abuse would be documented properly and have a
> > > > real reason to exist. That sounds like a better approach to me.
> > > > 
> > > > But if you absolutely _insist_ I can change that.
> > > 
> > > Yeah, please do (with a big FIXME comment as mentioned), this originally
> > > came from a real bug report. Anyway, feel free to add my Acked-by then.
> > 
> > Thanks! I will repost the whole series today.
> 
> Looks like I got only Cc'ed on the cover letter of your v3 from today
> (should have been v4 actually?).

Yes

> Anyway, I looked up the last patch
> on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?

I misread your response. I thought you were OK with the FIXME
explanation.

> At least that was what was discussed above (insisting on __GFP_NORETRY
> plus FIXME comment) for providing my Acked-by then. Can you still fix
> that up in a final respin?

I will probably just drop that last patch instead. I am not convinced
that we should bend the new API over and let people mimic that
throughout the code. I have just seen too many examples of this pattern
already.

I would also like to prevent the next rebase, unless there any issues
with some patches of course.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-30  7:56                       ` Michal Hocko
@ 2017-01-30 16:15                         ` Daniel Borkmann
  2017-01-30 16:28                           ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-30 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/30/2017 08:56 AM, Michal Hocko wrote:
> On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
>> On 01/27/2017 11:05 AM, Michal Hocko wrote:
>>> On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> [...]
>>>> So to answer your second email with the bpf and netfilter hunks, why
>>>> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
>>>> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
>>>> is not harmful though has only /partial/ effect right now and that full
>>>> support needs to be implemented in future. That would still be better
>>>> that not having it, imo, and the FIXME would make expectations clear
>>>> to anyone reading that code.
>>>
>>> Well, we can do that, I just would like to prevent from this (ab)use
>>> if there is no _real_ and _sensible_ usecase for it. Having a real bug
>>
>> Understandable.
>>
>>> report or a fallback mechanism you are mentioning above would justify
>>> the (ab)use IMHO. But that abuse would be documented properly and have a
>>> real reason to exist. That sounds like a better approach to me.
>>>
>>> But if you absolutely _insist_ I can change that.
>>
>> Yeah, please do (with a big FIXME comment as mentioned), this originally
>> came from a real bug report. Anyway, feel free to add my Acked-by then.
>
> Thanks! I will repost the whole series today.

Looks like I got only Cc'ed on the cover letter of your v3 from today
(should have been v4 actually?). Anyway, I looked up the last patch
on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?
At least that was what was discussed above (insisting on __GFP_NORETRY
plus FIXME comment) for providing my Acked-by then. Can you still fix
that up in a final respin?

Thanks again,
Daniel

   [1] https://lkml.org/lkml/2017/1/30/129

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-30 16:15                         ` Daniel Borkmann
  2017-01-30 16:28                           ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-30 16:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/30/2017 08:56 AM, Michal Hocko wrote:
> On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
>> On 01/27/2017 11:05 AM, Michal Hocko wrote:
>>> On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> [...]
>>>> So to answer your second email with the bpf and netfilter hunks, why
>>>> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
>>>> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
>>>> is not harmful though has only /partial/ effect right now and that full
>>>> support needs to be implemented in future. That would still be better
>>>> that not having it, imo, and the FIXME would make expectations clear
>>>> to anyone reading that code.
>>>
>>> Well, we can do that, I just would like to prevent from this (ab)use
>>> if there is no _real_ and _sensible_ usecase for it. Having a real bug
>>
>> Understandable.
>>
>>> report or a fallback mechanism you are mentioning above would justify
>>> the (ab)use IMHO. But that abuse would be documented properly and have a
>>> real reason to exist. That sounds like a better approach to me.
>>>
>>> But if you absolutely _insist_ I can change that.
>>
>> Yeah, please do (with a big FIXME comment as mentioned), this originally
>> came from a real bug report. Anyway, feel free to add my Acked-by then.
>
> Thanks! I will repost the whole series today.

Looks like I got only Cc'ed on the cover letter of your v3 from today
(should have been v4 actually?). Anyway, I looked up the last patch
on lkml [1] and it seems you forgot the __GFP_NORETRY we talked about?
At least that was what was discussed above (insisting on __GFP_NORETRY
plus FIXME comment) for providing my Acked-by then. Can you still fix
that up in a final respin?

Thanks again,
Daniel

   [1] https://lkml.org/lkml/2017/1/30/129

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-27 20:12                     ` Daniel Borkmann
@ 2017-01-30  7:56                       ` Michal Hocko
  2017-01-30 16:15                         ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  7:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
[...]
> > > So to answer your second email with the bpf and netfilter hunks, why
> > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> > > is not harmful though has only /partial/ effect right now and that full
> > > support needs to be implemented in future. That would still be better
> > > that not having it, imo, and the FIXME would make expectations clear
> > > to anyone reading that code.
> > 
> > Well, we can do that, I just would like to prevent from this (ab)use
> > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> 
> Understandable.
> 
> > report or a fallback mechanism you are mentioning above would justify
> > the (ab)use IMHO. But that abuse would be documented properly and have a
> > real reason to exist. That sounds like a better approach to me.
> > 
> > But if you absolutely _insist_ I can change that.
> 
> Yeah, please do (with a big FIXME comment as mentioned), this originally
> came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks! I will repost the whole series today.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-30  7:56                       ` Michal Hocko
  2017-01-30 16:15                         ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-30  7:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Fri 27-01-17 21:12:26, Daniel Borkmann wrote:
> On 01/27/2017 11:05 AM, Michal Hocko wrote:
> > On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
[...]
> > > So to answer your second email with the bpf and netfilter hunks, why
> > > not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> > > big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> > > is not harmful though has only /partial/ effect right now and that full
> > > support needs to be implemented in future. That would still be better
> > > that not having it, imo, and the FIXME would make expectations clear
> > > to anyone reading that code.
> > 
> > Well, we can do that, I just would like to prevent from this (ab)use
> > if there is no _real_ and _sensible_ usecase for it. Having a real bug
> 
> Understandable.
> 
> > report or a fallback mechanism you are mentioning above would justify
> > the (ab)use IMHO. But that abuse would be documented properly and have a
> > real reason to exist. That sounds like a better approach to me.
> > 
> > But if you absolutely _insist_ I can change that.
> 
> Yeah, please do (with a big FIXME comment as mentioned), this originally
> came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks! I will repost the whole series today.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-27 10:05                   ` Michal Hocko
@ 2017-01-27 20:12                     ` Daniel Borkmann
  2017-01-30  7:56                       ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-27 20:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/27/2017 11:05 AM, Michal Hocko wrote:
> On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
>> On 01/26/2017 02:40 PM, Michal Hocko wrote:
> [...]
>>> But realistically, how big is this problem really? Is it really worth
>>> it? You said this is an admin only interface and admin can kill the
>>> machine by OOM and other means already.
>>>
>>> Moreover and I should probably mention it explicitly, your d407bd25a204b
>>> reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
>>> previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
>>> could indeed hit the OOM e.g. due to memory fragmentation. It would be
>>> much harder to hit the OOM killer from vmalloc which doesn't issue
>>> higher order allocation requests. Or have you ever seen the OOM killer
>>> pointing to the vmalloc fallback path?
>>
>> The case I was concerned about was from vmalloc() path, not kmalloc().
>> That was where the stack trace indicating OOM pointed to. As an example,
>> there could be really large allocation requests for maps where the map
>> has pre-allocated memory for its elements. Thus, if we get to the point
>> where we need to kill others due to shortage of mem for satisfying this,
>> I'd much much rather prefer to just not let vmalloc() work really hard
>> and fail early on instead.
>
> I see, but as already mentioned, chances are that by the time you get
> close to the OOM somebody else will hit the OOM before the vmalloc path
> manages to free the allocated memory.
>
>> In my (crafted) test case, I was connected
>> via ssh and it each time reliably killed my connection, which is really
>> suboptimal.
>>
>> F.e., I could also imagine a buggy or miscalculated map definition for
>> a prog that is provisioned to multiple places, which then accidentally
>> triggers this. Or if large on purpose, but we crossed the line, it
>> could be handled more gracefully, f.e. I could imagine an option to
>> falling back to a non-pre-allocated map flavor from the application
>> loading the program. Trade-off for sure, but still allowing it to
>> operate up to a certain extend. Granted, if vmalloc() succeeded without
>> trying hard and we then OOM elsewhere, too bad, but we don't have much
>> control over that one anyway, only about our own request. Reason I
>> asked above was whether having __GFP_NORETRY in would be fatal
>> somewhere down the path, but seems not as you say.
>>
>> So to answer your second email with the bpf and netfilter hunks, why
>> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
>> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
>> is not harmful though has only /partial/ effect right now and that full
>> support needs to be implemented in future. That would still be better
>> that not having it, imo, and the FIXME would make expectations clear
>> to anyone reading that code.
>
> Well, we can do that, I just would like to prevent from this (ab)use
> if there is no _real_ and _sensible_ usecase for it. Having a real bug

Understandable.

> report or a fallback mechanism you are mentioning above would justify
> the (ab)use IMHO. But that abuse would be documented properly and have a
> real reason to exist. That sounds like a better approach to me.
>
> But if you absolutely _insist_ I can change that.

Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks again,
Daniel

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-27 20:12                     ` Daniel Borkmann
  2017-01-30  7:56                       ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-27 20:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/27/2017 11:05 AM, Michal Hocko wrote:
> On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
>> On 01/26/2017 02:40 PM, Michal Hocko wrote:
> [...]
>>> But realistically, how big is this problem really? Is it really worth
>>> it? You said this is an admin only interface and admin can kill the
>>> machine by OOM and other means already.
>>>
>>> Moreover and I should probably mention it explicitly, your d407bd25a204b
>>> reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
>>> previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
>>> could indeed hit the OOM e.g. due to memory fragmentation. It would be
>>> much harder to hit the OOM killer from vmalloc which doesn't issue
>>> higher order allocation requests. Or have you ever seen the OOM killer
>>> pointing to the vmalloc fallback path?
>>
>> The case I was concerned about was from vmalloc() path, not kmalloc().
>> That was where the stack trace indicating OOM pointed to. As an example,
>> there could be really large allocation requests for maps where the map
>> has pre-allocated memory for its elements. Thus, if we get to the point
>> where we need to kill others due to shortage of mem for satisfying this,
>> I'd much much rather prefer to just not let vmalloc() work really hard
>> and fail early on instead.
>
> I see, but as already mentioned, chances are that by the time you get
> close to the OOM somebody else will hit the OOM before the vmalloc path
> manages to free the allocated memory.
>
>> In my (crafted) test case, I was connected
>> via ssh and it each time reliably killed my connection, which is really
>> suboptimal.
>>
>> F.e., I could also imagine a buggy or miscalculated map definition for
>> a prog that is provisioned to multiple places, which then accidentally
>> triggers this. Or if large on purpose, but we crossed the line, it
>> could be handled more gracefully, f.e. I could imagine an option to
>> falling back to a non-pre-allocated map flavor from the application
>> loading the program. Trade-off for sure, but still allowing it to
>> operate up to a certain extend. Granted, if vmalloc() succeeded without
>> trying hard and we then OOM elsewhere, too bad, but we don't have much
>> control over that one anyway, only about our own request. Reason I
>> asked above was whether having __GFP_NORETRY in would be fatal
>> somewhere down the path, but seems not as you say.
>>
>> So to answer your second email with the bpf and netfilter hunks, why
>> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
>> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
>> is not harmful though has only /partial/ effect right now and that full
>> support needs to be implemented in future. That would still be better
>> that not having it, imo, and the FIXME would make expectations clear
>> to anyone reading that code.
>
> Well, we can do that, I just would like to prevent from this (ab)use
> if there is no _real_ and _sensible_ usecase for it. Having a real bug

Understandable.

> report or a fallback mechanism you are mentioning above would justify
> the (ab)use IMHO. But that abuse would be documented properly and have a
> real reason to exist. That sounds like a better approach to me.
>
> But if you absolutely _insist_ I can change that.

Yeah, please do (with a big FIXME comment as mentioned), this originally
came from a real bug report. Anyway, feel free to add my Acked-by then.

Thanks again,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 20:34                 ` Daniel Borkmann
@ 2017-01-27 10:05                   ` Michal Hocko
  2017-01-27 20:12                     ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-27 10:05 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> On 01/26/2017 02:40 PM, Michal Hocko wrote:
[...]
> > But realistically, how big is this problem really? Is it really worth
> > it? You said this is an admin only interface and admin can kill the
> > machine by OOM and other means already.
> > 
> > Moreover and I should probably mention it explicitly, your d407bd25a204b
> > reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
> > previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
> > could indeed hit the OOM e.g. due to memory fragmentation. It would be
> > much harder to hit the OOM killer from vmalloc which doesn't issue
> > higher order allocation requests. Or have you ever seen the OOM killer
> > pointing to the vmalloc fallback path?
> 
> The case I was concerned about was from vmalloc() path, not kmalloc().
> That was where the stack trace indicating OOM pointed to. As an example,
> there could be really large allocation requests for maps where the map
> has pre-allocated memory for its elements. Thus, if we get to the point
> where we need to kill others due to shortage of mem for satisfying this,
> I'd much much rather prefer to just not let vmalloc() work really hard
> and fail early on instead. 

I see, but as already mentioned, chances are that by the time you get
close to the OOM somebody else will hit the OOM before the vmalloc path
manages to free the allocated memory.

> In my (crafted) test case, I was connected
> via ssh and it each time reliably killed my connection, which is really
> suboptimal.
> 
> F.e., I could also imagine a buggy or miscalculated map definition for
> a prog that is provisioned to multiple places, which then accidentally
> triggers this. Or if large on purpose, but we crossed the line, it
> could be handled more gracefully, f.e. I could imagine an option to
> falling back to a non-pre-allocated map flavor from the application
> loading the program. Trade-off for sure, but still allowing it to
> operate up to a certain extend. Granted, if vmalloc() succeeded without
> trying hard and we then OOM elsewhere, too bad, but we don't have much
> control over that one anyway, only about our own request. Reason I
> asked above was whether having __GFP_NORETRY in would be fatal
> somewhere down the path, but seems not as you say.
> 
> So to answer your second email with the bpf and netfilter hunks, why
> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> is not harmful though has only /partial/ effect right now and that full
> support needs to be implemented in future. That would still be better
> that not having it, imo, and the FIXME would make expectations clear
> to anyone reading that code.

Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug
report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-27 10:05                   ` Michal Hocko
  2017-01-27 20:12                     ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-27 10:05 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 21:34:04, Daniel Borkmann wrote:
> On 01/26/2017 02:40 PM, Michal Hocko wrote:
[...]
> > But realistically, how big is this problem really? Is it really worth
> > it? You said this is an admin only interface and admin can kill the
> > machine by OOM and other means already.
> > 
> > Moreover and I should probably mention it explicitly, your d407bd25a204b
> > reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
> > previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
> > could indeed hit the OOM e.g. due to memory fragmentation. It would be
> > much harder to hit the OOM killer from vmalloc which doesn't issue
> > higher order allocation requests. Or have you ever seen the OOM killer
> > pointing to the vmalloc fallback path?
> 
> The case I was concerned about was from vmalloc() path, not kmalloc().
> That was where the stack trace indicating OOM pointed to. As an example,
> there could be really large allocation requests for maps where the map
> has pre-allocated memory for its elements. Thus, if we get to the point
> where we need to kill others due to shortage of mem for satisfying this,
> I'd much much rather prefer to just not let vmalloc() work really hard
> and fail early on instead. 

I see, but as already mentioned, chances are that by the time you get
close to the OOM somebody else will hit the OOM before the vmalloc path
manages to free the allocated memory.

> In my (crafted) test case, I was connected
> via ssh and it each time reliably killed my connection, which is really
> suboptimal.
> 
> F.e., I could also imagine a buggy or miscalculated map definition for
> a prog that is provisioned to multiple places, which then accidentally
> triggers this. Or if large on purpose, but we crossed the line, it
> could be handled more gracefully, f.e. I could imagine an option to
> falling back to a non-pre-allocated map flavor from the application
> loading the program. Trade-off for sure, but still allowing it to
> operate up to a certain extend. Granted, if vmalloc() succeeded without
> trying hard and we then OOM elsewhere, too bad, but we don't have much
> control over that one anyway, only about our own request. Reason I
> asked above was whether having __GFP_NORETRY in would be fatal
> somewhere down the path, but seems not as you say.
> 
> So to answer your second email with the bpf and netfilter hunks, why
> not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
> big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
> is not harmful though has only /partial/ effect right now and that full
> support needs to be implemented in future. That would still be better
> that not having it, imo, and the FIXME would make expectations clear
> to anyone reading that code.

Well, we can do that, I just would like to prevent from this (ab)use
if there is no _real_ and _sensible_ usecase for it. Having a real bug
report or a fallback mechanism you are mentioning above would justify
the (ab)use IMHO. But that abuse would be documented properly and have a
real reason to exist. That sounds like a better approach to me.

But if you absolutely _insist_ I can change that.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 13:40               ` Michal Hocko
  2017-01-26 14:13                 ` Michal Hocko
@ 2017-01-26 20:34                 ` Daniel Borkmann
  2017-01-27 10:05                   ` Michal Hocko
  1 sibling, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 20:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 02:40 PM, Michal Hocko wrote:
> On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
>> On 01/26/2017 12:58 PM, Michal Hocko wrote:
>>> On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
>>>> On 01/26/2017 11:08 AM, Michal Hocko wrote:
>>> [...]
>>>>> If you disagree I can drop the bpf part of course...
>>>>
>>>> If we could consolidate these spots with kvmalloc() eventually, I'm
>>>> all for it. But even if __GFP_NORETRY is not covered down to all
>>>> possible paths, it kind of does have an effect already of saying
>>>> 'don't try too hard', so would it be harmful to still keep that for
>>>> now? If it's not, I'd personally prefer to just leave it as is until
>>>> there's some form of support by kvmalloc() and friends.
>>>
>>> Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
>>> disallowed. It is not _supported_ which means that if it doesn't work as
>>> you expect you are on your own. Which is actually the situation right
>>> now as well. But I still think that this is just not right thing to do.
>>> Even though it might happen to work in some cases it gives a false
>>> impression of a solution. So I would rather go with
>>
>> Hmm. 'On my own' means, we could potentially BUG somewhere down the
>> vmalloc implementation, etc, presumably? So it might in-fact be
>> harmful to pass that, right?
>
> No it would mean that it might eventually hit the behavior which you are
> trying to avoid - in other words it may invoke OOM killer even though
> __GFP_NORETRY means giving up before any system wide disruptive actions
> a re taken.

Ok, thanks for clarifying, more on that further below.

>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 8697f43cf93c..a6dc4d596f14 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>>>
>>>    void *bpf_map_area_alloc(size_t size)
>>>    {
>>> +	/*
>>> +	 * FIXME: we would really like to not trigger the OOM killer and rather
>>> +	 * fail instead. This is not supported right now. Please nag MM people
>>> +	 * if these OOM start bothering people.
>>> +	 */
>>
>> Ok, I know this is out of scope for this series, but since i) this
>> is _not_ the _only_ spot right now which has such a construct and ii)
>> I am already kind of nagging a bit ;), my question would be, what
>> would it take to start supporting it?
>
> propagate gfp mask all the way down from vmalloc to all places which
> might allocate down the path and especially page table allocation
> function are PITA because they are really deep. This is a lot of work...
>
> But realistically, how big is this problem really? Is it really worth
> it? You said this is an admin only interface and admin can kill the
> machine by OOM and other means already.
>
> Moreover and I should probably mention it explicitly, your d407bd25a204b
> reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
> previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
> could indeed hit the OOM e.g. due to memory fragmentation. It would be
> much harder to hit the OOM killer from vmalloc which doesn't issue
> higher order allocation requests. Or have you ever seen the OOM killer
> pointing to the vmalloc fallback path?

The case I was concerned about was from vmalloc() path, not kmalloc().
That was where the stack trace indicating OOM pointed to. As an example,
there could be really large allocation requests for maps where the map
has pre-allocated memory for its elements. Thus, if we get to the point
where we need to kill others due to shortage of mem for satisfying this,
I'd much much rather prefer to just not let vmalloc() work really hard
and fail early on instead. In my (crafted) test case, I was connected
via ssh and it each time reliably killed my connection, which is really
suboptimal.

F.e., I could also imagine a buggy or miscalculated map definition for
a prog that is provisioned to multiple places, which then accidentally
triggers this. Or if large on purpose, but we crossed the line, it
could be handled more gracefully, f.e. I could imagine an option to
falling back to a non-pre-allocated map flavor from the application
loading the program. Trade-off for sure, but still allowing it to
operate up to a certain extend. Granted, if vmalloc() succeeded without
trying hard and we then OOM elsewhere, too bad, but we don't have much
control over that one anyway, only about our own request. Reason I
asked above was whether having __GFP_NORETRY in would be fatal
somewhere down the path, but seems not as you say.

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 20:34                 ` Daniel Borkmann
  2017-01-27 10:05                   ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 20:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 02:40 PM, Michal Hocko wrote:
> On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
>> On 01/26/2017 12:58 PM, Michal Hocko wrote:
>>> On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
>>>> On 01/26/2017 11:08 AM, Michal Hocko wrote:
>>> [...]
>>>>> If you disagree I can drop the bpf part of course...
>>>>
>>>> If we could consolidate these spots with kvmalloc() eventually, I'm
>>>> all for it. But even if __GFP_NORETRY is not covered down to all
>>>> possible paths, it kind of does have an effect already of saying
>>>> 'don't try too hard', so would it be harmful to still keep that for
>>>> now? If it's not, I'd personally prefer to just leave it as is until
>>>> there's some form of support by kvmalloc() and friends.
>>>
>>> Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
>>> disallowed. It is not _supported_ which means that if it doesn't work as
>>> you expect you are on your own. Which is actually the situation right
>>> now as well. But I still think that this is just not right thing to do.
>>> Even though it might happen to work in some cases it gives a false
>>> impression of a solution. So I would rather go with
>>
>> Hmm. 'On my own' means, we could potentially BUG somewhere down the
>> vmalloc implementation, etc, presumably? So it might in-fact be
>> harmful to pass that, right?
>
> No it would mean that it might eventually hit the behavior which you are
> trying to avoid - in other words it may invoke OOM killer even though
> __GFP_NORETRY means giving up before any system wide disruptive actions
> a re taken.

Ok, thanks for clarifying, more on that further below.

>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 8697f43cf93c..a6dc4d596f14 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>>>
>>>    void *bpf_map_area_alloc(size_t size)
>>>    {
>>> +	/*
>>> +	 * FIXME: we would really like to not trigger the OOM killer and rather
>>> +	 * fail instead. This is not supported right now. Please nag MM people
>>> +	 * if these OOM start bothering people.
>>> +	 */
>>
>> Ok, I know this is out of scope for this series, but since i) this
>> is _not_ the _only_ spot right now which has such a construct and ii)
>> I am already kind of nagging a bit ;), my question would be, what
>> would it take to start supporting it?
>
> propagate gfp mask all the way down from vmalloc to all places which
> might allocate down the path and especially page table allocation
> function are PITA because they are really deep. This is a lot of work...
>
> But realistically, how big is this problem really? Is it really worth
> it? You said this is an admin only interface and admin can kill the
> machine by OOM and other means already.
>
> Moreover and I should probably mention it explicitly, your d407bd25a204b
> reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
> previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
> could indeed hit the OOM e.g. due to memory fragmentation. It would be
> much harder to hit the OOM killer from vmalloc which doesn't issue
> higher order allocation requests. Or have you ever seen the OOM killer
> pointing to the vmalloc fallback path?

The case I was concerned about was from vmalloc() path, not kmalloc().
That was where the stack trace indicating OOM pointed to. As an example,
there could be really large allocation requests for maps where the map
has pre-allocated memory for its elements. Thus, if we get to the point
where we need to kill others due to shortage of mem for satisfying this,
I'd much much rather prefer to just not let vmalloc() work really hard
and fail early on instead. In my (crafted) test case, I was connected
via ssh and it each time reliably killed my connection, which is really
suboptimal.

F.e., I could also imagine a buggy or miscalculated map definition for
a prog that is provisioned to multiple places, which then accidentally
triggers this. Or if large on purpose, but we crossed the line, it
could be handled more gracefully, f.e. I could imagine an option to
falling back to a non-pre-allocated map flavor from the application
loading the program. Trade-off for sure, but still allowing it to
operate up to a certain extend. Granted, if vmalloc() succeeded without
trying hard and we then OOM elsewhere, too bad, but we don't have much
control over that one anyway, only about our own request. Reason I
asked above was whether having __GFP_NORETRY in would be fatal
somewhere down the path, but seems not as you say.

So to answer your second email with the bpf and netfilter hunks, why
not replacing them with kvmalloc() and __GFP_NORETRY flag and add that
big fat FIXME comment above there, saying explicitly that __GFP_NORETRY
is not harmful though has only /partial/ effect right now and that full
support needs to be implemented in future. That would still be better
that not having it, imo, and the FIXME would make expectations clear
to anyone reading that code.

Thanks,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 13:40               ` Michal Hocko
@ 2017-01-26 14:13                 ` Michal Hocko
  2017-01-26 20:34                 ` Daniel Borkmann
  1 sibling, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 14:13 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 14:40:04, Michal Hocko wrote:
> On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> > On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > > [...]
> > > > > If you disagree I can drop the bpf part of course...
> > > > 
> > > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > > possible paths, it kind of does have an effect already of saying
> > > > 'don't try too hard', so would it be harmful to still keep that for
> > > > now? If it's not, I'd personally prefer to just leave it as is until
> > > > there's some form of support by kvmalloc() and friends.
> > > 
> > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > > disallowed. It is not _supported_ which means that if it doesn't work as
> > > you expect you are on your own. Which is actually the situation right
> > > now as well. But I still think that this is just not right thing to do.
> > > Even though it might happen to work in some cases it gives a false
> > > impression of a solution. So I would rather go with
> > 
> > Hmm. 'On my own' means, we could potentially BUG somewhere down the
> > vmalloc implementation, etc, presumably? So it might in-fact be
> > harmful to pass that, right?
> 
> No it would mean that it might eventually hit the behavior which you are
> trying to avoid - in other words it may invoke OOM killer even though
> __GFP_NORETRY means giving up before any system wide disruptive actions
> a re taken.

I will separate both bpf and netfilter hunks into its own patch with the
clarification. Does the following look better?
---
>From ab6b2d724228e4abcc69c44f5ab1ce91009aa91d Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 26 Jan 2017 14:59:21 +0100
Subject: [PATCH] net, bpf: use kvzalloc helper

both bpf_map_area_alloc and xt_alloc_table_info try really hard to
play nicely with large memory requests which can be triggered from
the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
don't trigger OOM killer under pressure with map alloc").

The current allocation pattern strongly resembles kvmalloc helper except
for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
main reason why kvmalloc doesn't really support __GFP_NORETRY is
because vmalloc doesn't support this flag properly and it is far from
straightforward to make it understand it because there are some hard
coded GFP_KERNEL allocation deep in the call chains. This patch simply
replaces the open coded variants with kvmalloc and puts a note to
push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
to be really needed along with OOM report pointing at vmalloc.

If there is an immediate need and no full support yet then
	kvmalloc(size, gfp | __GFP_NORETRY)
will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
might trigger the OOM in some cases.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/bpf/syscall.c     | 19 +++++--------------
 net/netfilter/x_tables.c | 16 ++++++----------
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
-	 * trigger under memory pressure as we really just want to
-	 * fail instead.
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
 	 */
-	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-	void *area;
-
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		area = kmalloc(size, GFP_USER | flags);
-		if (area != NULL)
-			return area;
-	}
-
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-			 PAGE_KERNEL);
+	return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d529989f5791..ba8ba633da72 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
 	if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
 		return NULL;
 
-	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
-		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
-	if (!info) {
-		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
-				     __GFP_NORETRY | __GFP_HIGHMEM,
-				 PAGE_KERNEL);
-		if (!info)
-			return NULL;
-	}
-	memset(info, 0, sizeof(*info));
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
+	 */
+	info = kvzalloc(sz, GFP_KERNEL);
 	info->size = size;
 	return info;
 }
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 14:13                 ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 14:13 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 14:40:04, Michal Hocko wrote:
> On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> > On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > > [...]
> > > > > If you disagree I can drop the bpf part of course...
> > > > 
> > > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > > possible paths, it kind of does have an effect already of saying
> > > > 'don't try too hard', so would it be harmful to still keep that for
> > > > now? If it's not, I'd personally prefer to just leave it as is until
> > > > there's some form of support by kvmalloc() and friends.
> > > 
> > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > > disallowed. It is not _supported_ which means that if it doesn't work as
> > > you expect you are on your own. Which is actually the situation right
> > > now as well. But I still think that this is just not right thing to do.
> > > Even though it might happen to work in some cases it gives a false
> > > impression of a solution. So I would rather go with
> > 
> > Hmm. 'On my own' means, we could potentially BUG somewhere down the
> > vmalloc implementation, etc, presumably? So it might in-fact be
> > harmful to pass that, right?
> 
> No it would mean that it might eventually hit the behavior which you are
> trying to avoid - in other words it may invoke OOM killer even though
> __GFP_NORETRY means giving up before any system wide disruptive actions
> a re taken.

I will separate both bpf and netfilter hunks into its own patch with the
clarification. Does the following look better?
---
>From ab6b2d724228e4abcc69c44f5ab1ce91009aa91d Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 26 Jan 2017 14:59:21 +0100
Subject: [PATCH] net, bpf: use kvzalloc helper

both bpf_map_area_alloc and xt_alloc_table_info try really hard to
play nicely with large memory requests which can be triggered from
the userspace (by an admin). See 5bad87348c70 ("netfilter: x_tables:
avoid warn and OOM killer on vmalloc call") resp. d407bd25a204 ("bpf:
don't trigger OOM killer under pressure with map alloc").

The current allocation pattern strongly resembles kvmalloc helper except
for one thing __GFP_NORETRY is not used for the vmalloc fallback. The
main reason why kvmalloc doesn't really support __GFP_NORETRY is
because vmalloc doesn't support this flag properly and it is far from
straightforward to make it understand it because there are some hard
coded GFP_KERNEL allocation deep in the call chains. This patch simply
replaces the open coded variants with kvmalloc and puts a note to
push on MM people to support __GFP_NORETRY in kvmalloc it this turns out
to be really needed along with OOM report pointing at vmalloc.

If there is an immediate need and no full support yet then
	kvmalloc(size, gfp | __GFP_NORETRY)
will work as good as __vmalloc(gfp | __GFP_NORETRY) - in other words it
might trigger the OOM in some cases.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/bpf/syscall.c     | 19 +++++--------------
 net/netfilter/x_tables.c | 16 ++++++----------
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,12 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
-	 * trigger under memory pressure as we really just want to
-	 * fail instead.
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
 	 */
-	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-	void *area;
-
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		area = kmalloc(size, GFP_USER | flags);
-		if (area != NULL)
-			return area;
-	}
-
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-			 PAGE_KERNEL);
+	return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d529989f5791..ba8ba633da72 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -995,16 +995,12 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
 	if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
 		return NULL;
 
-	if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
-		info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
-	if (!info) {
-		info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN |
-				     __GFP_NORETRY | __GFP_HIGHMEM,
-				 PAGE_KERNEL);
-		if (!info)
-			return NULL;
-	}
-	memset(info, 0, sizeof(*info));
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
+	 */
+	info = kvzalloc(sz, GFP_KERNEL);
 	info->size = size;
 	return info;
 }
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 14:13                 ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 14:13 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 14:40:04, Michal Hocko wrote:
> On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> > On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > > [...]
> > > > > If you disagree I can drop the bpf part of course...
> > > > 
> > > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > > possible paths, it kind of does have an effect already of saying
> > > > 'don't try too hard', so would it be harmful to still keep that for
> > > > now? If it's not, I'd personally prefer to just leave it as is until
> > > > there's some form of support by kvmalloc() and friends.
> > > 
> > > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > > disallowed. It is not _supported_ which means that if it doesn't work as
> > > you expect you are on your own. Which is actually the situation right
> > > now as well. But I still think that this is just not right thing to do.
> > > Even though it might happen to work in some cases it gives a false
> > > impression of a solution. So I would rather go with
> > 
> > Hmm. 'On my own' means, we could potentially BUG somewhere down the
> > vmalloc implementation, etc, presumably? So it might in-fact be
> > harmful to pass that, right?
> 
> No it would mean that it might eventually hit the behavior which you are
> trying to avoid - in other words it may invoke OOM killer even though
> __GFP_NORETRY means giving up before any system wide disruptive actions
> a re taken.

I will separate both bpf and netfilter hunks into its own patch with the
clarification. Does the following look better?
---

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 13:10             ` Daniel Borkmann
@ 2017-01-26 13:40               ` Michal Hocko
  2017-01-26 14:13                 ` Michal Hocko
  2017-01-26 20:34                 ` Daniel Borkmann
  0 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 13:40 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > [...]
> > > > If you disagree I can drop the bpf part of course...
> > > 
> > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > possible paths, it kind of does have an effect already of saying
> > > 'don't try too hard', so would it be harmful to still keep that for
> > > now? If it's not, I'd personally prefer to just leave it as is until
> > > there's some form of support by kvmalloc() and friends.
> > 
> > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > disallowed. It is not _supported_ which means that if it doesn't work as
> > you expect you are on your own. Which is actually the situation right
> > now as well. But I still think that this is just not right thing to do.
> > Even though it might happen to work in some cases it gives a false
> > impression of a solution. So I would rather go with
> 
> Hmm. 'On my own' means, we could potentially BUG somewhere down the
> vmalloc implementation, etc, presumably? So it might in-fact be
> harmful to pass that, right?

No it would mean that it might eventually hit the behavior which you are
trying to avoid - in other words it may invoke OOM killer even though
__GFP_NORETRY means giving up before any system wide disruptive actions
a re taken.

> 
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 8697f43cf93c..a6dc4d596f14 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
> > 
> >   void *bpf_map_area_alloc(size_t size)
> >   {
> > +	/*
> > +	 * FIXME: we would really like to not trigger the OOM killer and rather
> > +	 * fail instead. This is not supported right now. Please nag MM people
> > +	 * if these OOM start bothering people.
> > +	 */
> 
> Ok, I know this is out of scope for this series, but since i) this
> is _not_ the _only_ spot right now which has such a construct and ii)
> I am already kind of nagging a bit ;), my question would be, what
> would it take to start supporting it?

propagate gfp mask all the way down from vmalloc to all places which
might allocate down the path and especially page table allocation
function are PITA because they are really deep. This is a lot of work...

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 13:40               ` Michal Hocko
  2017-01-26 14:13                 ` Michal Hocko
  2017-01-26 20:34                 ` Daniel Borkmann
  0 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 13:40 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 14:10:06, Daniel Borkmann wrote:
> On 01/26/2017 12:58 PM, Michal Hocko wrote:
> > On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> > > On 01/26/2017 11:08 AM, Michal Hocko wrote:
> > [...]
> > > > If you disagree I can drop the bpf part of course...
> > > 
> > > If we could consolidate these spots with kvmalloc() eventually, I'm
> > > all for it. But even if __GFP_NORETRY is not covered down to all
> > > possible paths, it kind of does have an effect already of saying
> > > 'don't try too hard', so would it be harmful to still keep that for
> > > now? If it's not, I'd personally prefer to just leave it as is until
> > > there's some form of support by kvmalloc() and friends.
> > 
> > Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> > disallowed. It is not _supported_ which means that if it doesn't work as
> > you expect you are on your own. Which is actually the situation right
> > now as well. But I still think that this is just not right thing to do.
> > Even though it might happen to work in some cases it gives a false
> > impression of a solution. So I would rather go with
> 
> Hmm. 'On my own' means, we could potentially BUG somewhere down the
> vmalloc implementation, etc, presumably? So it might in-fact be
> harmful to pass that, right?

No it would mean that it might eventually hit the behavior which you are
trying to avoid - in other words it may invoke OOM killer even though
__GFP_NORETRY means giving up before any system wide disruptive actions
a re taken.

> 
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 8697f43cf93c..a6dc4d596f14 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
> > 
> >   void *bpf_map_area_alloc(size_t size)
> >   {
> > +	/*
> > +	 * FIXME: we would really like to not trigger the OOM killer and rather
> > +	 * fail instead. This is not supported right now. Please nag MM people
> > +	 * if these OOM start bothering people.
> > +	 */
> 
> Ok, I know this is out of scope for this series, but since i) this
> is _not_ the _only_ spot right now which has such a construct and ii)
> I am already kind of nagging a bit ;), my question would be, what
> would it take to start supporting it?

propagate gfp mask all the way down from vmalloc to all places which
might allocate down the path and especially page table allocation
function are PITA because they are really deep. This is a lot of work...

But realistically, how big is this problem really? Is it really worth
it? You said this is an admin only interface and admin can kill the
machine by OOM and other means already.

Moreover and I should probably mention it explicitly, your d407bd25a204b
reduced the likelyhood of oom for other reason. kmalloc used GPF_USER
previously and with order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER this
could indeed hit the OOM e.g. due to memory fragmentation. It would be
much harder to hit the OOM killer from vmalloc which doesn't issue
higher order allocation requests. Or have you ever seen the OOM killer
pointing to the vmalloc fallback path?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 11:58           ` Michal Hocko
@ 2017-01-26 13:10             ` Daniel Borkmann
  2017-01-26 13:40               ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 13:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 12:58 PM, Michal Hocko wrote:
> On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
>> On 01/26/2017 11:08 AM, Michal Hocko wrote:
> [...]
>>> If you disagree I can drop the bpf part of course...
>>
>> If we could consolidate these spots with kvmalloc() eventually, I'm
>> all for it. But even if __GFP_NORETRY is not covered down to all
>> possible paths, it kind of does have an effect already of saying
>> 'don't try too hard', so would it be harmful to still keep that for
>> now? If it's not, I'd personally prefer to just leave it as is until
>> there's some form of support by kvmalloc() and friends.
>
> Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> disallowed. It is not _supported_ which means that if it doesn't work as
> you expect you are on your own. Which is actually the situation right
> now as well. But I still think that this is just not right thing to do.
> Even though it might happen to work in some cases it gives a false
> impression of a solution. So I would rather go with

Hmm. 'On my own' means, we could potentially BUG somewhere down the
vmalloc implementation, etc, presumably? So it might in-fact be
harmful to pass that, right?

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8697f43cf93c..a6dc4d596f14 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>
>   void *bpf_map_area_alloc(size_t size)
>   {
> +	/*
> +	 * FIXME: we would really like to not trigger the OOM killer and rather
> +	 * fail instead. This is not supported right now. Please nag MM people
> +	 * if these OOM start bothering people.
> +	 */

Ok, I know this is out of scope for this series, but since i) this
is _not_ the _only_ spot right now which has such a construct and ii)
I am already kind of nagging a bit ;), my question would be, what
would it take to start supporting it?

>   	return kvzalloc(size, GFP_USER);
>   }

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 13:10             ` Daniel Borkmann
  2017-01-26 13:40               ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 13:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 12:58 PM, Michal Hocko wrote:
> On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
>> On 01/26/2017 11:08 AM, Michal Hocko wrote:
> [...]
>>> If you disagree I can drop the bpf part of course...
>>
>> If we could consolidate these spots with kvmalloc() eventually, I'm
>> all for it. But even if __GFP_NORETRY is not covered down to all
>> possible paths, it kind of does have an effect already of saying
>> 'don't try too hard', so would it be harmful to still keep that for
>> now? If it's not, I'd personally prefer to just leave it as is until
>> there's some form of support by kvmalloc() and friends.
>
> Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
> disallowed. It is not _supported_ which means that if it doesn't work as
> you expect you are on your own. Which is actually the situation right
> now as well. But I still think that this is just not right thing to do.
> Even though it might happen to work in some cases it gives a false
> impression of a solution. So I would rather go with

Hmm. 'On my own' means, we could potentially BUG somewhere down the
vmalloc implementation, etc, presumably? So it might in-fact be
harmful to pass that, right?

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8697f43cf93c..a6dc4d596f14 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>
>   void *bpf_map_area_alloc(size_t size)
>   {
> +	/*
> +	 * FIXME: we would really like to not trigger the OOM killer and rather
> +	 * fail instead. This is not supported right now. Please nag MM people
> +	 * if these OOM start bothering people.
> +	 */

Ok, I know this is out of scope for this series, but since i) this
is _not_ the _only_ spot right now which has such a construct and ii)
I am already kind of nagging a bit ;), my question would be, what
would it take to start supporting it?

>   	return kvzalloc(size, GFP_USER);
>   }

Thanks,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 12:14           ` Joe Perches
@ 2017-01-26 12:27             ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 12:27 UTC (permalink / raw)
  To: Joe Perches
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, Johannes Weiner, linux-mm, LKML,
	netdev, marcelo.leitner

On Thu 26-01-17 04:14:37, Joe Perches wrote:
> On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> []
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> []
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >   *	Allocate enough pages to cover @size from the page level
> >   *	allocator with @gfp_mask flags.  Map them into contiguous
> >   *	kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + *	and __GFP_NOFAIL are not supported
> 
> Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

I would really like to not touch vmalloc in this series.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 12:27             ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 12:27 UTC (permalink / raw)
  To: Joe Perches
  Cc: Daniel Borkmann, Alexei Starovoitov, Andrew Morton,
	Vlastimil Babka, Mel Gorman, Johannes Weiner, linux-mm, LKML,
	netdev, marcelo.leitner

On Thu 26-01-17 04:14:37, Joe Perches wrote:
> On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> []
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> []
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >   *	Allocate enough pages to cover @size from the page level
> >   *	allocator with @gfp_mask flags.  Map them into contiguous
> >   *	kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + *	and __GFP_NOFAIL are not supported
> 
> Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

I would really like to not touch vmalloc in this series.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 10:32         ` Michal Hocko
  2017-01-26 11:04           ` Daniel Borkmann
@ 2017-01-26 12:14           ` Joe Perches
  2017-01-26 12:27             ` Michal Hocko
  1 sibling, 1 reply; 111+ messages in thread
From: Joe Perches @ 2017-01-26 12:14 UTC (permalink / raw)
  To: Michal Hocko, Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> So I have folded the following to the patch 1. It is in line with
> kvmalloc and hopefully at least tell more than the current code.
[]
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
[]
> @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>   *	Allocate enough pages to cover @size from the page level
>   *	allocator with @gfp_mask flags.  Map them into contiguous
>   *	kernel virtual space, using a pagetable protection of @prot.
> + *
> + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> + *	and __GFP_NOFAIL are not supported

Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 12:14           ` Joe Perches
  2017-01-26 12:27             ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Joe Perches @ 2017-01-26 12:14 UTC (permalink / raw)
  To: Michal Hocko, Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu, 2017-01-26 at 11:32 +0100, Michal Hocko wrote:
> So I have folded the following to the patch 1. It is in line with
> kvmalloc and hopefully at least tell more than the current code.
[]
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
[]
> @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>   *	Allocate enough pages to cover @size from the page level
>   *	allocator with @gfp_mask flags.  Map them into contiguous
>   *	kernel virtual space, using a pagetable protection of @prot.
> + *
> + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> + *	and __GFP_NOFAIL are not supported

Maybe add a BUILD_BUG or a WARN_ON_ONCE to catch new occurrences?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 11:33         ` Daniel Borkmann
@ 2017-01-26 11:58           ` Michal Hocko
  2017-01-26 13:10             ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 11:58 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> On 01/26/2017 11:08 AM, Michal Hocko wrote:
[...]
> > If you disagree I can drop the bpf part of course...
> 
> If we could consolidate these spots with kvmalloc() eventually, I'm
> all for it. But even if __GFP_NORETRY is not covered down to all
> possible paths, it kind of does have an effect already of saying
> 'don't try too hard', so would it be harmful to still keep that for
> now? If it's not, I'd personally prefer to just leave it as is until
> there's some form of support by kvmalloc() and friends.

Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
+	 */
 	return kvzalloc(size, GFP_USER);
 }
 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 11:58           ` Michal Hocko
  2017-01-26 13:10             ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 11:58 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 12:33:55, Daniel Borkmann wrote:
> On 01/26/2017 11:08 AM, Michal Hocko wrote:
[...]
> > If you disagree I can drop the bpf part of course...
> 
> If we could consolidate these spots with kvmalloc() eventually, I'm
> all for it. But even if __GFP_NORETRY is not covered down to all
> possible paths, it kind of does have an effect already of saying
> 'don't try too hard', so would it be harmful to still keep that for
> now? If it's not, I'd personally prefer to just leave it as is until
> there's some form of support by kvmalloc() and friends.

Well, you can use kvmalloc(size, GFP_KERNEL|__GFP_NORETRY). It is not
disallowed. It is not _supported_ which means that if it doesn't work as
you expect you are on your own. Which is actually the situation right
now as well. But I still think that this is just not right thing to do.
Even though it might happen to work in some cases it gives a false
impression of a solution. So I would rather go with
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8697f43cf93c..a6dc4d596f14 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,6 +53,11 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
+	/*
+	 * FIXME: we would really like to not trigger the OOM killer and rather
+	 * fail instead. This is not supported right now. Please nag MM people
+	 * if these OOM start bothering people.
+	 */
 	return kvzalloc(size, GFP_USER);
 }
 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 11:04           ` Daniel Borkmann
@ 2017-01-26 11:49             ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 11:49 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 12:04:13, Daniel Borkmann wrote:
> On 01/26/2017 11:32 AM, Michal Hocko wrote:
> > On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> > > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > > > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> > > [...]
> > > > > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > > > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > > > > If you look above at flags, they're also passed to __vmalloc() to not
> > > > > > trigger OOM in these situations I've experienced.
> > > > > 
> > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > > > think it would. It can still trigger the OOM killer becauset the flags
> > > > > are no propagated all the way down to all allocations requests (e.g.
> > > > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > > > vmalloc.
> > > > 
> > > > Ok, good to know, is that somewhere clearly documented (like for the
> > > > case with kmalloc())?
> > > 
> > > I am afraid that we really suck on this front. I will add something.
> > 
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> > ---
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d89034a393f2..6c1aa2c68887 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >    *	Allocate enough pages to cover @size from the page level
> >    *	allocator with @gfp_mask flags.  Map them into contiguous
> >    *	kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + *	and __GFP_NOFAIL are not supported
> 
> We could probably also mention that __GFP_ZERO in @gfp_mask is
> supported, though.

There are others which would be supported so I would rather stay with
explicit unsupported.

> 
> > + *	Any use of gfp flags outside of GFP_KERNEL should be consulted
> > + *	with mm people.
> 
> Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
> that is what vmalloc() resp. vzalloc() and others pass as flags?

yes, even though I think that specifying __GFP_HIGHMEM shouldn't be
really necessary. Are there any users who would really insist on vmalloc
pages in lowmem? Anyway this made me recheck kvmalloc_node
implementation and I am not adding this flags which would mean a
regression from the current state. Will fix it up.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 11:49             ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 11:49 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 12:04:13, Daniel Borkmann wrote:
> On 01/26/2017 11:32 AM, Michal Hocko wrote:
> > On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> > > On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > > > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> > > [...]
> > > > > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > > > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > > > > If you look above at flags, they're also passed to __vmalloc() to not
> > > > > > trigger OOM in these situations I've experienced.
> > > > > 
> > > > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > > > think it would. It can still trigger the OOM killer becauset the flags
> > > > > are no propagated all the way down to all allocations requests (e.g.
> > > > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > > > vmalloc.
> > > > 
> > > > Ok, good to know, is that somewhere clearly documented (like for the
> > > > case with kmalloc())?
> > > 
> > > I am afraid that we really suck on this front. I will add something.
> > 
> > So I have folded the following to the patch 1. It is in line with
> > kvmalloc and hopefully at least tell more than the current code.
> > ---
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d89034a393f2..6c1aa2c68887 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >    *	Allocate enough pages to cover @size from the page level
> >    *	allocator with @gfp_mask flags.  Map them into contiguous
> >    *	kernel virtual space, using a pagetable protection of @prot.
> > + *
> > + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> > + *	and __GFP_NOFAIL are not supported
> 
> We could probably also mention that __GFP_ZERO in @gfp_mask is
> supported, though.

There are others which would be supported so I would rather stay with
explicit unsupported.

> 
> > + *	Any use of gfp flags outside of GFP_KERNEL should be consulted
> > + *	with mm people.
> 
> Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
> that is what vmalloc() resp. vzalloc() and others pass as flags?

yes, even though I think that specifying __GFP_HIGHMEM shouldn't be
really necessary. Are there any users who would really insist on vmalloc
pages in lowmem? Anyway this made me recheck kvmalloc_node
implementation and I am not adding this flags which would mean a
regression from the current state. Will fix it up.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 10:08       ` Michal Hocko
  2017-01-26 10:32         ` Michal Hocko
@ 2017-01-26 11:33         ` Daniel Borkmann
  2017-01-26 11:58           ` Michal Hocko
  1 sibling, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 11:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 11:08 AM, Michal Hocko wrote:
> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
>> On 01/26/2017 08:43 AM, Michal Hocko wrote:
>>> On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> [...]
>>>> I assume that kvzalloc() is still the same from [1], right? If so, then
>>>> it would unfortunately (partially) reintroduce the issue that was fixed.
>>>> If you look above at flags, they're also passed to __vmalloc() to not
>>>> trigger OOM in these situations I've experienced.
>>>
>>> Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
>>> think it would. It can still trigger the OOM killer becauset the flags
>>> are no propagated all the way down to all allocations requests (e.g.
>>> page tables). This is the same reason why GFP_NOFS is not supported in
>>> vmalloc.
>>
>> Ok, good to know, is that somewhere clearly documented (like for the
>> case with kmalloc())?
>
> I am afraid that we really suck on this front. I will add something.

Thanks for doing that, much appreciated!

>> If not, could we do that for non-mm folks, or
>> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
>> it obvious to users that a given flag combination is not supported all
>> the way down?
>
> I am not sure that triggering a warning that somebody has used
> __GFP_NOWARN is very helpful ;). I also do not think that covering all the
> supported flags is really feasible. Most of them will not have bad side
> effects. I have added the warning because this API is new and I wanted
> to catch new abusers. Old ones would have to die slowly.

Okay, makes sense then. Just the kdoc comment from your other
mail should help fine already.

>>>> This is effectively the
>>>> same requirement as in other networking areas f.e. that 5bad87348c70
>>>> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
>>>> In your comment in kvzalloc() you eventually say that some of the above
>>>> modifiers are not supported. So there would be two options, i) just leave
>>>> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
>>>> it later (along with similar code from 5bad87348c70), or ii) implement
>>>> support for these modifiers as well to your original set. I guess it's not
>>>> too urgent, so we could also proceed with i) if that is easier for you to
>>>> proceed (I don't mind either way).
>>>
>>> Could you clarify why the oom killer in vmalloc matters actually?
>>
>> For both mentioned commits, (privileged) user space can potentially
>> create large allocation requests, where we thus switch to vmalloc()
>> flavor eventually and then OOM starts killing processes to try to
>> satisfy the allocation request. This is bad, because we want the
>> request to just fail instead as it's non-critical and f.e. not kill
>> ssh connection et al. Failing is totally fine in this case, whereas
>> triggering OOM is not.
>
> I see your intention but does it really make any real difference?
> Consider you would back off right before you would have OOMed. Any
> parallel request would just hit the OOM for you. You are (almost) never
> doing an allocation in an isolation.
>
>> In my testing, __GFP_NORETRY did satisfy this
>> just fine, but as you say it seems it's not enough.
>
> Yeah, ptes have been most probably popullated already.
>
>> Given there are
>> multiple places like these in the kernel, could we instead add an
>> option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?
>
> As said above I do not really think that suppressing the OOM killer
> makes any difference because it might be just somebody else doing that
> for you. Also the OOM killer is the MM internal implementation "detail"
> users shouldn't really care. I agree that callers should have a way to
> say they do not want to try really hard and that is not that simple
> for vmalloc unfortunatelly. The main problem here is that gfp mask
> propagation is not that easy to fix without a lot of code churn as some
> of those hardcoded allocation requests are deep in call chains.

I see, that's unfortunate. I understand that there are requests
in parallel and that we might end up with OOM eventually if we're
unlucky, but having some way to tell vmalloc to just not try as
hard as usual would be nice.

> I know this sucks and it would be great to support __GFP_NORETRY to
> [k]vmalloc and maybe we will get there eventually. But for the mean time
> I really think that using kvmalloc wherever possible is much better than
> open coded variants whith expectations which do not hold sometimes.

I totally agree with you that having kvmalloc() as helper is awesome
and probably long overdue as well. :)

> If you disagree I can drop the bpf part of course...

If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.

Thanks for your input, Michal!

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 11:33         ` Daniel Borkmann
  2017-01-26 11:58           ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 11:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 11:08 AM, Michal Hocko wrote:
> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
>> On 01/26/2017 08:43 AM, Michal Hocko wrote:
>>> On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> [...]
>>>> I assume that kvzalloc() is still the same from [1], right? If so, then
>>>> it would unfortunately (partially) reintroduce the issue that was fixed.
>>>> If you look above at flags, they're also passed to __vmalloc() to not
>>>> trigger OOM in these situations I've experienced.
>>>
>>> Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
>>> think it would. It can still trigger the OOM killer becauset the flags
>>> are no propagated all the way down to all allocations requests (e.g.
>>> page tables). This is the same reason why GFP_NOFS is not supported in
>>> vmalloc.
>>
>> Ok, good to know, is that somewhere clearly documented (like for the
>> case with kmalloc())?
>
> I am afraid that we really suck on this front. I will add something.

Thanks for doing that, much appreciated!

>> If not, could we do that for non-mm folks, or
>> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
>> it obvious to users that a given flag combination is not supported all
>> the way down?
>
> I am not sure that triggering a warning that somebody has used
> __GFP_NOWARN is very helpful ;). I also do not think that covering all the
> supported flags is really feasible. Most of them will not have bad side
> effects. I have added the warning because this API is new and I wanted
> to catch new abusers. Old ones would have to die slowly.

Okay, makes sense then. Just the kdoc comment from your other
mail should help fine already.

>>>> This is effectively the
>>>> same requirement as in other networking areas f.e. that 5bad87348c70
>>>> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
>>>> In your comment in kvzalloc() you eventually say that some of the above
>>>> modifiers are not supported. So there would be two options, i) just leave
>>>> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
>>>> it later (along with similar code from 5bad87348c70), or ii) implement
>>>> support for these modifiers as well to your original set. I guess it's not
>>>> too urgent, so we could also proceed with i) if that is easier for you to
>>>> proceed (I don't mind either way).
>>>
>>> Could you clarify why the oom killer in vmalloc matters actually?
>>
>> For both mentioned commits, (privileged) user space can potentially
>> create large allocation requests, where we thus switch to vmalloc()
>> flavor eventually and then OOM starts killing processes to try to
>> satisfy the allocation request. This is bad, because we want the
>> request to just fail instead as it's non-critical and f.e. not kill
>> ssh connection et al. Failing is totally fine in this case, whereas
>> triggering OOM is not.
>
> I see your intention but does it really make any real difference?
> Consider you would back off right before you would have OOMed. Any
> parallel request would just hit the OOM for you. You are (almost) never
> doing an allocation in an isolation.
>
>> In my testing, __GFP_NORETRY did satisfy this
>> just fine, but as you say it seems it's not enough.
>
> Yeah, ptes have been most probably popullated already.
>
>> Given there are
>> multiple places like these in the kernel, could we instead add an
>> option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?
>
> As said above I do not really think that suppressing the OOM killer
> makes any difference because it might be just somebody else doing that
> for you. Also the OOM killer is the MM internal implementation "detail"
> users shouldn't really care. I agree that callers should have a way to
> say they do not want to try really hard and that is not that simple
> for vmalloc unfortunatelly. The main problem here is that gfp mask
> propagation is not that easy to fix without a lot of code churn as some
> of those hardcoded allocation requests are deep in call chains.

I see, that's unfortunate. I understand that there are requests
in parallel and that we might end up with OOM eventually if we're
unlucky, but having some way to tell vmalloc to just not try as
hard as usual would be nice.

> I know this sucks and it would be great to support __GFP_NORETRY to
> [k]vmalloc and maybe we will get there eventually. But for the mean time
> I really think that using kvmalloc wherever possible is much better than
> open coded variants whith expectations which do not hold sometimes.

I totally agree with you that having kvmalloc() as helper is awesome
and probably long overdue as well. :)

> If you disagree I can drop the bpf part of course...

If we could consolidate these spots with kvmalloc() eventually, I'm
all for it. But even if __GFP_NORETRY is not covered down to all
possible paths, it kind of does have an effect already of saying
'don't try too hard', so would it be harmful to still keep that for
now? If it's not, I'd personally prefer to just leave it as is until
there's some form of support by kvmalloc() and friends.

Thanks for your input, Michal!

Cheers,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 10:32         ` Michal Hocko
@ 2017-01-26 11:04           ` Daniel Borkmann
  2017-01-26 11:49             ` Michal Hocko
  2017-01-26 12:14           ` Joe Perches
  1 sibling, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 11:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 11:32 AM, Michal Hocko wrote:
> On Thu 26-01-17 11:08:02, Michal Hocko wrote:
>> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
>>> On 01/26/2017 08:43 AM, Michal Hocko wrote:
>>>> On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
>> [...]
>>>>> I assume that kvzalloc() is still the same from [1], right? If so, then
>>>>> it would unfortunately (partially) reintroduce the issue that was fixed.
>>>>> If you look above at flags, they're also passed to __vmalloc() to not
>>>>> trigger OOM in these situations I've experienced.
>>>>
>>>> Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
>>>> think it would. It can still trigger the OOM killer becauset the flags
>>>> are no propagated all the way down to all allocations requests (e.g.
>>>> page tables). This is the same reason why GFP_NOFS is not supported in
>>>> vmalloc.
>>>
>>> Ok, good to know, is that somewhere clearly documented (like for the
>>> case with kmalloc())?
>>
>> I am afraid that we really suck on this front. I will add something.
>
> So I have folded the following to the patch 1. It is in line with
> kvmalloc and hopefully at least tell more than the current code.
> ---
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d89034a393f2..6c1aa2c68887 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>    *	Allocate enough pages to cover @size from the page level
>    *	allocator with @gfp_mask flags.  Map them into contiguous
>    *	kernel virtual space, using a pagetable protection of @prot.
> + *
> + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> + *	and __GFP_NOFAIL are not supported

We could probably also mention that __GFP_ZERO in @gfp_mask is
supported, though.

> + *	Any use of gfp flags outside of GFP_KERNEL should be consulted
> + *	with mm people.

Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
that is what vmalloc() resp. vzalloc() and others pass as flags?

> + *
>    */

Sounds good otherwise, thanks Michal!

>   static void *__vmalloc_node(unsigned long size, unsigned long align,
>   			    gfp_t gfp_mask, pgprot_t prot,

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 11:04           ` Daniel Borkmann
  2017-01-26 11:49             ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26 11:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 11:32 AM, Michal Hocko wrote:
> On Thu 26-01-17 11:08:02, Michal Hocko wrote:
>> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
>>> On 01/26/2017 08:43 AM, Michal Hocko wrote:
>>>> On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
>> [...]
>>>>> I assume that kvzalloc() is still the same from [1], right? If so, then
>>>>> it would unfortunately (partially) reintroduce the issue that was fixed.
>>>>> If you look above at flags, they're also passed to __vmalloc() to not
>>>>> trigger OOM in these situations I've experienced.
>>>>
>>>> Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
>>>> think it would. It can still trigger the OOM killer becauset the flags
>>>> are no propagated all the way down to all allocations requests (e.g.
>>>> page tables). This is the same reason why GFP_NOFS is not supported in
>>>> vmalloc.
>>>
>>> Ok, good to know, is that somewhere clearly documented (like for the
>>> case with kmalloc())?
>>
>> I am afraid that we really suck on this front. I will add something.
>
> So I have folded the following to the patch 1. It is in line with
> kvmalloc and hopefully at least tell more than the current code.
> ---
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d89034a393f2..6c1aa2c68887 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>    *	Allocate enough pages to cover @size from the page level
>    *	allocator with @gfp_mask flags.  Map them into contiguous
>    *	kernel virtual space, using a pagetable protection of @prot.
> + *
> + *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
> + *	and __GFP_NOFAIL are not supported

We could probably also mention that __GFP_ZERO in @gfp_mask is
supported, though.

> + *	Any use of gfp flags outside of GFP_KERNEL should be consulted
> + *	with mm people.

Just a question: should that read 'GFP_KERNEL | __GFP_HIGHMEM' as
that is what vmalloc() resp. vzalloc() and others pass as flags?

> + *
>    */

Sounds good otherwise, thanks Michal!

>   static void *__vmalloc_node(unsigned long size, unsigned long align,
>   			    gfp_t gfp_mask, pgprot_t prot,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26 10:08       ` Michal Hocko
@ 2017-01-26 10:32         ` Michal Hocko
  2017-01-26 11:04           ` Daniel Borkmann
  2017-01-26 12:14           ` Joe Perches
  2017-01-26 11:33         ` Daniel Borkmann
  1 sibling, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 10:32 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> [...]
> > > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > > If you look above at flags, they're also passed to __vmalloc() to not
> > > > trigger OOM in these situations I've experienced.
> > > 
> > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > think it would. It can still trigger the OOM killer becauset the flags
> > > are no propagated all the way down to all allocations requests (e.g.
> > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > vmalloc.
> > 
> > Ok, good to know, is that somewhere clearly documented (like for the
> > case with kmalloc())?
> 
> I am afraid that we really suck on this front. I will add something.

So I have folded the following to the patch 1. It is in line with
kvmalloc and hopefully at least tell more than the current code.
---
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *	Allocate enough pages to cover @size from the page level
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
+ *
+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ *	and __GFP_NOFAIL are not supported
+ *
+ *	Any use of gfp flags outside of GFP_KERNEL should be consulted
+ *	with mm people.
+ *
  */
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 10:32         ` Michal Hocko
  2017-01-26 11:04           ` Daniel Borkmann
  2017-01-26 12:14           ` Joe Perches
  0 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 10:32 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 11:08:02, Michal Hocko wrote:
> On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> > On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> [...]
> > > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > > If you look above at flags, they're also passed to __vmalloc() to not
> > > > trigger OOM in these situations I've experienced.
> > > 
> > > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > > think it would. It can still trigger the OOM killer becauset the flags
> > > are no propagated all the way down to all allocations requests (e.g.
> > > page tables). This is the same reason why GFP_NOFS is not supported in
> > > vmalloc.
> > 
> > Ok, good to know, is that somewhere clearly documented (like for the
> > case with kmalloc())?
> 
> I am afraid that we really suck on this front. I will add something.

So I have folded the following to the patch 1. It is in line with
kvmalloc and hopefully at least tell more than the current code.
---
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d89034a393f2..6c1aa2c68887 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1741,6 +1741,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *	Allocate enough pages to cover @size from the page level
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
+ *
+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ *	and __GFP_NOFAIL are not supported
+ *
+ *	Any use of gfp flags outside of GFP_KERNEL should be consulted
+ *	with mm people.
+ *
  */
 static void *__vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, pgprot_t prot,
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26  9:36     ` Daniel Borkmann
  2017-01-26  9:48       ` David Laight
@ 2017-01-26 10:08       ` Michal Hocko
  2017-01-26 10:32         ` Michal Hocko
  2017-01-26 11:33         ` Daniel Borkmann
  1 sibling, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 10:08 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
[...]
> > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > If you look above at flags, they're also passed to __vmalloc() to not
> > > trigger OOM in these situations I've experienced.
> > 
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())?

I am afraid that we really suck on this front. I will add something.

> If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

I am not sure that triggering a warning that somebody has used
__GFP_NOWARN is very helpful ;). I also do not think that covering all the
supported flags is really feasible. Most of them will not have bad side
effects. I have added the warning because this API is new and I wanted
to catch new abusers. Old ones would have to die slowly.

> > > This is effectively the
> > > same requirement as in other networking areas f.e. that 5bad87348c70
> > > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> > > In your comment in kvzalloc() you eventually say that some of the above
> > > modifiers are not supported. So there would be two options, i) just leave
> > > out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
> > > it later (along with similar code from 5bad87348c70), or ii) implement
> > > support for these modifiers as well to your original set. I guess it's not
> > > too urgent, so we could also proceed with i) if that is easier for you to
> > > proceed (I don't mind either way).
> > 
> > Could you clarify why the oom killer in vmalloc matters actually?
> 
> For both mentioned commits, (privileged) user space can potentially
> create large allocation requests, where we thus switch to vmalloc()
> flavor eventually and then OOM starts killing processes to try to
> satisfy the allocation request. This is bad, because we want the
> request to just fail instead as it's non-critical and f.e. not kill
> ssh connection et al. Failing is totally fine in this case, whereas
> triggering OOM is not.

I see your intention but does it really make any real difference?
Consider you would back off right before you would have OOMed. Any
parallel request would just hit the OOM for you. You are (almost) never
doing an allocation in an isolation.

> In my testing, __GFP_NORETRY did satisfy this
> just fine, but as you say it seems it's not enough.

Yeah, ptes have been most probably popullated already.

> Given there are
> multiple places like these in the kernel, could we instead add an
> option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

As said above I do not really think that suppressing the OOM killer
makes any difference because it might be just somebody else doing that
for you. Also the OOM killer is the MM internal implementation "detail"
users shouldn't really care. I agree that callers should have a way to
say they do not want to try really hard and that is not that simple
for vmalloc unfortunatelly. The main problem here is that gfp mask
propagation is not that easy to fix without a lot of code churn as some
of those hardcoded allocation requests are deep in call chains.

I know this sucks and it would be great to support __GFP_NORETRY to
[k]vmalloc and maybe we will get there eventually. But for the mean time
I really think that using kvmalloc wherever possible is much better than
open coded variants whith expectations which do not hold sometimes.

If you disagree I can drop the bpf part of course...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26 10:08       ` Michal Hocko
  2017-01-26 10:32         ` Michal Hocko
  2017-01-26 11:33         ` Daniel Borkmann
  0 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-26 10:08 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On Thu 26-01-17 10:36:49, Daniel Borkmann wrote:
> On 01/26/2017 08:43 AM, Michal Hocko wrote:
> > On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
[...]
> > > I assume that kvzalloc() is still the same from [1], right? If so, then
> > > it would unfortunately (partially) reintroduce the issue that was fixed.
> > > If you look above at flags, they're also passed to __vmalloc() to not
> > > trigger OOM in these situations I've experienced.
> > 
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())?

I am afraid that we really suck on this front. I will add something.

> If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

I am not sure that triggering a warning that somebody has used
__GFP_NOWARN is very helpful ;). I also do not think that covering all the
supported flags is really feasible. Most of them will not have bad side
effects. I have added the warning because this API is new and I wanted
to catch new abusers. Old ones would have to die slowly.

> > > This is effectively the
> > > same requirement as in other networking areas f.e. that 5bad87348c70
> > > ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> > > In your comment in kvzalloc() you eventually say that some of the above
> > > modifiers are not supported. So there would be two options, i) just leave
> > > out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
> > > it later (along with similar code from 5bad87348c70), or ii) implement
> > > support for these modifiers as well to your original set. I guess it's not
> > > too urgent, so we could also proceed with i) if that is easier for you to
> > > proceed (I don't mind either way).
> > 
> > Could you clarify why the oom killer in vmalloc matters actually?
> 
> For both mentioned commits, (privileged) user space can potentially
> create large allocation requests, where we thus switch to vmalloc()
> flavor eventually and then OOM starts killing processes to try to
> satisfy the allocation request. This is bad, because we want the
> request to just fail instead as it's non-critical and f.e. not kill
> ssh connection et al. Failing is totally fine in this case, whereas
> triggering OOM is not.

I see your intention but does it really make any real difference?
Consider you would back off right before you would have OOMed. Any
parallel request would just hit the OOM for you. You are (almost) never
doing an allocation in an isolation.

> In my testing, __GFP_NORETRY did satisfy this
> just fine, but as you say it seems it's not enough.

Yeah, ptes have been most probably popullated already.

> Given there are
> multiple places like these in the kernel, could we instead add an
> option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

As said above I do not really think that suppressing the OOM killer
makes any difference because it might be just somebody else doing that
for you. Also the OOM killer is the MM internal implementation "detail"
users shouldn't really care. I agree that callers should have a way to
say they do not want to try really hard and that is not that simple
for vmalloc unfortunatelly. The main problem here is that gfp mask
propagation is not that easy to fix without a lot of code churn as some
of those hardcoded allocation requests are deep in call chains.

I know this sucks and it would be great to support __GFP_NORETRY to
[k]vmalloc and maybe we will get there eventually. But for the mean time
I really think that using kvmalloc wherever possible is much better than
open coded variants whith expectations which do not hold sometimes.

If you disagree I can drop the bpf part of course...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* RE: [PATCH 0/6 v3] kvmalloc
  2017-01-26  9:36     ` Daniel Borkmann
@ 2017-01-26  9:48       ` David Laight
  2017-01-26 10:08       ` Michal Hocko
  1 sibling, 0 replies; 111+ messages in thread
From: David Laight @ 2017-01-26  9:48 UTC (permalink / raw)
  To: 'Daniel Borkmann', Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

From: Daniel Borkmann
> Sent: 26 January 2017 09:37
...
> >> I assume that kvzalloc() is still the same from [1], right? If so, then
> >> it would unfortunately (partially) reintroduce the issue that was fixed.
> >> If you look above at flags, they're also passed to __vmalloc() to not
> >> trigger OOM in these situations I've experienced.
> >
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())? If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

ISTM that requests for the relatively small memory blocks needed for page
tables aren't really likely to invoke the OOM killer when it isn't already
being invoked by other actions. So that isn't really a problem.

More of a problem is that requests that you really don't mind failing
can use the last 'reasonably available' memory.
This will cause the next allocate to fail when it would be better for
the earlier one to fail instead.

	David

^ permalink raw reply	[flat|nested] 111+ messages in thread

* RE: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26  9:48       ` David Laight
  0 siblings, 0 replies; 111+ messages in thread
From: David Laight @ 2017-01-26  9:48 UTC (permalink / raw)
  To: 'Daniel Borkmann', Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

From: Daniel Borkmann
> Sent: 26 January 2017 09:37
...
> >> I assume that kvzalloc() is still the same from [1], right? If so, then
> >> it would unfortunately (partially) reintroduce the issue that was fixed.
> >> If you look above at flags, they're also passed to __vmalloc() to not
> >> trigger OOM in these situations I've experienced.
> >
> > Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> > think it would. It can still trigger the OOM killer becauset the flags
> > are no propagated all the way down to all allocations requests (e.g.
> > page tables). This is the same reason why GFP_NOFS is not supported in
> > vmalloc.
> 
> Ok, good to know, is that somewhere clearly documented (like for the
> case with kmalloc())? If not, could we do that for non-mm folks, or
> at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
> it obvious to users that a given flag combination is not supported all
> the way down?

ISTM that requests for the relatively small memory blocks needed for page
tables aren't really likely to invoke the OOM killer when it isn't already
being invoked by other actions. So that isn't really a problem.

More of a problem is that requests that you really don't mind failing
can use the last 'reasonably available' memory.
This will cause the next allocate to fail when it would be better for
the earlier one to fail instead.

	David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-26  7:43   ` Michal Hocko
@ 2017-01-26  9:36     ` Daniel Borkmann
  2017-01-26  9:48       ` David Laight
  2017-01-26 10:08       ` Michal Hocko
  0 siblings, 2 replies; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 08:43 AM, Michal Hocko wrote:
> On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
>> On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
>>> On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>>>>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
>> [...]
>>>>>>> Are there any more comments? I would really appreciate to hear from
>>>>>>> networking folks before I resubmit the series.
>>>>>>
>>>>>> while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
>>>>>> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
>>>>>> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
>>>>>> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
>>>>>> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>>>>
>>>>> OK, will do. Thanks for the heads up.
>>>>
>>>> Just for the record, I will fold the following into the patch 1
>>>> ---
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index 19b6129eab23..8697f43cf93c 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>>>>
>>>>    void *bpf_map_area_alloc(size_t size)
>>>>    {
>>>> -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
>>>> -        * trigger under memory pressure as we really just want to
>>>> -        * fail instead.
>>>> -        */
>>>> -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>>>> -       void *area;
>>>> -
>>>> -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>> -               area = kmalloc(size, GFP_USER | flags);
>>>> -               if (area != NULL)
>>>> -                       return area;
>>>> -       }
>>>> -
>>>> -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
>>>> -                        PAGE_KERNEL);
>>>> +       return kvzalloc(size, GFP_USER);
>>>>    }
>>>>
>>>>    void bpf_map_area_free(void *area)
>>>
>>> Looks fine by me.
>>> Daniel, thoughts?
>>
>> I assume that kvzalloc() is still the same from [1], right? If so, then
>> it would unfortunately (partially) reintroduce the issue that was fixed.
>> If you look above at flags, they're also passed to __vmalloc() to not
>> trigger OOM in these situations I've experienced.
>
> Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> think it would. It can still trigger the OOM killer becauset the flags
> are no propagated all the way down to all allocations requests (e.g.
> page tables). This is the same reason why GFP_NOFS is not supported in
> vmalloc.

Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())? If not, could we do that for non-mm folks, or
at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
it obvious to users that a given flag combination is not supported all
the way down?

>> This is effectively the
>> same requirement as in other networking areas f.e. that 5bad87348c70
>> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
>> In your comment in kvzalloc() you eventually say that some of the above
>> modifiers are not supported. So there would be two options, i) just leave
>> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
>> it later (along with similar code from 5bad87348c70), or ii) implement
>> support for these modifiers as well to your original set. I guess it's not
>> too urgent, so we could also proceed with i) if that is easier for you to
>> proceed (I don't mind either way).
>
> Could you clarify why the oom killer in vmalloc matters actually?

For both mentioned commits, (privileged) user space can potentially
create large allocation requests, where we thus switch to vmalloc()
flavor eventually and then OOM starts killing processes to try to
satisfy the allocation request. This is bad, because we want the
request to just fail instead as it's non-critical and f.e. not kill
ssh connection et al. Failing is totally fine in this case, whereas
triggering OOM is not. In my testing, __GFP_NORETRY did satisfy this
just fine, but as you say it seems it's not enough. Given there are
multiple places like these in the kernel, could we instead add an
option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26  9:36     ` Daniel Borkmann
  2017-01-26  9:48       ` David Laight
  2017-01-26 10:08       ` Michal Hocko
  0 siblings, 2 replies; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-26  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev, marcelo.leitner

On 01/26/2017 08:43 AM, Michal Hocko wrote:
> On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
>> On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
>>> On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>>>>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
>> [...]
>>>>>>> Are there any more comments? I would really appreciate to hear from
>>>>>>> networking folks before I resubmit the series.
>>>>>>
>>>>>> while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
>>>>>> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
>>>>>> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
>>>>>> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
>>>>>> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>>>>
>>>>> OK, will do. Thanks for the heads up.
>>>>
>>>> Just for the record, I will fold the following into the patch 1
>>>> ---
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index 19b6129eab23..8697f43cf93c 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>>>>
>>>>    void *bpf_map_area_alloc(size_t size)
>>>>    {
>>>> -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
>>>> -        * trigger under memory pressure as we really just want to
>>>> -        * fail instead.
>>>> -        */
>>>> -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>>>> -       void *area;
>>>> -
>>>> -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>>>> -               area = kmalloc(size, GFP_USER | flags);
>>>> -               if (area != NULL)
>>>> -                       return area;
>>>> -       }
>>>> -
>>>> -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
>>>> -                        PAGE_KERNEL);
>>>> +       return kvzalloc(size, GFP_USER);
>>>>    }
>>>>
>>>>    void bpf_map_area_free(void *area)
>>>
>>> Looks fine by me.
>>> Daniel, thoughts?
>>
>> I assume that kvzalloc() is still the same from [1], right? If so, then
>> it would unfortunately (partially) reintroduce the issue that was fixed.
>> If you look above at flags, they're also passed to __vmalloc() to not
>> trigger OOM in these situations I've experienced.
>
> Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
> think it would. It can still trigger the OOM killer becauset the flags
> are no propagated all the way down to all allocations requests (e.g.
> page tables). This is the same reason why GFP_NOFS is not supported in
> vmalloc.

Ok, good to know, is that somewhere clearly documented (like for the
case with kmalloc())? If not, could we do that for non-mm folks, or
at least add a similar WARN_ON_ONCE() as you did for kvmalloc() to make
it obvious to users that a given flag combination is not supported all
the way down?

>> This is effectively the
>> same requirement as in other networking areas f.e. that 5bad87348c70
>> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
>> In your comment in kvzalloc() you eventually say that some of the above
>> modifiers are not supported. So there would be two options, i) just leave
>> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
>> it later (along with similar code from 5bad87348c70), or ii) implement
>> support for these modifiers as well to your original set. I guess it's not
>> too urgent, so we could also proceed with i) if that is easier for you to
>> proceed (I don't mind either way).
>
> Could you clarify why the oom killer in vmalloc matters actually?

For both mentioned commits, (privileged) user space can potentially
create large allocation requests, where we thus switch to vmalloc()
flavor eventually and then OOM starts killing processes to try to
satisfy the allocation request. This is bad, because we want the
request to just fail instead as it's non-critical and f.e. not kill
ssh connection et al. Failing is totally fine in this case, whereas
triggering OOM is not. In my testing, __GFP_NORETRY did satisfy this
just fine, but as you say it seems it's not enough. Given there are
multiple places like these in the kernel, could we instead add an
option such as __GFP_NOOOM, or just make __GFP_NORETRY supported?

Thanks,
Daniel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-25 20:16 ` Daniel Borkmann
@ 2017-01-26  7:43   ` Michal Hocko
  2017-01-26  9:36     ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-26  7:43 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
> > On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> > > > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> [...]
> > > > > > Are there any more comments? I would really appreciate to hear from
> > > > > > networking folks before I resubmit the series.
> > > > > 
> > > > > while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
> > > > > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> > > > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
> > > > > it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> > > > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> > > > 
> > > > OK, will do. Thanks for the heads up.
> > > 
> > > Just for the record, I will fold the following into the patch 1
> > > ---
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 19b6129eab23..8697f43cf93c 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
> > > 
> > >   void *bpf_map_area_alloc(size_t size)
> > >   {
> > > -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> > > -        * trigger under memory pressure as we really just want to
> > > -        * fail instead.
> > > -        */
> > > -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> > > -       void *area;
> > > -
> > > -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> > > -               area = kmalloc(size, GFP_USER | flags);
> > > -               if (area != NULL)
> > > -                       return area;
> > > -       }
> > > -
> > > -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> > > -                        PAGE_KERNEL);
> > > +       return kvzalloc(size, GFP_USER);
> > >   }
> > > 
> > >   void bpf_map_area_free(void *area)
> > 
> > Looks fine by me.
> > Daniel, thoughts?
> 
> I assume that kvzalloc() is still the same from [1], right? If so, then
> it would unfortunately (partially) reintroduce the issue that was fixed.
> If you look above at flags, they're also passed to __vmalloc() to not
> trigger OOM in these situations I've experienced.

Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.

> This is effectively the
> same requirement as in other networking areas f.e. that 5bad87348c70
> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> In your comment in kvzalloc() you eventually say that some of the above
> modifiers are not supported. So there would be two options, i) just leave
> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
> it later (along with similar code from 5bad87348c70), or ii) implement
> support for these modifiers as well to your original set. I guess it's not
> too urgent, so we could also proceed with i) if that is easier for you to
> proceed (I don't mind either way).

Could you clarify why the oom killer in vmalloc matters actually?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-26  7:43   ` Michal Hocko
  2017-01-26  9:36     ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-26  7:43 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Andrew Morton, Vlastimil Babka, Mel Gorman,
	Johannes Weiner, linux-mm, LKML, netdev

On Wed 25-01-17 21:16:42, Daniel Borkmann wrote:
> On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
> > On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> > > > On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> [...]
> > > > > > Are there any more comments? I would really appreciate to hear from
> > > > > > networking folks before I resubmit the series.
> > > > > 
> > > > > while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
> > > > > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> > > > > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
> > > > > it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> > > > > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> > > > 
> > > > OK, will do. Thanks for the heads up.
> > > 
> > > Just for the record, I will fold the following into the patch 1
> > > ---
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index 19b6129eab23..8697f43cf93c 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
> > > 
> > >   void *bpf_map_area_alloc(size_t size)
> > >   {
> > > -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> > > -        * trigger under memory pressure as we really just want to
> > > -        * fail instead.
> > > -        */
> > > -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> > > -       void *area;
> > > -
> > > -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> > > -               area = kmalloc(size, GFP_USER | flags);
> > > -               if (area != NULL)
> > > -                       return area;
> > > -       }
> > > -
> > > -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> > > -                        PAGE_KERNEL);
> > > +       return kvzalloc(size, GFP_USER);
> > >   }
> > > 
> > >   void bpf_map_area_free(void *area)
> > 
> > Looks fine by me.
> > Daniel, thoughts?
> 
> I assume that kvzalloc() is still the same from [1], right? If so, then
> it would unfortunately (partially) reintroduce the issue that was fixed.
> If you look above at flags, they're also passed to __vmalloc() to not
> trigger OOM in these situations I've experienced.

Pushing __GFP_NORETRY to __vmalloc doesn't have the effect you might
think it would. It can still trigger the OOM killer becauset the flags
are no propagated all the way down to all allocations requests (e.g.
page tables). This is the same reason why GFP_NOFS is not supported in
vmalloc.

> This is effectively the
> same requirement as in other networking areas f.e. that 5bad87348c70
> ("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
> In your comment in kvzalloc() you eventually say that some of the above
> modifiers are not supported. So there would be two options, i) just leave
> out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
> it later (along with similar code from 5bad87348c70), or ii) implement
> support for these modifiers as well to your original set. I guess it's not
> too urgent, so we could also proceed with i) if that is easier for you to
> proceed (I don't mind either way).

Could you clarify why the oom killer in vmalloc matters actually?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-25 18:14 Alexei Starovoitov
@ 2017-01-25 20:16 ` Daniel Borkmann
  2017-01-26  7:43   ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-25 20:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	linux-mm, LKML, netdev

On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
> On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
[...]
>>>>> Are there any more comments? I would really appreciate to hear from
>>>>> networking folks before I resubmit the series.
>>>>
>>>> while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
>>>> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
>>>> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
>>>> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
>>>> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>>
>>> OK, will do. Thanks for the heads up.
>>
>> Just for the record, I will fold the following into the patch 1
>> ---
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 19b6129eab23..8697f43cf93c 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>>
>>   void *bpf_map_area_alloc(size_t size)
>>   {
>> -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
>> -        * trigger under memory pressure as we really just want to
>> -        * fail instead.
>> -        */
>> -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>> -       void *area;
>> -
>> -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>> -               area = kmalloc(size, GFP_USER | flags);
>> -               if (area != NULL)
>> -                       return area;
>> -       }
>> -
>> -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
>> -                        PAGE_KERNEL);
>> +       return kvzalloc(size, GFP_USER);
>>   }
>>
>>   void bpf_map_area_free(void *area)
>
> Looks fine by me.
> Daniel, thoughts?

I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced. This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).

Thanks a lot,
Daniel

   [1] https://lkml.org/lkml/2017/1/12/442

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-25 20:16 ` Daniel Borkmann
  2017-01-26  7:43   ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Daniel Borkmann @ 2017-01-25 20:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	linux-mm, LKML, netdev

On 01/25/2017 07:14 PM, Alexei Starovoitov wrote:
> On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
[...]
>>>>> Are there any more comments? I would really appreciate to hear from
>>>>> networking folks before I resubmit the series.
>>>>
>>>> while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
>>>> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
>>>> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
>>>> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
>>>> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>>
>>> OK, will do. Thanks for the heads up.
>>
>> Just for the record, I will fold the following into the patch 1
>> ---
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 19b6129eab23..8697f43cf93c 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>>
>>   void *bpf_map_area_alloc(size_t size)
>>   {
>> -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
>> -        * trigger under memory pressure as we really just want to
>> -        * fail instead.
>> -        */
>> -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
>> -       void *area;
>> -
>> -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>> -               area = kmalloc(size, GFP_USER | flags);
>> -               if (area != NULL)
>> -                       return area;
>> -       }
>> -
>> -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
>> -                        PAGE_KERNEL);
>> +       return kvzalloc(size, GFP_USER);
>>   }
>>
>>   void bpf_map_area_free(void *area)
>
> Looks fine by me.
> Daniel, thoughts?

I assume that kvzalloc() is still the same from [1], right? If so, then
it would unfortunately (partially) reintroduce the issue that was fixed.
If you look above at flags, they're also passed to __vmalloc() to not
trigger OOM in these situations I've experienced. This is effectively the
same requirement as in other networking areas f.e. that 5bad87348c70
("netfilter: x_tables: avoid warn and OOM killer on vmalloc call") has.
In your comment in kvzalloc() you eventually say that some of the above
modifiers are not supported. So there would be two options, i) just leave
out the kvzalloc() chunk for BPF area to avoid the merge conflict and tackle
it later (along with similar code from 5bad87348c70), or ii) implement
support for these modifiers as well to your original set. I guess it's not
too urgent, so we could also proceed with i) if that is easier for you to
proceed (I don't mind either way).

Thanks a lot,
Daniel

   [1] https://lkml.org/lkml/2017/1/12/442

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-25 18:14 Alexei Starovoitov
  2017-01-25 20:16 ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Alexei Starovoitov @ 2017-01-25 18:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	linux-mm, LKML, Daniel Borkmann, netdev

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
>> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
>> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
>> > > > Hi,
>> > > > this has been previously posted as a single patch [1] but later on more
>> > > > built on top. It turned out that there are users who would like to have
>> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
>> > > > requests. Doing the same for smaller requests would require to redefine
>> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
>> > > > this series.
>> > > >
>> > > > There are many open coded kmalloc with vmalloc fallback instances in
>> > > > the tree.  Most of them are not careful enough or simply do not care
>> > > > about the underlying semantic of the kmalloc/page allocator which means
>> > > > that a) some vmalloc fallbacks are basically unreachable because the
>> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
>> > > > can invoke a really disruptive steps like the OOM killer to move forward
>> > > > which doesn't sound appropriate when we consider that the vmalloc
>> > > > fallback is available.
>> > > >
>> > > > As it can be seen implementing kvmalloc requires quite an intimate
>> > > > knowledge if the page allocator and the memory reclaim internals which
>> > > > strongly suggests that a helper should be implemented in the memory
>> > > > subsystem proper.
>> > > >
>> > > > Most callers I could find have been converted to use the helper instead.
>> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
>> > > > networking stack which I have converted as well but considering we do
>> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
>> > > > have marked it RFC.
>> > >
>> > > Are there any more comments? I would really appreciate to hear from
>> > > networking folks before I resubmit the series.
>> >
>> > while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
>> > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
>> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
>> > it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
>> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>
>> OK, will do. Thanks for the heads up.
>
> Just for the record, I will fold the following into the patch 1
> ---
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 19b6129eab23..8697f43cf93c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>
>  void *bpf_map_area_alloc(size_t size)
>  {
> -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -        * trigger under memory pressure as we really just want to
> -        * fail instead.
> -        */
> -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> -       void *area;
> -
> -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -               area = kmalloc(size, GFP_USER | flags);
> -               if (area != NULL)
> -                       return area;
> -       }
> -
> -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> -                        PAGE_KERNEL);
> +       return kvzalloc(size, GFP_USER);
>  }
>
>  void bpf_map_area_free(void *area)

Looks fine by me.
Daniel, thoughts?

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-25 18:14 Alexei Starovoitov
  2017-01-25 20:16 ` Daniel Borkmann
  0 siblings, 1 reply; 111+ messages in thread
From: Alexei Starovoitov @ 2017-01-25 18:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	linux-mm, LKML, Daniel Borkmann, netdev

On Wed, Jan 25, 2017 at 5:21 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 25-01-17 14:10:06, Michal Hocko wrote:
>> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
>> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
>> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
>> > > > Hi,
>> > > > this has been previously posted as a single patch [1] but later on more
>> > > > built on top. It turned out that there are users who would like to have
>> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
>> > > > requests. Doing the same for smaller requests would require to redefine
>> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
>> > > > this series.
>> > > >
>> > > > There are many open coded kmalloc with vmalloc fallback instances in
>> > > > the tree.  Most of them are not careful enough or simply do not care
>> > > > about the underlying semantic of the kmalloc/page allocator which means
>> > > > that a) some vmalloc fallbacks are basically unreachable because the
>> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
>> > > > can invoke a really disruptive steps like the OOM killer to move forward
>> > > > which doesn't sound appropriate when we consider that the vmalloc
>> > > > fallback is available.
>> > > >
>> > > > As it can be seen implementing kvmalloc requires quite an intimate
>> > > > knowledge if the page allocator and the memory reclaim internals which
>> > > > strongly suggests that a helper should be implemented in the memory
>> > > > subsystem proper.
>> > > >
>> > > > Most callers I could find have been converted to use the helper instead.
>> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
>> > > > networking stack which I have converted as well but considering we do
>> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
>> > > > have marked it RFC.
>> > >
>> > > Are there any more comments? I would really appreciate to hear from
>> > > networking folks before I resubmit the series.
>> >
>> > while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
>> > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
>> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
>> > it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
>> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
>>
>> OK, will do. Thanks for the heads up.
>
> Just for the record, I will fold the following into the patch 1
> ---
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 19b6129eab23..8697f43cf93c 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
>
>  void *bpf_map_area_alloc(size_t size)
>  {
> -       /* We definitely need __GFP_NORETRY, so OOM killer doesn't
> -        * trigger under memory pressure as we really just want to
> -        * fail instead.
> -        */
> -       const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
> -       void *area;
> -
> -       if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -               area = kmalloc(size, GFP_USER | flags);
> -               if (area != NULL)
> -                       return area;
> -       }
> -
> -       return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
> -                        PAGE_KERNEL);
> +       return kvzalloc(size, GFP_USER);
>  }
>
>  void bpf_map_area_free(void *area)

Looks fine by me.
Daniel, thoughts?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-25 13:10     ` Michal Hocko
@ 2017-01-25 13:21       ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-25 13:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Eric Dumazet,
	Hariprasad S, Heiko Carstens, Herbert Xu, Ilya Dryomov,
	Kees Cook, Kent Overstreet, Martin Schwidefsky,
	Michael S. Tsirkin, Mike Snitzer, Oleg Drokin, Paolo Bonzini,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan,
	Theodore Ts'o, Tom Herbert, Tony Luck, Yan, Zheng,
	Yishai Hadas, Daniel Borkmann

On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > > Hi,
> > > > this has been previously posted as a single patch [1] but later on more
> > > > built on top. It turned out that there are users who would like to have
> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > > requests. Doing the same for smaller requests would require to redefine
> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > > this series.
> > > > 
> > > > There are many open coded kmalloc with vmalloc fallback instances in
> > > > the tree.  Most of them are not careful enough or simply do not care
> > > > about the underlying semantic of the kmalloc/page allocator which means
> > > > that a) some vmalloc fallbacks are basically unreachable because the
> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > > can invoke a really disruptive steps like the OOM killer to move forward
> > > > which doesn't sound appropriate when we consider that the vmalloc
> > > > fallback is available.
> > > > 
> > > > As it can be seen implementing kvmalloc requires quite an intimate
> > > > knowledge if the page allocator and the memory reclaim internals which
> > > > strongly suggests that a helper should be implemented in the memory
> > > > subsystem proper.
> > > > 
> > > > Most callers I could find have been converted to use the helper instead.
> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > > networking stack which I have converted as well but considering we do
> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > > have marked it RFC.
> > > 
> > > Are there any more comments? I would really appreciate to hear from
> > > networking folks before I resubmit the series.
> > 
> > while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
> > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
> > it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> 
> OK, will do. Thanks for the heads up.

Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
-	 * trigger under memory pressure as we really just want to
-	 * fail instead.
-	 */
-	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-	void *area;
-
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		area = kmalloc(size, GFP_USER | flags);
-		if (area != NULL)
-			return area;
-	}
-
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-			 PAGE_KERNEL);
+	return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-25 13:21       ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-25 13:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Eric Dumazet,
	Hariprasad S, Heiko Carstens, Herbert Xu, Ilya Dryomov,
	Kees Cook, Kent Overstreet, Martin Schwidefsky,
	Michael S. Tsirkin, Mike Snitzer, Oleg Drokin, Paolo Bonzini,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan,
	Theodore Ts'o, Tom Herbert, Tony Luck, Yan, Zheng,
	Yishai Hadas, Daniel Borkmann

On Wed 25-01-17 14:10:06, Michal Hocko wrote:
> On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> > On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > > Hi,
> > > > this has been previously posted as a single patch [1] but later on more
> > > > built on top. It turned out that there are users who would like to have
> > > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > > requests. Doing the same for smaller requests would require to redefine
> > > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > > this series.
> > > > 
> > > > There are many open coded kmalloc with vmalloc fallback instances in
> > > > the tree.  Most of them are not careful enough or simply do not care
> > > > about the underlying semantic of the kmalloc/page allocator which means
> > > > that a) some vmalloc fallbacks are basically unreachable because the
> > > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > > can invoke a really disruptive steps like the OOM killer to move forward
> > > > which doesn't sound appropriate when we consider that the vmalloc
> > > > fallback is available.
> > > > 
> > > > As it can be seen implementing kvmalloc requires quite an intimate
> > > > knowledge if the page allocator and the memory reclaim internals which
> > > > strongly suggests that a helper should be implemented in the memory
> > > > subsystem proper.
> > > > 
> > > > Most callers I could find have been converted to use the helper instead.
> > > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > > networking stack which I have converted as well but considering we do
> > > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > > have marked it RFC.
> > > 
> > > Are there any more comments? I would really appreciate to hear from
> > > networking folks before I resubmit the series.
> > 
> > while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
> > which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> > See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
> > it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> > So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
> 
> OK, will do. Thanks for the heads up.

Just for the record, I will fold the following into the patch 1
---
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 19b6129eab23..8697f43cf93c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -53,21 +53,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl)
 
 void *bpf_map_area_alloc(size_t size)
 {
-	/* We definitely need __GFP_NORETRY, so OOM killer doesn't
-	 * trigger under memory pressure as we really just want to
-	 * fail instead.
-	 */
-	const gfp_t flags = __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO;
-	void *area;
-
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		area = kmalloc(size, GFP_USER | flags);
-		if (area != NULL)
-			return area;
-	}
-
-	return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | flags,
-			 PAGE_KERNEL);
+	return kvzalloc(size, GFP_USER);
 }
 
 void bpf_map_area_free(void *area)

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-24 16:00   ` Eric Dumazet
@ 2017-01-25 13:10     ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-25 13:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Hariprasad S,
	Heiko Carstens, Herbert Xu, Ilya Dryomov, Kees Cook,
	Kent Overstreet, Martin Schwidefsky, Michael S. Tsirkin,
	Mike Snitzer, Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki,
	Santosh Raspatur, Tariq Toukan, Theodore Ts'o, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

On Tue 24-01-17 08:00:26, Eric Dumazet wrote:
> On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> I do not see any issues right now.
> 
> I am happy to see this thing finally coming, after years of
> resistance ;)

OK, so I will repost the series and ask Andrew for inclusion
after it passes my compile test battery after the rebase.
 
Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-25 13:10     ` Michal Hocko
  0 siblings, 0 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-25 13:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Hariprasad S,
	Heiko Carstens, Herbert Xu, Ilya Dryomov, Kees Cook,
	Kent Overstreet, Martin Schwidefsky, Michael S. Tsirkin,
	Mike Snitzer, Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki,
	Santosh Raspatur, Tariq Toukan, Theodore Ts'o, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

On Tue 24-01-17 08:00:26, Eric Dumazet wrote:
> On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> I do not see any issues right now.
> 
> I am happy to see this thing finally coming, after years of
> resistance ;)

OK, so I will repost the series and ask Andrew for inclusion
after it passes my compile test battery after the rebase.
 
Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-24 19:17   ` Alexei Starovoitov
@ 2017-01-25 13:10     ` Michal Hocko
  2017-01-25 13:21       ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-25 13:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Eric Dumazet,
	Hariprasad S, Heiko Carstens, Herbert Xu, Ilya Dryomov,
	Kees Cook, Kent Overstreet, Martin Schwidefsky,
	Michael S. Tsirkin, Mike Snitzer, Oleg Drokin, Paolo Bonzini,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan,
	Theodore Ts'o, Tom Herbert, Tony Luck, Yan, Zheng,
	Yishai Hadas, Daniel Borkmann

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > Hi,
> > > this has been previously posted as a single patch [1] but later on more
> > > built on top. It turned out that there are users who would like to have
> > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > requests. Doing the same for smaller requests would require to redefine
> > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > this series.
> > > 
> > > There are many open coded kmalloc with vmalloc fallback instances in
> > > the tree.  Most of them are not careful enough or simply do not care
> > > about the underlying semantic of the kmalloc/page allocator which means
> > > that a) some vmalloc fallbacks are basically unreachable because the
> > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > can invoke a really disruptive steps like the OOM killer to move forward
> > > which doesn't sound appropriate when we consider that the vmalloc
> > > fallback is available.
> > > 
> > > As it can be seen implementing kvmalloc requires quite an intimate
> > > knowledge if the page allocator and the memory reclaim internals which
> > > strongly suggests that a helper should be implemented in the memory
> > > subsystem proper.
> > > 
> > > Most callers I could find have been converted to use the helper instead.
> > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > networking stack which I have converted as well but considering we do
> > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > have marked it RFC.
> > 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().

OK, will do. Thanks for the heads up.


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-25 13:10     ` Michal Hocko
  2017-01-25 13:21       ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-25 13:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Eric Dumazet,
	Hariprasad S, Heiko Carstens, Herbert Xu, Ilya Dryomov,
	Kees Cook, Kent Overstreet, Martin Schwidefsky,
	Michael S. Tsirkin, Mike Snitzer, Oleg Drokin, Paolo Bonzini,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan,
	Theodore Ts'o, Tom Herbert, Tony Luck, Yan, Zheng,
	Yishai Hadas, Daniel Borkmann

On Tue 24-01-17 11:17:21, Alexei Starovoitov wrote:
> On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> > On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > > Hi,
> > > this has been previously posted as a single patch [1] but later on more
> > > built on top. It turned out that there are users who would like to have
> > > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > > requests. Doing the same for smaller requests would require to redefine
> > > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > > this series.
> > > 
> > > There are many open coded kmalloc with vmalloc fallback instances in
> > > the tree.  Most of them are not careful enough or simply do not care
> > > about the underlying semantic of the kmalloc/page allocator which means
> > > that a) some vmalloc fallbacks are basically unreachable because the
> > > kmalloc part will keep retrying until it succeeds b) the page allocator
> > > can invoke a really disruptive steps like the OOM killer to move forward
> > > which doesn't sound appropriate when we consider that the vmalloc
> > > fallback is available.
> > > 
> > > As it can be seen implementing kvmalloc requires quite an intimate
> > > knowledge if the page allocator and the memory reclaim internals which
> > > strongly suggests that a helper should be implemented in the memory
> > > subsystem proper.
> > > 
> > > Most callers I could find have been converted to use the helper instead.
> > > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > > networking stack which I have converted as well but considering we do
> > > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > > have marked it RFC.
> > 
> > Are there any more comments? I would really appreciate to hear from
> > networking folks before I resubmit the series.
> 
> while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
> which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
> See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
> it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
> So please rebase and switch bpf_map_area_alloc() to use kvmalloc().

OK, will do. Thanks for the heads up.


-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-24 15:17 ` Michal Hocko
  2017-01-24 16:00   ` Eric Dumazet
@ 2017-01-24 19:17   ` Alexei Starovoitov
  2017-01-25 13:10     ` Michal Hocko
  1 sibling, 1 reply; 111+ messages in thread
From: Alexei Starovoitov @ 2017-01-24 19:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Eric Dumazet,
	Hariprasad S, Heiko Carstens, Herbert Xu, Ilya Dryomov,
	Kees Cook, Kent Overstreet, Martin Schwidefsky,
	Michael S. Tsirkin, Mike Snitzer, Oleg Drokin, Paolo Bonzini,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan,
	Theodore Ts'o, Tom Herbert, Tony Luck, Yan, Zheng,
	Yishai Hadas, Daniel Borkmann

On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > Hi,
> > this has been previously posted as a single patch [1] but later on more
> > built on top. It turned out that there are users who would like to have
> > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > requests. Doing the same for smaller requests would require to redefine
> > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > this series.
> > 
> > There are many open coded kmalloc with vmalloc fallback instances in
> > the tree.  Most of them are not careful enough or simply do not care
> > about the underlying semantic of the kmalloc/page allocator which means
> > that a) some vmalloc fallbacks are basically unreachable because the
> > kmalloc part will keep retrying until it succeeds b) the page allocator
> > can invoke a really disruptive steps like the OOM killer to move forward
> > which doesn't sound appropriate when we consider that the vmalloc
> > fallback is available.
> > 
> > As it can be seen implementing kvmalloc requires quite an intimate
> > knowledge if the page allocator and the memory reclaim internals which
> > strongly suggests that a helper should be implemented in the memory
> > subsystem proper.
> > 
> > Most callers I could find have been converted to use the helper instead.
> > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > networking stack which I have converted as well but considering we do
> > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > have marked it RFC.
> 
> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
Thanks

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-24 19:17   ` Alexei Starovoitov
  2017-01-25 13:10     ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Alexei Starovoitov @ 2017-01-24 19:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Eric Dumazet,
	Hariprasad S, Heiko Carstens, Herbert Xu, Ilya Dryomov,
	Kees Cook, Kent Overstreet, Martin Schwidefsky,
	Michael S. Tsirkin, Mike Snitzer, Oleg Drokin, Paolo Bonzini,
	Rafael J. Wysocki, Santosh Raspatur, Tariq Toukan,
	Theodore Ts'o, Tom Herbert, Tony Luck, Yan, Zheng,
	Yishai Hadas, Daniel Borkmann

On Tue, Jan 24, 2017 at 04:17:52PM +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> > Hi,
> > this has been previously posted as a single patch [1] but later on more
> > built on top. It turned out that there are users who would like to have
> > __GFP_REPEAT semantic. This is currently implemented for costly >64B
> > requests. Doing the same for smaller requests would require to redefine
> > __GFP_REPEAT semantic in the page allocator which is out of scope of
> > this series.
> > 
> > There are many open coded kmalloc with vmalloc fallback instances in
> > the tree.  Most of them are not careful enough or simply do not care
> > about the underlying semantic of the kmalloc/page allocator which means
> > that a) some vmalloc fallbacks are basically unreachable because the
> > kmalloc part will keep retrying until it succeeds b) the page allocator
> > can invoke a really disruptive steps like the OOM killer to move forward
> > which doesn't sound appropriate when we consider that the vmalloc
> > fallback is available.
> > 
> > As it can be seen implementing kvmalloc requires quite an intimate
> > knowledge if the page allocator and the memory reclaim internals which
> > strongly suggests that a helper should be implemented in the memory
> > subsystem proper.
> > 
> > Most callers I could find have been converted to use the helper instead.
> > This is patch 5. There are some more relying on __GFP_REPEAT in the
> > networking stack which I have converted as well but considering we do
> > not have a support for __GFP_REPEAT for requests smaller than 64kB I
> > have marked it RFC.
> 
> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

while this patchset was baking the bpf side switched to use bpf_map_area_alloc()
which fixes the issue with missing __GFP_NORETRY that we had to fix quickly.
See commit d407bd25a204 ("bpf: don't trigger OOM killer under pressure with map alloc")
it covers all kmalloc/vmalloc pairs instead of just one place as in this set.
So please rebase and switch bpf_map_area_alloc() to use kvmalloc().
Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-24 15:17 ` Michal Hocko
@ 2017-01-24 16:00   ` Eric Dumazet
  2017-01-25 13:10     ` Michal Hocko
  2017-01-24 19:17   ` Alexei Starovoitov
  1 sibling, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2017-01-24 16:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Hariprasad S,
	Heiko Carstens, Herbert Xu, Ilya Dryomov, Kees Cook,
	Kent Overstreet, Martin Schwidefsky, Michael S. Tsirkin,
	Mike Snitzer, Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki,
	Santosh Raspatur, Tariq Toukan, Theodore Ts'o, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:

> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

I do not see any issues right now.

I am happy to see this thing finally coming, after years of
resistance ;)

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-24 16:00   ` Eric Dumazet
  2017-01-25 13:10     ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Eric Dumazet @ 2017-01-24 16:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, David Rientjes, Mel Gorman,
	Johannes Weiner, Al Viro, linux-mm, LKML, Alexei Starovoitov,
	Anatoly Stepanov, Andreas Dilger, Andreas Dilger,
	Anton Vorontsov, Ben Skeggs, Boris Ostrovsky, Colin Cross,
	Dan Williams, David Sterba, Eric Dumazet, Hariprasad S,
	Heiko Carstens, Herbert Xu, Ilya Dryomov, Kees Cook,
	Kent Overstreet, Martin Schwidefsky, Michael S. Tsirkin,
	Mike Snitzer, Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki,
	Santosh Raspatur, Tariq Toukan, Theodore Ts'o, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

On Tue, 2017-01-24 at 16:17 +0100, Michal Hocko wrote:
> On Thu 12-01-17 16:37:11, Michal Hocko wrote:

> Are there any more comments? I would really appreciate to hear from
> networking folks before I resubmit the series.

I do not see any issues right now.

I am happy to see this thing finally coming, after years of
resistance ;)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
  2017-01-12 15:37 Michal Hocko
@ 2017-01-24 15:17 ` Michal Hocko
  2017-01-24 16:00   ` Eric Dumazet
  2017-01-24 19:17   ` Alexei Starovoitov
  0 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-24 15:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Anatoly Stepanov,
	Andreas Dilger, Andreas Dilger, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Colin Cross, Dan Williams, David Sterba,
	Eric Dumazet, Eric Dumazet, Hariprasad S, Heiko Carstens,
	Herbert Xu, Ilya Dryomov, Kees Cook, Kent Overstreet,
	Martin Schwidefsky, Michael S. Tsirkin, Mike Snitzer,
	Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki, Santosh Raspatur,
	Tariq Toukan, Theodore Ts'o, Tom Herbert, Tony Luck, Yan,
	Zheng, Yishai Hadas

On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> Hi,
> this has been previously posted as a single patch [1] but later on more
> built on top. It turned out that there are users who would like to have
> __GFP_REPEAT semantic. This is currently implemented for costly >64B
> requests. Doing the same for smaller requests would require to redefine
> __GFP_REPEAT semantic in the page allocator which is out of scope of
> this series.
> 
> There are many open coded kmalloc with vmalloc fallback instances in
> the tree.  Most of them are not careful enough or simply do not care
> about the underlying semantic of the kmalloc/page allocator which means
> that a) some vmalloc fallbacks are basically unreachable because the
> kmalloc part will keep retrying until it succeeds b) the page allocator
> can invoke a really disruptive steps like the OOM killer to move forward
> which doesn't sound appropriate when we consider that the vmalloc
> fallback is available.
> 
> As it can be seen implementing kvmalloc requires quite an intimate
> knowledge if the page allocator and the memory reclaim internals which
> strongly suggests that a helper should be implemented in the memory
> subsystem proper.
> 
> Most callers I could find have been converted to use the helper instead.
> This is patch 5. There are some more relying on __GFP_REPEAT in the
> networking stack which I have converted as well but considering we do
> not have a support for __GFP_REPEAT for requests smaller than 64kB I
> have marked it RFC.

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.

Thanks!

> [1] http://lkml.kernel.org/r/20170102133700.1734-1-mhocko@kernel.org
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 111+ messages in thread

* Re: [PATCH 0/6 v3] kvmalloc
@ 2017-01-24 15:17 ` Michal Hocko
  2017-01-24 16:00   ` Eric Dumazet
  2017-01-24 19:17   ` Alexei Starovoitov
  0 siblings, 2 replies; 111+ messages in thread
From: Michal Hocko @ 2017-01-24 15:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Anatoly Stepanov,
	Andreas Dilger, Andreas Dilger, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Colin Cross, Dan Williams, David Sterba,
	Eric Dumazet, Eric Dumazet, Hariprasad S, Heiko Carstens,
	Herbert Xu, Ilya Dryomov, Kees Cook, Kent Overstreet,
	Martin Schwidefsky, Michael S. Tsirkin, Mike Snitzer,
	Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki, Santosh Raspatur,
	Tariq Toukan, Theodore Ts'o, Tom Herbert, Tony Luck, Yan,
	Zheng, Yishai Hadas

On Thu 12-01-17 16:37:11, Michal Hocko wrote:
> Hi,
> this has been previously posted as a single patch [1] but later on more
> built on top. It turned out that there are users who would like to have
> __GFP_REPEAT semantic. This is currently implemented for costly >64B
> requests. Doing the same for smaller requests would require to redefine
> __GFP_REPEAT semantic in the page allocator which is out of scope of
> this series.
> 
> There are many open coded kmalloc with vmalloc fallback instances in
> the tree.  Most of them are not careful enough or simply do not care
> about the underlying semantic of the kmalloc/page allocator which means
> that a) some vmalloc fallbacks are basically unreachable because the
> kmalloc part will keep retrying until it succeeds b) the page allocator
> can invoke a really disruptive steps like the OOM killer to move forward
> which doesn't sound appropriate when we consider that the vmalloc
> fallback is available.
> 
> As it can be seen implementing kvmalloc requires quite an intimate
> knowledge if the page allocator and the memory reclaim internals which
> strongly suggests that a helper should be implemented in the memory
> subsystem proper.
> 
> Most callers I could find have been converted to use the helper instead.
> This is patch 5. There are some more relying on __GFP_REPEAT in the
> networking stack which I have converted as well but considering we do
> not have a support for __GFP_REPEAT for requests smaller than 64kB I
> have marked it RFC.

Are there any more comments? I would really appreciate to hear from
networking folks before I resubmit the series.

Thanks!

> [1] http://lkml.kernel.org/r/20170102133700.1734-1-mhocko@kernel.org
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 0/6 v3] kvmalloc
@ 2017-01-12 15:37 Michal Hocko
  2017-01-24 15:17 ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-12 15:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Anatoly Stepanov,
	Andreas Dilger, Andreas Dilger, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Colin Cross, Dan Williams, David Sterba,
	Eric Dumazet, Eric Dumazet, Hariprasad S, Heiko Carstens,
	Herbert Xu, Ilya Dryomov, Kees Cook, Kent Overstreet,
	Martin Schwidefsky, Michael S. Tsirkin, Michal Hocko,
	Mike Snitzer, Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki,
	Santosh Raspatur, Tariq Toukan, Theodore Ts'o, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

Hi,
this has been previously posted as a single patch [1] but later on more
built on top. It turned out that there are users who would like to have
__GFP_REPEAT semantic. This is currently implemented for costly >64B
requests. Doing the same for smaller requests would require to redefine
__GFP_REPEAT semantic in the page allocator which is out of scope of
this series.

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers I could find have been converted to use the helper instead.
This is patch 5. There are some more relying on __GFP_REPEAT in the
networking stack which I have converted as well but considering we do
not have a support for __GFP_REPEAT for requests smaller than 64kB I
have marked it RFC.

[1] http://lkml.kernel.org/r/20170102133700.1734-1-mhocko@kernel.org

^ permalink raw reply	[flat|nested] 111+ messages in thread

* [PATCH 0/6 v3] kvmalloc
@ 2017-01-12 15:37 Michal Hocko
  2017-01-24 15:17 ` Michal Hocko
  0 siblings, 1 reply; 111+ messages in thread
From: Michal Hocko @ 2017-01-12 15:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, David Rientjes, Mel Gorman, Johannes Weiner,
	Al Viro, linux-mm, LKML, Alexei Starovoitov, Anatoly Stepanov,
	Andreas Dilger, Andreas Dilger, Anton Vorontsov, Ben Skeggs,
	Boris Ostrovsky, Colin Cross, Dan Williams, David Sterba,
	Eric Dumazet, Eric Dumazet, Hariprasad S, Heiko Carstens,
	Herbert Xu, Ilya Dryomov, Kees Cook, Kent Overstreet,
	Martin Schwidefsky, Michael S. Tsirkin, Michal Hocko,
	Mike Snitzer, Oleg Drokin, Paolo Bonzini, Rafael J. Wysocki,
	Santosh Raspatur, Tariq Toukan, Theodore Ts'o, Tom Herbert,
	Tony Luck, Yan, Zheng, Yishai Hadas

Hi,
this has been previously posted as a single patch [1] but later on more
built on top. It turned out that there are users who would like to have
__GFP_REPEAT semantic. This is currently implemented for costly >64B
requests. Doing the same for smaller requests would require to redefine
__GFP_REPEAT semantic in the page allocator which is out of scope of
this series.

There are many open coded kmalloc with vmalloc fallback instances in
the tree.  Most of them are not careful enough or simply do not care
about the underlying semantic of the kmalloc/page allocator which means
that a) some vmalloc fallbacks are basically unreachable because the
kmalloc part will keep retrying until it succeeds b) the page allocator
can invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers I could find have been converted to use the helper instead.
This is patch 5. There are some more relying on __GFP_REPEAT in the
networking stack which I have converted as well but considering we do
not have a support for __GFP_REPEAT for requests smaller than 64kB I
have marked it RFC.

[1] http://lkml.kernel.org/r/20170102133700.1734-1-mhocko@kernel.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 111+ messages in thread

end of thread, other threads:[~2017-02-05 10:23 UTC | newest]

Thread overview: 111+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-30  9:49 [PATCH 0/6 v3] kvmalloc Michal Hocko
2017-01-30  9:49 ` [PATCH 1/9] mm: introduce kv[mz]alloc helpers Michal Hocko
2017-01-30  9:49 ` [PATCH 2/9] mm: support __GFP_REPEAT in kvmalloc_node for >32kB Michal Hocko
2017-01-30  9:49 ` [PATCH 3/9] rhashtable: simplify a strange allocation pattern Michal Hocko
2017-01-30 14:04   ` Vlastimil Babka
2017-01-30  9:49 ` [PATCH 4/9] ila: " Michal Hocko
2017-01-30 15:21   ` Vlastimil Babka
2017-01-30  9:49 ` [PATCH 5/9] treewide: use kv[mz]alloc* rather than opencoded variants Michal Hocko
2017-01-30 10:21   ` Leon Romanovsky
2017-01-30 16:14   ` Vlastimil Babka
2017-01-30 19:24   ` Kees Cook
2017-01-30  9:49 ` [PATCH 6/9] net: use kvmalloc with __GFP_REPEAT rather than open coded variant Michal Hocko
2017-01-30 16:36   ` Vlastimil Babka
2017-01-30  9:49 ` [PATCH 7/9] md: use kvmalloc rather than opencoded variant Michal Hocko
2017-01-30 16:42   ` Vlastimil Babka
2017-02-01 17:29   ` Mikulas Patocka
2017-02-01 17:58     ` Michal Hocko
2017-01-30  9:49 ` [PATCH 8/9] bcache: use kvmalloc Michal Hocko
2017-01-30 16:47   ` Vlastimil Babka
2017-01-30 17:25     ` Michal Hocko
2017-01-30  9:49 ` [PATCH 9/9] net, bpf: use kvzalloc helper Michal Hocko
2017-01-30 17:20   ` Michal Hocko
2017-02-05 10:23 ` [PATCH 0/6 v3] kvmalloc Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2017-01-25 18:14 Alexei Starovoitov
2017-01-25 20:16 ` Daniel Borkmann
2017-01-26  7:43   ` Michal Hocko
2017-01-26  9:36     ` Daniel Borkmann
2017-01-26  9:48       ` David Laight
2017-01-26 10:08       ` Michal Hocko
2017-01-26 10:32         ` Michal Hocko
2017-01-26 11:04           ` Daniel Borkmann
2017-01-26 11:49             ` Michal Hocko
2017-01-26 12:14           ` Joe Perches
2017-01-26 12:27             ` Michal Hocko
2017-01-26 11:33         ` Daniel Borkmann
2017-01-26 11:58           ` Michal Hocko
2017-01-26 13:10             ` Daniel Borkmann
2017-01-26 13:40               ` Michal Hocko
2017-01-26 14:13                 ` Michal Hocko
2017-01-26 20:34                 ` Daniel Borkmann
2017-01-27 10:05                   ` Michal Hocko
2017-01-27 20:12                     ` Daniel Borkmann
2017-01-30  7:56                       ` Michal Hocko
2017-01-30 16:15                         ` Daniel Borkmann
2017-01-30 16:28                           ` Michal Hocko
2017-01-30 16:45                             ` Daniel Borkmann
2017-01-12 15:37 Michal Hocko
2017-01-24 15:17 ` Michal Hocko
2017-01-24 16:00   ` Eric Dumazet
2017-01-25 13:10     ` Michal Hocko
2017-01-24 19:17   ` Alexei Starovoitov
2017-01-25 13:10     ` Michal Hocko
2017-01-25 13:21       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.