[PATCH v2 0/2] mm: Allow kmalloc() allocations below ARCH_KMALLOC_MINALIGN

* [PATCH v2 0/2] mm: Allow kmalloc() allocations below ARCH_KMALLOC_MINALIGN
@ 2022-10-25 20:52 ` Catalin Marinas
  0 siblings, 0 replies; 74+ messages in thread
From: Catalin Marinas @ 2022-10-25 20:52 UTC (permalink / raw)
  To: Linus Torvalds, Arnd Bergmann
  Cc: Will Deacon, Marc Zyngier, Greg Kroah-Hartman, Andrew Morton,
	Herbert Xu, Ard Biesheuvel, Christoph Hellwig, Isaac Manjarres,
	Saravana Kannan, linux-mm, linux-arm-kernel

Hi,

I called this 'v2' but it's a different incarnation of the previous
attempt to reduce ARCH_KMALLOC_MINALIGN for arm64 (and please treat it
as an RFC):

https://lore.kernel.org/all/20220405135758.774016-1-catalin.marinas@arm.com/

The long discussion in the thread above seemed to converge towards
bouncing of the DMA buffers that came from small kmalloc() caches.
However, there are a few problems with such approach:

- We still need ARCH_KMALLOC_MINALIGN to match the smaller kmalloc()
  size but this breaks crypto without changing the structure alignments
  to the larger ARCH_DMA_MINALIGN (not agreed upon yet).

- Not all systems, especially mobile devices, have a swiotlb buffer
  (most boot with 'noswiotlb'). Even of we allow a small bounce buffer,
  it's hard to come up with a size that would please the mobile vendors.

- An alternative to the default bounce buffer is to use another
  kmalloc() for the bouncing, though it adds an overhead, potential
  allocation failures and DMA unmapping cannot be easily tracked back to
  the bounce kmalloc'ed buffer (no simple is_swiotlb_buffer()).

- IOMMU needs bouncing as well when cache maintenance is required. This
  seems to get pretty complicated in the presence of scatterlists
  (though probably doable with enough dedication but it would add an
  overhead with the additional scatterlist scanning).

- Custom dma_ops - not too many but they'd also need the bouncing logic.

- Undecided on whether we should only bounce small sizes or any
  unaligned size. The latter would end up bouncing large-ish buffers
  (network) but it has the advantage that we don't need to change the
  crypto code if the ARCH_KMALLOC_MINALIGN becomes 8 (well, it will
  bounce all the DMA until the code is changed).

So, with this patchset, I've given up on trying to sort out the bouncing
and went for an explicit opt-in to smaller kmalloc() sizes via a newly
introduced __GFP_PACKED flag which would return objects aligned to
KMALLOC_PACKED_ALIGN (8 bytes with slub, 32 with slab). The second patch
sprinkles the kernel with several gfp flag updates where some hot
allocations seem to be on my hardware (the commit log has instructions
on how to identify them).

On a Cavium TX2, after boot, I get:

kmalloc-128        24608  24608    128   32    1 : tunables    0    0    0 : slabdata    769    769      0
kmalloc-96          9408   9660     96   42    1 : tunables    0    0    0 : slabdata    230    230      0
kmalloc-64          9594   9728     64   64    1 : tunables    0    0    0 : slabdata    152    152      0
kmalloc-32         14052  14464     32  128    1 : tunables    0    0    0 : slabdata    113    113      0
kmalloc-16         26368  26368     16  256    1 : tunables    0    0    0 : slabdata    103    103      0
kmalloc-8          36352  36352      8  512    1 : tunables    0    0    0 : slabdata     71     71      0

It's still not ideal and I suspect there may be some early allocations
even prior to ftrace identifying them but there is a significant
improvement without breaking any of the existing cases.

To me, the ideal approach would be __dma annotations on pointers aimed
at DMA and using kernel tools like sparse to identify them. A
dma_kmalloc() would return such pointers. However, things get muddier
with scatterlists since functions like sg_set_page() don't have any such
pointer information (can we mark the offset as __dma instead?).

Comments/suggestions are welcome but, as per above, I'd like to stay
away from bouncing if possible.

Thanks.

Catalin Marinas (2):
  mm: slab: Introduce __GFP_PACKED for smaller kmalloc() alignments
  treewide: Add the __GFP_PACKED flag to several non-DMA kmalloc()
    allocations

 drivers/usb/core/message.c           |  3 ++-
 fs/binfmt_elf.c                      |  6 ++++--
 fs/dcache.c                          |  3 ++-
 fs/ext4/dir.c                        |  4 ++--
 fs/ext4/extents.c                    |  4 ++--
 fs/file.c                            |  2 +-
 fs/kernfs/file.c                     |  8 ++++----
 fs/nfs/dir.c                         |  7 ++++---
 fs/nfs/inode.c                       |  2 +-
 fs/nfs/nfs4state.c                   |  2 +-
 fs/nfs/write.c                       |  3 ++-
 fs/notify/inotify/inotify_fsnotify.c |  3 ++-
 fs/proc/self.c                       |  2 +-
 fs/seq_file.c                        |  5 +++--
 include/acpi/platform/aclinuxex.h    |  6 ++++--
 include/linux/gfp_types.h            | 10 ++++++++--
 include/linux/slab.h                 | 22 ++++++++++++++++++----
 kernel/trace/ftrace.c                |  2 +-
 kernel/trace/tracing_map.c           |  2 +-
 lib/kasprintf.c                      |  2 +-
 lib/kobject.c                        |  4 ++--
 lib/mpi/mpiutil.c                    |  2 +-
 mm/list_lru.c                        |  6 ++++--
 mm/memcontrol.c                      |  4 ++--
 mm/slab_common.c                     |  3 ++-
 mm/util.c                            |  6 +++---
 mm/vmalloc.c                         |  6 ++++--
 net/sunrpc/auth_unix.c               |  2 +-
 net/sunrpc/xdr.c                     |  2 +-
 security/apparmor/lsm.c              |  2 +-
 security/security.c                  |  4 ++--
 security/tomoyo/realpath.c           |  2 +-
 32 files changed, 88 insertions(+), 53 deletions(-)

^ permalink raw reply	[flat|nested] 74+ messages in thread