All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND v2 0/5] Avoid requesting page from DMA zone when no managed pages
@ 2021-12-07  3:07 ` Baoquan He
  0 siblings, 0 replies; 64+ messages in thread
From: Baoquan He @ 2021-12-07  3:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, akpm, hch, robin.murphy, cl, penberg, rientjes,
	iamjoonsoo.kim, vbabka, m.szyprowski, John.p.donnelly, kexec,
	Baoquan He

***Problem observed:
On x86_64, when crash is triggered and entering into kdump kernel, page
allocation failure can always be seen.

 ---------------------------------
 DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
 swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
 CPU: 0 PID: 1 Comm: swapper/0 
 Call Trace:
  dump_stack+0x7f/0xa1
  warn_alloc.cold+0x72/0xd6
  ......
  __alloc_pages+0x24d/0x2c0
  ......
  dma_atomic_pool_init+0xdb/0x176
  do_one_initcall+0x67/0x320
  ? rcu_read_lock_sched_held+0x3f/0x80
  kernel_init_freeable+0x290/0x2dc
  ? rest_init+0x24f/0x24f
  kernel_init+0xa/0x111
  ret_from_fork+0x22/0x30
 Mem-Info:
 ------------------------------------

***Root cause:
In the current kernel, it assumes that DMA zone must have managed pages
and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
always true. E.g in kdump kernel of x86_64, only low 1M is presented and
locked down at very early stage of boot, so that this low 1M won't be
added into buddy allocator to become managed pages of DMA zone. This
exception will always cause page allocation failure if page is requested
from DMA zone.

***Investigation:
This failure happens since below commit merged into linus's tree.
  1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
  23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
  f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
  7c321eb2b843 x86/kdump: Remove the backup region handling
  6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified

Before them, on x86_64, the low 640K area will be reused by kdump kernel.
So in kdump kernel, the content of low 640K area is copied into a backup
region for dumping before jumping into kdump. Then except of those firmware
reserved region in [0, 640K], the left area will be added into buddy
allocator to become available managed pages of DMA zone.

However, after above commits applied, in kdump kernel of x86_64, the low
1M is reserved by memblock, but not released to buddy allocator. So any
later page allocation requested from DMA zone will fail. 

This low 1M lock down is needed because AMD SME encrypts memory making
the old backup region mechanims impossible when switching into kdump
kernel. And Intel engineer mentioned their TDX (Trusted domain extensions)
which is under development in kernel also needs lock down the low 1M.
So we can't simply revert above commits to fix the page allocation
failure from DMA zone as someone suggested.

***Solution:
Currently, only DMA atomic pool and dma-kmalloc will initialize and
request page allocation with GFP_DMA during bootup. So only initialize
them when DMA zone has available managed pages, otherwise just skip the
initialization. From testing and code, this doesn't matter. In kdump
kernel of x86_64, the page allocation failure disappear.

***Further thinking
On x86_64, it consistently takes [0, 16M] into ZONE_DMA, and (16M, 4G]
into ZONE_DMA32 by default. The zone DMA covering low 16M is used to
take care of antique ISA devices. In fact, on 64bit system, it rarely
need ZONE_DMA (which is low 16M) to support almost extinct ISA devices. 
However, some components treat DMA as a generic concept, e.g
kmalloc-dma, slab allocator initializes it for later any DMA related
buffer allocation, but not limited to ISA DMA. 

On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 
are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32
empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4),
then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of
the 32-bit addressable memory. 

I am wondering if we can also change the size of DMA and DMA32 ZONE as
dynamically adjusted, just as arm64 is doing? On x86_64, we can make
zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by
default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers
low 16M area, zone DMA32 covers the rest of 32-bit addressable memory.
(I am not familiar with ISA_DMA_API, will it require 24-bit addressable
memory when enabled?)

Change history:

v2 post:
https://lore.kernel.org/all/20210810094835.13402-1-bhe@redhat.com/T/#u

v1 post:
https://lore.kernel.org/all/20210624052010.5676-1-bhe@redhat.com/T/#u

v2->v2 RESEND:
 John pinged to push the repost of this patchset. So fix one typo of
 suject of patch 3/5; Fix a building error caused by mix declaration in
 patch 5/5. Both of them are found by John from his testing.

v1->v2:
 Change to check if managed DMA zone exists. If DMA zone has managed
 pages, go further to request page from DMA zone to initialize. Otherwise,
 just skip to initialize stuffs which need pages from DMA zone.

Baoquan He (5):
  docs: kernel-parameters: Update to reflect the current default size of
    atomic pool
  dma-pool: allow user to disable atomic pool
  mm_zone: add function to check if managed dma zone exists
  dma/pool: create dma atomic pool only if dma zone has managed pages
  mm/slub: do not create dma-kmalloc if no managed pages in DMA zone

 .../admin-guide/kernel-parameters.txt         |  5 ++++-
 include/linux/mmzone.h                        | 21 +++++++++++++++++++
 kernel/dma/pool.c                             | 11 ++++++----
 mm/page_alloc.c                               | 11 ++++++++++
 mm/slab_common.c                              |  9 ++++++++
 5 files changed, 52 insertions(+), 5 deletions(-)

-- 
2.17.2


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2021-12-13 14:21 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07  3:07 [PATCH RESEND v2 0/5] Avoid requesting page from DMA zone when no managed pages Baoquan He
2021-12-07  3:07 ` Baoquan He
2021-12-07  3:07 ` [PATCH RESEND v2 1/5] docs: kernel-parameters: Update to reflect the current default size of atomic pool Baoquan He
2021-12-07  3:07   ` Baoquan He
2021-12-07  3:53   ` John Donnelly
2021-12-07  3:53     ` John Donnelly
2021-12-07  3:07 ` [PATCH RESEND v2 2/5] dma-pool: allow user to disable " Baoquan He
2021-12-07  3:07   ` Baoquan He
2021-12-07  3:53   ` John Donnelly
2021-12-07  3:53     ` John Donnelly
2021-12-13  7:44   ` Christoph Hellwig
2021-12-13  7:44     ` Christoph Hellwig
2021-12-13  8:16     ` Baoquan He
2021-12-13  8:16       ` Baoquan He
2021-12-07  3:07 ` [PATCH RESEND v2 3/5] mm_zone: add function to check if managed dma zone exists Baoquan He
2021-12-07  3:07   ` Baoquan He
2021-12-07  3:53   ` John Donnelly
2021-12-07  3:53     ` John Donnelly
2021-12-07 11:23   ` David Hildenbrand
2021-12-07 11:23     ` David Hildenbrand
2021-12-09 13:02     ` Baoquan He
2021-12-09 13:02       ` Baoquan He
2021-12-09 13:10       ` David Hildenbrand
2021-12-09 13:10         ` David Hildenbrand
2021-12-09 13:23         ` Baoquan He
2021-12-09 13:23           ` Baoquan He
2021-12-07  3:07 ` [PATCH RESEND v2 4/5] dma/pool: create dma atomic pool only if dma zone has managed pages Baoquan He
2021-12-07  3:07   ` Baoquan He
2021-12-07  3:07   ` Baoquan He
2021-12-07  3:54   ` John Donnelly
2021-12-07  3:54     ` John Donnelly
2021-12-07  3:54     ` John Donnelly
2021-12-07  3:07 ` [PATCH RESEND v2 5/5] mm/slub: do not create dma-kmalloc if no managed pages in DMA zone Baoquan He
2021-12-07  3:07   ` Baoquan He
2021-12-07  3:54   ` John Donnelly
2021-12-07  3:54     ` John Donnelly
2021-12-07  3:16 ` [PATCH RESEND v2 0/5] Avoid requesting page from DMA zone when no managed pages Baoquan He
2021-12-07  3:16   ` Baoquan He
2021-12-07  4:03   ` John Donnelly
2021-12-07  4:03     ` John Donnelly
2021-12-08  4:33     ` Andrew Morton
2021-12-08  4:33       ` Andrew Morton
2021-12-08  4:56       ` John Donnelly
2021-12-08  4:56         ` John Donnelly
2021-12-13  3:54     ` Baoquan He
2021-12-13  3:54       ` Baoquan He
2021-12-13 13:25   ` Borislav Petkov
2021-12-13 13:25     ` Borislav Petkov
2021-12-13 14:03     ` Baoquan He
2021-12-13 14:03       ` Baoquan He
2021-12-07  8:05 ` Christoph Lameter
2021-12-07  8:05   ` Christoph Lameter
2021-12-09  8:05   ` Baoquan He
2021-12-09  8:05     ` Baoquan He
2021-12-09 12:59     ` Christoph Lameter
2021-12-09 12:59       ` Christoph Lameter
2021-12-13  7:39       ` Baoquan He
2021-12-13  7:39         ` Baoquan He
2021-12-13  7:49         ` Christoph Hellwig
2021-12-13  7:49           ` Christoph Hellwig
2021-12-13 14:21       ` Hyeonggon Yoo
2021-12-13 14:21         ` Hyeonggon Yoo
2021-12-13  7:47   ` Christoph Hellwig
2021-12-13  7:47     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.