Re: [PATCH] printk: Add best-effort printk() buffering.

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: sergey.senozhatsky.work@gmail.com
Cc: sergey.senozhatsky@gmail.com, linux-kernel@vger.kernel.org,
	mingo@redhat.com, peterz@infradead.org, pmladek@suse.com,
	rostedt@goodmis.org
Subject: Re: [PATCH] printk: Add best-effort printk() buffering.
Date: Wed, 10 May 2017 20:27:32 +0900	[thread overview]
Message-ID: <201705102027.BHC13517.SFLQFOFtJMOVHO@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20170510062121.GB1966@jagdpanzerIV.localdomain>

Sergey Senozhatsky wrote:
> On (05/09/17 20:41), Tetsuo Handa wrote:
> [..]
> > > what I meant was -- "can we sleep under printk_buffered_begin() or not".
> > > printk-safe disables local IRQs. so what I propose is something like this
> > > 
> > > 	printk-safe-enter    //disable local IRQs, use per-CPU buffer
> > > 	backtrace
> > > 	printk-safe-exit     //flush per-CPU buffer, enable local IRQs
> > > 
> > > except that 'printk-safe-enter/exit' will have new names here, say
> > > printk-buffered-begin/end, and, probably, handle flush differently.
> > 
> > OK. Then, answer is that we are allowed to sleep after get_printk_buffer()
> > if get_printk_buffer() is called from schedulable context because different
> > printk_buffer will be assigned by get_printk_buffer() if get_printk_buffer()
> > is called from non-schedulable context.
> > 
> > > 
> > > 
> > > > > hm, 16 is rather random, it's too much for UP and probably not enough for
> > > > > a 240 CPUs system. for the time being there are 3 buffered-printk users
> > > > > (as far as I can see), but who knows how more will be added in the future.
> > > > > each CPU can have overlapping printks from process, IRQ and NMI contexts.
> > > > > for NMI we use printk-nmi buffers, so it's out of the list; but, in general,
> > > > > *it seems* that we better depend on the number of CPUs the system has.
> > > > > which, once again, returns us back to printk-safe...
> > > > > 
> > > > > thoughts?
> > > > 
> > > > I can make 16 a CONFIG_ option.
> > > 
> > > but still, why use additional N buffers, when we already have per-CPU
> > > buffers? what am I missing?
> > 
> > Per-CPU buffers need to disable preemption by disabling local hard
> > IRQ / soft IRQ. But printk_buffers need not to disable preemption.
> 
> yes. ok. seems that I can't explain what I want.
> 
> my point is:
> 
> printk-buffered does not disable preemption and we can sleep under
> printk-buffered-begin. fine. but why would you want to sleep there anyway?
> you just want to print a backtrace and be done with it. and backtracing
> does not sleep, afaiu, or it least it should not, because it must be
> possible to dump_stack() from atomic context. so why have
> 
> 	printk-buffered keeps preemption and irqs enable and uses one
> 	of aux buffers (if any).
> 
> instead of
> 
> 	printk-buffered starts an atomic section - it disables preemption
> 	and local irqs, because it uses per-CPU buffer (which is always,
> 	and already, there).
> 
> ?

Backtrace is just an example chunk where multiple lines are expected to be
printed together. In practice, backtraces are surrounded by preceding lines
and trailing lines. For example, an OOM killer message looks like

[   88.584836] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[   88.590706] a.out cpuset=/ mems_allowed=0
[   88.594524] CPU: 1 PID: 2981 Comm: a.out Not tainted 4.11.0-rc7-next-20170421+ #588
[   88.599445] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[   88.604543] Call Trace:
[   88.606779]  ? dump_stack+0x5c/0x7d
[   88.608655]  ? dump_header+0x97/0x233
[   88.610083]  ? oom_kill_process+0x223/0x3e0
[   88.611511]  ? has_capability_noaudit+0x17/0x20
[   88.613046]  ? oom_badness+0xeb/0x160
[   88.614359]  ? out_of_memory+0x10b/0x490
[   88.615592]  ? __alloc_pages_slowpath+0x701/0x8e2
[   88.616947]  ? __alloc_pages_nodemask+0x1ed/0x210
[   88.618267]  ? alloc_pages_vma+0x9f/0x220
[   88.619475]  ? __handle_mm_fault+0xc22/0x11e0
[   88.620793]  ? handle_mm_fault+0xc5/0x220
[   88.622222]  ? __do_page_fault+0x21e/0x4b0
[   88.623903]  ? do_page_fault+0x2b/0x70
[   88.625235]  ? page_fault+0x28/0x30
[   88.626392] Mem-Info:
[   88.627314] active_anon:395525 inactive_anon:2096 isolated_anon:0
[   88.627314]  active_file:60 inactive_file:56 isolated_file:0
[   88.627314]  unevictable:0 dirty:7 writeback:0 unstable:0
[   88.627314]  slab_reclaimable:2211 slab_unreclaimable:6459
[   88.627314]  mapped:411 shmem:2162 pagetables:2418 bounce:0
[   88.627314]  free:13040 free_pcp:54 free_cma:0
[   88.637107] Node 0 active_anon:1582100kB inactive_anon:8384kB active_file:240kB inactive_file:224kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:1644kB dirty:28kB writeback:0kB shmem:8648kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 1349632kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   88.644027] Node 0 DMA free:7176kB min:408kB low:508kB high:608kB active_anon:8696kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:40kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   88.651384] lowmem_reserve[]: 0 1696 1696 1696
[   88.653015] Node 0 DMA32 free:44984kB min:44644kB low:55804kB high:66964kB active_anon:1573396kB inactive_anon:8384kB active_file:1268kB inactive_file:968kB unevictable:0kB writepending:28kB present:2080640kB managed:1756232kB mlocked:0kB slab_reclaimable:8844kB slab_unreclaimable:25804kB kernel_stack:5856kB pagetables:9632kB bounce:0kB free_pcp:248kB local_pcp:40kB free_cma:0kB
[   88.661952] lowmem_reserve[]: 0 0 0 0
[   88.663388] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (M) 2*1024kB (UM) 0*2048kB 1*4096kB (M) = 7140kB
[   88.666629] Node 0 DMA32: 1041*4kB (UME) 616*8kB (UE) 157*16kB (UE) 95*32kB (UE) 59*64kB (UME) 61*128kB (ME) 27*256kB (UME) 15*512kB (ME) 4*1024kB (UME) 0*2048kB 0*4096kB = 44916kB
[   88.671430] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   88.674182] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   88.676448] 2299 total pagecache pages
[   88.678121] 0 pages in swap cache
[   88.679472] Swap cache stats: add 0, delete 0, find 0/0
[   88.681159] Free swap  = 0kB
[   88.682698] Total swap = 0kB
[   88.683996] 524157 pages RAM
[   88.685385] 0 pages HighMem/MovableOnly
[   88.686970] 81123 pages reserved
[   88.688221] 0 pages cma reserved
[   88.689831] 0 pages hwpoisoned
[   88.691238] Out of memory: Kill process 2982 (a.out) score 999 or sacrifice child
[   88.693403] Killed process 2982 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB

where showing memory related information involves full of pr_cont() usage.

If sysctl_oom_dump_tasks is set to 1, hundreds to a few thousands of lines
could be added by dump_tasks() and per-CPU buffers might not be large enough
(e.g. CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 is only 8KB, isn't it?).

Surrounding oom_kill_process() with printk_safe_enter()/printk_safe_exit() is not
enough, and there is no need to disable hard IRQ during oom_kill_process().

I don't want to worry about current thread migrating to different CPUs during
oom_kill_process() and messages stored during printk_safe_enter()/printk_safe_exit() are
written to console devices out of order. Can the printk kernel thread guarantee ordering?

Yes, if an API for waiting for per-CPU printk buffers be written to console devices
exists, we could scatter around that API so that per-CPU buffers will not overflow.

> 
> [..]
> > > hm, from a schedulable context you can do *something* like
> > > 
> > > 	console_lock()
> > > 	printk()
> > > 	...
> > > 	printk()
> > > 	console_unlock()
> > > 
> > > 
> > > you won't be able to console_lock() until all pending messages are
> > > flushed. since you are in a schedulable context, you can sleep on
> > > console_sem in console_lock(). well, just saying.
> > 
> > console_lock()/console_unlock() pair is different from what I want.
> > 
> > console_lock()/console_unlock() pair blocks as long as somebody else
> > is printk()ing. What I want is an API for
> > 
> >   current thread waits for N bytes to be written to console devices
> >   if current thread stored N bytes using printk(), but allow using some
> >   timeout and killable because waiting unconditionally forever is not good
> >   (e.g. current thread is expected to bail out soon if OOM-killed during
> >   waiting for N bytes to be written to console devices)
> 
> I assume you are talking here about a completely new API, not related to
> the patch in question (because your patch does not do this). right?

Right.