Re: [RFC][PATCH v2 1/2] printk: Make printk() completely async

From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
To: sergey.senozhatsky.work@gmail.com
Cc: akpm@linux-foundation.org, jack@suse.com, pmladek@suse.com,
	tj@kernel.org, linux-kernel@vger.kernel.org,
	sergey.senozhatsky@gmail.com, jack@suse.cz
Subject: Re: [RFC][PATCH v2 1/2] printk: Make printk() completely async
Date: Sun, 6 Mar 2016 20:06:03 +0900	[thread overview]
Message-ID: <201603062006.IJD17667.OOQFLtMVHOFSJF@I-love.SAKURA.ne.jp> (raw)
In-Reply-To: <20160306093530.GA26055@swordfish>

Sergey Senozhatsky wrote:
> On (03/06/16 16:18), Tetsuo Handa wrote:
> > Moreover, I don't like use of a workqueue even if printk_wq was allocated
> > with WQ_MEM_RECLAIM bit. As you can see in the discussion of the OOM reaper,
> > the OOM reaper chose a dedicated kernel thread rather than a workqueue
> > ( http://lkml.kernel.org/r/1454505240-23446-2-git-send-email-mhocko@kernel.org ).
> >
> > Blocking actual printing until ongoing workqueue item calls schedule_timeout_*()
> > is not nice (see commit 373ccbe59270 and 564e81a57f97).
> 
> do you mean a new worker allocation delay and a MAYDAY timer delay?
> 

I don't know what MAYDAY is. I'm talking about a situation where printing_work
work item is not processed (i.e. printing_work_func() is not called) until
current work item calls schedule_timeout_*().

We had a problem that since vmstat_work work item was using system_wq,
vmstat_work work item was not processed (i.e. vmstat_update() was not called) if
kworker was looping inside memory allocator without calling schedule_timeout_*()
due to disk_events_workfn() doing GFP_NOIO allocation).

----------
[  271.579276] MemAlloc: kworker/0:56(7399) gfp=0x2400000 order=0 delay=129294
[  271.581237]  ffff88007c78fa08 ffff8800778f8c80 ffff88007c790000 ffff8800778f8c80
[  271.583329]  0000000002400000 0000000000000000 ffff8800778f8c80 ffff88007c78fa20
[  271.585391]  ffffffff8162aa9d 0000000000000001 ffff88007c78fa30 ffffffff8162aac7
[  271.587463] Call Trace:
[  271.588512]  [<ffffffff8162aa9d>] preempt_schedule_common+0x18/0x2b
[  271.590243]  [<ffffffff8162aac7>] _cond_resched+0x17/0x20
[  271.591830]  [<ffffffff8111fafe>] __alloc_pages_nodemask+0x64e/0xcc0
[  271.593561]  [<ffffffff8116a3b2>] ? __kmalloc+0x22/0x190
[  271.595119]  [<ffffffff81160ce7>] alloc_pages_current+0x87/0x110
[  271.596778]  [<ffffffff812e95f4>] bio_copy_kern+0xc4/0x180
[  271.598342]  [<ffffffff810a6a00>] ? wait_woken+0x80/0x80
[  271.599878]  [<ffffffff812f25f0>] blk_rq_map_kern+0x70/0x130
[  271.601481]  [<ffffffff812ece35>] ? blk_get_request+0x75/0xe0
[  271.603100]  [<ffffffff814433fd>] scsi_execute+0x12d/0x160
[  271.604657]  [<ffffffff81443524>] scsi_execute_req_flags+0x84/0xf0
[  271.606339]  [<ffffffffa01db742>] sr_check_events+0xb2/0x2a0 [sr_mod]
[  271.608141]  [<ffffffff8109cbfc>] ? set_next_entity+0x6c/0x6a0
[  271.609830]  [<ffffffffa01cf163>] cdrom_check_events+0x13/0x30 [cdrom]
[  271.611610]  [<ffffffffa01dbb85>] sr_block_check_events+0x25/0x30 [sr_mod]
[  271.613429]  [<ffffffff812fc7eb>] disk_check_events+0x5b/0x150
[  271.615065]  [<ffffffff812fc8f1>] disk_events_workfn+0x11/0x20
[  271.616699]  [<ffffffff810827c5>] process_one_work+0x135/0x310
[  271.618321]  [<ffffffff81082abb>] worker_thread+0x11b/0x4a0
[  271.620018]  [<ffffffff810829a0>] ? process_one_work+0x310/0x310
[  271.622022]  [<ffffffff81087e53>] kthread+0xd3/0xf0
[  271.623533]  [<ffffffff81087d80>] ? kthread_create_on_node+0x1a0/0x1a0
[  271.625487]  [<ffffffff8162f09f>] ret_from_fork+0x3f/0x70
[  271.627175]  [<ffffffff81087d80>] ? kthread_create_on_node+0x1a0/0x1a0
----------

Even if you use printk_wq with WQ_MEM_RECLAIM for printing_work work item,
printing_work_func() will not be called until current work item calls
schedule_timeout_*(). That will be an undesirable random delay. If you use
a dedicated kernel thread rather than a dedicated workqueue with WQ_MEM_RECLAIM,
we can avoid this random delay.

> > > +static void printing_work_func(struct work_struct *work)
> > > +{
> > > +	console_lock();
> > > +	console_unlock();
> > > +}
> > 
> > Is this safe? If somebody invokes the OOM killer between console_lock()
> > and console_unlock(), won't this cause OOM killer messages not printed?
> 
> if you mean something like
> 
> 	console_lock();
> 	for (...) {
> 		do_foo() {
> 			...
> 				pr_err(" ... foo message ...\n");
> 			...
> 		}
> 	}
> 	console_unlock();
> 
> then yes, nothing will be printed until that process executes console_unlock(),
> because it's console_unlock() that pushes the messages to console drivers.

Yes, I meant a sequence like

  console_lock();
  ptr = kmalloc(GFP_KERNEL);
  kfree(ptr);
  console_unlock();

and kmalloc() prints OOM killer messages rather than failing that allocation.
Are we sure that there is no such usage?