All of lore.kernel.org
 help / color / mirror / Atom feed
* printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
@ 2017-07-06 10:28 Tetsuo Handa
  2017-07-07  2:39 ` Sergey Senozhatsky
  0 siblings, 1 reply; 17+ messages in thread
From: Tetsuo Handa @ 2017-07-06 10:28 UTC (permalink / raw)
  To: sergey.senozhatsky.work, sergey.senozhatsky, pmladek, mhocko
  Cc: jack, dri-devel, linux-mm, rostedt, andi, pavel

Hello.

I noticed that printing to consoles can stop forever because console drivers
can wait for memory allocation when memory allocation cannot make progress.

I was testing almost OOM situation and reproduced a situation where allocation
requests fall into infinite too_many_isolated() trap in shrink_inactive_list().

Since nothing will be printed to consoles when we hit that situation, I tried to
confirm that I actually reproduced that situation using SysRq-t. However, SysRq-t
printed nothing for some reason. Therefore, I used SysRq-c at uptime = 910.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170706.txt.xz :
------------------------------------------------------------
[  779.494100] Mem-Info:
[  779.495493] active_anon:846948 inactive_anon:5416 isolated_anon:0
[  779.495493]  active_file:33 inactive_file:13 isolated_file:47
[  779.495493]  unevictable:0 dirty:11 writeback:0 unstable:0
[  779.495493]  slab_reclaimable:0 slab_unreclaimable:11
[  779.495493]  mapped:3580 shmem:5581 pagetables:7491 bounce:0
[  779.495493]  free:21354 free_pcp:87 free_cma:0
[  779.510664] Node 0 active_anon:3387792kB inactive_anon:21664kB active_file:132kB inactive_file:52kB unevictable:0kB isolated(anon):0kB isolated(file):188kB mapped:14320kB dirty:44kB writeback:0kB shmem:22324kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3067904kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  779.522778] Node 0 DMA free:14836kB min:284kB low:352kB high:420kB active_anon:1024kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:8kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  779.533972] lowmem_reserve[]: 0 2700 3642 3642
[  779.536321] Node 0 DMA32 free:53656kB min:49888kB low:62360kB high:74832kB active_anon:2710360kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2765492kB mlocked:0kB kernel_stack:0kB pagetables:12kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  779.548127] lowmem_reserve[]: 0 0 942 942
[  779.550326] Node 0 Normal free:16924kB min:17404kB low:21752kB high:26100kB active_anon:676260kB inactive_anon:21664kB active_file:340kB inactive_file:244kB unevictable:0kB writepending:44kB present:1048576kB managed:964816kB mlocked:0kB kernel_stack:13600kB pagetables:29944kB bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
[  779.563405] lowmem_reserve[]: 0 0 0 0
[  779.565537] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 2*1024kB (UM) 0*2048kB 3*4096kB (M) = 14836kB
[  779.571291] Node 0 DMA32: 6*4kB (UM) 8*8kB (UM) 8*16kB (UM) 20*32kB (U) 29*64kB (UM) 32*128kB (UM) 13*256kB (UM) 9*512kB (UM) 10*1024kB (UM) 0*2048kB 7*4096kB (UME) = 53656kB
[  779.578672] Node 0 Normal: 734*4kB (UE) 478*8kB (UME) 266*16kB (UE) 163*32kB (UE) 7*64kB (U) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17064kB
[  779.585618] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  779.589721] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  779.593765] 5660 total pagecache pages
[  779.596059] 0 pages in swap cache
[  779.598144] Swap cache stats: add 0, delete 0, find 0/0
[  779.600947] Free swap  = 0kB
[  779.602837] Total swap = 0kB
[  779.604716] 1048445 pages RAM
[  779.606691] 0 pages HighMem/MovableOnly
[  779.608974] 111892 pages reserved
[  779.611075] 0 pages cma reserved
[  779.613118] 0 pages hwpoisoned
[  779.615117] Out of memory: Kill process 3655 (c.out) score 999 or sacrifice child
[  779.618790] Killed process 3655 (c.out) total-vm:4168kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[  779.625401] oom_reaper: reaped process 3655 (c.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
** 1838 printk messages dropped ** [  912.886103]  __do_fault+0x19/0xf0
[  912.886108]  __handle_mm_fault+0xb0b/0x1030
[  912.886114]  handle_mm_fault+0xf4/0x220
[  912.886117]  __do_page_fault+0x25b/0x4a0
[  912.886120]  do_page_fault+0x30/0x80
[  912.886124]  ? do_syscall_64+0xfd/0x140
[  912.886127]  page_fault+0x28/0x30
(...snipped...)
[  912.892027] kworker/3:3     D12120  3217      2 0x00000080
[  912.892041] Workqueue: events console_callback
[  912.892042] Call Trace:
[  912.892047]  __schedule+0x23f/0x5d0
[  912.892051]  schedule+0x31/0x80
[  912.892056]  schedule_preempt_disabled+0x9/0x10
[  912.892061]  __mutex_lock.isra.2+0x2ac/0x4d0
[  912.892068]  __mutex_lock_slowpath+0xe/0x10
[  912.892072]  ? __mutex_lock_slowpath+0xe/0x10
[  912.892077]  mutex_lock+0x2a/0x30
[  912.892105]  vmw_fb_pan_display+0x35/0x90 [vmwgfx]
[  912.892114]  fb_pan_display+0xca/0x160
[  912.892118]  bit_update_start+0x1b/0x40
[  912.892123]  fbcon_switch+0x4a6/0x630
[  912.892128]  redraw_screen+0x15a/0x240
[  912.892132]  ? update_attr.isra.3+0x90/0x90
[  912.892139]  complete_change_console+0x3d/0xd0
[  912.892143]  change_console+0x57/0x90
[  912.892147]  console_callback+0x116/0x190
[  912.892153]  process_one_work+0x1f5/0x390
[  912.892156]  worker_thread+0x46/0x410
[  912.892161]  ? __schedule+0x247/0x5d0
[  912.892165]  kthread+0xff/0x140
[  912.892170]  ? process_one_work+0x390/0x390
[  912.892174]  ? kthread_create_on_node+0x60/0x60
[  912.892178]  ? do_syscall_64+0x13a/0x140
[  912.892181]  ret_from_fork+0x25/0x30
(...snipped...)
[  912.934633] kworker/0:0     D12824  4263      2 0x00000080
[  912.934643] Workqueue: events vmw_fb_dirty_flush [vmwgfx]
[  912.934643] Call Trace:
[  912.934645]  __schedule+0x23f/0x5d0
[  912.934646]  schedule+0x31/0x80
[  912.934647]  schedule_timeout+0x189/0x290
[  912.934649]  ? del_timer_sync+0x40/0x40
[  912.934650]  io_schedule_timeout+0x19/0x40
[  912.934651]  ? io_schedule_timeout+0x19/0x40
[  912.934653]  congestion_wait+0x7d/0xd0
[  912.934654]  ? wait_woken+0x80/0x80
[  912.934654]  shrink_inactive_list+0x3e3/0x4d0
[  912.934656]  shrink_node_memcg+0x360/0x780
[  912.934657]  ? list_lru_count_one+0x65/0x70
[  912.934658]  shrink_node+0xdc/0x310
[  912.934658]  ? shrink_node+0xdc/0x310
[  912.934659]  do_try_to_free_pages+0xea/0x370
[  912.934660]  try_to_free_pages+0xc3/0x100
[  912.934661]  __alloc_pages_slowpath+0x441/0xd50
[  912.934663]  ? ___slab_alloc+0x1b6/0x590
[  912.934664]  __alloc_pages_nodemask+0x20c/0x250
[  912.934665]  alloc_pages_current+0x65/0xd0
[  912.934666]  new_slab+0x472/0x600
[  912.934668]  ___slab_alloc+0x41b/0x590
[  912.934685]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
[  912.934691]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
[  912.934692]  __slab_alloc+0x1b/0x30
[  912.934693]  ? __slab_alloc+0x1b/0x30
[  912.934694]  kmem_cache_alloc+0x16b/0x1c0
[  912.934699]  drm_modeset_lock_all+0x1b/0xa0 [drm]
[  912.934702]  vmw_framebuffer_dmabuf_dirty+0x47/0x1d0 [vmwgfx]
[  912.934706]  vmw_fb_dirty_flush+0x229/0x270 [vmwgfx]
[  912.934708]  process_one_work+0x1f5/0x390
[  912.934709]  worker_thread+0x46/0x410
[  912.934710]  ? __schedule+0x247/0x5d0
[  912.934711]  kthread+0xff/0x140
[  912.934712]  ? process_one_work+0x390/0x390
[  912.934713]  ? kthread_create_on_node+0x60/0x60
[  912.934714]  ret_from_fork+0x25/0x30
(...snipped...)
[  912.934765] Showing busy workqueues and worker pools:
[  912.934785] workqueue events: flags=0x0
[  912.934798]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  912.934801]     in-flight: 3217:console_callback{79553} console_callback{79553}
[  912.934806]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[  912.934807]     in-flight: 4263:vmw_fb_dirty_flush [vmwgfx]{130989} vmw_fb_dirty_flush [vmwgfx]{130989}
[  912.934849] workqueue events_freezable_power_: flags=0x84
[  912.934865]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  912.934866]     in-flight: 4264:disk_events_workfn{131975}
[  912.934906] workqueue writeback: flags=0x4e
[  912.934907]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256
[  912.934909]     in-flight: 357:wb_workfn{130154} wb_workfn{130154}
[  912.935347] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=5 idle: 33 3214 237
[  912.935351] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 29 95 3215
[  912.935364] pool 128: cpus=0-63 flags=0x4 nice=0 hung=130s workers=3 idle: 358 356
[  925.093875] sysrq: SysRq : Trigger a crash
[  925.093903] BUG: unable to handle kernel NULL pointer dereference at           (null)
------------------------------------------------------------

Pressing SysRq-c caused all locks to be released (doesn't it ?), and console
resumed printing pending output. The {value} after each work is delay in jiffies
since that work was queued (as suggested at 
http://lkml.kernel.org/r/201706291957.JGH39511.tQMOFSLOFJVHOF@I-love.SAKURA.ne.jp ).

Since disk_events_workfn was stalling for 131975 jiffies (CONFIG_HZ=1000)
as of uptime = 912, disk_events_workfn started stalling at uptime = 781
(which was 2 seconds after the OOM killer/reaper reclaimed some memory
for the last time).

Since vmw_fb_dirty_flush was stalling for 130989 jiffies,
vmw_fb_dirty_flush started stalling at uptime = 782. And
drm_modeset_lock_all() from vmw_fb_dirty_flush work started
GFP_KERNEL memory allocation

----------
void drm_modeset_lock_all(struct drm_device *dev)
{
        struct drm_mode_config *config = &dev->mode_config;
        struct drm_modeset_acquire_ctx *ctx;
        int ret;

        ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
        if (WARN_ON(!ctx))
                return;
(...snipped...)
}
----------

with par->bo_mutex mutex held at vmw_fb_dirty_flush().

----------
static void vmw_fb_dirty_flush(struct work_struct *work)
{
        struct vmw_fb_par *par = container_of(work, struct vmw_fb_par,
                                              local_work.work);
        struct vmw_private *vmw_priv = par->vmw_priv;
        struct fb_info *info = vmw_priv->fb_info;
        unsigned long irq_flags;
        s32 dst_x1, dst_x2, dst_y1, dst_y2, w, h;
        u32 cpp, max_x, max_y;
        struct drm_clip_rect clip;
        struct drm_framebuffer *cur_fb;
        u8 *src_ptr, *dst_ptr;

        if (vmw_priv->suspended)
                return;

        mutex_lock(&par->bo_mutex);
(...snipped...)
                WARN_ON_ONCE(par->set_fb->funcs->dirty(cur_fb, NULL, 0, 0,
                                                       &clip, 1));
(...snipped...)
        mutex_unlock(&par->bo_mutex);
}
----------

Since console_callback was stalling for 79553 jiffies, console_callback
started stalling at uptime = 833. (Thus, maybe there were OOM killer
activities a few more times, but they if any are hidden into
"** 1838 printk messages dropped **" message.)

vmw_fb_pan_display() from fb_pan_display() from console_callback work was
waiting for par->bo_mutex mutex held at vmw_fb_dirty_flush() which cannot
make progress due to infinite too_many_isolated() trap.

----------
static int vmw_fb_pan_display(struct fb_var_screeninfo *var,
                              struct fb_info *info)
{
        struct vmw_fb_par *par = info->par;

        if ((var->xoffset + var->xres) > var->xres_virtual ||
            (var->yoffset + var->yres) > var->yres_virtual) {
                DRM_ERROR("Requested panning can not fit in framebuffer\n");
                return -EINVAL;
        }

        mutex_lock(&par->bo_mutex);
        par->fb_x = var->xoffset;
        par->fb_y = var->yoffset;
        if (par->set_fb)
                vmw_fb_dirty_mark(par, par->fb_x, par->fb_y, par->set_fb->width,
                                  par->set_fb->height);
        mutex_unlock(&par->bo_mutex);

        return 0;
}
----------

As a result, console was not able to print SysRq-t output.

So, how should we avoid this problem? Of course, this problem would not
have occurred from the beginning if we did not fall into infinite
too_many_isolated() trap in shrink_inactive_list() (as proposed at 
http://lkml.kernel.org/r/20170307133057.26182-1-mhocko@kernel.org ).
But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
memory allocations because consoles should be as responsive as printk() ?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-06 10:28 printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations? Tetsuo Handa
@ 2017-07-07  2:39 ` Sergey Senozhatsky
  2017-07-07  8:41     ` Daniel Vetter
                     ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Sergey Senozhatsky @ 2017-07-07  2:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky.work, sergey.senozhatsky, pmladek, mhocko,
	pavel, rostedt, andi, jack, dri-devel, linux-mm, Daniel Vetter

Hello,

(Cc Daniel)

On (07/06/17 19:28), Tetsuo Handa wrote:
> (...snipped...)
> [  912.892027] kworker/3:3     D12120  3217      2 0x00000080
> [  912.892041] Workqueue: events console_callback
> [  912.892042] Call Trace:
> [  912.892047]  __schedule+0x23f/0x5d0
> [  912.892051]  schedule+0x31/0x80
> [  912.892056]  schedule_preempt_disabled+0x9/0x10
> [  912.892061]  __mutex_lock.isra.2+0x2ac/0x4d0
> [  912.892068]  __mutex_lock_slowpath+0xe/0x10
> [  912.892072]  ? __mutex_lock_slowpath+0xe/0x10
> [  912.892077]  mutex_lock+0x2a/0x30
> [  912.892105]  vmw_fb_pan_display+0x35/0x90 [vmwgfx]
> [  912.892114]  fb_pan_display+0xca/0x160
> [  912.892118]  bit_update_start+0x1b/0x40
> [  912.892123]  fbcon_switch+0x4a6/0x630
> [  912.892128]  redraw_screen+0x15a/0x240
> [  912.892132]  ? update_attr.isra.3+0x90/0x90
> [  912.892139]  complete_change_console+0x3d/0xd0
> [  912.892143]  change_console+0x57/0x90
> [  912.892147]  console_callback+0x116/0x190
> [  912.892153]  process_one_work+0x1f5/0x390
> [  912.892156]  worker_thread+0x46/0x410
> [  912.892161]  ? __schedule+0x247/0x5d0
> [  912.892165]  kthread+0xff/0x140
> [  912.892170]  ? process_one_work+0x390/0x390
> [  912.892174]  ? kthread_create_on_node+0x60/0x60
> [  912.892178]  ? do_syscall_64+0x13a/0x140
> [  912.892181]  ret_from_fork+0x25/0x30
> (...snipped...)
> [  912.934633] kworker/0:0     D12824  4263      2 0x00000080
> [  912.934643] Workqueue: events vmw_fb_dirty_flush [vmwgfx]
> [  912.934643] Call Trace:
> [  912.934645]  __schedule+0x23f/0x5d0
> [  912.934646]  schedule+0x31/0x80
> [  912.934647]  schedule_timeout+0x189/0x290
> [  912.934649]  ? del_timer_sync+0x40/0x40
> [  912.934650]  io_schedule_timeout+0x19/0x40
> [  912.934651]  ? io_schedule_timeout+0x19/0x40
> [  912.934653]  congestion_wait+0x7d/0xd0
> [  912.934654]  ? wait_woken+0x80/0x80
> [  912.934654]  shrink_inactive_list+0x3e3/0x4d0
> [  912.934656]  shrink_node_memcg+0x360/0x780
> [  912.934657]  ? list_lru_count_one+0x65/0x70
> [  912.934658]  shrink_node+0xdc/0x310
> [  912.934658]  ? shrink_node+0xdc/0x310
> [  912.934659]  do_try_to_free_pages+0xea/0x370
> [  912.934660]  try_to_free_pages+0xc3/0x100
> [  912.934661]  __alloc_pages_slowpath+0x441/0xd50
> [  912.934663]  ? ___slab_alloc+0x1b6/0x590
> [  912.934664]  __alloc_pages_nodemask+0x20c/0x250
> [  912.934665]  alloc_pages_current+0x65/0xd0
> [  912.934666]  new_slab+0x472/0x600
> [  912.934668]  ___slab_alloc+0x41b/0x590
> [  912.934685]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
> [  912.934691]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
> [  912.934692]  __slab_alloc+0x1b/0x30
> [  912.934693]  ? __slab_alloc+0x1b/0x30
> [  912.934694]  kmem_cache_alloc+0x16b/0x1c0
> [  912.934699]  drm_modeset_lock_all+0x1b/0xa0 [drm]
> [  912.934702]  vmw_framebuffer_dmabuf_dirty+0x47/0x1d0 [vmwgfx]
> [  912.934706]  vmw_fb_dirty_flush+0x229/0x270 [vmwgfx]
> [  912.934708]  process_one_work+0x1f5/0x390
> [  912.934709]  worker_thread+0x46/0x410
> [  912.934710]  ? __schedule+0x247/0x5d0
> [  912.934711]  kthread+0xff/0x140
> [  912.934712]  ? process_one_work+0x390/0x390
> [  912.934713]  ? kthread_create_on_node+0x60/0x60
> [  912.934714]  ret_from_fork+0x25/0x30
> 
> Pressing SysRq-c caused all locks to be released (doesn't it ?), and console

hm, I think what happened is a bit different thing. sysrq-c didn't
unlock any of the locks. I suspect that ->bo_mutex is never taken
on the direct path vprintk_emit()->console_unlock()->call_console_drivers(),
otherwise it would have made vprintk_emit() from atomic context impossible.
so ->bo_mutex does not directly affect printk. it affects it indirectly.
the root cause, however, I think, is actually console semaphore and
console_lock() in change_console(). printk() depends on it a lot, so do
drm/tty/etc. as long as the console semaphore is locked, printk can only
add new messages to the logbuf. and this is what happened here, under
console_sem we scheduled on ->bo_mutex, which was locked because of memory
allocation on another CPU, yes. you see lost messages in your report
because part of printk that is responsible for storing new messages was
working just fine; it's the output to consoles that was blocked by
console_sem -> bo_mutex chain.

the reason why sysrq-c helped was because, sysrq-c did

	panic_on_oops = 1
	panic()

and panic() called console_flush_on_panic(), which completely ignored the
state of console semaphore and just flushed all the pending logbuf
messages.

	console_trylock();
	console_unlock();

so, I believe, console_semaphore remained locked just like it was before
sysrq-c, and ->bo_mutex most likely remained locked as well. it's just we
ignored the state of console_sem and this let us to print the messages
(which also proves that ->bo_mutex is not taken by
console_unlock()->call_console_drivers()).

[..]
> Since vmw_fb_dirty_flush was stalling for 130989 jiffies,
> vmw_fb_dirty_flush started stalling at uptime = 782. And
> drm_modeset_lock_all() from vmw_fb_dirty_flush work started
> GFP_KERNEL memory allocation
> 
> ----------
> void drm_modeset_lock_all(struct drm_device *dev)
> {
>         struct drm_mode_config *config = &dev->mode_config;
>         struct drm_modeset_acquire_ctx *ctx;
>         int ret;
> 
>         ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>         if (WARN_ON(!ctx))
>                 return;

hm, this allocation, per se, looks ok to me. can't really blame it.
what you had is a combination of factors

	CPU0			CPU1				CPU2
								console_callback()
								 console_lock()
								 ^^^^^^^^^^^^^
	vprintk_emit()		mutex_lock(&par->bo_mutex)
				 kzalloc(GFP_KERNEL)
	 console_trylock()	  kmem_cache_alloc()		  mutex_lock(&par->bo_mutex)
	 ^^^^^^^^^^^^^^^^	   io_schedule_timeout

// but I haven't seen the logs that you have provided, yet.

[..]
> As a result, console was not able to print SysRq-t output.
> 
> So, how should we avoid this problem?

from the top of my head -- console_sem must be replaced with something
better. but that's a task for years.

hm...

> But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
> memory allocations because consoles should be as responsive as printk() ?

may be, may be not. like I said, the allocation in question does not
participate in console output. it's rather hard to imagine how we would
enforce a !__GFP_DIRECT_RECLAIM requirement here. it's console semaphore
to blame, I think.

if we could unlock console for some of operations done under ->bo_mutex,
then may be we could make some printing progress, at least. but I have
zero knowledge of that part of the kernel.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-07  2:39 ` Sergey Senozhatsky
@ 2017-07-07  8:41     ` Daniel Vetter
  2017-07-08 13:30   ` Tetsuo Handa
  2017-07-10 13:33   ` Michal Hocko
  2 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-07  8:41 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, Sergey Senozhatsky, pmladek, Michal Hocko,
	Pavel Machek, Steven Rostedt, Andreas Mohr, Jan Kara, dri-devel,
	Linux MM

On Fri, Jul 7, 2017 at 4:39 AM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
> Hello,
>
> (Cc Daniel)

Oh the fun :-/

Imo it's complete misdesign of the console/printk stuff that it ends
up calling down into drm/kms drivers. There's no way this will ever
reliably work, and GFP_KERNEL is the least of your worries (there's an
endless amounts of locks we need to grab even for fairly simple
stuff). What we're doing now is trying to bail out when called from
any kind of suspicious context. That means much reduced chances
critical output shows up, but ime even when we managed to display
something it all died later on and only made sure the first oops
scrolled off the screen.

Here's our bag of tricks:
- We dropped the panic notifier.
- All drm/kms callbacks for fbdev/fbcon have an oops_in_progress
check, and bail out.
- Lots of work items for things because fbdev/fbcon assume you can
call gfx drivers from atomic context, which just isn't true anymore.

That's it for context, no idea what to do here, but no longer allowing
GFP_KERNEL in drm/kms drivers is definitely not an option. Like Sergey
said, untangling the console semaphore might be what we need - if we
can somehow get the console printing out from under the console lock,
that would help a lot. At least for fbcon on top of drm drivers,
simpler consoles probably don't want that.

A quick hack might be a PF_HOLDING_CONSOLE thread flag, and if that's
set we push all the fbdev operations into a worker thread which runs
entirely asynchronously. A real bad hack would be to do a
console_try_lock, but the console semaphore is highly contended, so
might result in too many false positives for practical use.

Cheers, Daniel

> On (07/06/17 19:28), Tetsuo Handa wrote:
>> (...snipped...)
>> [  912.892027] kworker/3:3     D12120  3217      2 0x00000080
>> [  912.892041] Workqueue: events console_callback
>> [  912.892042] Call Trace:
>> [  912.892047]  __schedule+0x23f/0x5d0
>> [  912.892051]  schedule+0x31/0x80
>> [  912.892056]  schedule_preempt_disabled+0x9/0x10
>> [  912.892061]  __mutex_lock.isra.2+0x2ac/0x4d0
>> [  912.892068]  __mutex_lock_slowpath+0xe/0x10
>> [  912.892072]  ? __mutex_lock_slowpath+0xe/0x10
>> [  912.892077]  mutex_lock+0x2a/0x30
>> [  912.892105]  vmw_fb_pan_display+0x35/0x90 [vmwgfx]
>> [  912.892114]  fb_pan_display+0xca/0x160
>> [  912.892118]  bit_update_start+0x1b/0x40
>> [  912.892123]  fbcon_switch+0x4a6/0x630
>> [  912.892128]  redraw_screen+0x15a/0x240
>> [  912.892132]  ? update_attr.isra.3+0x90/0x90
>> [  912.892139]  complete_change_console+0x3d/0xd0
>> [  912.892143]  change_console+0x57/0x90
>> [  912.892147]  console_callback+0x116/0x190
>> [  912.892153]  process_one_work+0x1f5/0x390
>> [  912.892156]  worker_thread+0x46/0x410
>> [  912.892161]  ? __schedule+0x247/0x5d0
>> [  912.892165]  kthread+0xff/0x140
>> [  912.892170]  ? process_one_work+0x390/0x390
>> [  912.892174]  ? kthread_create_on_node+0x60/0x60
>> [  912.892178]  ? do_syscall_64+0x13a/0x140
>> [  912.892181]  ret_from_fork+0x25/0x30
>> (...snipped...)
>> [  912.934633] kworker/0:0     D12824  4263      2 0x00000080
>> [  912.934643] Workqueue: events vmw_fb_dirty_flush [vmwgfx]
>> [  912.934643] Call Trace:
>> [  912.934645]  __schedule+0x23f/0x5d0
>> [  912.934646]  schedule+0x31/0x80
>> [  912.934647]  schedule_timeout+0x189/0x290
>> [  912.934649]  ? del_timer_sync+0x40/0x40
>> [  912.934650]  io_schedule_timeout+0x19/0x40
>> [  912.934651]  ? io_schedule_timeout+0x19/0x40
>> [  912.934653]  congestion_wait+0x7d/0xd0
>> [  912.934654]  ? wait_woken+0x80/0x80
>> [  912.934654]  shrink_inactive_list+0x3e3/0x4d0
>> [  912.934656]  shrink_node_memcg+0x360/0x780
>> [  912.934657]  ? list_lru_count_one+0x65/0x70
>> [  912.934658]  shrink_node+0xdc/0x310
>> [  912.934658]  ? shrink_node+0xdc/0x310
>> [  912.934659]  do_try_to_free_pages+0xea/0x370
>> [  912.934660]  try_to_free_pages+0xc3/0x100
>> [  912.934661]  __alloc_pages_slowpath+0x441/0xd50
>> [  912.934663]  ? ___slab_alloc+0x1b6/0x590
>> [  912.934664]  __alloc_pages_nodemask+0x20c/0x250
>> [  912.934665]  alloc_pages_current+0x65/0xd0
>> [  912.934666]  new_slab+0x472/0x600
>> [  912.934668]  ___slab_alloc+0x41b/0x590
>> [  912.934685]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
>> [  912.934691]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
>> [  912.934692]  __slab_alloc+0x1b/0x30
>> [  912.934693]  ? __slab_alloc+0x1b/0x30
>> [  912.934694]  kmem_cache_alloc+0x16b/0x1c0
>> [  912.934699]  drm_modeset_lock_all+0x1b/0xa0 [drm]
>> [  912.934702]  vmw_framebuffer_dmabuf_dirty+0x47/0x1d0 [vmwgfx]
>> [  912.934706]  vmw_fb_dirty_flush+0x229/0x270 [vmwgfx]
>> [  912.934708]  process_one_work+0x1f5/0x390
>> [  912.934709]  worker_thread+0x46/0x410
>> [  912.934710]  ? __schedule+0x247/0x5d0
>> [  912.934711]  kthread+0xff/0x140
>> [  912.934712]  ? process_one_work+0x390/0x390
>> [  912.934713]  ? kthread_create_on_node+0x60/0x60
>> [  912.934714]  ret_from_fork+0x25/0x30
>>
>> Pressing SysRq-c caused all locks to be released (doesn't it ?), and console
>
> hm, I think what happened is a bit different thing. sysrq-c didn't
> unlock any of the locks. I suspect that ->bo_mutex is never taken
> on the direct path vprintk_emit()->console_unlock()->call_console_drivers(),
> otherwise it would have made vprintk_emit() from atomic context impossible.
> so ->bo_mutex does not directly affect printk. it affects it indirectly.
> the root cause, however, I think, is actually console semaphore and
> console_lock() in change_console(). printk() depends on it a lot, so do
> drm/tty/etc. as long as the console semaphore is locked, printk can only
> add new messages to the logbuf. and this is what happened here, under
> console_sem we scheduled on ->bo_mutex, which was locked because of memory
> allocation on another CPU, yes. you see lost messages in your report
> because part of printk that is responsible for storing new messages was
> working just fine; it's the output to consoles that was blocked by
> console_sem -> bo_mutex chain.
>
> the reason why sysrq-c helped was because, sysrq-c did
>
>         panic_on_oops = 1
>         panic()
>
> and panic() called console_flush_on_panic(), which completely ignored the
> state of console semaphore and just flushed all the pending logbuf
> messages.
>
>         console_trylock();
>         console_unlock();
>
> so, I believe, console_semaphore remained locked just like it was before
> sysrq-c, and ->bo_mutex most likely remained locked as well. it's just we
> ignored the state of console_sem and this let us to print the messages
> (which also proves that ->bo_mutex is not taken by
> console_unlock()->call_console_drivers()).
>
> [..]
>> Since vmw_fb_dirty_flush was stalling for 130989 jiffies,
>> vmw_fb_dirty_flush started stalling at uptime = 782. And
>> drm_modeset_lock_all() from vmw_fb_dirty_flush work started
>> GFP_KERNEL memory allocation
>>
>> ----------
>> void drm_modeset_lock_all(struct drm_device *dev)
>> {
>>         struct drm_mode_config *config = &dev->mode_config;
>>         struct drm_modeset_acquire_ctx *ctx;
>>         int ret;
>>
>>         ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>>         if (WARN_ON(!ctx))
>>                 return;
>
> hm, this allocation, per se, looks ok to me. can't really blame it.
> what you had is a combination of factors
>
>         CPU0                    CPU1                            CPU2
>                                                                 console_callback()
>                                                                  console_lock()
>                                                                  ^^^^^^^^^^^^^
>         vprintk_emit()          mutex_lock(&par->bo_mutex)
>                                  kzalloc(GFP_KERNEL)
>          console_trylock()        kmem_cache_alloc()              mutex_lock(&par->bo_mutex)
>          ^^^^^^^^^^^^^^^^          io_schedule_timeout
>
> // but I haven't seen the logs that you have provided, yet.
>
> [..]
>> As a result, console was not able to print SysRq-t output.
>>
>> So, how should we avoid this problem?
>
> from the top of my head -- console_sem must be replaced with something
> better. but that's a task for years.
>
> hm...
>
>> But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
>> memory allocations because consoles should be as responsive as printk() ?
>
> may be, may be not. like I said, the allocation in question does not
> participate in console output. it's rather hard to imagine how we would
> enforce a !__GFP_DIRECT_RECLAIM requirement here. it's console semaphore
> to blame, I think.
>
> if we could unlock console for some of operations done under ->bo_mutex,
> then may be we could make some printing progress, at least. but I have
> zero knowledge of that part of the kernel.
>
>         -ss



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
@ 2017-07-07  8:41     ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-07  8:41 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: pmladek, Jan Kara, Linux MM, Tetsuo Handa, dri-devel,
	Steven Rostedt, Michal Hocko, Sergey Senozhatsky, Andreas Mohr,
	Pavel Machek

On Fri, Jul 7, 2017 at 4:39 AM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
> Hello,
>
> (Cc Daniel)

Oh the fun :-/

Imo it's complete misdesign of the console/printk stuff that it ends
up calling down into drm/kms drivers. There's no way this will ever
reliably work, and GFP_KERNEL is the least of your worries (there's an
endless amounts of locks we need to grab even for fairly simple
stuff). What we're doing now is trying to bail out when called from
any kind of suspicious context. That means much reduced chances
critical output shows up, but ime even when we managed to display
something it all died later on and only made sure the first oops
scrolled off the screen.

Here's our bag of tricks:
- We dropped the panic notifier.
- All drm/kms callbacks for fbdev/fbcon have an oops_in_progress
check, and bail out.
- Lots of work items for things because fbdev/fbcon assume you can
call gfx drivers from atomic context, which just isn't true anymore.

That's it for context, no idea what to do here, but no longer allowing
GFP_KERNEL in drm/kms drivers is definitely not an option. Like Sergey
said, untangling the console semaphore might be what we need - if we
can somehow get the console printing out from under the console lock,
that would help a lot. At least for fbcon on top of drm drivers,
simpler consoles probably don't want that.

A quick hack might be a PF_HOLDING_CONSOLE thread flag, and if that's
set we push all the fbdev operations into a worker thread which runs
entirely asynchronously. A real bad hack would be to do a
console_try_lock, but the console semaphore is highly contended, so
might result in too many false positives for practical use.

Cheers, Daniel

> On (07/06/17 19:28), Tetsuo Handa wrote:
>> (...snipped...)
>> [  912.892027] kworker/3:3     D12120  3217      2 0x00000080
>> [  912.892041] Workqueue: events console_callback
>> [  912.892042] Call Trace:
>> [  912.892047]  __schedule+0x23f/0x5d0
>> [  912.892051]  schedule+0x31/0x80
>> [  912.892056]  schedule_preempt_disabled+0x9/0x10
>> [  912.892061]  __mutex_lock.isra.2+0x2ac/0x4d0
>> [  912.892068]  __mutex_lock_slowpath+0xe/0x10
>> [  912.892072]  ? __mutex_lock_slowpath+0xe/0x10
>> [  912.892077]  mutex_lock+0x2a/0x30
>> [  912.892105]  vmw_fb_pan_display+0x35/0x90 [vmwgfx]
>> [  912.892114]  fb_pan_display+0xca/0x160
>> [  912.892118]  bit_update_start+0x1b/0x40
>> [  912.892123]  fbcon_switch+0x4a6/0x630
>> [  912.892128]  redraw_screen+0x15a/0x240
>> [  912.892132]  ? update_attr.isra.3+0x90/0x90
>> [  912.892139]  complete_change_console+0x3d/0xd0
>> [  912.892143]  change_console+0x57/0x90
>> [  912.892147]  console_callback+0x116/0x190
>> [  912.892153]  process_one_work+0x1f5/0x390
>> [  912.892156]  worker_thread+0x46/0x410
>> [  912.892161]  ? __schedule+0x247/0x5d0
>> [  912.892165]  kthread+0xff/0x140
>> [  912.892170]  ? process_one_work+0x390/0x390
>> [  912.892174]  ? kthread_create_on_node+0x60/0x60
>> [  912.892178]  ? do_syscall_64+0x13a/0x140
>> [  912.892181]  ret_from_fork+0x25/0x30
>> (...snipped...)
>> [  912.934633] kworker/0:0     D12824  4263      2 0x00000080
>> [  912.934643] Workqueue: events vmw_fb_dirty_flush [vmwgfx]
>> [  912.934643] Call Trace:
>> [  912.934645]  __schedule+0x23f/0x5d0
>> [  912.934646]  schedule+0x31/0x80
>> [  912.934647]  schedule_timeout+0x189/0x290
>> [  912.934649]  ? del_timer_sync+0x40/0x40
>> [  912.934650]  io_schedule_timeout+0x19/0x40
>> [  912.934651]  ? io_schedule_timeout+0x19/0x40
>> [  912.934653]  congestion_wait+0x7d/0xd0
>> [  912.934654]  ? wait_woken+0x80/0x80
>> [  912.934654]  shrink_inactive_list+0x3e3/0x4d0
>> [  912.934656]  shrink_node_memcg+0x360/0x780
>> [  912.934657]  ? list_lru_count_one+0x65/0x70
>> [  912.934658]  shrink_node+0xdc/0x310
>> [  912.934658]  ? shrink_node+0xdc/0x310
>> [  912.934659]  do_try_to_free_pages+0xea/0x370
>> [  912.934660]  try_to_free_pages+0xc3/0x100
>> [  912.934661]  __alloc_pages_slowpath+0x441/0xd50
>> [  912.934663]  ? ___slab_alloc+0x1b6/0x590
>> [  912.934664]  __alloc_pages_nodemask+0x20c/0x250
>> [  912.934665]  alloc_pages_current+0x65/0xd0
>> [  912.934666]  new_slab+0x472/0x600
>> [  912.934668]  ___slab_alloc+0x41b/0x590
>> [  912.934685]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
>> [  912.934691]  ? drm_modeset_lock_all+0x1b/0xa0 [drm]
>> [  912.934692]  __slab_alloc+0x1b/0x30
>> [  912.934693]  ? __slab_alloc+0x1b/0x30
>> [  912.934694]  kmem_cache_alloc+0x16b/0x1c0
>> [  912.934699]  drm_modeset_lock_all+0x1b/0xa0 [drm]
>> [  912.934702]  vmw_framebuffer_dmabuf_dirty+0x47/0x1d0 [vmwgfx]
>> [  912.934706]  vmw_fb_dirty_flush+0x229/0x270 [vmwgfx]
>> [  912.934708]  process_one_work+0x1f5/0x390
>> [  912.934709]  worker_thread+0x46/0x410
>> [  912.934710]  ? __schedule+0x247/0x5d0
>> [  912.934711]  kthread+0xff/0x140
>> [  912.934712]  ? process_one_work+0x390/0x390
>> [  912.934713]  ? kthread_create_on_node+0x60/0x60
>> [  912.934714]  ret_from_fork+0x25/0x30
>>
>> Pressing SysRq-c caused all locks to be released (doesn't it ?), and console
>
> hm, I think what happened is a bit different thing. sysrq-c didn't
> unlock any of the locks. I suspect that ->bo_mutex is never taken
> on the direct path vprintk_emit()->console_unlock()->call_console_drivers(),
> otherwise it would have made vprintk_emit() from atomic context impossible.
> so ->bo_mutex does not directly affect printk. it affects it indirectly.
> the root cause, however, I think, is actually console semaphore and
> console_lock() in change_console(). printk() depends on it a lot, so do
> drm/tty/etc. as long as the console semaphore is locked, printk can only
> add new messages to the logbuf. and this is what happened here, under
> console_sem we scheduled on ->bo_mutex, which was locked because of memory
> allocation on another CPU, yes. you see lost messages in your report
> because part of printk that is responsible for storing new messages was
> working just fine; it's the output to consoles that was blocked by
> console_sem -> bo_mutex chain.
>
> the reason why sysrq-c helped was because, sysrq-c did
>
>         panic_on_oops = 1
>         panic()
>
> and panic() called console_flush_on_panic(), which completely ignored the
> state of console semaphore and just flushed all the pending logbuf
> messages.
>
>         console_trylock();
>         console_unlock();
>
> so, I believe, console_semaphore remained locked just like it was before
> sysrq-c, and ->bo_mutex most likely remained locked as well. it's just we
> ignored the state of console_sem and this let us to print the messages
> (which also proves that ->bo_mutex is not taken by
> console_unlock()->call_console_drivers()).
>
> [..]
>> Since vmw_fb_dirty_flush was stalling for 130989 jiffies,
>> vmw_fb_dirty_flush started stalling at uptime = 782. And
>> drm_modeset_lock_all() from vmw_fb_dirty_flush work started
>> GFP_KERNEL memory allocation
>>
>> ----------
>> void drm_modeset_lock_all(struct drm_device *dev)
>> {
>>         struct drm_mode_config *config = &dev->mode_config;
>>         struct drm_modeset_acquire_ctx *ctx;
>>         int ret;
>>
>>         ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
>>         if (WARN_ON(!ctx))
>>                 return;
>
> hm, this allocation, per se, looks ok to me. can't really blame it.
> what you had is a combination of factors
>
>         CPU0                    CPU1                            CPU2
>                                                                 console_callback()
>                                                                  console_lock()
>                                                                  ^^^^^^^^^^^^^
>         vprintk_emit()          mutex_lock(&par->bo_mutex)
>                                  kzalloc(GFP_KERNEL)
>          console_trylock()        kmem_cache_alloc()              mutex_lock(&par->bo_mutex)
>          ^^^^^^^^^^^^^^^^          io_schedule_timeout
>
> // but I haven't seen the logs that you have provided, yet.
>
> [..]
>> As a result, console was not able to print SysRq-t output.
>>
>> So, how should we avoid this problem?
>
> from the top of my head -- console_sem must be replaced with something
> better. but that's a task for years.
>
> hm...
>
>> But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
>> memory allocations because consoles should be as responsive as printk() ?
>
> may be, may be not. like I said, the allocation in question does not
> participate in console output. it's rather hard to imagine how we would
> enforce a !__GFP_DIRECT_RECLAIM requirement here. it's console semaphore
> to blame, I think.
>
> if we could unlock console for some of operations done under ->bo_mutex,
> then may be we could make some printing progress, at least. but I have
> zero knowledge of that part of the kernel.
>
>         -ss



-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-07  2:39 ` Sergey Senozhatsky
  2017-07-07  8:41     ` Daniel Vetter
@ 2017-07-08 13:30   ` Tetsuo Handa
  2017-07-10  5:07     ` Sergey Senozhatsky
  2017-07-10 12:59     ` Petr Mladek
  2017-07-10 13:33   ` Michal Hocko
  2 siblings, 2 replies; 17+ messages in thread
From: Tetsuo Handa @ 2017-07-08 13:30 UTC (permalink / raw)
  To: sergey.senozhatsky.work
  Cc: sergey.senozhatsky, pmladek, mhocko, pavel, rostedt, andi, jack,
	dri-devel, linux-mm, daniel.vetter

Sergey Senozhatsky wrote:
> On (07/06/17 19:28), Tetsuo Handa wrote:
> > Pressing SysRq-c caused all locks to be released (doesn't it ?), and console
> 
> hm, I think what happened is a bit different thing. sysrq-c didn't
> unlock any of the locks. I suspect that ->bo_mutex is never taken
> on the direct path vprintk_emit()->console_unlock()->call_console_drivers(),
> otherwise it would have made vprintk_emit() from atomic context impossible.
> so ->bo_mutex does not directly affect printk. it affects it indirectly.
> the root cause, however, I think, is actually console semaphore and
> console_lock() in change_console(). printk() depends on it a lot, so do
> drm/tty/etc. as long as the console semaphore is locked, printk can only
> add new messages to the logbuf. and this is what happened here, under
> console_sem we scheduled on ->bo_mutex, which was locked because of memory
> allocation on another CPU, yes. you see lost messages in your report
> because part of printk that is responsible for storing new messages was
> working just fine; it's the output to consoles that was blocked by
> console_sem -> bo_mutex chain.
> 
> the reason why sysrq-c helped was because, sysrq-c did
> 
> 	panic_on_oops = 1
> 	panic()
> 
> and panic() called console_flush_on_panic(), which completely ignored the
> state of console semaphore and just flushed all the pending logbuf
> messages.
> 
> 	console_trylock();
> 	console_unlock();
> 
> so, I believe, console_semaphore remained locked just like it was before
> sysrq-c, and ->bo_mutex most likely remained locked as well. it's just we
> ignored the state of console_sem and this let us to print the messages
> (which also proves that ->bo_mutex is not taken by
> console_unlock()->call_console_drivers()).

Indeed, you are right. No need to unlock all locks, and should not change
the state of locks for capturing vmcore for analysis.

Today I was testing below OOM program in order to demonstrate how wasting
CPU time via direct reclaim / compaction negatively affects when oom_lock is
already held by somebody else, I again hit this problem when the system is
under OOM situation (i.e. oom_lock is already held).

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
	struct sched_param sp = { };
	cpu_set_t cpu = { { 1 } };
	static int pipe_fd[2] = { EOF, EOF };
	unsigned long size = 65536;
	char *buf = malloc(size);
	unsigned long i = 0;
	signal(SIGCLD, SIG_IGN);
	sched_setaffinity(0, sizeof(cpu), &cpu);
	prctl(PR_SET_NAME, (unsigned long) "normal-priority", 0, 0, 0);
	pipe(pipe_fd);
	for (i = 0; i < 10; i++)
		if (fork() == 0) {
			if (i)
				close(pipe_fd[1]);
			/* Wait until first child gets SIGKILL. */
			read(pipe_fd[0], &i, 1);
			/* Join the direct reclaim storm. */
			for (i = 0; i < size; i += 4096)
				buf[i] = 0;
			_exit(0);
		}
	for (size = 512; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sched_setscheduler(0, SCHED_IDLE, &sp);
	prctl(PR_SET_NAME, (unsigned long) "idle-priority", 0, 0, 0);
	close(pipe_fd[1]);
	for (i = 0; i < size; i += 4096)
		buf[i] = 0; /* Will cause OOM due to overcommit */
	kill(-1, SIGKILL);
	return 0; /* Not reached. */
}
----------

The only difference against linux-next-20170707 is that I temporarily disabled
stall warning (using below change), for warn_alloc() trivially causes printk()
lockup inside the OOM killer with oom_lock held.

----------
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3899,7 +3899,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 		goto nopage;
 
 	/* Make sure we know about allocations which stall for too long */
-	if (time_after(jiffies, alloc_start + stall_timeout)) {
+	if (0 && time_after(jiffies, alloc_start + stall_timeout)) {
 		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
 			"page allocation stalls for %ums, order:%u",
 			jiffies_to_msecs(jiffies-alloc_start), order);
----------

Excuse me, I forgot to save bootup messages into serial.log file.
----------
[  324.038782] Out of memory: Kill process 2786 (idle-priority) score 942 or sacrifice child
[  324.041336] Killed process 2786 (idle-priority) total-vm:4264140kB, anon-rss:3508120kB, file-rss:4kB, shmem-rss:0kB
[  324.571323] oom_reaper: reaped process 2786 (idle-priority), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  350.140740] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[  350.144370] idle-priority cpuset=/ mems_allowed=0
[  350.146357] CPU: 0 PID: 2798 Comm: idle-priority Not tainted 4.12.0-next-20170707+ #620
[  350.147344] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  350.147344] Call Trace:
[  350.147344]  dump_stack+0x67/0x9e
[  350.147344]  dump_header+0x9d/0x3fa
[  350.147344]  ? trace_hardirqs_on+0xd/0x10
[  350.147344]  oom_kill_process+0x226/0x650
[  350.147344]  out_of_memory+0x136/0x560
[  350.147344]  ? out_of_memory+0x206/0x560
[  350.147344]  __alloc_pages_nodemask+0xcd2/0xe50
[  350.147344]  alloc_pages_vma+0x76/0x1a0
[  350.147344]  __handle_mm_fault+0xdff/0x1180
[  350.147344]  handle_mm_fault+0x186/0x360
[  350.147344]  ? handle_mm_fault+0x44/0x360
[  350.147344]  __do_page_fault+0x1da/0x510
[  350.147344]  do_page_fault+0x21/0x70
[  350.147344]  page_fault+0x22/0x30
[  350.147344] RIP: 0033:0x4008b8
[  350.147344] RSP: 002b:00007ffe9feca070 EFLAGS: 00010206
[  350.147344] RAX: 00000000d61d9000 RBX: 0000000100000000 RCX: 00007fb31dca3bd0
[  350.147344] RDX: 0000000000000000 RSI: 0000000000400ae0 RDI: 0000000000000004
[  350.147344] RBP: 00007fb11ddaf010 R08: 0000000000000000 R09: 0000000000021000
[  350.147344] R10: 00007ffe9fec9df0 R11: 0000000000000246 R12: 0000000000000006
[  350.147344] R13: 00007fb11ddaf010 R14: 0000000000000000 R15: 0000000000000000
[  350.199327] Mem-Info:
[  350.201499] active_anon:878748 inactive_anon:1643 isolated_anon:0
[  350.201499]  active_file:0 inactive_file:0 isolated_file:0
[  350.201499]  unevictable:0 dirty:0 writeback:0 unstable:0
[  350.201499]  slab_reclaimable:0 slab_unreclaimable:0
[  350.201499]  mapped:1124 shmem:2152 pagetables:1990 bounce:0
[  350.201499]  free:21375 free_pcp:0 free_cma:0
[  350.212232] Node 0 active_anon:3514992kB inactive_anon:6572kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4496kB dirty:0kB writeback:0kB shmem:8608kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3328000kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[  350.219383] Node 0 DMA free:14780kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  350.226577] lowmem_reserve[]: 0 2688 3624 3624
[  350.228274] Node 0 DMA32 free:53428kB min:49908kB low:62384kB high:74860kB active_anon:2698464kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752964kB mlocked:0kB kernel_stack:0kB pagetables:8kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  350.236222] lowmem_reserve[]: 0 0 936 936
[  350.238267] Node 0 Normal free:17292kB min:17384kB low:21728kB high:26072kB active_anon:815436kB inactive_anon:6572kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:2752kB pagetables:7952kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  350.246639] lowmem_reserve[]: 0 0 0 0
[  350.248437] Node 0 DMA: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 2*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14780kB
[  350.253174] Node 0 DMA32: 1*4kB (M) 4*8kB (U) 5*16kB (U) 6*32kB (UM) 2*64kB (UM) 4*128kB (UM) 7*256kB (M) 5*512kB (UM) 3*1024kB (M) 2*2048kB (UM) 10*4096kB (UM) = 53428kB
[  350.258555] Node 0 Normal: 47*4kB (UMH) 70*8kB (UMH) 100*16kB (UMEH) 107*32kB (EH) 68*64kB (UMEH) 24*128kB (UEH) 6*256kB (EH) 1*512kB (U) 2*1024kB (M) 0*2048kB 0*4096kB = 17292kB
[  350.263669] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  350.266375] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  350.268956] 2152 total pagecache pages
[  350.273000] 0 pages in swap cache
[  350.274823] Swap cache stats: add 0, delete 0, find 0/0
[  350.276871] Free swap  = 0kB
[  350.278372] Total swap = 0kB
[  350.279833] 1048445 pages RAM
[  350.281552] 0 pages HighMem/MovableOnly
[  350.284522] 116511 pages reserved
[  350.286255] 0 pages hwpoisoned
[  350.288789] Out of memory: Kill process 2798 (idle-priority) score 942 or sacrifice child
[  350.291615] Killed process 2799 (normal-priority) total-vm:4360kB, anon-rss:92kB, file-rss:0kB, shmem-rss:0kB
[  360.523274] idle-priority invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
[  378.871399] idle-priority cpuset=/ mems_allowed=0
[  378.873536] CPU: 0 PID: 2798 Comm: idle-priority Not tainted 4.12.0-next-20170707+ #620
[  378.874519] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  378.874519] Call Trace:
[  378.874519]  dump_stack+0x67/0x9e
[  378.874519]  dump_header+0x9d/0x3fa
[  378.874519]  ? trace_hardirqs_on+0xd/0x10
[  378.874519]  oom_kill_process+0x226/0x650
[  378.874519]  out_of_memory+0x136/0x560
[  378.874519]  ? out_of_memory+0x206/0x560
[  378.874519]  __alloc_pages_nodemask+0xcd2/0xe50
[  378.874519]  alloc_pages_vma+0x76/0x1a0
[  378.874519]  __handle_mm_fault+0xdff/0x1180
[  378.874519]  handle_mm_fault+0x186/0x360
[  378.874519]  ? handle_mm_fault+0x44/0x360
[  378.874519]  __do_page_fault+0x1da/0x510
[  378.874519]  do_page_fault+0x21/0x70
[  378.874519]  page_fault+0x22/0x30
[  378.874519] RIP: 0033:0x4008b8
[  378.874519] RSP: 002b:00007ffe9feca070 EFLAGS: 00010206
[  378.874519] RAX: 00000000d61d9000 RBX: 0000000100000000 RCX: 00007fb31dca3bd0
[  378.874519] RDX: 0000000000000000 RSI: 0000000000400ae0 RDI: 0000000000000004
[  378.874519] RBP: 00007fb11ddaf010 R08: 0000000000000000 R09: 0000000000021000
[  378.874519] R10: 00007ffe9fec9df0 R11: 0000000000000246 R12: 0000000000000006
[  378.874519] R13: 00007fb11ddaf010 R14: 0000000000000000 R15: 0000000000000000
[  378.923708] Mem-Info:
[  454.043526] sysrq: SysRq : Show Memory
[  454.043529] Mem-Info:
[  454.043533] active_anon:878741 inactive_anon:1643 isolated_anon:0
[  454.043533]  active_file:0 inactive_file:0 isolated_file:0
[  454.043533]  unevictable:0 dirty:0 writeback:0 unstable:0
[  454.043533]  slab_reclaimable:0 slab_unreclaimable:0
[  454.043533]  mapped:1124 shmem:2152 pagetables:1982 bounce:0
[  454.043533]  free:21348 free_pcp:53 free_cma:0
[  454.043535] Node 0 active_anon:3514964kB inactive_anon:6572kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4496kB dirty:0kB writeback:0kB shmem:8608kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 3328000kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[  454.043536] Node 0 DMA free:14780kB min:288kB low:360kB high:432kB active_anon:1092kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  454.043539] lowmem_reserve[]: 0 2688 3624 3624
[  454.043543] Node 0 DMA32 free:53428kB min:49908kB low:62384kB high:74860kB active_anon:2698464kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129216kB managed:2752964kB mlocked:0kB kernel_stack:0kB pagetables:8kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  454.043546] lowmem_reserve[]: 0 0 936 936
[  454.043550] Node 0 Normal free:17184kB min:17384kB low:21728kB high:26072kB active_anon:815408kB inactive_anon:6572kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1048576kB managed:958868kB mlocked:0kB kernel_stack:2736kB pagetables:7920kB bounce:0kB free_pcp:212kB local_pcp:0kB free_cma:0kB
[  454.043553] lowmem_reserve[]: 0 0 0 0
[  454.043556] Node 0 DMA: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 2*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14780kB
[  454.043591] Node 0 DMA32: 1*4kB (M) 4*8kB (U) 5*16kB (U) 6*32kB (UM) 2*64kB (UM) 4*128kB (UM) 7*256kB (M) 5*512kB (UM) 3*1024kB (M) 2*2048kB (UM) 10*4096kB (UM) = 53428kB
[  454.043610] Node 0 Normal: 18*4kB (MH) 69*8kB (UMH) 101*16kB (UMEH) 107*32kB (EH) 68*64kB (UMEH) 24*128kB (UEH) 6*256kB (EH) 1*512kB (U) 2*1024kB (M) 0*2048kB 0*4096kB = 17184kB
[  454.043638] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  454.043639] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  454.043640] 2152 total pagecache pages
[  454.043651] 0 pages in swap cache
[  454.043652] Swap cache stats: add 0, delete 0, find 0/0
[  454.043653] Free swap  = 0kB
[  454.043654] Total swap = 0kB
[  454.043658] 1048445 pages RAM
[  454.043659] 0 pages HighMem/MovableOnly
[  454.043660] 116511 pages reserved
[  454.043661] 0 pages hwpoisoned
[  462.683506] sysrq: SysRq : HELP : loglevel(0-9) reboot(b) crash(c) show-all-locks(d) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) show-blocked-tasks(w) 
[  469.931550] sysrq: SysRq : Trigger a crash
[  469.931556] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  469.931563] IP: sysrq_handle_crash+0x42/0x80
[  469.931564] PGD 1377d1067 
[  469.931565] P4D 1377d1067 
[  469.931565] PUD 0 
[  469.931566] 
[  469.931567] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[  469.931569] Modules linked in: pcspkr coretemp sg shpchp i2c_piix4 vmw_vmci sd_mod ata_generic pata_acpi serio_raw vmwgfx drm_kms_helper syscopyarea mptspi scsi_transport_spi sysfillrect mptscsih sysimgblt fb_sys_fops ttm ahci e1000 libahci drm ata_piix mptbase i2c_core libata ipv6
[  469.931591] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.12.0-next-20170707+ #620
[  469.931592] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  469.931593] task: ffff880138ef8040 task.stack: ffff880138ef4000
[  469.931595] RIP: 0010:sysrq_handle_crash+0x42/0x80
[  469.931596] RSP: 0000:ffff88013a203b50 EFLAGS: 00010086
[  469.931597] RAX: 0000000000000000 RBX: ffffffff81c793c0 RCX: 000000002e837f6c
[  469.931598] RDX: 0000000000000003 RSI: 00000000b36b48a8 RDI: ffff880138ef8040
[  469.931598] RBP: ffff88013a203b50 R08: ffff880138ef8930 R09: ffff880138ef88f8
[  469.931599] R10: 0000000000000000 R11: 00000000e8aa6fec R12: 0000000000000063
[  469.931599] R13: 0000000000000001 R14: 000000000000000a R15: ffff880138af2530
[  469.931600] FS:  0000000000000000(0000) GS:ffff88013a200000(0000) knlGS:0000000000000000
[  469.931601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  469.931651] CR2: 0000000000000000 CR3: 0000000136f3c000 CR4: 00000000001406e0
[  469.931672] Call Trace:
[  469.931674]  <IRQ>
[  469.931677]  __handle_sysrq+0x138/0x220
[  469.931679]  ? __sysrq_get_key_op+0x30/0x30
[  469.931682]  sysrq_filter+0x372/0x3b0
[  469.931686]  input_to_handler+0x52/0x100
[  469.931689]  input_pass_values.part.5+0x1bb/0x260
[  469.931691]  ? input_devices_seq_next+0x20/0x20
[  469.931693]  input_handle_event+0xcb/0x590
[  469.931696]  input_event+0x4f/0x70
[  469.931699]  atkbd_interrupt+0x5c0/0x6a0
[  469.931702]  serio_interrupt+0x41/0x80
[  469.931705]  i8042_interrupt+0x1da/0x3a0
[  469.931710]  __handle_irq_event_percpu+0x31/0xe0
[  469.931713]  handle_irq_event_percpu+0x2d/0x70
[  469.931715]  handle_irq_event+0x34/0x60
[  469.931718]  handle_edge_irq+0x99/0x120
[  469.931721]  handle_irq+0x5d/0x120
[  469.931724]  do_IRQ+0x59/0x110
[  469.931727]  common_interrupt+0x9a/0x9a
[  469.931730] RIP: 0010:__do_softirq+0x9b/0x220
[  469.931731] RSP: 0000:ffff88013a203f70 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff8b
[  469.931732] RAX: ffff880138ef8040 RBX: ffff880138ef8040 RCX: 0000000000000000
[  469.931733] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880138ef8040
[  469.931734] RBP: ffff88013a203fb8 R08: 0000000000000000 R09: 0000000000000000
[  469.931734] R10: 0000000000000001 R11: 00000000ffffffff R12: ffff880138ef8040
[  469.931735] R13: 0000000000000002 R14: 0000000000000002 R15: ffff880138ef8040
[  469.931744]  irq_exit+0xcf/0xf0
[  469.931747]  smp_apic_timer_interrupt+0x38/0x50
[  469.931749]  apic_timer_interrupt+0x9a/0xa0
[  469.931750] RIP: 0010:default_idle+0xb/0x10
[  469.931751] RSP: 0000:ffff880138ef7ea0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff10
[  469.931752] RAX: ffff880138ef8040 RBX: ffff880138ef8040 RCX: 0000000000000001
[  469.931753] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880138ef8040
[  469.931754] RBP: ffff880138ef7ea0 R08: 0000000000000000 R09: 0000000000000000
[  469.931755] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880138ef8040
[  469.931755] R13: ffff880138ef8040 R14: 0000000000000000 R15: 0000000000000000
[  469.931756]  </IRQ>
[  469.931764]  arch_cpu_idle+0xa/0x10
[  469.931766]  default_idle_call+0x1e/0x30
[  469.931768]  do_idle+0x158/0x1e0
[  469.931771]  cpu_startup_entry+0x6e/0x80
[  469.931773]  start_secondary+0x15f/0x190
[  469.931777]  secondary_startup_64+0x9f/0x9f
[  469.931782] Code: c7 c2 76 55 41 81 be 01 00 00 00 48 c7 c7 a0 0b c5 81 65 ff 0d 08 70 bf 7e e8 6b 60 ca ff c7 05 f1 37 ba 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 e8 3f 4b cc ff 84 c0 75 c1 48 c7 
[  469.931818] RIP: sysrq_handle_crash+0x42/0x80 RSP: ffff88013a203b50
[  469.931819] CR2: 0000000000000000
[  469.931831] ---[ end trace 3f8fe8ab749c5fb1 ]---
[  469.931832] Kernel panic - not syncing: Fatal exception in interrupt
[  469.931892] Kernel Offset: disabled
----------

I noticed that console output stopped at uptime = 379. At first, I thought
that it is just a random delay caused by priority problem. But I pressed
SysRq-m at uptime = 454 because nothing was printed to console for more than
one minute. I started to suspect that I hit this problem rather than a random
delay. Thus, I pressed SysRq-h at uptime = 462 in order to confirm it, and
pressed SysRq-c at uptime = 469 in order to flush console output.

We can find that the OOM killer was invoked for the last time at uptime = 360,
and "Node 0 Normal free:" was still below min: watermark at uptime = 454. This
means that the OOM killer was not able to send SIGKILL for at least 94 seconds
on a CONFIG_PREEMPT_VOLUNTARY=y kernel. But that is not what I want to mention here.

What I want to mention here is that messages which were sent to printk()
were not printed to not only /dev/tty0 but also /dev/ttyS0 (I'm passing
"console=ttyS0,115200n8 console=tty0" to kernel command line.) I don't care
if output to /dev/tty0 is delayed, but I expect that output to /dev/ttyS0
is not delayed, for I'm anayzing things using printk() output sent to serial
console (serial.log in my VMware configuration). Hitting this problem when we
cannot allocate memory results in failing to save printk() output. Oops, it
is sad.

> 
> [..]
> > Since vmw_fb_dirty_flush was stalling for 130989 jiffies,
> > vmw_fb_dirty_flush started stalling at uptime = 782. And
> > drm_modeset_lock_all() from vmw_fb_dirty_flush work started
> > GFP_KERNEL memory allocation
> > 
> > ----------
> > void drm_modeset_lock_all(struct drm_device *dev)
> > {
> >         struct drm_mode_config *config = &dev->mode_config;
> >         struct drm_modeset_acquire_ctx *ctx;
> >         int ret;
> > 
> >         ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> >         if (WARN_ON(!ctx))
> >                 return;
> 
> hm, this allocation, per se, looks ok to me. can't really blame it.
> what you had is a combination of factors
> 
> 	CPU0			CPU1				CPU2
> 								console_callback()
> 								 console_lock()
> 								 ^^^^^^^^^^^^^
> 	vprintk_emit()		mutex_lock(&par->bo_mutex)
> 				 kzalloc(GFP_KERNEL)
> 	 console_trylock()	  kmem_cache_alloc()		  mutex_lock(&par->bo_mutex)
> 	 ^^^^^^^^^^^^^^^^	   io_schedule_timeout
> 
> // but I haven't seen the logs that you have provided, yet.
> 
> [..]
> > As a result, console was not able to print SysRq-t output.
> > 
> > So, how should we avoid this problem?
> 
> from the top of my head -- console_sem must be replaced with something
> better. but that's a task for years.
> 
> hm...
> 
> > But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
> > memory allocations because consoles should be as responsive as printk() ?
> 
> may be, may be not. like I said, the allocation in question does not
> participate in console output. it's rather hard to imagine how we would
> enforce a !__GFP_DIRECT_RECLAIM requirement here. it's console semaphore
> to blame, I think.
> 
> if we could unlock console for some of operations done under ->bo_mutex,
> then may be we could make some printing progress, at least. but I have
> zero knowledge of that part of the kernel.
> 
> 	-ss
> 

I thought that we can use GFP_ATOMIC | __GFP_NOWARN (or static allocation;
wasting 16 pages or so for things like "struct drm_modeset_acquire_ctx" won't
become a problem, will it?) for short term, and fix console_sem dependency for
long term. But according to Daniel's reply, GFP_ATOMIC | __GFP_NOWARN is not
acceptable? Hmm... should we consider addressing console_sem problem before
introducing printing kernel thread and offloading to that kernel thread?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-08 13:30   ` Tetsuo Handa
@ 2017-07-10  5:07     ` Sergey Senozhatsky
  2017-07-10 12:59     ` Petr Mladek
  1 sibling, 0 replies; 17+ messages in thread
From: Sergey Senozhatsky @ 2017-07-10  5:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky.work, sergey.senozhatsky, pmladek, mhocko,
	pavel, rostedt, andi, jack, dri-devel, linux-mm, daniel.vetter

Hello,

On (07/08/17 22:30), Tetsuo Handa wrote:
> Hmm... should we consider addressing console_sem problem before
> introducing printing kernel thread and offloading to that kernel
> thread?

printk-kthread addresses a completely different set of problems.

console_sem is hard to fix quickly, because it involves rework of
tty/fbcon/drm/etc/etc/etc sub-systems; printk is the easiest part
here...

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-08 13:30   ` Tetsuo Handa
  2017-07-10  5:07     ` Sergey Senozhatsky
@ 2017-07-10 12:59     ` Petr Mladek
  2017-07-10 18:07         ` Daniel Vetter
  1 sibling, 1 reply; 17+ messages in thread
From: Petr Mladek @ 2017-07-10 12:59 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: sergey.senozhatsky.work, sergey.senozhatsky, mhocko, pavel,
	rostedt, andi, jack, dri-devel, linux-mm, daniel.vetter

On Sat 2017-07-08 22:30:47, Tetsuo Handa wrote:
> What I want to mention here is that messages which were sent to printk()
> were not printed to not only /dev/tty0 but also /dev/ttyS0 (I'm passing
> "console=ttyS0,115200n8 console=tty0" to kernel command line.) I don't care
> if output to /dev/tty0 is delayed, but I expect that output to /dev/ttyS0
> is not delayed, for I'm anayzing things using printk() output sent to serial
> console (serial.log in my VMware configuration). Hitting this problem when we
> cannot allocate memory results in failing to save printk() output. Oops, it
> is sad.

Would it be acceptable to remove "console=tty0" parameter and push
the messages only to the serial console?

Also there is the patchset from Peter Zijlstra that allows to
use early console all the time, see
https://lkml.kernel.org/r/20161018170830.405990950@infradead.org


The current code flushes each line to all enabled consoles one
by one. If there is a deadlock in one console, everything
gets blocked.

We are trying to make printk() more robust. But it is much more
complicated than we anticipated. Many changes open another can
of worms. It seems to be a job for years.


> Hmm... should we consider addressing console_sem problem before
> introducing printing kernel thread and offloading to that kernel thread?

As Sergey said, the console rework seems to be much bigger task
than introducing the kthread.

Also if we would want to handle each console separately (as a
fallback) it would be helpful to have separate kthread for each
enabled console or for the less reliable consoles at least.

Best Regards,
Petr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-07  2:39 ` Sergey Senozhatsky
  2017-07-07  8:41     ` Daniel Vetter
  2017-07-08 13:30   ` Tetsuo Handa
@ 2017-07-10 13:33   ` Michal Hocko
  2 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-07-10 13:33 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Tetsuo Handa, sergey.senozhatsky, pmladek, pavel, rostedt, andi,
	jack, dri-devel, linux-mm, Daniel Vetter

On Fri 07-07-17 11:39:18, Sergey Senozhatsky wrote:
[...]
> > void drm_modeset_lock_all(struct drm_device *dev)
> > {
> >         struct drm_mode_config *config = &dev->mode_config;
> >         struct drm_modeset_acquire_ctx *ctx;
> >         int ret;
> > 
> >         ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
> >         if (WARN_ON(!ctx))
> >                 return;
> 
> hm, this allocation, per se, looks ok to me. can't really blame it.
> what you had is a combination of factors
> 
> 	CPU0			CPU1				CPU2
> 								console_callback()
> 								 console_lock()
> 								 ^^^^^^^^^^^^^
> 	vprintk_emit()		mutex_lock(&par->bo_mutex)
> 				 kzalloc(GFP_KERNEL)
> 	 console_trylock()	  kmem_cache_alloc()		  mutex_lock(&par->bo_mutex)
> 	 ^^^^^^^^^^^^^^^^	   io_schedule_timeout
> 
> // but I haven't seen the logs that you have provided, yet.
> 
> [..]
> > As a result, console was not able to print SysRq-t output.
> > 
> > So, how should we avoid this problem?
> 
> from the top of my head -- console_sem must be replaced with something
> better.

Yeah, absolutely. The current mess just allows basically arbitrary lock
depencies which are not deadlocks because the printk part is careful but
essentially we are deadlocked wrt. functionality.

> but that's a task for years.
> 
> hm...
> 
> > But should fbcon, drm, tty and so on stop using __GFP_DIRECT_RECLAIM
> > memory allocations because consoles should be as responsive as printk() ?
> 
> may be, may be not. like I said, the allocation in question does not
> participate in console output. it's rather hard to imagine how we would
> enforce a !__GFP_DIRECT_RECLAIM requirement here. it's console semaphore
> to blame, I think.

Agreed! Looking at the problem just from the page allocator perspective
is simply wrong. That is where you see your immediate problem because
that is what you are testing I would bet my hat you can find other
interesting scenarios if you try too hard...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-10 12:59     ` Petr Mladek
@ 2017-07-10 18:07         ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-10 18:07 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Tetsuo Handa, Sergey Senozhatsky, Sergey Senozhatsky,
	Michal Hocko, Pavel Machek, Steven Rostedt, Andreas Mohr,
	Jan Kara, dri-devel, Linux MM

On Mon, Jul 10, 2017 at 2:59 PM, Petr Mladek <pmladek@suse.com> wrote:
> On Sat 2017-07-08 22:30:47, Tetsuo Handa wrote:
>> What I want to mention here is that messages which were sent to printk()
>> were not printed to not only /dev/tty0 but also /dev/ttyS0 (I'm passing
>> "console=ttyS0,115200n8 console=tty0" to kernel command line.) I don't care
>> if output to /dev/tty0 is delayed, but I expect that output to /dev/ttyS0
>> is not delayed, for I'm anayzing things using printk() output sent to serial
>> console (serial.log in my VMware configuration). Hitting this problem when we
>> cannot allocate memory results in failing to save printk() output. Oops, it
>> is sad.
>
> Would it be acceptable to remove "console=tty0" parameter and push
> the messages only to the serial console?
>
> Also there is the patchset from Peter Zijlstra that allows to
> use early console all the time, see
> https://lkml.kernel.org/r/20161018170830.405990950@infradead.org
>
>
> The current code flushes each line to all enabled consoles one
> by one. If there is a deadlock in one console, everything
> gets blocked.
>
> We are trying to make printk() more robust. But it is much more
> complicated than we anticipated. Many changes open another can
> of worms. It seems to be a job for years.
>
>
>> Hmm... should we consider addressing console_sem problem before
>> introducing printing kernel thread and offloading to that kernel thread?
>
> As Sergey said, the console rework seems to be much bigger task
> than introducing the kthread.
>
> Also if we would want to handle each console separately (as a
> fallback) it would be helpful to have separate kthread for each
> enabled console or for the less reliable consoles at least.

Since the console-loggin-in-kthread comes up routinely, and equally
often people say "but I dont want to make my serial console delayed":
Should we make kthread-based printk a per-console opt-in? fbcon and
other horror shows with deep nesting of entire subsystems and their
locking hierarchy would do that. Truly simple console drivers like
serial or maybe logging to some firmware/platform service for recovery
after rebooting would not.

Of course we'd also need one kthread per console, and we'd need to
have at least some per-console locking (plus an overall console lock
on top for both registering/unregistering consoles and all the legacy
users like fbdev that need much more work to untangle). We could even
restrict the per-console locking (i.e. those which can go ahead while
someone else is holding the main or other console_locks) just for
those console drivers which do not use a kthread, to cut down the
audit burden to something manageable.

Just my 2 cents, thrown in from the sideline.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
@ 2017-07-10 18:07         ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-10 18:07 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Jan Kara, Sergey Senozhatsky, Linux MM, Tetsuo Handa, dri-devel,
	Steven Rostedt, Michal Hocko, Sergey Senozhatsky, Andreas Mohr,
	Pavel Machek

On Mon, Jul 10, 2017 at 2:59 PM, Petr Mladek <pmladek@suse.com> wrote:
> On Sat 2017-07-08 22:30:47, Tetsuo Handa wrote:
>> What I want to mention here is that messages which were sent to printk()
>> were not printed to not only /dev/tty0 but also /dev/ttyS0 (I'm passing
>> "console=ttyS0,115200n8 console=tty0" to kernel command line.) I don't care
>> if output to /dev/tty0 is delayed, but I expect that output to /dev/ttyS0
>> is not delayed, for I'm anayzing things using printk() output sent to serial
>> console (serial.log in my VMware configuration). Hitting this problem when we
>> cannot allocate memory results in failing to save printk() output. Oops, it
>> is sad.
>
> Would it be acceptable to remove "console=tty0" parameter and push
> the messages only to the serial console?
>
> Also there is the patchset from Peter Zijlstra that allows to
> use early console all the time, see
> https://lkml.kernel.org/r/20161018170830.405990950@infradead.org
>
>
> The current code flushes each line to all enabled consoles one
> by one. If there is a deadlock in one console, everything
> gets blocked.
>
> We are trying to make printk() more robust. But it is much more
> complicated than we anticipated. Many changes open another can
> of worms. It seems to be a job for years.
>
>
>> Hmm... should we consider addressing console_sem problem before
>> introducing printing kernel thread and offloading to that kernel thread?
>
> As Sergey said, the console rework seems to be much bigger task
> than introducing the kthread.
>
> Also if we would want to handle each console separately (as a
> fallback) it would be helpful to have separate kthread for each
> enabled console or for the less reliable consoles at least.

Since the console-loggin-in-kthread comes up routinely, and equally
often people say "but I dont want to make my serial console delayed":
Should we make kthread-based printk a per-console opt-in? fbcon and
other horror shows with deep nesting of entire subsystems and their
locking hierarchy would do that. Truly simple console drivers like
serial or maybe logging to some firmware/platform service for recovery
after rebooting would not.

Of course we'd also need one kthread per console, and we'd need to
have at least some per-console locking (plus an overall console lock
on top for both registering/unregistering consoles and all the legacy
users like fbdev that need much more work to untangle). We could even
restrict the per-console locking (i.e. those which can go ahead while
someone else is holding the main or other console_locks) just for
those console drivers which do not use a kthread, to cut down the
audit burden to something manageable.

Just my 2 cents, thrown in from the sideline.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-10 18:07         ` Daniel Vetter
  (?)
@ 2017-07-11  2:31         ` Sergey Senozhatsky
  2017-07-11  4:57           ` Sergey Senozhatsky
  -1 siblings, 1 reply; 17+ messages in thread
From: Sergey Senozhatsky @ 2017-07-11  2:31 UTC (permalink / raw)
  To: Daniel Vetter, Petr Mladek
  Cc: Tetsuo Handa, Sergey Senozhatsky, Sergey Senozhatsky,
	Michal Hocko, Pavel Machek, Steven Rostedt, Andreas Mohr,
	Jan Kara, dri-devel, Linux MM

Hello,

On (07/10/17 20:07), Daniel Vetter wrote:
[..]
> > Would it be acceptable to remove "console=tty0" parameter and push
> > the messages only to the serial console?
> >
> > Also there is the patchset from Peter Zijlstra that allows to
> > use early console all the time, see
> > https://lkml.kernel.org/r/20161018170830.405990950@infradead.org
> >
> >
> > The current code flushes each line to all enabled consoles one
> > by one. If there is a deadlock in one console, everything
> > gets blocked.
> >
> > We are trying to make printk() more robust. But it is much more
> > complicated than we anticipated. Many changes open another can
> > of worms. It seems to be a job for years.
> >
> >
> >> Hmm... should we consider addressing console_sem problem before
> >> introducing printing kernel thread and offloading to that kernel thread?
> >
> > As Sergey said, the console rework seems to be much bigger task
> > than introducing the kthread.
> >
> > Also if we would want to handle each console separately (as a
> > fallback) it would be helpful to have separate kthread for each
> > enabled console or for the less reliable consoles at least.
> 
> Since the console-loggin-in-kthread comes up routinely, and equally
> often people say "but I dont want to make my serial console delayed":
> Should we make kthread-based printk a per-console opt-in? fbcon and
> other horror shows with deep nesting of entire subsystems and their
> locking hierarchy would do that. Truly simple console drivers like
> serial or maybe logging to some firmware/platform service for recovery
> after rebooting would not.
> 
> Of course we'd also need one kthread per console, and we'd need to
> have at least some per-console locking (plus an overall console lock
> on top for both registering/unregistering consoles and all the legacy
> users like fbdev that need much more work to untangle). We could even
> restrict the per-console locking (i.e. those which can go ahead while
> someone else is holding the main or other console_locks) just for
> those console drivers which do not use a kthread, to cut down the
> audit burden to something manageable.
> 
> Just my 2 cents, thrown in from the sideline.

(replying to both Petr and Daniel)

interesting direction, gents.

and this is what I thought about over the weekend; it's very sketchy and
I didn't spend too much time on it. (I'm on a sick leave now, sorry).

it's quite close to what you guys have mentioned above.

a) keep console_sem only to protect console drivers list modification
b) add a semaphore/mutex to struct console
c) move global console_seq/etc to struct console
e) use a single kthread for printing, but do console_unlock() multi passes,
   printing unseen logbuf messages on per-console basis


so console_lock()/console_unlock() will simply protect console drivers
list from concurrent manipulation; it will not prevent us from printing.
now, there are places where console_lock() serves a special purpose - it
makes sure that no new lines are printed to the console while we scroll
it/flip it/etc. IOW while we do "some things" to a particular console.
the problem here, is that this also blocks printing to all of the registered
console drivers, not just the one we are touching now. therefore, what I was
thinking about is to disable/enable that particular console in all of the
places where we really want to stop printing to this console for a bit.

IOW, something like



	console_lock()
	:	down(console_sem);

	console_disable(con)
	:	lock(con->lock);
	:	con->flags &= ~CON_ENABLED;
	:	unlock(con->lock)

	console_unlock()
	:	for_each_console(con)
	:		while (con->console_seq != log_next_seq) {
	:			msg_print_text();
	:			con->console_seq++;
	:		
	:			call_console_drivers()
	:			:	if (con->flags & CON_ENABLED)
	:			:		con->write()
	:		}
	:	up(console_sem);


	// do "some things" to this console. it's disabled, so no
	// ->write() callback would be called in the meantime

	console_lock()
	:	down(console_sem);

	console_enable(con)
	:	lock(con->lock);
	:	con->flags |= CON_ENABLED;
	:	unlock(con->lock)


	// so now we enabled that console again. it's ->console_seq is
	// probably behind the rest of consoles, so console_unlock()
	// will ->write() all the unseen message to this console.

	console_unlock()
	:	for_each_console(con)
	:		while (con->console_seq != log_next_seq) {
	:			msg_print_text();
	:			con->console_seq++;
	:		
	:			call_console_drivers()
	:			:	if (con->flags & CON_ENABLED)
	:			:		con->write()
	:		}
	:	up(console_sem);


so this does change the behavior. may be even a lot. consoles now will not
look the same, in some cases: some consoles can be ahead, some can be behind
(as long as CON_ENABLED bit is cleared for the "do some things" part).

and this requires a number of changes in fb/tty/etc code. not just
shuffling of console_lock()/console_unlock() calls, but also
console_disable()/console_enable() calls... and we need to pass struct
console to console_disable()/console_enable()...


another thing is, ideally, only !CON_ENABLED consoles will now see
"dropped messages". if some particular console is !CON_ENABLED for long
time, then well, just like it happens now, logbuf may run out of space
and we will drop potentially unseen messages. but with this change, we
will drop messages only on !CON_ENABLED consoles. if there are CON_ENABLED
console(-s), we will print logbuf messages to those consoles. so may be
we have more chances saving the kernel logs now.

just a sketch...

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-11  2:31         ` Sergey Senozhatsky
@ 2017-07-11  4:57           ` Sergey Senozhatsky
  2017-07-11  7:50               ` Daniel Vetter
  0 siblings, 1 reply; 17+ messages in thread
From: Sergey Senozhatsky @ 2017-07-11  4:57 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Daniel Vetter, Petr Mladek, Tetsuo Handa, Sergey Senozhatsky,
	Michal Hocko, Pavel Machek, Steven Rostedt, Andreas Mohr,
	Jan Kara, dri-devel, Linux MM

On (07/11/17 11:31), Sergey Senozhatsky wrote:
[..]
> (replying to both Petr and Daniel)
> 
> interesting direction, gents.
> 
> and this is what I thought about over the weekend; it's very sketchy and
> I didn't spend too much time on it. (I'm on a sick leave now, sorry).
> 
> it's quite close to what you guys have mentioned above.
> 
> a) keep console_sem only to protect console drivers list modification
> b) add a semaphore/mutex to struct console
> c) move global console_seq/etc to struct console
> e) use a single kthread for printing, but do console_unlock() multi passes,
>    printing unseen logbuf messages on per-console basis
> 
> 
> so console_lock()/console_unlock() will simply protect console drivers
> list from concurrent manipulation; it will not prevent us from printing.
> now, there are places where console_lock() serves a special purpose - it
> makes sure that no new lines are printed to the console while we scroll
> it/flip it/etc. IOW while we do "some things" to a particular console.
> the problem here, is that this also blocks printing to all of the registered
> console drivers, not just the one we are touching now. therefore, what I was
> thinking about is to disable/enable that particular console in all of the
> places where we really want to stop printing to this console for a bit.
> 
> IOW, something like
> 
> 
> 
> 	console_lock()
> 	:	down(console_sem);
> 
> 	console_disable(con)
> 	:	lock(con->lock);
> 	:	con->flags &= ~CON_ENABLED;
> 	:	unlock(con->lock)
> 
> 	console_unlock()
> 	:	for_each_console(con)
> 	:		while (con->console_seq != log_next_seq) {
> 	:			msg_print_text();
> 	:			con->console_seq++;
> 	:		
> 	:			call_console_drivers()
> 	:			:	if (con->flags & CON_ENABLED)
> 	:			:		con->write()
> 	:		}
> 	:	up(console_sem);
> 
> 
> 	// do "some things" to this console. it's disabled, so no
> 	// ->write() callback would be called in the meantime
> 
> 	console_lock()
> 	:	down(console_sem);
> 
> 	console_enable(con)
> 	:	lock(con->lock);
> 	:	con->flags |= CON_ENABLED;
> 	:	unlock(con->lock)
> 
> 
> 	// so now we enabled that console again. it's ->console_seq is
> 	// probably behind the rest of consoles, so console_unlock()
> 	// will ->write() all the unseen message to this console.
> 
> 	console_unlock()
> 	:	for_each_console(con)
> 	:		while (con->console_seq != log_next_seq) {
> 	:			msg_print_text();
> 	:			con->console_seq++;
> 	:		
> 	:			call_console_drivers()
> 	:			:	if (con->flags & CON_ENABLED)
> 	:			:		con->write()
> 	:		}
> 	:	up(console_sem);
> 

ok, obviously stupid.

I meant to hold con->lock between console_disable() and console_enable().
so no other CPU can unregister it, etc. printk->console_unlock(), thus,
can either have a racy con->flags check (no con->lock taken) or try
something like down_trylock(&con->lock): if it fails, continue.

but need to look more.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-11  4:57           ` Sergey Senozhatsky
@ 2017-07-11  7:50               ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-11  7:50 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Daniel Vetter, Petr Mladek, Tetsuo Handa, Sergey Senozhatsky,
	Michal Hocko, Pavel Machek, Steven Rostedt, Andreas Mohr,
	Jan Kara, dri-devel, Linux MM

On Tue, Jul 11, 2017 at 01:57:10PM +0900, Sergey Senozhatsky wrote:
> On (07/11/17 11:31), Sergey Senozhatsky wrote:
> [..]
> > (replying to both Petr and Daniel)
> > 
> > interesting direction, gents.
> > 
> > and this is what I thought about over the weekend; it's very sketchy and
> > I didn't spend too much time on it. (I'm on a sick leave now, sorry).
> > 
> > it's quite close to what you guys have mentioned above.
> > 
> > a) keep console_sem only to protect console drivers list modification
> > b) add a semaphore/mutex to struct console
> > c) move global console_seq/etc to struct console
> > e) use a single kthread for printing, but do console_unlock() multi passes,
> >    printing unseen logbuf messages on per-console basis
> > 
> > 
> > so console_lock()/console_unlock() will simply protect console drivers
> > list from concurrent manipulation; it will not prevent us from printing.
> > now, there are places where console_lock() serves a special purpose - it
> > makes sure that no new lines are printed to the console while we scroll
> > it/flip it/etc. IOW while we do "some things" to a particular console.
> > the problem here, is that this also blocks printing to all of the registered
> > console drivers, not just the one we are touching now. therefore, what I was
> > thinking about is to disable/enable that particular console in all of the
> > places where we really want to stop printing to this console for a bit.
> > 
> > IOW, something like
> > 
> > 
> > 
> > 	console_lock()
> > 	:	down(console_sem);
> > 
> > 	console_disable(con)
> > 	:	lock(con->lock);
> > 	:	con->flags &= ~CON_ENABLED;
> > 	:	unlock(con->lock)
> > 
> > 	console_unlock()
> > 	:	for_each_console(con)
> > 	:		while (con->console_seq != log_next_seq) {
> > 	:			msg_print_text();
> > 	:			con->console_seq++;
> > 	:		
> > 	:			call_console_drivers()
> > 	:			:	if (con->flags & CON_ENABLED)
> > 	:			:		con->write()
> > 	:		}
> > 	:	up(console_sem);
> > 
> > 
> > 	// do "some things" to this console. it's disabled, so no
> > 	// ->write() callback would be called in the meantime
> > 
> > 	console_lock()
> > 	:	down(console_sem);
> > 
> > 	console_enable(con)
> > 	:	lock(con->lock);
> > 	:	con->flags |= CON_ENABLED;
> > 	:	unlock(con->lock)
> > 
> > 
> > 	// so now we enabled that console again. it's ->console_seq is
> > 	// probably behind the rest of consoles, so console_unlock()
> > 	// will ->write() all the unseen message to this console.
> > 
> > 	console_unlock()
> > 	:	for_each_console(con)
> > 	:		while (con->console_seq != log_next_seq) {
> > 	:			msg_print_text();
> > 	:			con->console_seq++;
> > 	:		
> > 	:			call_console_drivers()
> > 	:			:	if (con->flags & CON_ENABLED)
> > 	:			:		con->write()
> > 	:		}
> > 	:	up(console_sem);
> > 
> 
> ok, obviously stupid.
> 
> I meant to hold con->lock between console_disable() and console_enable().
> so no other CPU can unregister it, etc. printk->console_unlock(), thus,
> can either have a racy con->flags check (no con->lock taken) or try
> something like down_trylock(&con->lock): if it fails, continue.

I don't think you need the CON_ENABLED flag, just holding the per-console
lock should be enough (I hope). Or what exactly is the idea behind this.
I'm also not sure whether dropping the main console_lock is a good idea.

What I had in mind for the printk look is to not even hold the main
console_lock, but only grab the individual console_locks (plus the printk
buffer spinlock ofc), so

	for_each_console(con)
		if (!mutex_trylock(con->mutex))
			continue;

		/* pseudo-code, whatever we need to check to make sure
		 * this console is real and fully registered. */
		if (!(con->flags & CON_ENABLED))
			continue;

		if (con_requires_kthread(con)) {
			wake_up(con->printk_wq);

			/* this is for consoles that grab massive amounts
			 * of locks, like fbcon. We could repurpose klogd
			 * for this perhaps. */

			continue;
		}

		/* do the actual printing business */
	}

Very rough pseudo-code draft without looking at the details. The things
we'd need to do to get there:

- Audit _all_ the places that use console_lock to protect global data
  structures. Excessively sprinkle lockdep_assert_held(&console_lock_map);
  over them to make sure we don't break stuff. We'll probably want to
  stuff that lockdep assert into for_each_console (and have a
  special/open-coded one for the printk loop).

- Add con->mutex. Make sure that lock properly serializes against
  the last step in register_console and the first step in
  unregister_console. CON_ENABLED seems like the critical flag.

- Wrap all call to console callbacks in a mutex_lock(con->mutex) critical
  sections.

- Sprinkle lockdep_assert_held(con->mutex) into all the console callbacks
  of a few common console backends (fbcon + serial should be enough), to
  make sure that part is solid.

- Do the above changes in the printk loop. It also needs to be extracted
  from console_unlock so that we can replace the

	if (console_try_lock())
		console_unlock();

  pattern for pushing out the printk buffer with maybe a new
  printk_flush() function, which does _not_ try to acquire console_lock.

- console_unlock() still needs to flush out the printk buffers, to make
  sure we haven't lost any lines. Or we'll rely on klogd to ensure
  everything gets printed, when the trylock path doesn't work out.

I think this would give us a reasonable locking design, allows us to not
stall on slow consoles when trying to dump emergency output to serial, and
it would decouple printk entirely from the huge console_lock mess. And as
long as we carefully audit for global stuff (everywhere, not just in
printk.c) in the first step I think it should be a safe transition.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
@ 2017-07-11  7:50               ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-11  7:50 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Jan Kara, Linux MM, Tetsuo Handa, Daniel Vetter,
	dri-devel, Steven Rostedt, Michal Hocko, Sergey Senozhatsky,
	Andreas Mohr, Pavel Machek

On Tue, Jul 11, 2017 at 01:57:10PM +0900, Sergey Senozhatsky wrote:
> On (07/11/17 11:31), Sergey Senozhatsky wrote:
> [..]
> > (replying to both Petr and Daniel)
> > 
> > interesting direction, gents.
> > 
> > and this is what I thought about over the weekend; it's very sketchy and
> > I didn't spend too much time on it. (I'm on a sick leave now, sorry).
> > 
> > it's quite close to what you guys have mentioned above.
> > 
> > a) keep console_sem only to protect console drivers list modification
> > b) add a semaphore/mutex to struct console
> > c) move global console_seq/etc to struct console
> > e) use a single kthread for printing, but do console_unlock() multi passes,
> >    printing unseen logbuf messages on per-console basis
> > 
> > 
> > so console_lock()/console_unlock() will simply protect console drivers
> > list from concurrent manipulation; it will not prevent us from printing.
> > now, there are places where console_lock() serves a special purpose - it
> > makes sure that no new lines are printed to the console while we scroll
> > it/flip it/etc. IOW while we do "some things" to a particular console.
> > the problem here, is that this also blocks printing to all of the registered
> > console drivers, not just the one we are touching now. therefore, what I was
> > thinking about is to disable/enable that particular console in all of the
> > places where we really want to stop printing to this console for a bit.
> > 
> > IOW, something like
> > 
> > 
> > 
> > 	console_lock()
> > 	:	down(console_sem);
> > 
> > 	console_disable(con)
> > 	:	lock(con->lock);
> > 	:	con->flags &= ~CON_ENABLED;
> > 	:	unlock(con->lock)
> > 
> > 	console_unlock()
> > 	:	for_each_console(con)
> > 	:		while (con->console_seq != log_next_seq) {
> > 	:			msg_print_text();
> > 	:			con->console_seq++;
> > 	:		
> > 	:			call_console_drivers()
> > 	:			:	if (con->flags & CON_ENABLED)
> > 	:			:		con->write()
> > 	:		}
> > 	:	up(console_sem);
> > 
> > 
> > 	// do "some things" to this console. it's disabled, so no
> > 	// ->write() callback would be called in the meantime
> > 
> > 	console_lock()
> > 	:	down(console_sem);
> > 
> > 	console_enable(con)
> > 	:	lock(con->lock);
> > 	:	con->flags |= CON_ENABLED;
> > 	:	unlock(con->lock)
> > 
> > 
> > 	// so now we enabled that console again. it's ->console_seq is
> > 	// probably behind the rest of consoles, so console_unlock()
> > 	// will ->write() all the unseen message to this console.
> > 
> > 	console_unlock()
> > 	:	for_each_console(con)
> > 	:		while (con->console_seq != log_next_seq) {
> > 	:			msg_print_text();
> > 	:			con->console_seq++;
> > 	:		
> > 	:			call_console_drivers()
> > 	:			:	if (con->flags & CON_ENABLED)
> > 	:			:		con->write()
> > 	:		}
> > 	:	up(console_sem);
> > 
> 
> ok, obviously stupid.
> 
> I meant to hold con->lock between console_disable() and console_enable().
> so no other CPU can unregister it, etc. printk->console_unlock(), thus,
> can either have a racy con->flags check (no con->lock taken) or try
> something like down_trylock(&con->lock): if it fails, continue.

I don't think you need the CON_ENABLED flag, just holding the per-console
lock should be enough (I hope). Or what exactly is the idea behind this.
I'm also not sure whether dropping the main console_lock is a good idea.

What I had in mind for the printk look is to not even hold the main
console_lock, but only grab the individual console_locks (plus the printk
buffer spinlock ofc), so

	for_each_console(con)
		if (!mutex_trylock(con->mutex))
			continue;

		/* pseudo-code, whatever we need to check to make sure
		 * this console is real and fully registered. */
		if (!(con->flags & CON_ENABLED))
			continue;

		if (con_requires_kthread(con)) {
			wake_up(con->printk_wq);

			/* this is for consoles that grab massive amounts
			 * of locks, like fbcon. We could repurpose klogd
			 * for this perhaps. */

			continue;
		}

		/* do the actual printing business */
	}

Very rough pseudo-code draft without looking at the details. The things
we'd need to do to get there:

- Audit _all_ the places that use console_lock to protect global data
  structures. Excessively sprinkle lockdep_assert_held(&console_lock_map);
  over them to make sure we don't break stuff. We'll probably want to
  stuff that lockdep assert into for_each_console (and have a
  special/open-coded one for the printk loop).

- Add con->mutex. Make sure that lock properly serializes against
  the last step in register_console and the first step in
  unregister_console. CON_ENABLED seems like the critical flag.

- Wrap all call to console callbacks in a mutex_lock(con->mutex) critical
  sections.

- Sprinkle lockdep_assert_held(con->mutex) into all the console callbacks
  of a few common console backends (fbcon + serial should be enough), to
  make sure that part is solid.

- Do the above changes in the printk loop. It also needs to be extracted
  from console_unlock so that we can replace the

	if (console_try_lock())
		console_unlock();

  pattern for pushing out the printk buffer with maybe a new
  printk_flush() function, which does _not_ try to acquire console_lock.

- console_unlock() still needs to flush out the printk buffers, to make
  sure we haven't lost any lines. Or we'll rely on klogd to ensure
  everything gets printed, when the trylock path doesn't work out.

I think this would give us a reasonable locking design, allows us to not
stall on slow consoles when trying to dump emergency output to serial, and
it would decouple printk entirely from the huge console_lock mess. And as
long as we carefully audit for global stuff (everywhere, not just in
printk.c) in the first step I think it should be a safe transition.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-11  7:50               ` Daniel Vetter
  (?)
@ 2017-07-11  9:48               ` Sergey Senozhatsky
  2017-07-11 11:36                   ` Daniel Vetter
  -1 siblings, 1 reply; 17+ messages in thread
From: Sergey Senozhatsky @ 2017-07-11  9:48 UTC (permalink / raw)
  To: Sergey Senozhatsky, Petr Mladek, Tetsuo Handa,
	Sergey Senozhatsky, Michal Hocko, Pavel Machek, Steven Rostedt,
	Andreas Mohr, Jan Kara, dri-devel, Linux MM

On (07/11/17 09:50), Daniel Vetter wrote:
[..]
> > ok, obviously stupid.
> > 
> > I meant to hold con->lock between console_disable() and console_enable().
> > so no other CPU can unregister it, etc. printk->console_unlock(), thus,
> > can either have a racy con->flags check (no con->lock taken) or try
> > something like down_trylock(&con->lock): if it fails, continue.
> 
> I don't think you need the CON_ENABLED flag, just holding the per-console
> lock should be enough (I hope). Or what exactly is the idea behind this.
> I'm also not sure whether dropping the main console_lock is a good idea.

CON_ENABLED thing is completely broken, yes. my apologies.
what I really wanted to think about was something as below

	vprintk_emit()
	 down_trylock(console_sem)
	  console_unlock()

console_unlock() runs under console_sem, but in order to ->write() it
does down_trylock(con->lock) on every console.

the functions that modify consoles do:

    down(console_sem)
    down(con->lock);
    up(console_sem);

    "do things to con"

    up(con->lock);

    down_trylock(console_sem) // if it fails then someone else will do the printing
     console_unlock();

so console_unlock() will "pass on" those "frozen" consoles. the next time
we are in console_unlock() again, we will notice that console has its "seen
message IDX" behind the current log idx and will flush the (if console
semaphore will be available).

so the loop is

	for_each_console (con) {
		if (!down_trylock(con->lock))
			continue;

		while (con->console_seq != log_next_seq) {
			msg_print_text();
			con->console_seq++;

			if (!(con->flags & CON_ENABLED))
				break;
			if (!con->write)
				break;
			if (!cpu_online(smp_processor_id()) &&
			    !(con->flags & CON_ANYTIME))
				break;
			if (con->flags & CON_EXTENDED)
				con->write(con, ext_text, ext_len);
			else
				con->write(con, text, len);
		}

		up(con->lock);
	}

	up(console_sem);

> What I had in mind for the printk look is to not even hold the main
> console_lock, but only grab the individual console_locks (plus the printk
> buffer spinlock ofc), so

may be console_sem won't be needed for printk at all. need to think
more. I think I just wanted to jump over all those "suspend all console
drivers" etc for hibernate and other cases that we might be missing
at the moment.


so, no big and heavy console manipulations should be performed under
console_sem. we take console_sem briefly, find the right console, take
its ->lock, unlock console_sem. from now on the we the console, nothing
else should be able to concurrently un-register it, etc. etc.

> 	for_each_console(con)
> 		if (!mutex_trylock(con->mutex))
> 			continue;
> 
> 		/* pseudo-code, whatever we need to check to make sure
> 		 * this console is real and fully registered. */
> 		if (!(con->flags & CON_ENABLED))
> 			continue;
> 
> 		if (con_requires_kthread(con)) {
> 			wake_up(con->printk_wq);
> 
> 			/* this is for consoles that grab massive amounts
> 			 * of locks, like fbcon. We could repurpose klogd
> 			 * for this perhaps. */
> 
> 			continue;
> 		}
> 
> 		/* do the actual printing business */
> 	}

hm... ok. very close, but not exactly what I was thinking about.
may be guys your ideas are better. per-console printing kthread,
hm. interesting.

> Very rough pseudo-code draft without looking at the details. The things
> we'd need to do to get there:
> 
> - Audit _all_ the places that use console_lock to protect global data
>   structures. Excessively sprinkle lockdep_assert_held(&console_lock_map);
>   over them to make sure we don't break stuff. We'll probably want to
>   stuff that lockdep assert into for_each_console (and have a
>   special/open-coded one for the printk loop).
> 
> - Add con->mutex. Make sure that lock properly serializes against
>   the last step in register_console and the first step in
>   unregister_console. CON_ENABLED seems like the critical flag.

a silly side-note,
we must be able to console_unlock() from IRQ. mutex_trylock() cannot
be used in atomic, because it might sleep, unlike semaphore.

> - Wrap all call to console callbacks in a mutex_lock(con->mutex) critical
>   sections.
> 
> - Sprinkle lockdep_assert_held(con->mutex) into all the console callbacks
>   of a few common console backends (fbcon + serial should be enough), to
>   make sure that part is solid.
> 
> - Do the above changes in the printk loop. It also needs to be extracted
>   from console_unlock so that we can replace the
> 
> 	if (console_try_lock())
> 		console_unlock();
> 
>   pattern for pushing out the printk buffer with maybe a new
>   printk_flush() function, which does _not_ try to acquire console_lock.
> 
> - console_unlock() still needs to flush out the printk buffers, to make
>   sure we haven't lost any lines. Or we'll rely on klogd to ensure
>   everything gets printed, when the trylock path doesn't work out.
> 
> I think this would give us a reasonable locking design, allows us to not
> stall on slow consoles when trying to dump emergency output to serial, and
> it would decouple printk entirely from the huge console_lock mess. And as
> long as we carefully audit for global stuff (everywhere, not just in
> printk.c) in the first step I think it should be a safe transition.

many thanks for the inputs! I'll think more about it.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
  2017-07-11  9:48               ` Sergey Senozhatsky
@ 2017-07-11 11:36                   ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-11 11:36 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Tetsuo Handa, Sergey Senozhatsky, Michal Hocko,
	Pavel Machek, Steven Rostedt, Andreas Mohr, Jan Kara, dri-devel,
	Linux MM

On Tue, Jul 11, 2017 at 11:48 AM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
> On (07/11/17 09:50), Daniel Vetter wrote:
> [..]
>> > ok, obviously stupid.
>> >
>> > I meant to hold con->lock between console_disable() and console_enable().
>> > so no other CPU can unregister it, etc. printk->console_unlock(), thus,
>> > can either have a racy con->flags check (no con->lock taken) or try
>> > something like down_trylock(&con->lock): if it fails, continue.
>>
>> I don't think you need the CON_ENABLED flag, just holding the per-console
>> lock should be enough (I hope). Or what exactly is the idea behind this.
>> I'm also not sure whether dropping the main console_lock is a good idea.
>
> CON_ENABLED thing is completely broken, yes. my apologies.
> what I really wanted to think about was something as below
>
>         vprintk_emit()
>          down_trylock(console_sem)
>           console_unlock()
>
> console_unlock() runs under console_sem, but in order to ->write() it
> does down_trylock(con->lock) on every console.
>
> the functions that modify consoles do:
>
>     down(console_sem)
>     down(con->lock);
>     up(console_sem);
>
>     "do things to con"
>
>     up(con->lock);
>
>     down_trylock(console_sem) // if it fails then someone else will do the printing
>      console_unlock();
>
> so console_unlock() will "pass on" those "frozen" consoles. the next time
> we are in console_unlock() again, we will notice that console has its "seen
> message IDX" behind the current log idx and will flush the (if console
> semaphore will be available).

I think the problem with that is that console_sem is such a huge outer
lock that dropping it is currently not possible. At least not until
we've pushed it back down into register_console and friends, through
the entire fbcon layer. That's why I think adding console->lock as an
entirely inner lock is the only approach we can do right now. But that
means printk writing must be extremely careful, and stop relying on
console_sem completely. I think that is doable, but a bit of work
(mostly in annotating all the current stuff protected by console_sem).

> so the loop is
>
>         for_each_console (con) {
>                 if (!down_trylock(con->lock))
>                         continue;
>
>                 while (con->console_seq != log_next_seq) {
>                         msg_print_text();
>                         con->console_seq++;
>
>                         if (!(con->flags & CON_ENABLED))
>                                 break;
>                         if (!con->write)
>                                 break;
>                         if (!cpu_online(smp_processor_id()) &&
>                             !(con->flags & CON_ANYTIME))
>                                 break;

I think the above checks don't need to be done for every msg? Or do I
misunderstand something here? As long as we hold con->lock, things
shouldn't change.

>                         if (con->flags & CON_EXTENDED)
>                                 con->write(con, ext_text, ext_len);
>                         else
>                                 con->write(con, text, len);
>                 }
>
>                 up(con->lock);
>         }
>
>         up(console_sem);
>
>> What I had in mind for the printk look is to not even hold the main
>> console_lock, but only grab the individual console_locks (plus the printk
>> buffer spinlock ofc), so
>
> may be console_sem won't be needed for printk at all. need to think
> more. I think I just wanted to jump over all those "suspend all console
> drivers" etc for hibernate and other cases that we might be missing
> at the moment.
>
>
> so, no big and heavy console manipulations should be performed under
> console_sem. we take console_sem briefly, find the right console, take
> its ->lock, unlock console_sem. from now on the we the console, nothing
> else should be able to concurrently un-register it, etc. etc.

As-is (i.e. without rewriting fbcon/fbdev) you can't assume that
console_sem is only held briefly. That's why I think any printk
redesign needs to be entirely uncoupled from console_sem. And I think
that's doable (with enough care).

>>       for_each_console(con)
>>               if (!mutex_trylock(con->mutex))
>>                       continue;
>>
>>               /* pseudo-code, whatever we need to check to make sure
>>                * this console is real and fully registered. */
>>               if (!(con->flags & CON_ENABLED))
>>                       continue;
>>
>>               if (con_requires_kthread(con)) {
>>                       wake_up(con->printk_wq);
>>
>>                       /* this is for consoles that grab massive amounts
>>                        * of locks, like fbcon. We could repurpose klogd
>>                        * for this perhaps. */
>>
>>                       continue;
>>               }
>>
>>               /* do the actual printing business */
>>       }
>
> hm... ok. very close, but not exactly what I was thinking about.
> may be guys your ideas are better. per-console printing kthread,
> hm. interesting.

I don't think we want a per-console kthread for everything, because
that would delay serial console (and other very simple consoles that
don't drag in an entire locking tree). But if you have something like
fbcon in your console list there's a _very_ high chance that ->write
or ->unblank (not 100% on the exact callchains here) will grab some
random heavy-weight lock from a completely different subsystem and
either deadlock, or cause massive delays. Per-console kthread would
work around all these problems, as long as we make the critical path
of printk _never_ attempt to take the console_sem (not even with a
trylock imo, since then you'd get delays by retrying from klogd
eventually).

>> Very rough pseudo-code draft without looking at the details. The things
>> we'd need to do to get there:
>>
>> - Audit _all_ the places that use console_lock to protect global data
>>   structures. Excessively sprinkle lockdep_assert_held(&console_lock_map);
>>   over them to make sure we don't break stuff. We'll probably want to
>>   stuff that lockdep assert into for_each_console (and have a
>>   special/open-coded one for the printk loop).
>>
>> - Add con->mutex. Make sure that lock properly serializes against
>>   the last step in register_console and the first step in
>>   unregister_console. CON_ENABLED seems like the critical flag.
>
> a silly side-note,
> we must be able to console_unlock() from IRQ. mutex_trylock() cannot
> be used in atomic, because it might sleep, unlike semaphore.

TIL. mutex_trylock not working from atomic feels super-silly. Why is
that? That would mean all the locking for console->lock needs to be a
semaphore, and we also need to hand-roll the lockdep annotations.
Yuck.

We do need a lock that you can sleep under, because some console
drivers will have to do that while holding that lock (hence also
per-console kthread).

>> - Wrap all call to console callbacks in a mutex_lock(con->mutex) critical
>>   sections.
>>
>> - Sprinkle lockdep_assert_held(con->mutex) into all the console callbacks
>>   of a few common console backends (fbcon + serial should be enough), to
>>   make sure that part is solid.
>>
>> - Do the above changes in the printk loop. It also needs to be extracted
>>   from console_unlock so that we can replace the
>>
>>       if (console_try_lock())
>>               console_unlock();
>>
>>   pattern for pushing out the printk buffer with maybe a new
>>   printk_flush() function, which does _not_ try to acquire console_lock.
>>
>> - console_unlock() still needs to flush out the printk buffers, to make
>>   sure we haven't lost any lines. Or we'll rely on klogd to ensure
>>   everything gets printed, when the trylock path doesn't work out.
>>
>> I think this would give us a reasonable locking design, allows us to not
>> stall on slow consoles when trying to dump emergency output to serial, and
>> it would decouple printk entirely from the huge console_lock mess. And as
>> long as we carefully audit for global stuff (everywhere, not just in
>> printk.c) in the first step I think it should be a safe transition.
>
> many thanks for the inputs! I'll think more about it.

I'm super happy that finally someone is looking into fixing this chaos
for real. It has hurt us for a long time in gfx-land.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations?
@ 2017-07-11 11:36                   ` Daniel Vetter
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Vetter @ 2017-07-11 11:36 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Petr Mladek, Jan Kara, Linux MM, Tetsuo Handa, dri-devel,
	Steven Rostedt, Michal Hocko, Sergey Senozhatsky, Andreas Mohr,
	Pavel Machek

On Tue, Jul 11, 2017 at 11:48 AM, Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
> On (07/11/17 09:50), Daniel Vetter wrote:
> [..]
>> > ok, obviously stupid.
>> >
>> > I meant to hold con->lock between console_disable() and console_enable().
>> > so no other CPU can unregister it, etc. printk->console_unlock(), thus,
>> > can either have a racy con->flags check (no con->lock taken) or try
>> > something like down_trylock(&con->lock): if it fails, continue.
>>
>> I don't think you need the CON_ENABLED flag, just holding the per-console
>> lock should be enough (I hope). Or what exactly is the idea behind this.
>> I'm also not sure whether dropping the main console_lock is a good idea.
>
> CON_ENABLED thing is completely broken, yes. my apologies.
> what I really wanted to think about was something as below
>
>         vprintk_emit()
>          down_trylock(console_sem)
>           console_unlock()
>
> console_unlock() runs under console_sem, but in order to ->write() it
> does down_trylock(con->lock) on every console.
>
> the functions that modify consoles do:
>
>     down(console_sem)
>     down(con->lock);
>     up(console_sem);
>
>     "do things to con"
>
>     up(con->lock);
>
>     down_trylock(console_sem) // if it fails then someone else will do the printing
>      console_unlock();
>
> so console_unlock() will "pass on" those "frozen" consoles. the next time
> we are in console_unlock() again, we will notice that console has its "seen
> message IDX" behind the current log idx and will flush the (if console
> semaphore will be available).

I think the problem with that is that console_sem is such a huge outer
lock that dropping it is currently not possible. At least not until
we've pushed it back down into register_console and friends, through
the entire fbcon layer. That's why I think adding console->lock as an
entirely inner lock is the only approach we can do right now. But that
means printk writing must be extremely careful, and stop relying on
console_sem completely. I think that is doable, but a bit of work
(mostly in annotating all the current stuff protected by console_sem).

> so the loop is
>
>         for_each_console (con) {
>                 if (!down_trylock(con->lock))
>                         continue;
>
>                 while (con->console_seq != log_next_seq) {
>                         msg_print_text();
>                         con->console_seq++;
>
>                         if (!(con->flags & CON_ENABLED))
>                                 break;
>                         if (!con->write)
>                                 break;
>                         if (!cpu_online(smp_processor_id()) &&
>                             !(con->flags & CON_ANYTIME))
>                                 break;

I think the above checks don't need to be done for every msg? Or do I
misunderstand something here? As long as we hold con->lock, things
shouldn't change.

>                         if (con->flags & CON_EXTENDED)
>                                 con->write(con, ext_text, ext_len);
>                         else
>                                 con->write(con, text, len);
>                 }
>
>                 up(con->lock);
>         }
>
>         up(console_sem);
>
>> What I had in mind for the printk look is to not even hold the main
>> console_lock, but only grab the individual console_locks (plus the printk
>> buffer spinlock ofc), so
>
> may be console_sem won't be needed for printk at all. need to think
> more. I think I just wanted to jump over all those "suspend all console
> drivers" etc for hibernate and other cases that we might be missing
> at the moment.
>
>
> so, no big and heavy console manipulations should be performed under
> console_sem. we take console_sem briefly, find the right console, take
> its ->lock, unlock console_sem. from now on the we the console, nothing
> else should be able to concurrently un-register it, etc. etc.

As-is (i.e. without rewriting fbcon/fbdev) you can't assume that
console_sem is only held briefly. That's why I think any printk
redesign needs to be entirely uncoupled from console_sem. And I think
that's doable (with enough care).

>>       for_each_console(con)
>>               if (!mutex_trylock(con->mutex))
>>                       continue;
>>
>>               /* pseudo-code, whatever we need to check to make sure
>>                * this console is real and fully registered. */
>>               if (!(con->flags & CON_ENABLED))
>>                       continue;
>>
>>               if (con_requires_kthread(con)) {
>>                       wake_up(con->printk_wq);
>>
>>                       /* this is for consoles that grab massive amounts
>>                        * of locks, like fbcon. We could repurpose klogd
>>                        * for this perhaps. */
>>
>>                       continue;
>>               }
>>
>>               /* do the actual printing business */
>>       }
>
> hm... ok. very close, but not exactly what I was thinking about.
> may be guys your ideas are better. per-console printing kthread,
> hm. interesting.

I don't think we want a per-console kthread for everything, because
that would delay serial console (and other very simple consoles that
don't drag in an entire locking tree). But if you have something like
fbcon in your console list there's a _very_ high chance that ->write
or ->unblank (not 100% on the exact callchains here) will grab some
random heavy-weight lock from a completely different subsystem and
either deadlock, or cause massive delays. Per-console kthread would
work around all these problems, as long as we make the critical path
of printk _never_ attempt to take the console_sem (not even with a
trylock imo, since then you'd get delays by retrying from klogd
eventually).

>> Very rough pseudo-code draft without looking at the details. The things
>> we'd need to do to get there:
>>
>> - Audit _all_ the places that use console_lock to protect global data
>>   structures. Excessively sprinkle lockdep_assert_held(&console_lock_map);
>>   over them to make sure we don't break stuff. We'll probably want to
>>   stuff that lockdep assert into for_each_console (and have a
>>   special/open-coded one for the printk loop).
>>
>> - Add con->mutex. Make sure that lock properly serializes against
>>   the last step in register_console and the first step in
>>   unregister_console. CON_ENABLED seems like the critical flag.
>
> a silly side-note,
> we must be able to console_unlock() from IRQ. mutex_trylock() cannot
> be used in atomic, because it might sleep, unlike semaphore.

TIL. mutex_trylock not working from atomic feels super-silly. Why is
that? That would mean all the locking for console->lock needs to be a
semaphore, and we also need to hand-roll the lockdep annotations.
Yuck.

We do need a lock that you can sleep under, because some console
drivers will have to do that while holding that lock (hence also
per-console kthread).

>> - Wrap all call to console callbacks in a mutex_lock(con->mutex) critical
>>   sections.
>>
>> - Sprinkle lockdep_assert_held(con->mutex) into all the console callbacks
>>   of a few common console backends (fbcon + serial should be enough), to
>>   make sure that part is solid.
>>
>> - Do the above changes in the printk loop. It also needs to be extracted
>>   from console_unlock so that we can replace the
>>
>>       if (console_try_lock())
>>               console_unlock();
>>
>>   pattern for pushing out the printk buffer with maybe a new
>>   printk_flush() function, which does _not_ try to acquire console_lock.
>>
>> - console_unlock() still needs to flush out the printk buffers, to make
>>   sure we haven't lost any lines. Or we'll rely on klogd to ensure
>>   everything gets printed, when the trylock path doesn't work out.
>>
>> I think this would give us a reasonable locking design, allows us to not
>> stall on slow consoles when trying to dump emergency output to serial, and
>> it would decouple printk entirely from the huge console_lock mess. And as
>> long as we carefully audit for global stuff (everywhere, not just in
>> printk.c) in the first step I think it should be a safe transition.
>
> many thanks for the inputs! I'll think more about it.

I'm super happy that finally someone is looking into fixing this chaos
for real. It has hurt us for a long time in gfx-land.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-07-11 11:36 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-06 10:28 printk: Should console related code avoid __GFP_DIRECT_RECLAIM memory allocations? Tetsuo Handa
2017-07-07  2:39 ` Sergey Senozhatsky
2017-07-07  8:41   ` Daniel Vetter
2017-07-07  8:41     ` Daniel Vetter
2017-07-08 13:30   ` Tetsuo Handa
2017-07-10  5:07     ` Sergey Senozhatsky
2017-07-10 12:59     ` Petr Mladek
2017-07-10 18:07       ` Daniel Vetter
2017-07-10 18:07         ` Daniel Vetter
2017-07-11  2:31         ` Sergey Senozhatsky
2017-07-11  4:57           ` Sergey Senozhatsky
2017-07-11  7:50             ` Daniel Vetter
2017-07-11  7:50               ` Daniel Vetter
2017-07-11  9:48               ` Sergey Senozhatsky
2017-07-11 11:36                 ` Daniel Vetter
2017-07-11 11:36                   ` Daniel Vetter
2017-07-10 13:33   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.