[REGRESSION] funny sched_domain build failure during resume

* [REGRESSION] funny sched_domain build failure during resume
@ 2014-05-09 16:04 Tejun Heo
  2014-05-09 16:15 ` Peter Zijlstra
  2014-05-14 14:00 ` Peter Zijlstra
  0 siblings, 2 replies; 17+ messages in thread
From: Tejun Heo @ 2014-05-09 16:04 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: linux-kernel, Johannes Weiner, Rafael J. Wysocki

Hello, guys.

So, after resuming from suspend, I found my build jobs can not migrate
away from the CPU it started on and thus just making use of single
core.  It turns out the scheduler failed to build sched domains due to
order-3 allocation failure.

 systemd-sleep: page allocation failure: order:3, mode:0x104010
 CPU: 0 PID: 11648 Comm: systemd-sleep Not tainted 3.14.2-200.fc20.x86_64 #1
 Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
  0000000000000000 000000001bc36890 ffff88009c2d5958 ffffffff816eec92
  0000000000104010 ffff88009c2d59e8 ffffffff8117a32a 0000000000000000
  ffff88021efe6b00 0000000000000003 0000000000104010 ffff88009c2d59e8
 Call Trace:
  [<ffffffff816eec92>] dump_stack+0x45/0x56
  [<ffffffff8117a32a>] warn_alloc_failed+0xfa/0x170
  [<ffffffff8117e8f5>] __alloc_pages_nodemask+0x8e5/0xb00
  [<ffffffff811c0ce3>] alloc_pages_current+0xa3/0x170
  [<ffffffff811796a4>] __get_free_pages+0x14/0x50
  [<ffffffff8119823e>] kmalloc_order_trace+0x2e/0xa0
  [<ffffffff810c033f>] build_sched_domains+0x1ff/0xcc0
  [<ffffffff810c123e>] partition_sched_domains+0x35e/0x3d0
  [<ffffffff811168e7>] cpuset_update_active_cpus+0x17/0x40
  [<ffffffff810c130a>] cpuset_cpu_active+0x5a/0x70
  [<ffffffff816f9f4c>] notifier_call_chain+0x4c/0x70
  [<ffffffff810b2a1e>] __raw_notifier_call_chain+0xe/0x10
  [<ffffffff8108a413>] cpu_notify+0x23/0x50
  [<ffffffff8108a678>] _cpu_up+0x188/0x1a0
  [<ffffffff816e1783>] enable_nonboot_cpus+0x93/0xf0
  [<ffffffff810d9d45>] suspend_devices_and_enter+0x325/0x450
  [<ffffffff810d9fe8>] pm_suspend+0x178/0x260
  [<ffffffff810d8e79>] state_store+0x79/0xf0
  [<ffffffff81355bdf>] kobj_attr_store+0xf/0x20
  [<ffffffff81262c4d>] sysfs_kf_write+0x3d/0x50
  [<ffffffff81266b12>] kernfs_fop_write+0xd2/0x140
  [<ffffffff811e964a>] vfs_write+0xba/0x1e0
  [<ffffffff811ea0a5>] SyS_write+0x55/0xd0
  [<ffffffff816ff029>] system_call_fastpath+0x16/0x1b

The allocation is from alloc_rootdomain().

	struct root_domain *rd;

	rd = kmalloc(sizeof(*rd), GFP_KERNEL);

The thing is the system has plenty of reclaimable memory and shouldn't
have any trouble satisfying one GFP_KERNEL order-3 allocation;
however, the problem is that this is during resume and the devices
haven't been woken up yet, so pm_restrict_gfp_mask() punches out
GFP_IOFS from all allocation masks and the page allocator has just
__GFP_WAIT to work with and, with enough bad luck, fails expectedly.

The problem has always been there but seems to have been exposed by
the addition of deadline scheduler support, which added cpudl to
root_domain making it larger by around 20k bytes on my setup, making
an order-3 allocation necessary during CPU online.

It looks like the allocation is for a temp buffer and there are also
percpu allocations going on.  Maybe just allocate the buffers on boot
and keep them around?

Kudos to Johannes for helping deciphering mm debug messages.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread