All of lore.kernel.org
 help / color / mirror / Atom feed
* memory hot-add: the kernel can notify udev daemon before creating the sys file state?
@ 2014-05-23  9:46 DX Cui
  2014-05-23 12:27 ` DX Cui
  0 siblings, 1 reply; 6+ messages in thread
From: DX Cui @ 2014-05-23  9:46 UTC (permalink / raw)
  To: linux-mm

[-- Attachment #1: Type: text/plain, Size: 4933 bytes --]

Hi all,
I'm debugging a strange memory hotplug issue on CentOS
6.5(2.6.32-431.17.1.el6): when a chunk of memory is hot-added, it seems the
kernel *occasionally* can send a MEMORY ADD event to the udev daemon before
the kernel actually creates the sys file 'state'!
As a result, udev can't reliably make new memory online by this udev rule:
SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"

Please see the end of the mail for the strace log of udevd when I run udevd
manually:

When udevd gets a MEMORY ADD event for /sys/devices/system/memory/memory23,
it tries to write "online" to /sys/devices/system/memory/memory23/state,
but the file hasn't been created by the kernel yet. In this case, when I
manually check the file at once with ls, it has been created, and I can
manually echo online into it to make it online correctly.

Please note: this bad behavior of the kernel is only occasional, which may
imply there is a race condition somewhere?

BTW, it looks the issue does't exist in 3.10+ kernels. Is this a known
issue already fixed in new kernels?

I'm trying to dig into the code and I hope I can get some suggestions here.
Thanks!

-- DX

The strace log is:
1427  1400822167.053704 socket(PF_NETLINK, SOCK_DGRAM|SOCK_CLOEXEC, 15) = 5
...
1427  1400822372.247210 recvmsg(5, {msg_name(12)={sa_family=AF_NETLINK,
pid=0, groups=00000001},
msg_iov(1)=[{"add@/devices/system/memory/memory23\0ACTION=add\0DEVPATH=/devices/system/memory/memory23\0SUBSYSTEM=memory\0SEQNUM=1358\0\0\0\0\0\0\0\0\0\0\0\0\\0\0\0\0\0\0\0"...,
8192}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS{pid=0, uid=0, gid=0}}, msg_flags=0}, 0) = 116
1427  1400822372.247298 lseek(10, 0, SEEK_CUR) = 1332
1427  1400822372.247320 write(10,
"N\5\0\0\0\0\0\0\37\0/devices/system/memory/memory23", 41) = 41
1427  1400822372.247352 gettimeofday({1400822372, 247358}, NULL) = 0
1427  1400822372.247387 writev(2, [{"udevd[1427]: seq 1358 queued, 'add'
'memory'\n", 45}], 1) = 45
1427  1400822372.247428 sendto(3, "<30>May 23 06:19:32 udevd[1427]: seq
1358 queued, 'add' 'memory'\n", 65, MSG_NOSIGNAL, NULL, 0) = 65
1427  1400822372.247511 sendmsg(5, {msg_name(12)={sa_family=AF_NETLINK,
pid=-4139, groups=00000000},
msg_iov(2)=[{"udev-147\0\0\0\0\0\0\0\0\312\376\35\352
\0[\0\312\n\234<\0\0\0\0", 32},
{"UDEV_LOG=7\0ACTION=add\0DEVPATH=/devices/system/memory/memory23\0SUBSYSTEM=memory\0SEQNUM=1358\0",
91}], msg_controllen=0, msg_flags=0}, 0 <unfinished ...>
1456  1400822372.247568 <... ppoll resumed> ) = 1 ([{fd=11,
revents=POLLIN}])
1427  1400822372.247578 <... sendmsg resumed> ) = 123
1456  1400822372.247595 recvmsg(11,  <unfinished ...>
1427  1400822372.247602 gettimeofday( <unfinished ...>
1456  1400822372.247615 <... recvmsg resumed>
{msg_name(12)={sa_family=AF_NETLINK, pid=1427, groups=00000000},
msg_iov(1)=[{"udev-147\0\0\0\0\0\0\0\0\312\376\35\352
\0[\0\312\n\234<\0\0\0\0UDEV_LOG=7\0ACTION=add\0DEVPATH=/devices/system/memory/memory23\0SUBSYSTEM=memory\0SEQNUM=1358\0\0\0\0\0\0\0\\0\0\0\0\0"...,
8192}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS{pid=1427, uid=0, gid=0}}, msg_flags=0}, 0) = 123
1427  1400822372.247658 <... gettimeofday resumed> {1400822372, 247610},
NULL) = 0
1456  1400822372.247688 gettimeofday( <unfinished ...>
1427  1400822372.247695 writev(2, [{"udevd[1427]: passed 123 bytes to
monitor 0x7f71f4c6a6c0\n", 56}], 1 <unfinished ...>
1456  1400822372.247715 <... gettimeofday resumed> {1400822372, 247706},
NULL) = 0
1427  1400822372.247722 <... writev resumed> ) = 56
1456  1400822372.247738 writev(2, [{"udevd-work[1456]: seq 1358 running\n",
35}], 1 <unfinished ...>
1427  1400822372.247750 sendto(3, "<30>May 23 06:19:32 udevd[1427]: passed
123 bytes to monitor 0x7f71f4c6a6c0\n", 76, MSG_NOSIGNAL, NULL, 0
<unfinished ...>1456  1400822372.247785 <... writev resumed> ) = 35
1427  1400822372.247792 <... sendto resumed> ) = 76
1456  1400822372.247805 sendto(3, "<30>May 23 06:19:32 udevd-work[1456]:
seq 1358 running\n", 55, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
1427  1400822372.247816 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN},
{fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 5, -1
<unfinished ...>
1456  1400822372.247847 <... sendto resumed> ) = 55
1456  1400822372.247861 alarm(180)      = 0
1456  1400822372.247892 gettimeofday({1400822372, 247898}, NULL) = 0
1456  1400822372.247917 writev(2, [{"udevd-work[1456]: ATTR
'/sys/devices/system/memory/memory23/state' writing 'online'
/etc/udev/rules.d/100-balloons.rules:1\n", 123}], 1) = 123
1456  1400822372.247946 sendto(3, "<30>May 23 06:19:32 udevd-work[1456]:
ATTR '/sys/devices/system/memory/memory23/state' writing 'online'
/etc/udev/rules.d/100-balloons.rules:1\n", 143, MSG_NOSIGNAL, NULL, 0) = 143
1456  1400822372.247992 open("/sys/devices/system/memory/memory23/state",
O_WRONLY|O_CREAT|O_TRUNC, 0666) = -1 ENOENT (No such file or directory)

[-- Attachment #2: Type: text/html, Size: 5986 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: memory hot-add: the kernel can notify udev daemon before creating the sys file state?
  2014-05-23  9:46 memory hot-add: the kernel can notify udev daemon before creating the sys file state? DX Cui
@ 2014-05-23 12:27 ` DX Cui
  2014-05-25 15:41   ` DX Cui
  2014-05-28 20:47   ` Dave Hansen
  0 siblings, 2 replies; 6+ messages in thread
From: DX Cui @ 2014-05-23 12:27 UTC (permalink / raw)
  To: linux-mm
  Cc: Matt Tolentino, Dave Hansen, Andrew Morton, Linus Torvalds,
	Nathan Fontenot, Greg Kroah-Hartman

On Fri, May 23, 2014 at 5:46 PM, DX Cui <rijcos@gmail.com> wrote:
> Hi all,
> I'm debugging a strange memory hotplug issue on CentOS 6.5(2.6.32-431.17.1.el6):
> when a chunk of memory is hot-added, it seems the kernel *occasionally* can send
> a MEMORY ADD event to the udev daemon before the kernel actually creates the
> sys file 'state'!
> As a result, udev can't reliably make new memory online by this udev rule:
> SUBSYSTEM=="memory", ACTION=="add", ATTR{state}="online"
>
> Please see the end of the mail for the strace log of udevd when I run udevd
> manually:
>
> When udevd gets a MEMORY ADD event for
> /sys/devices/system/memory/memory23, it tries to write "online" to
> /sys/devices/system/memory/memory23/state, but the file hasn't been created by
> the kernel yet. In this case, when I manually check the file at once with ls, it has
> been created, and I can manually echo online into it to make it online correctly.
>
> Please note: this bad behavior of the kernel is only occasional, which may imply
> there is a race condition somewhere?
>
> BTW, it looks the issue does't exist in 3.10+ kernels. Is this a known issue
> already fixed in new kernels?

Hi all,
I think I found out the root cause: when memory hotplug was introduced in 2005:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3947be1969a9ce455ec30f60ef51efb10e4323d1
there was a race condition in:

+ static int add_memory_block(unsigned long node_id, struct
mem_section *section,
+ unsigned long state, int phys_device)
+{
...
+ ret = register_memory(mem, section, NULL);
+ if (!ret)
+        ret = mem_create_simple_file(mem, phys_index);
+ if (!ret)
+        ret = mem_create_simple_file(mem, state);

Here, first, add_memory_block() invokes register_memory() ->
sysdev_register() -> sysdev_add()->
kobject_uevent(&sysdev->kobj, KOBJ_ADD) to notify udev daemon, then
invokes mem_create_simple_file(). If the current execution is preempted
between the 2 steps, the issue I reported in the previous mail can happen.

Luckily a commit in 2013 has fixed this issue undesignedly:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=96b2c0fc8e74a615888e2bedfe55b439aa4695e1

It looks the new "register_memory() --> ... -> device_add()" path has the
correct order for sysfs creation and notification udev.

It would be great if you can confirm my analysis. :-)

 -- DX

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: memory hot-add: the kernel can notify udev daemon before creating the sys file state?
  2014-05-23 12:27 ` DX Cui
@ 2014-05-25 15:41   ` DX Cui
  2014-05-28 20:32     ` Nathan Fontenot
  2014-05-28 20:47   ` Dave Hansen
  1 sibling, 1 reply; 6+ messages in thread
From: DX Cui @ 2014-05-25 15:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Matt Tolentino, Dave Hansen, Andrew Morton, Linus Torvalds,
	Nathan Fontenot, Greg Kroah-Hartman

On Fri, May 23, 2014 at 8:27 PM, DX Cui <rijcos@gmail.com> wrote:
> Hi all,
> I think I found out the root cause: when memory hotplug was introduced in 2005:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3947be1969a9ce455ec30f60ef51efb10e4323d1
> there was a race condition in:
>
> + static int add_memory_block(unsigned long node_id, struct
> mem_section *section,
> + unsigned long state, int phys_device)
> +{
> ...
> + ret = register_memory(mem, section, NULL);
> + if (!ret)
> +        ret = mem_create_simple_file(mem, phys_index);
> + if (!ret)
> +        ret = mem_create_simple_file(mem, state);
>
> Here, first, add_memory_block() invokes register_memory() ->
> sysdev_register() -> sysdev_add()->
> kobject_uevent(&sysdev->kobj, KOBJ_ADD) to notify udev daemon, then
> invokes mem_create_simple_file(). If the current execution is preempted
> between the 2 steps, the issue I reported in the previous mail can happen.
>
> Luckily a commit in 2013 has fixed this issue undesignedly:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=96b2c0fc8e74a615888e2bedfe55b439aa4695e1
>
> It looks the new "register_memory() --> ... -> device_add()" path has the
> correct order for sysfs creation and notification udev.
>
> It would be great if you can confirm my analysis. :-)

Any comments?
I think we need to backport the patch
96b2c0fc8e74a615888e2bedfe55b439aa4695e1 to <=3.9 stable kernels.

-- DX

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: memory hot-add: the kernel can notify udev daemon before creating the sys file state?
  2014-05-25 15:41   ` DX Cui
@ 2014-05-28 20:32     ` Nathan Fontenot
  0 siblings, 0 replies; 6+ messages in thread
From: Nathan Fontenot @ 2014-05-28 20:32 UTC (permalink / raw)
  To: DX Cui, linux-mm
  Cc: Matt Tolentino, Dave Hansen, Andrew Morton, Linus Torvalds,
	Greg Kroah-Hartman

On 05/25/2014 10:41 AM, DX Cui wrote:
> On Fri, May 23, 2014 at 8:27 PM, DX Cui <rijcos@gmail.com> wrote:
>> Hi all,
>> I think I found out the root cause: when memory hotplug was introduced in 2005:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=3947be1969a9ce455ec30f60ef51efb10e4323d1
>> there was a race condition in:
>>
>> + static int add_memory_block(unsigned long node_id, struct
>> mem_section *section,
>> + unsigned long state, int phys_device)
>> +{
>> ...
>> + ret = register_memory(mem, section, NULL);
>> + if (!ret)
>> +        ret = mem_create_simple_file(mem, phys_index);
>> + if (!ret)
>> +        ret = mem_create_simple_file(mem, state);
>>
>> Here, first, add_memory_block() invokes register_memory() ->
>> sysdev_register() -> sysdev_add()->
>> kobject_uevent(&sysdev->kobj, KOBJ_ADD) to notify udev daemon, then
>> invokes mem_create_simple_file(). If the current execution is preempted
>> between the 2 steps, the issue I reported in the previous mail can happen.
>>
>> Luckily a commit in 2013 has fixed this issue undesignedly:
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=96b2c0fc8e74a615888e2bedfe55b439aa4695e1
>>
>> It looks the new "register_memory() --> ... -> device_add()" path has the
>> correct order for sysfs creation and notification udev.
>>

Correct. that patch does fix this issue, though that was not the primary reason
for doing the patch. Always nice when a patch has unintended positive side affects.
 
>> It would be great if you can confirm my analysis. :-)
> 
> Any comments?
> I think we need to backport the patch
> 96b2c0fc8e74a615888e2bedfe55b439aa4695e1 to <=3.9 stable kernels.
> 

Although I have seen any issues because of this issue I agree that the fix
should be backported. Best to get rid of a known race condition before it
jumps up and bites us.

-Nathan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: memory hot-add: the kernel can notify udev daemon before creating the sys file state?
  2014-05-23 12:27 ` DX Cui
  2014-05-25 15:41   ` DX Cui
@ 2014-05-28 20:47   ` Dave Hansen
  2014-05-29  8:48     ` DX Cui
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2014-05-28 20:47 UTC (permalink / raw)
  To: DX Cui, linux-mm
  Cc: Matt Tolentino, Dave Hansen, Andrew Morton, Linus Torvalds,
	Nathan Fontenot, Greg Kroah-Hartman

On 05/23/2014 05:27 AM, DX Cui wrote:
> It looks the new "register_memory() --> ... -> device_add()" path has the
> correct order for sysfs creation and notification udev.
> 
> It would be great if you can confirm my analysis. :-)

Your analysis looks correct to me.  Nathan's patch does, indeed look
like a quite acceptable fix.  How far back do those sysfs attribute
groups go, btw?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: memory hot-add: the kernel can notify udev daemon before creating the sys file state?
  2014-05-28 20:47   ` Dave Hansen
@ 2014-05-29  8:48     ` DX Cui
  0 siblings, 0 replies; 6+ messages in thread
From: DX Cui @ 2014-05-29  8:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-mm, Matt Tolentino, Dave Hansen, Andrew Morton,
	Linus Torvalds, Nathan Fontenot, Greg Kroah-Hartman

On Thu, May 29, 2014 at 4:47 AM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 05/23/2014 05:27 AM, DX Cui wrote:
>> It looks the new "register_memory() --> ... -> device_add()" path has the
>> correct order for sysfs creation and notification udev.
>>
>> It would be great if you can confirm my analysis. :-)
>
> Your analysis looks correct to me.  Nathan's patch does, indeed look
> like a quite acceptable fix.  How far back do those sysfs attribute
> groups go, btw?

I'm not familiar with it.
My gut feeling is: this may need non-trivial efforts -- probably several
extra patches need to be backported too.

BTW, this race condition finally can cause kernel panic when old Linux
VMs of kernel versions <3.9.x, like CentOS 6.5, run on Hyper-V, AND
memory hot-add and the balloon driver are used:

https://bugzilla.redhat.com/show_bug.cgi?id=1102551
(there is a workaround patch for ***CentOS6.5*** in the bug entry)

-- DX

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-05-29  8:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-23  9:46 memory hot-add: the kernel can notify udev daemon before creating the sys file state? DX Cui
2014-05-23 12:27 ` DX Cui
2014-05-25 15:41   ` DX Cui
2014-05-28 20:32     ` Nathan Fontenot
2014-05-28 20:47   ` Dave Hansen
2014-05-29  8:48     ` DX Cui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.