* 3.6rc6 slab corruption. @ 2012-09-18 14:35 Dave Jones 2012-09-18 18:38 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Dave Jones @ 2012-09-18 14:35 UTC (permalink / raw) To: Linux Kernel; +Cc: Linus Torvalds I was chasing a networking bug, and had trinity reduced to just making read & setsockopt calls, and let that run overnight. I woke up to 800mb of traces from a different bug.. The traces look mostly like this.. ============================================================================= BUG kmalloc-64 (Not tainted): Redzone overwritten ----------------------------------------------------------------------------- INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 __slab_alloc+0x516/0x5a5 __kmalloc+0x213/0x2c0 u32_array_read+0xd1/0x110 vfs_read+0xac/0x180 sys_read+0x4d/0x90 system_call_fastpath+0x1a/0x1f INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 __slab_free+0x3f/0x3bf kfree+0x2d5/0x310 u32_array_read+0x99/0x110 vfs_read+0xac/0x180 sys_read+0x4d/0x90 system_call_fastpath+0x1a/0x1f INFO: Slab 0xffffea00007d2d00 objects=41 used=14 fp=0xffff88001f4b7410 flags=0x10000000004081 INFO: Object 0xffff88001f4b4930 @offset=2352 fp=0xffff88001f4b7410 Bytes b4 ffff88001f4b4920: 1b 20 1c 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a . ......ZZZZZZZZ Object ffff88001f4b4930: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object ffff88001f4b4940: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object ffff88001f4b4950: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object ffff88001f4b4960: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. Redzone ffff88001f4b4970: bb bb bb bb bb bb bb bb ........ Padding ffff88001f4b4ab0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Pid: 32756, comm: trinity-child52 Not tainted 3.6.0-rc6+ #44 Call Trace: [<ffffffff811c10ad>] ? print_section+0x3d/0x40 [<ffffffff811c23fe>] print_trailer+0xfe/0x160 [<ffffffff811c2592>] check_bytes_and_report+0xe2/0x120 [<ffffffff81023b79>] ? native_sched_clock+0x19/0x80 [<ffffffff811c2b5b>] check_object+0x18b/0x250 [<ffffffff8169b7d7>] free_debug_processing+0xc0/0x1fd [<ffffffff812d2e29>] ? u32_array_read+0x99/0x110 [<ffffffff8169ba5c>] __slab_free+0x3f/0x3bf [<ffffffff81365a1c>] ? debug_check_no_obj_freed+0x16c/0x210 [<ffffffff810db04f>] ? lock_release_holdtime.part.26+0xf/0x180 [<ffffffff812d2e29>] ? u32_array_read+0x99/0x110 [<ffffffff811c5725>] kfree+0x2d5/0x310 [<ffffffff812d2e29>] u32_array_read+0x99/0x110 [<ffffffff811df88c>] vfs_read+0xac/0x180 [<ffffffff811df9ad>] sys_read+0x4d/0x90 [<ffffffff816aea2d>] system_call_fastpath+0x1a/0x1f FIX kmalloc-64: Restoring 0xffff88001f4b4970-0xffff88001f4b4977=0xcc ============================================================================= Which looks like we read some file (probably something in sysfs/procfs) that corrupted some internal state. Any ideas on what I could do to narrow this down ? The full traces are at http://www.codemonkey.org.uk/junk/slab-corrupt.txt They vary a little later, but it looks like it's probably all the same problem to me. Sometimes it flip-flops between "First byte 0xbb instead of 0xcc" and "First byte 0xcc instead of 0xbb" The one outlier being this weird message.. Sep 18 02:00:13 bitcrush kernel: [36617.487681] hrtimer: interrupt took 242337 ns Which is weird, but probably unrelated. Dave ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 14:35 3.6rc6 slab corruption Dave Jones @ 2012-09-18 18:38 ` Linus Torvalds 2012-09-18 18:53 ` Dave Jones 2012-09-18 19:23 ` Konrad Rzeszutek Wilk 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2012-09-18 18:38 UTC (permalink / raw) To: Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T, Konrad Rzeszutek Wilk Quoting the entire email, since I added Greg to the list of people (as the documented maintainer of debugfs) along with what I think are the guilty parties. Dave, is trinity perhaps doing read calls on the same file in parallel? Because it looks to me like debugfs is racy for that case. At least the file->private_data accesses are, for the case of that "u32_array" case. In fact it is racy in two different ways: - the whole "file->private_data" access is racy (and this is, I think, the bug you are hitting) If you have multiple readers on the same file, the whole if (file->private_data) { kfree(file->private_data); file->private_data = NULL; } file->private_data = format_array_alloc("%u", data->array, data->elements); thing is just a disaster waiting to happen. It should be easyish to fix by just adding a lock around those things. The other bug is slightly subtler and probably harder to hit (but also harder to fix): - the whole format_array_alloc() code is one buggy piece of sh*t, since afaik there is nothing that guarantees that the values cannot change. So the notion of "let's format the output once to know how big it is, and then a second time to actually print things into the array we just allocated based on the first time" is pure and utter garbage, afaik. Anyway, this is all fairly recent, and came in through the Xen tree. See commit 9fe2a7015393 ("debugfs: Add support to print u32 array in debugfs"). Guys? Linus On Tue, Sep 18, 2012 at 7:35 AM, Dave Jones <davej@redhat.com> wrote: > I was chasing a networking bug, and had trinity reduced to just making read & setsockopt calls, > and let that run overnight. I woke up to 800mb of traces from a different bug.. > > The traces look mostly like this.. > > > ============================================================================= > BUG kmalloc-64 (Not tainted): Redzone overwritten > ----------------------------------------------------------------------------- > > INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc > INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 > __slab_alloc+0x516/0x5a5 > __kmalloc+0x213/0x2c0 > u32_array_read+0xd1/0x110 > vfs_read+0xac/0x180 > sys_read+0x4d/0x90 > system_call_fastpath+0x1a/0x1f > INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 > __slab_free+0x3f/0x3bf > kfree+0x2d5/0x310 > u32_array_read+0x99/0x110 > vfs_read+0xac/0x180 > sys_read+0x4d/0x90 > system_call_fastpath+0x1a/0x1f > INFO: Slab 0xffffea00007d2d00 objects=41 used=14 fp=0xffff88001f4b7410 flags=0x10000000004081 > INFO: Object 0xffff88001f4b4930 @offset=2352 fp=0xffff88001f4b7410 > > Bytes b4 ffff88001f4b4920: 1b 20 1c 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a . ......ZZZZZZZZ > Object ffff88001f4b4930: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > Object ffff88001f4b4940: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > Object ffff88001f4b4950: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > Object ffff88001f4b4960: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. > Redzone ffff88001f4b4970: bb bb bb bb bb bb bb bb ........ > Padding ffff88001f4b4ab0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ > Pid: 32756, comm: trinity-child52 Not tainted 3.6.0-rc6+ #44 > Call Trace: > [<ffffffff811c10ad>] ? print_section+0x3d/0x40 > [<ffffffff811c23fe>] print_trailer+0xfe/0x160 > [<ffffffff811c2592>] check_bytes_and_report+0xe2/0x120 > [<ffffffff81023b79>] ? native_sched_clock+0x19/0x80 > [<ffffffff811c2b5b>] check_object+0x18b/0x250 > [<ffffffff8169b7d7>] free_debug_processing+0xc0/0x1fd > [<ffffffff812d2e29>] ? u32_array_read+0x99/0x110 > [<ffffffff8169ba5c>] __slab_free+0x3f/0x3bf > [<ffffffff81365a1c>] ? debug_check_no_obj_freed+0x16c/0x210 > [<ffffffff810db04f>] ? lock_release_holdtime.part.26+0xf/0x180 > [<ffffffff812d2e29>] ? u32_array_read+0x99/0x110 > [<ffffffff811c5725>] kfree+0x2d5/0x310 > [<ffffffff812d2e29>] u32_array_read+0x99/0x110 > [<ffffffff811df88c>] vfs_read+0xac/0x180 > [<ffffffff811df9ad>] sys_read+0x4d/0x90 > [<ffffffff816aea2d>] system_call_fastpath+0x1a/0x1f > FIX kmalloc-64: Restoring 0xffff88001f4b4970-0xffff88001f4b4977=0xcc > ============================================================================= > > Which looks like we read some file (probably something in sysfs/procfs) that corrupted some internal state. > Any ideas on what I could do to narrow this down ? > > The full traces are at http://www.codemonkey.org.uk/junk/slab-corrupt.txt > They vary a little later, but it looks like it's probably all the same problem to me. > Sometimes it flip-flops between "First byte 0xbb instead of 0xcc" and "First byte 0xcc instead of 0xbb" > > > The one outlier being this weird message.. > > Sep 18 02:00:13 bitcrush kernel: [36617.487681] hrtimer: interrupt took 242337 ns > > Which is weird, but probably unrelated. > > Dave > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 18:38 ` Linus Torvalds @ 2012-09-18 18:53 ` Dave Jones 2012-09-19 14:00 ` Raghavendra K T 2012-09-18 19:23 ` Konrad Rzeszutek Wilk 1 sibling, 1 reply; 33+ messages in thread From: Dave Jones @ 2012-09-18 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T, Konrad Rzeszutek Wilk On Tue, Sep 18, 2012 at 11:38:44AM -0700, Linus Torvalds wrote: > Quoting the entire email, since I added Greg to the list of people (as > the documented maintainer of debugfs) along with what I think are the > guilty parties. > > Dave, is trinity perhaps doing read calls on the same file in parallel? Yeah, entirely possible. It opens hundreds of fd's on startup, and then a bunch of threads do whatever operations are specified on those fds. (in this case, just read/setsockopt, but probably the read's are all that is needed to trigger this) Dave ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 18:53 ` Dave Jones @ 2012-09-19 14:00 ` Raghavendra K T 2012-09-19 17:09 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Raghavendra K T @ 2012-09-19 14:00 UTC (permalink / raw) To: Dave Jones, Linus Torvalds, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Konrad Rzeszutek Wilk, Jeremy Fitzhardinge On 09/19/2012 12:23 AM, Dave Jones wrote: > On Tue, Sep 18, 2012 at 11:38:44AM -0700, Linus Torvalds wrote: > > Quoting the entire email, since I added Greg to the list of people (as > > the documented maintainer of debugfs) along with what I think are the > > guilty parties. > > > > Dave, is trinity perhaps doing read calls on the same file in parallel? > > Yeah, entirely possible. It opens hundreds of fd's on startup, and then > a bunch of threads do whatever operations are specified on those fds. > (in this case, just read/setsockopt, but probably the read's are all that > is needed to trigger this) > Create a 350 processes reading /sys/kernel/debug/kvm/spinlocks/histo_blocked file simultaneously in while loop for more than 3 hours on my box. Could not reproduce the problem. ( was it the right thing I did to reproduce.. not sure..). Wanted to check that because if it had hit the problem I can verify fix with same stuff. But in any case will try the fix what Konrad/Linus suggested, and come back. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 14:00 ` Raghavendra K T @ 2012-09-19 17:09 ` Linus Torvalds 2012-09-19 21:27 ` David Rientjes 0 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2012-09-19 17:09 UTC (permalink / raw) To: Raghavendra K T Cc: Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Konrad Rzeszutek Wilk, Jeremy Fitzhardinge On Wed, Sep 19, 2012 at 7:00 AM, Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote: > > Create a 350 processes reading /sys/kernel/debug/kvm/spinlocks/histo_blocked > file simultaneously in while loop for more than 3 hours on my box. You need to open the file a single time, and then after that sinelg open (either threaded or with fork()) do multiple concurrent copies something like for (;;) { char buf[1024]; lseek(fd, 0, SEEK_SET); read(fd, buf, sizeof(buf)); } or similar. But it's important that they all share the same struct file. It's also likely to make it easier to trigger the race if you have a kernel with preemption enabled. And you need to have SLAB debugging enabled to actually *see* the messages. Otherwise you'll have just (possibly silent) corruption or a memory leak. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 17:09 ` Linus Torvalds @ 2012-09-19 21:27 ` David Rientjes 2012-09-19 21:41 ` Dave Jones 0 siblings, 1 reply; 33+ messages in thread From: David Rientjes @ 2012-09-19 21:27 UTC (permalink / raw) To: Linus Torvalds Cc: Raghavendra K T, Dave Jones, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, linux-kernel On Wed, 19 Sep 2012, Linus Torvalds wrote: > > Create a 350 processes reading /sys/kernel/debug/kvm/spinlocks/histo_blocked > > file simultaneously in while loop for more than 3 hours on my box. > > You need to open the file a single time, and then after that sinelg > open (either threaded or with fork()) do multiple concurrent copies > something like > > for (;;) { > char buf[1024]; > lseek(fd, 0, SEEK_SET); These are non-seekable files, so this will always fail. That makes the race much more difficult to trigger: the read needs to call u32_array_read() with both threads finding *ppos == 0 and then race between the kfree() and resetting of file->private_data pointer. [ I'm surprised that Dave was able to trigger this so often that he has 800MB of log. ] Anyway, I instrumented the kernel to open the race by sleeping after checking *ppos == 0 and immediately after the kfree() and I could reproduce the issue but with a "Object already free" error rather than a redzoning error. I assumed this was because Dave didn't have a certain slub debug option enabled or redzoning was checked before double-free, but it turns out this should always be caught first. For some reason the freed object is not being found on the partial slab's freelist. > read(fd, buf, sizeof(buf)); > } > > or similar. But it's important that they all share the same struct file. > > It's also likely to make it easier to trigger the race if you have a > kernel with preemption enabled. > > And you need to have SLAB debugging enabled to actually *see* the > messages. Otherwise you'll have just (possibly silent) corruption or a > memory leak. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 21:27 ` David Rientjes @ 2012-09-19 21:41 ` Dave Jones 0 siblings, 0 replies; 33+ messages in thread From: Dave Jones @ 2012-09-19 21:41 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Raghavendra K T, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Konrad Rzeszutek Wilk, Jeremy Fitzhardinge, linux-kernel On Wed, Sep 19, 2012 at 02:27:37PM -0700, David Rientjes wrote: > On Wed, 19 Sep 2012, Linus Torvalds wrote: > > > > Create a 350 processes reading /sys/kernel/debug/kvm/spinlocks/histo_blocked > > > file simultaneously in while loop for more than 3 hours on my box. > > > > You need to open the file a single time, and then after that sinelg > > open (either threaded or with fork()) do multiple concurrent copies > > something like > > > > for (;;) { > > char buf[1024]; > > lseek(fd, 0, SEEK_SET); > > These are non-seekable files, so this will always fail. That makes the > race much more difficult to trigger: the read needs to call > u32_array_read() with both threads finding *ppos == 0 and then race > between the kfree() and resetting of file->private_data pointer. > > [ I'm surprised that Dave was able to trigger this so often that he has > 800MB of log. ] Brainfart. 800Kb. Still, lots. Dave ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 18:38 ` Linus Torvalds 2012-09-18 18:53 ` Dave Jones @ 2012-09-18 19:23 ` Konrad Rzeszutek Wilk 2012-09-18 20:24 ` David Rientjes 2012-09-18 20:35 ` Linus Torvalds 1 sibling, 2 replies; 33+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-18 19:23 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Tue, Sep 18, 2012 at 11:38:44AM -0700, Linus Torvalds wrote: > Quoting the entire email, since I added Greg to the list of people (as > the documented maintainer of debugfs) along with what I think are the > guilty parties. > > Dave, is trinity perhaps doing read calls on the same file in parallel? > > Because it looks to me like debugfs is racy for that case. At least > the file->private_data accesses are, for the case of that "u32_array" > case. > > In fact it is racy in two different ways: > > - the whole "file->private_data" access is racy (and this is, I > think, the bug you are hitting) > > If you have multiple readers on the same file, the whole > > if (file->private_data) { > kfree(file->private_data); > file->private_data = NULL; > } > > file->private_data = format_array_alloc("%u", data->array, > data->elements); > > thing is just a disaster waiting to happen. > > It should be easyish to fix by just adding a lock around those things. Like this: diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 2340f69..309b235 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -524,6 +524,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); struct array_data { void *array; u32 elements; + struct mutex lock; }; static int u32_array_open(struct inode *inode, struct file *file) @@ -580,6 +581,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, struct array_data *data = inode->i_private; size_t size; + mutex_lock(&data->lock); if (*ppos == 0) { if (file->private_data) { kfree(file->private_data); @@ -594,6 +596,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, if (file->private_data) size = strlen(file->private_data); + mutex_unlock(&data->lock); return simple_read_from_buffer(buf, len, ppos, file->private_data, size); } @@ -643,6 +646,7 @@ struct dentry *debugfs_create_u32_array(const char *name, umode_t mode, data->array = array; data->elements = elements; + mutex_init(&data->lock); return debugfs_create_file(name, mode, parent, data, &u32_array_fops); } > > The other bug is slightly subtler and probably harder to hit (but also > harder to fix): > > - the whole format_array_alloc() code is one buggy piece of sh*t, > since afaik there is nothing that guarantees that the values cannot > change. So the notion of "let's format the output once to know how big > it is, and then a second time to actually print things into the array > we just allocated based on the first time" is pure and utter garbage, > afaik. Yikes. The fix could be to allocate a buffer large enough for the maximum that %u could take * array_size and not bother with the first pass. > > Anyway, this is all fairly recent, and came in through the Xen tree. > See commit 9fe2a7015393 ("debugfs: Add support to print u32 array in > debugfs"). > > Guys? Let me include Jeremy on this. He was the original author of it and perhaps he has some patch stashed away were he re-implemented it.. > > Linus > > On Tue, Sep 18, 2012 at 7:35 AM, Dave Jones <davej@redhat.com> wrote: > > I was chasing a networking bug, and had trinity reduced to just making read & setsockopt calls, > > and let that run overnight. I woke up to 800mb of traces from a different bug.. > > > > The traces look mostly like this.. > > > > > > ============================================================================= > > BUG kmalloc-64 (Not tainted): Redzone overwritten > > ----------------------------------------------------------------------------- > > > > INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc > > INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 > > __slab_alloc+0x516/0x5a5 > > __kmalloc+0x213/0x2c0 > > u32_array_read+0xd1/0x110 > > vfs_read+0xac/0x180 > > sys_read+0x4d/0x90 > > system_call_fastpath+0x1a/0x1f > > INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 > > __slab_free+0x3f/0x3bf > > kfree+0x2d5/0x310 > > u32_array_read+0x99/0x110 > > vfs_read+0xac/0x180 > > sys_read+0x4d/0x90 > > system_call_fastpath+0x1a/0x1f > > INFO: Slab 0xffffea00007d2d00 objects=41 used=14 fp=0xffff88001f4b7410 flags=0x10000000004081 > > INFO: Object 0xffff88001f4b4930 @offset=2352 fp=0xffff88001f4b7410 > > > > Bytes b4 ffff88001f4b4920: 1b 20 1c 00 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a . ......ZZZZZZZZ > > Object ffff88001f4b4930: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > > Object ffff88001f4b4940: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > > Object ffff88001f4b4950: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk > > Object ffff88001f4b4960: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkkkkkkkkkk. > > Redzone ffff88001f4b4970: bb bb bb bb bb bb bb bb ........ > > Padding ffff88001f4b4ab0: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ > > Pid: 32756, comm: trinity-child52 Not tainted 3.6.0-rc6+ #44 > > Call Trace: > > [<ffffffff811c10ad>] ? print_section+0x3d/0x40 > > [<ffffffff811c23fe>] print_trailer+0xfe/0x160 > > [<ffffffff811c2592>] check_bytes_and_report+0xe2/0x120 > > [<ffffffff81023b79>] ? native_sched_clock+0x19/0x80 > > [<ffffffff811c2b5b>] check_object+0x18b/0x250 > > [<ffffffff8169b7d7>] free_debug_processing+0xc0/0x1fd > > [<ffffffff812d2e29>] ? u32_array_read+0x99/0x110 > > [<ffffffff8169ba5c>] __slab_free+0x3f/0x3bf > > [<ffffffff81365a1c>] ? debug_check_no_obj_freed+0x16c/0x210 > > [<ffffffff810db04f>] ? lock_release_holdtime.part.26+0xf/0x180 > > [<ffffffff812d2e29>] ? u32_array_read+0x99/0x110 > > [<ffffffff811c5725>] kfree+0x2d5/0x310 > > [<ffffffff812d2e29>] u32_array_read+0x99/0x110 > > [<ffffffff811df88c>] vfs_read+0xac/0x180 > > [<ffffffff811df9ad>] sys_read+0x4d/0x90 > > [<ffffffff816aea2d>] system_call_fastpath+0x1a/0x1f > > FIX kmalloc-64: Restoring 0xffff88001f4b4970-0xffff88001f4b4977=0xcc > > ============================================================================= > > > > Which looks like we read some file (probably something in sysfs/procfs) that corrupted some internal state. > > Any ideas on what I could do to narrow this down ? > > > > The full traces are at http://www.codemonkey.org.uk/junk/slab-corrupt.txt > > They vary a little later, but it looks like it's probably all the same problem to me. > > Sometimes it flip-flops between "First byte 0xbb instead of 0xcc" and "First byte 0xcc instead of 0xbb" > > > > > > The one outlier being this weird message.. > > > > Sep 18 02:00:13 bitcrush kernel: [36617.487681] hrtimer: interrupt took 242337 ns > > > > Which is weird, but probably unrelated. > > > > Dave > > ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 19:23 ` Konrad Rzeszutek Wilk @ 2012-09-18 20:24 ` David Rientjes 2012-09-18 20:31 ` Konrad Rzeszutek Wilk ` (2 more replies) 2012-09-18 20:35 ` Linus Torvalds 1 sibling, 3 replies; 33+ messages in thread From: David Rientjes @ 2012-09-18 20:24 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Linus Torvalds, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Tue, 18 Sep 2012, Konrad Rzeszutek Wilk wrote: > diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c > index 2340f69..309b235 100644 > --- a/fs/debugfs/file.c > +++ b/fs/debugfs/file.c > @@ -524,6 +524,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); > struct array_data { > void *array; > u32 elements; > + struct mutex lock; This should be a spinlock. > }; > > static int u32_array_open(struct inode *inode, struct file *file) > @@ -580,6 +581,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, > struct array_data *data = inode->i_private; > size_t size; > > + mutex_lock(&data->lock); > if (*ppos == 0) { > if (file->private_data) { > kfree(file->private_data); > @@ -594,6 +596,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, > if (file->private_data) > size = strlen(file->private_data); > > + mutex_unlock(&data->lock); > return simple_read_from_buffer(buf, len, ppos, > file->private_data, size); > } Your critical section isn't entirely covered since you're still accessing file->private_data in the call to simple_read_from_buffer(). What happens if a concurrent reader does file->private_data = NULL immediately after your unlock? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 20:24 ` David Rientjes @ 2012-09-18 20:31 ` Konrad Rzeszutek Wilk 2012-09-18 20:36 ` Linus Torvalds 2012-09-19 0:57 ` Raghavendra K T 2 siblings, 0 replies; 33+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-18 20:31 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Tue, Sep 18, 2012 at 01:24:15PM -0700, David Rientjes wrote: > On Tue, 18 Sep 2012, Konrad Rzeszutek Wilk wrote: > > > diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c > > index 2340f69..309b235 100644 > > --- a/fs/debugfs/file.c > > +++ b/fs/debugfs/file.c > > @@ -524,6 +524,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); > > struct array_data { > > void *array; > > u32 elements; > > + struct mutex lock; > > This should be a spinlock. I would get warnings about using a spinlock.. .. snip.. > > + mutex_unlock(&data->lock); > > return simple_read_from_buffer(buf, len, ppos, > > file->private_data, size); > > } > > Your critical section isn't entirely covered since you're still accessing > file->private_data in the call to simple_read_from_buffer(). What happens > if a concurrent reader does file->private_data = NULL immediately after > your unlock? Duh! diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 2340f69..c6d9088 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -524,6 +524,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); struct array_data { void *array; u32 elements; + struct mutex lock; }; static int u32_array_open(struct inode *inode, struct file *file) @@ -580,6 +581,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, struct array_data *data = inode->i_private; size_t size; + mutex_lock(&data->lock); if (*ppos == 0) { if (file->private_data) { kfree(file->private_data); @@ -594,8 +596,10 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, if (file->private_data) size = strlen(file->private_data); - return simple_read_from_buffer(buf, len, ppos, + size = simple_read_from_buffer(buf, len, ppos, file->private_data, size); + mutex_unlock(&data->lock); + return size; } static int u32_array_release(struct inode *inode, struct file *file) @@ -643,6 +647,7 @@ struct dentry *debugfs_create_u32_array(const char *name, umode_t mode, data->array = array; data->elements = elements; + mutex_init(&data->lock); return debugfs_create_file(name, mode, parent, data, &u32_array_fops); } ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 20:24 ` David Rientjes 2012-09-18 20:31 ` Konrad Rzeszutek Wilk @ 2012-09-18 20:36 ` Linus Torvalds 2012-09-19 0:57 ` Raghavendra K T 2 siblings, 0 replies; 33+ messages in thread From: Linus Torvalds @ 2012-09-18 20:36 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Tue, Sep 18, 2012 at 1:24 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 18 Sep 2012, Konrad Rzeszutek Wilk wrote: > >> u32 elements; >> + struct mutex lock; > > This should be a spinlock. I don't think it can be a spinlock since it needs to protect the user-space accesses (which Konrad admittedly didn't do, but that's just a bug) Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 20:24 ` David Rientjes 2012-09-18 20:31 ` Konrad Rzeszutek Wilk 2012-09-18 20:36 ` Linus Torvalds @ 2012-09-19 0:57 ` Raghavendra K T 2 siblings, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2012-09-19 0:57 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Jeremy Fitzhardinge On 09/19/2012 01:54 AM, David Rientjes wrote: > On Tue, 18 Sep 2012, Konrad Rzeszutek Wilk wrote: > >> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c >> index 2340f69..309b235 100644 >> --- a/fs/debugfs/file.c >> +++ b/fs/debugfs/file.c >> @@ -524,6 +524,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); >> struct array_data { >> void *array; >> u32 elements; >> + struct mutex lock; > > This should be a spinlock. I remember we used debugfs because traceprintks used spinlock. The code was being accessed from paravirt spinlock. Sorry for joining late (Time Zone difference) CCing Jeremy > >> }; >> >> static int u32_array_open(struct inode *inode, struct file *file) >> @@ -580,6 +581,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, >> struct array_data *data = inode->i_private; >> size_t size; >> >> + mutex_lock(&data->lock); >> if (*ppos == 0) { >> if (file->private_data) { >> kfree(file->private_data); >> @@ -594,6 +596,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, >> if (file->private_data) >> size = strlen(file->private_data); >> >> + mutex_unlock(&data->lock); >> return simple_read_from_buffer(buf, len, ppos, >> file->private_data, size); >> } > > Your critical section isn't entirely covered since you're still accessing > file->private_data in the call to simple_read_from_buffer(). What happens > if a concurrent reader does file->private_data = NULL immediately after > your unlock? > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 19:23 ` Konrad Rzeszutek Wilk 2012-09-18 20:24 ` David Rientjes @ 2012-09-18 20:35 ` Linus Torvalds 2012-09-18 20:37 ` Konrad Rzeszutek Wilk 1 sibling, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2012-09-18 20:35 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Tue, Sep 18, 2012 at 12:23 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >> >> It should be easyish to fix by just adding a lock around those things. > > Like this: Not quite. I suspect you need to protect the "read_from_buffer()" call too, since otherwise the buffer can be free'd by another thread while being copied to user space.. >> - the whole format_array_alloc() code is one buggy piece of sh*t, >> since afaik there is nothing that guarantees that the values cannot >> change. So the notion of "let's format the output once to know how big >> it is, and then a second time to actually print things into the array >> we just allocated based on the first time" is pure and utter garbage, >> afaik. > > Yikes. The fix could be to allocate a buffer large enough for the maximum > that %u could take * array_size and not bother with the first pass. That's the simple approach. An alternative approach would be to start with a reasonable guess as to the size, and use krealloc() if needed to expand. What are the realistic array sizes? If we're talking about arrays of just 16 words or something, just do the maximum (what, 12 bytes per 32-bit word? 11 bytes for the number, and then space/NUL after that). If we're talking about potentially much bigger arrays, maybe we need to use krealloc() to handle it without wasting tons of memory for the common case.. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 20:35 ` Linus Torvalds @ 2012-09-18 20:37 ` Konrad Rzeszutek Wilk 2012-09-18 20:49 ` Linus Torvalds 0 siblings, 1 reply; 33+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-18 20:37 UTC (permalink / raw) To: Linus Torvalds Cc: Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T > >> It should be easyish to fix by just adding a lock around those things. > > > > Like this: > > Not quite. .. new patch in another email... .. snip.. > > Yikes. The fix could be to allocate a buffer large enough for the maximum > > that %u could take * array_size and not bother with the first pass. > > That's the simple approach. An alternative approach would be to start > with a reasonable guess as to the size, and use krealloc() if needed > to expand. > > What are the realistic array sizes? If we're talking about arrays of > just 16 words or something, just do the maximum (what, 12 bytes per > 32-bit word? 11 bytes for the number, and then space/NUL after that). 30 words. ~360 + 29 spaces + NULL = 390? > > If we're talking about potentially much bigger arrays, maybe we need > to use krealloc() to handle it without wasting tons of memory for the > common case.. > > Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 20:37 ` Konrad Rzeszutek Wilk @ 2012-09-18 20:49 ` Linus Torvalds 2012-09-19 1:16 ` Raghavendra K T 2012-09-19 19:16 ` Konrad Rzeszutek Wilk 0 siblings, 2 replies; 33+ messages in thread From: Linus Torvalds @ 2012-09-18 20:49 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Tue, Sep 18, 2012 at 1:37 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > > 30 words. ~360 + 29 spaces + NULL = 390? Just allocate the max then. That's tiny. And it's actually just 330: max ten characters for an unsigned 32-bit number. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 20:49 ` Linus Torvalds @ 2012-09-19 1:16 ` Raghavendra K T 2012-09-19 19:16 ` Konrad Rzeszutek Wilk 1 sibling, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2012-09-19 1:16 UTC (permalink / raw) To: Linus Torvalds Cc: Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Jeremy Fitzhardinge On 09/19/2012 02:19 AM, Linus Torvalds wrote: > On Tue, Sep 18, 2012 at 1:37 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: >> >> 30 words. ~360 + 29 spaces + NULL = 390? > > Just allocate the max then. That's tiny. > > And it's actually just 330: max ten characters for an unsigned 32-bit number. > Okay, I think you meant 32*11 = 352 chars. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-18 20:49 ` Linus Torvalds 2012-09-19 1:16 ` Raghavendra K T @ 2012-09-19 19:16 ` Konrad Rzeszutek Wilk 2012-09-19 21:29 ` David Rientjes 2012-09-20 12:55 ` Raghavendra K T 1 sibling, 2 replies; 33+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-19 19:16 UTC (permalink / raw) To: Linus Torvalds Cc: Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Tue, Sep 18, 2012 at 01:49:52PM -0700, Linus Torvalds wrote: > On Tue, Sep 18, 2012 at 1:37 PM, Konrad Rzeszutek Wilk > <konrad.wilk@oracle.com> wrote: > > > > 30 words. ~360 + 29 spaces + NULL = 390? > > Just allocate the max then. That's tiny. > > And it's actually just 330: max ten characters for an unsigned 32-bit number. Linus, Could you take a look at these two patches to see if I missed anything? Thank you. >From 0806b133b5b28081adf23d0d04a99636ed3b861b Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Wed, 19 Sep 2012 11:23:01 -0400 Subject: [PATCH 1/2] debugfs: Add lock for u32_array_read Dave Jones spotted that the u32_array_read was doing something funny: ============================================================================= BUG kmalloc-64 (Not tainted): Redzone overwritten ----------------------------------------------------------------------------- INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 __slab_alloc+0x516/0x5a5 __kmalloc+0x213/0x2c0 u32_array_read+0xd1/0x110 .. snip.. INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 __slab_free+0x3f/0x3bf kfree+0x2d5/0x310 u32_array_read+0x99/0x110 Linus tracked it down and found out that "debugfs is racy for that case [read calls in parallel on the debugfs]. At least the file->private_data accesses are, for the case of that "u32_array" case. In fact it is racy in ... the whole "file->private_data" access .. If you have multiple readers on the same file, the whole if (file->private_data) { kfree(file->private_data); file->private_data = NULL; } file->private_data = format_array_alloc("%u", data->array, data->elements); thing is just a disaster waiting to happen." He suggested putting a lock which this patch does. The consequence of this is that it will trigger more spinlock usage, as this particular debugfs is used to provide a histogram of spinlock contention. But memory corruption is a worst offender then that. Reported-by: Dave Jones <davej@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> --- fs/debugfs/file.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index 2340f69..c6d9088 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -524,6 +524,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); struct array_data { void *array; u32 elements; + struct mutex lock; }; static int u32_array_open(struct inode *inode, struct file *file) @@ -580,6 +581,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, struct array_data *data = inode->i_private; size_t size; + mutex_lock(&data->lock); if (*ppos == 0) { if (file->private_data) { kfree(file->private_data); @@ -594,8 +596,10 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, if (file->private_data) size = strlen(file->private_data); - return simple_read_from_buffer(buf, len, ppos, + size = simple_read_from_buffer(buf, len, ppos, file->private_data, size); + mutex_unlock(&data->lock); + return size; } static int u32_array_release(struct inode *inode, struct file *file) @@ -643,6 +647,7 @@ struct dentry *debugfs_create_u32_array(const char *name, umode_t mode, data->array = array; data->elements = elements; + mutex_init(&data->lock); return debugfs_create_file(name, mode, parent, data, &u32_array_fops); } -- 1.7.7.6 >From c3937cb7144a6ead80e6aabee89420645945a926 Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Date: Wed, 19 Sep 2012 14:04:22 -0400 Subject: [PATCH 2/2] debugfs: Pre-allocate u32 array to thwart a race. Linus spotted via code inspection "bug is slightly subtler and probably harder to hit (but also harder to fix): The whole format_array_alloc() code is one buggy piece of sh*t, since afaik there is nothing that guarantees that the values cannot change. So the notion of "let's format the output once to know how big it is, and then a second time to actually print things into the array we just allocated based on the first time" is pure and utter garbage, afaik." This patch fixes this by pre-allocating the buffer to the maximum size during debugfs initialization by the driver. We print %u values, so the math is pretty straightforward: 10 bytes for the maximum that %u can use (4294967295) + spaces for the number of elements and \n\0. We also add an extra byte to compensate for the data->len == size check which we would hit if all of the array entries were of their maximum size (-1U). If we end up using exactly up to the size we allocated (this includes the extra byte), then we allocate a new buffer twice the size. And if we fail again, we print a warning as snprintf is doing something silly. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- fs/debugfs/file.c | 66 ++++++++++++++++++++++------------------------------ 1 files changed, 28 insertions(+), 38 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c index c6d9088..4c59200 100644 --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -524,12 +524,13 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); struct array_data { void *array; u32 elements; + ssize_t len; + char *buf; struct mutex lock; }; static int u32_array_open(struct inode *inode, struct file *file) { - file->private_data = NULL; return nonseekable_open(inode, file); } @@ -560,20 +561,6 @@ static size_t format_array(char *buf, size_t bufsize, const char *fmt, return ret; } -static char *format_array_alloc(const char *fmt, u32 *array, - u32 array_size) -{ - size_t len = format_array(NULL, 0, fmt, array, array_size); - char *ret; - - ret = kmalloc(len, GFP_KERNEL); - if (ret == NULL) - return NULL; - - format_array(ret, len, fmt, array, array_size); - return ret; -} - static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, loff_t *ppos) { @@ -582,37 +569,33 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, size_t size; mutex_lock(&data->lock); - if (*ppos == 0) { - if (file->private_data) { - kfree(file->private_data); - file->private_data = NULL; - } - - file->private_data = format_array_alloc("%u", data->array, - data->elements); + if (*ppos == 0) + size = format_array(data->buf, data->len - 1 /* for \0 */, + "%u", data->array, data->elements); + else + size = strlen(data->buf); + + if (size == data->len) { + char *p = krealloc(data->buf, data->len * 2, GFP_KERNEL); + if (ZERO_OR_NULL_PTR(p)) + goto out; + data->buf = p; + data->len *= 2; + size = format_array(data->buf, data->len - 1 /* for \0 */, + "%u", data->array, data->elements); + /* It keeps on growing! Early pre-allocation MUST be wrong. */ + WARN_ON(size == data->len); } - - size = 0; - if (file->private_data) - size = strlen(file->private_data); - +out: size = simple_read_from_buffer(buf, len, ppos, - file->private_data, size); + data->buf, size); mutex_unlock(&data->lock); return size; } -static int u32_array_release(struct inode *inode, struct file *file) -{ - kfree(file->private_data); - - return 0; -} - static const struct file_operations u32_array_fops = { .owner = THIS_MODULE, .open = u32_array_open, - .release = u32_array_release, .read = u32_array_read, .llseek = no_llseek, }; @@ -648,7 +631,14 @@ struct dentry *debugfs_create_u32_array(const char *name, umode_t mode, data->array = array; data->elements = elements; mutex_init(&data->lock); - + data->len = elements * 10 /* max value for %u */ + + elements - 1 /* spaces */ + 2 /* \n\0 */ + + 1 /* to thwart the size == data->len check. */; + data->buf = kmalloc(data->len, GFP_KERNEL); + if (!data->buf) { + kfree(data); + return ERR_PTR(-ENOMEM); + } return debugfs_create_file(name, mode, parent, data, &u32_array_fops); } EXPORT_SYMBOL_GPL(debugfs_create_u32_array); -- 1.7.7.6 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 19:16 ` Konrad Rzeszutek Wilk @ 2012-09-19 21:29 ` David Rientjes 2012-09-19 21:49 ` David Rientjes 2012-09-20 12:55 ` Raghavendra K T 1 sibling, 1 reply; 33+ messages in thread From: David Rientjes @ 2012-09-19 21:29 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Wed, 19 Sep 2012, Konrad Rzeszutek Wilk wrote: > From 0806b133b5b28081adf23d0d04a99636ed3b861b Mon Sep 17 00:00:00 2001 > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > Date: Wed, 19 Sep 2012 11:23:01 -0400 > Subject: [PATCH 1/2] debugfs: Add lock for u32_array_read > > Dave Jones spotted that the u32_array_read was doing something funny: > > ============================================================================= > BUG kmalloc-64 (Not tainted): Redzone overwritten > ----------------------------------------------------------------------------- > > INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc > INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 > __slab_alloc+0x516/0x5a5 > __kmalloc+0x213/0x2c0 > u32_array_read+0xd1/0x110 > .. snip.. > INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 > __slab_free+0x3f/0x3bf > kfree+0x2d5/0x310 > u32_array_read+0x99/0x110 > > Linus tracked it down and found out that "debugfs is racy for that case > [read calls in parallel on the debugfs]. At least the file->private_data > accesses are, for the case of that "u32_array" case. > > In fact it is racy in ... the whole "file->private_data" access .. > If you have multiple readers on the same file, the whole > > if (file->private_data) { > kfree(file->private_data); > file->private_data = NULL; > } > > file->private_data = format_array_alloc("%u", data->array, > data->elements); > > thing is just a disaster waiting to happen." He suggested > putting a lock which this patch does. > Since these are non-seekable files, it must also race to find *ppos == 0. > The consequence of this is that it will trigger more spinlock usage, > as this particular debugfs is used to provide a histogram of spinlock > contention. But memory corruption is a worst offender then that. > > Reported-by: Dave Jones <davej@redhat.com> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: David Rientjes <rientjes@google.com> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 21:29 ` David Rientjes @ 2012-09-19 21:49 ` David Rientjes 2012-09-19 23:01 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: David Rientjes @ 2012-09-19 21:49 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Wed, 19 Sep 2012, David Rientjes wrote: > > From 0806b133b5b28081adf23d0d04a99636ed3b861b Mon Sep 17 00:00:00 2001 > > From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > Date: Wed, 19 Sep 2012 11:23:01 -0400 > > Subject: [PATCH 1/2] debugfs: Add lock for u32_array_read > > > > Dave Jones spotted that the u32_array_read was doing something funny: > > > > ============================================================================= > > BUG kmalloc-64 (Not tainted): Redzone overwritten > > ----------------------------------------------------------------------------- > > > > INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc > > INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 > > __slab_alloc+0x516/0x5a5 > > __kmalloc+0x213/0x2c0 > > u32_array_read+0xd1/0x110 > > .. snip.. > > INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 > > __slab_free+0x3f/0x3bf > > kfree+0x2d5/0x310 > > u32_array_read+0x99/0x110 > > > > Linus tracked it down and found out that "debugfs is racy for that case > > [read calls in parallel on the debugfs]. At least the file->private_data > > accesses are, for the case of that "u32_array" case. > > > > In fact it is racy in ... the whole "file->private_data" access .. > > If you have multiple readers on the same file, the whole > > > > if (file->private_data) { > > kfree(file->private_data); > > file->private_data = NULL; > > } > > > > file->private_data = format_array_alloc("%u", data->array, > > data->elements); > > > > thing is just a disaster waiting to happen." He suggested > > putting a lock which this patch does. > > > > Since these are non-seekable files, it must also race to find *ppos == 0. > > > The consequence of this is that it will trigger more spinlock usage, > > as this particular debugfs is used to provide a histogram of spinlock > > contention. But memory corruption is a worst offender then that. > > > > Reported-by: Dave Jones <davej@redhat.com> > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > > Tested-by: David Rientjes <rientjes@google.com> > An alternative to this, though, might be to never test for *ppos == 0 in u32_array_read() and do the format_array_alloc() in u32_array_open() to initialize file->private_data. If that allocation fails, just return -ENOMEM. Then you never need to add a mutex in the read path. Any reason we can't do this? --- fs/debugfs/file.c | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -526,12 +526,6 @@ struct array_data { u32 elements; }; -static int u32_array_open(struct inode *inode, struct file *file) -{ - file->private_data = NULL; - return nonseekable_open(inode, file); -} - static size_t format_array(char *buf, size_t bufsize, const char *fmt, u32 *array, u32 array_size) { @@ -573,26 +567,21 @@ static char *format_array_alloc(const char *fmt, u32 *array, return ret; } -static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, - loff_t *ppos) +static int u32_array_open(struct inode *inode, struct file *file) { - struct inode *inode = file->f_path.dentry->d_inode; struct array_data *data = inode->i_private; - size_t size; - if (*ppos == 0) { - if (file->private_data) { - kfree(file->private_data); - file->private_data = NULL; - } - - file->private_data = format_array_alloc("%u", data->array, - data->elements); - } + file->private_data = format_array_alloc("%u", data->array, + data->elements); + if (!file->private_data) + return -ENOMEM; + return nonseekable_open(inode, file); +} - size = 0; - if (file->private_data) - size = strlen(file->private_data); +static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, + loff_t *ppos) +{ + size_t size = strlen(file->private_data); return simple_read_from_buffer(buf, len, ppos, file->private_data, size); ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 21:49 ` David Rientjes @ 2012-09-19 23:01 ` Linus Torvalds 2012-09-19 23:20 ` David Rientjes 2012-09-20 2:29 ` Raghavendra K T 2012-09-20 12:57 ` 3.6rc6 slab corruption Raghavendra K T 2 siblings, 1 reply; 33+ messages in thread From: Linus Torvalds @ 2012-09-19 23:01 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Wed, Sep 19, 2012 at 2:49 PM, David Rientjes <rientjes@google.com> wrote: > > An alternative to this, though, might be to never test for *ppos == 0 in > u32_array_read() and do the format_array_alloc() in u32_array_open() to > initialize file->private_data. If that allocation fails, just return > -ENOMEM. Then you never need to add a mutex in the read path. > > Any reason we can't do this? That does look simpler, and avoiding the lock is a good idea. Since we don't support lseek() (or pread/pwrite) on that thing anyway, there's no way to keep the fd open and just re-use it to read the data over and over, so populating it at open time sounds like a good solution with no real downsides. Linus ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 23:01 ` Linus Torvalds @ 2012-09-19 23:20 ` David Rientjes 2012-09-20 21:14 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 33+ messages in thread From: David Rientjes @ 2012-09-19 23:20 UTC (permalink / raw) To: Linus Torvalds Cc: Konrad Rzeszutek Wilk, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Wed, 19 Sep 2012, Linus Torvalds wrote: > That does look simpler, and avoiding the lock is a good idea. Since we > don't support lseek() (or pread/pwrite) on that thing anyway, there's > no way to keep the fd open and just re-use it to read the data over > and over, so populating it at open time sounds like a good solution > with no real downsides. > Yeah, my patch is functionally the same as what we currently have with the only exception being that it isn't racy. I'm wondering if that's what we really want, though, since the data read from the file will remain persistent as long as it is opened. That obviously happens in my patch because we allocate and copy the buffer at open(), but also happens implicitly with the old code precisely because it's a non-seekable file and *ppos == 0 only once (when not racy). So if the API for these xen files is to remain persistent after open() as it currently does, then my patch solves the issue. However, if the API wants to allow to only open() once and then read the spinlock_stats data continuously, then we'll need the mutex: allocate the file->private_data buffer once at open() for the maximum allowable size and then copy to the buffer from xen's spinlock_stats under the protection of the mutex to read(). Konrad? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 23:20 ` David Rientjes @ 2012-09-20 21:14 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 33+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-20 21:14 UTC (permalink / raw) To: David Rientjes Cc: Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T On Wed, Sep 19, 2012 at 04:20:25PM -0700, David Rientjes wrote: > On Wed, 19 Sep 2012, Linus Torvalds wrote: > > > That does look simpler, and avoiding the lock is a good idea. Since we > > don't support lseek() (or pread/pwrite) on that thing anyway, there's > > no way to keep the fd open and just re-use it to read the data over > > and over, so populating it at open time sounds like a good solution > > with no real downsides. > > > > Yeah, my patch is functionally the same as what we currently have with the > only exception being that it isn't racy. I'm wondering if that's what we > really want, though, since the data read from the file will remain > persistent as long as it is opened. That obviously happens in my patch > because we allocate and copy the buffer at open(), but also happens > implicitly with the old code precisely because it's a non-seekable file > and *ppos == 0 only once (when not racy). > > So if the API for these xen files is to remain persistent after open() as > it currently does, then my patch solves the issue. However, if the API Nah. It was initially a debug option to see how contended the spinlocks are. Nobody but developers should look at it - and they can deal with open/close cycle. Thought we should probably provide a nice little comment in the file mentioning the reason for stale data. > wants to allow to only open() once and then read the spinlock_stats data > continuously, then we'll need the mutex: allocate the file->private_data > buffer once at open() for the maximum allowable size and then copy to the > buffer from xen's spinlock_stats under the protection of the mutex to > read(). > > Konrad? Your patch is way simpler and it does the job better than mine. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 21:49 ` David Rientjes 2012-09-19 23:01 ` Linus Torvalds @ 2012-09-20 2:29 ` Raghavendra K T 2012-09-20 2:46 ` David Rientjes 2012-09-20 12:57 ` 3.6rc6 slab corruption Raghavendra K T 2 siblings, 1 reply; 33+ messages in thread From: Raghavendra K T @ 2012-09-20 2:29 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose On 09/20/2012 03:19 AM, David Rientjes wrote: > On Wed, 19 Sep 2012, David Rientjes wrote: > >>> From 0806b133b5b28081adf23d0d04a99636ed3b861b Mon Sep 17 00:00:00 2001 >>> From: Konrad Rzeszutek Wilk<konrad.wilk@oracle.com> >>> Date: Wed, 19 Sep 2012 11:23:01 -0400 >>> Subject: [PATCH 1/2] debugfs: Add lock for u32_array_read >>> >>> Dave Jones spotted that the u32_array_read was doing something funny: >>> >>> ============================================================================= >>> BUG kmalloc-64 (Not tainted): Redzone overwritten >>> ----------------------------------------------------------------------------- >>> >>> INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc >>> INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 >>> __slab_alloc+0x516/0x5a5 >>> __kmalloc+0x213/0x2c0 >>> u32_array_read+0xd1/0x110 >>> .. snip.. >>> INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 >>> __slab_free+0x3f/0x3bf >>> kfree+0x2d5/0x310 >>> u32_array_read+0x99/0x110 >>> >>> Linus tracked it down and found out that "debugfs is racy for that case >>> [read calls in parallel on the debugfs]. At least the file->private_data >>> accesses are, for the case of that "u32_array" case. >>> >>> In fact it is racy in ... the whole "file->private_data" access .. >>> If you have multiple readers on the same file, the whole >>> >>> if (file->private_data) { >>> kfree(file->private_data); >>> file->private_data = NULL; >>> } >>> >>> file->private_data = format_array_alloc("%u", data->array, >>> data->elements); >>> >>> thing is just a disaster waiting to happen." He suggested >>> putting a lock which this patch does. >>> >> >> Since these are non-seekable files, it must also race to find *ppos == 0. >> >>> The consequence of this is that it will trigger more spinlock usage, >>> as this particular debugfs is used to provide a histogram of spinlock >>> contention. But memory corruption is a worst offender then that. >>> >>> Reported-by: Dave Jones<davej@redhat.com> >>> Suggested-by: Linus Torvalds<torvalds@linux-foundation.org> >> >> Tested-by: David Rientjes<rientjes@google.com> >> > > An alternative to this, though, might be to never test for *ppos == 0 in > u32_array_read() and do the format_array_alloc() in u32_array_open() to > initialize file->private_data. If that allocation fails, just return > -ENOMEM. Then you never need to add a mutex in the read path. > > Any reason we can't do this? > --- > fs/debugfs/file.c | 33 +++++++++++---------------------- > 1 file changed, 11 insertions(+), 22 deletions(-) > > diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c > --- a/fs/debugfs/file.c > +++ b/fs/debugfs/file.c > @@ -526,12 +526,6 @@ struct array_data { > u32 elements; > }; > > -static int u32_array_open(struct inode *inode, struct file *file) > -{ > - file->private_data = NULL; > - return nonseekable_open(inode, file); > -} > - > static size_t format_array(char *buf, size_t bufsize, const char *fmt, > u32 *array, u32 array_size) > { > @@ -573,26 +567,21 @@ static char *format_array_alloc(const char *fmt, u32 *array, > return ret; > } > > -static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, > - loff_t *ppos) > +static int u32_array_open(struct inode *inode, struct file *file) > { > - struct inode *inode = file->f_path.dentry->d_inode; > struct array_data *data = inode->i_private; > - size_t size; > > - if (*ppos == 0) { > - if (file->private_data) { > - kfree(file->private_data); > - file->private_data = NULL; > - } > - > - file->private_data = format_array_alloc("%u", data->array, > - data->elements); > - } > + file->private_data = format_array_alloc("%u", data->array, > + data->elements); > + if (!file->private_data) > + return -ENOMEM; > + return nonseekable_open(inode, file); > +} > > - size = 0; > - if (file->private_data) > - size = strlen(file->private_data); > +static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, > + loff_t *ppos) > +{ > + size_t size = strlen(file->private_data); > > return simple_read_from_buffer(buf, len, ppos, > file->private_data, size); > > Only problem, I find is histogram data expands dynamically (because it changes). I think having static allocation of 352 bytes as suggested Linus is a good idea. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-20 2:29 ` Raghavendra K T @ 2012-09-20 2:46 ` David Rientjes 2012-09-20 2:55 ` Raghavendra K T 2012-09-20 21:18 ` Konrad Rzeszutek Wilk 0 siblings, 2 replies; 33+ messages in thread From: David Rientjes @ 2012-09-20 2:46 UTC (permalink / raw) To: Raghavendra K T Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose On Thu, 20 Sep 2012, Raghavendra K T wrote: > Only problem, I find is histogram data expands dynamically (because it > changes). I think having static allocation of 352 bytes as suggested > Linus is a good idea. > Certainly, but it's a different topic and would be a subsequent patch to either my patch or Konrad's patch. Before that's done, I think we should fix the race condition that currently exists either by: - merging my patch (which I can sign-off and write a changelog for if Konrad agrees), or - merging Konrad's patch and introducing a mutex so that it's possible to do many reads to collect statistics after opening the file a single time with a single fd. Since these files are incapable of doing lseek, it would seem that my patch would be best and you'd simply want to close() + open() to read new data, which also guarantees that all readers get the same information. The only reason I hesitate on that and will defer to Konrad's opinion is because the way the code is currently written looks like it was intended to copy the data are read() rather than open(); in other words, it almost seems as if they were made to be non-seekable after the u32_array_read() implementation was complete and it was at one time possible to do an lseek(SEEK_SET). After that's fixed, and to address your concern, we can simply do the allocation of file->private_data for the maximum size possible when the file is created as a follow-up patch. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-20 2:46 ` David Rientjes @ 2012-09-20 2:55 ` Raghavendra K T 2012-09-20 21:18 ` Konrad Rzeszutek Wilk 1 sibling, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2012-09-20 2:55 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose On 09/20/2012 08:16 AM, David Rientjes wrote: > On Thu, 20 Sep 2012, Raghavendra K T wrote: > >> Only problem, I find is histogram data expands dynamically (because it >> changes). I think having static allocation of 352 bytes as suggested >> Linus is a good idea. >> > > Certainly, but it's a different topic and would be a subsequent patch to > either my patch or Konrad's patch. Before that's done, I think we should > fix the race condition that currently exists either by: > > - merging my patch (which I can sign-off and write a changelog for if > Konrad agrees), or > > - merging Konrad's patch and introducing a mutex so that it's possible to > do many reads to collect statistics after opening the file a single > time with a single fd. > > Since these files are incapable of doing lseek, it would seem that my > patch would be best and you'd simply want to close() + open() to read new > data, which also guarantees that all readers get the same information. > The only reason I hesitate on that and will defer to Konrad's opinion is > because the way the code is currently written looks like it was intended > to copy the data are read() rather than open(); in other words, it almost > seems as if they were made to be non-seekable after the u32_array_read() > implementation was complete and it was at one time possible to do an > lseek(SEEK_SET). Yes. common use case is just open once,read and close. i.e. cat /sys/kernel/debug/../histoblocked I also don't see why we have to keep fd open to read it. But yes let Konrad comment on that. > After that's fixed, and to address your concern, we can simply do the > allocation of file->private_data for the maximum size possible when the > file is created as a follow-up patch. I agree, missed that once we do a open and then read it we are done. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-20 2:46 ` David Rientjes 2012-09-20 2:55 ` Raghavendra K T @ 2012-09-20 21:18 ` Konrad Rzeszutek Wilk 2012-09-21 9:16 ` [patch for-3.6] fs, debugfs: fix race in u32_array_read and allocate array at open David Rientjes 1 sibling, 1 reply; 33+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-20 21:18 UTC (permalink / raw) To: David Rientjes Cc: Raghavendra K T, Konrad Rzeszutek Wilk, Linus Torvalds, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose On Wed, Sep 19, 2012 at 07:46:57PM -0700, David Rientjes wrote: > On Thu, 20 Sep 2012, Raghavendra K T wrote: > > > Only problem, I find is histogram data expands dynamically (because it > > changes). I think having static allocation of 352 bytes as suggested > > Linus is a good idea. > > > > Certainly, but it's a different topic and would be a subsequent patch to > either my patch or Konrad's patch. Before that's done, I think we should > fix the race condition that currently exists either by: > > - merging my patch (which I can sign-off and write a changelog for if > Konrad agrees), or Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > - merging Konrad's patch and introducing a mutex so that it's possible to > do many reads to collect statistics after opening the file a single > time with a single fd. > > Since these files are incapable of doing lseek, it would seem that my > patch would be best and you'd simply want to close() + open() to read new > data, which also guarantees that all readers get the same information. Yup. > The only reason I hesitate on that and will defer to Konrad's opinion is > because the way the code is currently written looks like it was intended > to copy the data are read() rather than open(); in other words, it almost > seems as if they were made to be non-seekable after the u32_array_read() > implementation was complete and it was at one time possible to do an > lseek(SEEK_SET). The "users" (looks at himself and Raghavendra) can deal with the open/close, open/close cycle. The only thing that I would add extra is to add the explanation you provided in the comment of the file in case somebody expects something else. > > After that's fixed, and to address your concern, we can simply do the > allocation of file->private_data for the maximum size possible when the > file is created as a follow-up patch. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [patch for-3.6] fs, debugfs: fix race in u32_array_read and allocate array at open 2012-09-20 21:18 ` Konrad Rzeszutek Wilk @ 2012-09-21 9:16 ` David Rientjes 2012-09-21 10:22 ` Raghavendra K T 0 siblings, 1 reply; 33+ messages in thread From: David Rientjes @ 2012-09-21 9:16 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, Linus Torvalds, Greg Kroah-Hartman Cc: Raghavendra K T, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Srivatsa Vaddagiri, Suzuki Poulose u32_array_open() is racy when multiple threads read from a file with a seek position of zero, i.e. when two or more simultaneous reads are occurring after the non-seekable files are created. It is possible that file->private_data is double-freed because the threads races between kfree(file->private-data); and file->private_data = NULL; The fix is to only do format_array_alloc() when the file is opened and free it when it is closed. This means that any thread that holds the file open and reads multiple times will see persistent data; they must close() and open() the file again to see new data, like a simple cat of the file. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: David Rientjes <rientjes@google.com> --- fs/debugfs/file.c | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-) diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c --- a/fs/debugfs/file.c +++ b/fs/debugfs/file.c @@ -526,12 +526,6 @@ struct array_data { u32 elements; }; -static int u32_array_open(struct inode *inode, struct file *file) -{ - file->private_data = NULL; - return nonseekable_open(inode, file); -} - static size_t format_array(char *buf, size_t bufsize, const char *fmt, u32 *array, u32 array_size) { @@ -573,26 +567,21 @@ static char *format_array_alloc(const char *fmt, u32 *array, return ret; } -static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, - loff_t *ppos) +static int u32_array_open(struct inode *inode, struct file *file) { - struct inode *inode = file->f_path.dentry->d_inode; struct array_data *data = inode->i_private; - size_t size; - if (*ppos == 0) { - if (file->private_data) { - kfree(file->private_data); - file->private_data = NULL; - } - - file->private_data = format_array_alloc("%u", data->array, - data->elements); - } + file->private_data = format_array_alloc("%u", data->array, + data->elements); + if (!file->private_data) + return -ENOMEM; + return nonseekable_open(inode, file); +} - size = 0; - if (file->private_data) - size = strlen(file->private_data); +static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, + loff_t *ppos) +{ + size_t size = strlen(file->private_data); return simple_read_from_buffer(buf, len, ppos, file->private_data, size); ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch for-3.6] fs, debugfs: fix race in u32_array_read and allocate array at open 2012-09-21 9:16 ` [patch for-3.6] fs, debugfs: fix race in u32_array_read and allocate array at open David Rientjes @ 2012-09-21 10:22 ` Raghavendra K T 2012-09-24 22:26 ` David Rientjes 0 siblings, 1 reply; 33+ messages in thread From: Raghavendra K T @ 2012-09-21 10:22 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Greg Kroah-Hartman, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Srivatsa Vaddagiri, Suzuki Poulose On 09/21/2012 02:46 PM, David Rientjes wrote: > u32_array_open() is racy when multiple threads read from a file with a > seek position of zero, i.e. when two or more simultaneous reads are > occurring after the non-seekable files are created. It is possible that > file->private_data is double-freed because the threads races between > > kfree(file->private-data); > > and > > file->private_data = NULL; > > The fix is to only do format_array_alloc() when the file is opened and > free it when it is closed. This means that any thread that holds the > file open and reads multiple times will see persistent data; I think you meant we can read data only once. second time onwards we don't see any data. (except when fd is forked by child/ races in threads). ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch for-3.6] fs, debugfs: fix race in u32_array_read and allocate array at open 2012-09-21 10:22 ` Raghavendra K T @ 2012-09-24 22:26 ` David Rientjes 2012-09-25 2:54 ` Raghavendra K T 0 siblings, 1 reply; 33+ messages in thread From: David Rientjes @ 2012-09-24 22:26 UTC (permalink / raw) To: Raghavendra K T Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Greg Kroah-Hartman, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Srivatsa Vaddagiri, Suzuki Poulose On Fri, 21 Sep 2012, Raghavendra K T wrote: > I think you meant we can read data only once. second time onwards we don't see > any data. (except when fd is forked by child/ races in > threads). > You can read(2) as many times as you want after it's opened and will get persistent data. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [patch for-3.6] fs, debugfs: fix race in u32_array_read and allocate array at open 2012-09-24 22:26 ` David Rientjes @ 2012-09-25 2:54 ` Raghavendra K T 0 siblings, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2012-09-25 2:54 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Greg Kroah-Hartman, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Srivatsa Vaddagiri, Suzuki Poulose On 09/25/2012 03:56 AM, David Rientjes wrote: > On Fri, 21 Sep 2012, Raghavendra K T wrote: > >> I think you meant we can read data only once. second time onwards we don't see >> any data. (except when fd is forked by child/ races in >> threads). >> > > You can read(2) as many times as you want after it's opened and will get > persistent data. > > Okay, Sorry for the noise. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 21:49 ` David Rientjes 2012-09-19 23:01 ` Linus Torvalds 2012-09-20 2:29 ` Raghavendra K T @ 2012-09-20 12:57 ` Raghavendra K T 2012-09-20 21:18 ` Konrad Rzeszutek Wilk 2 siblings, 1 reply; 33+ messages in thread From: Raghavendra K T @ 2012-09-20 12:57 UTC (permalink / raw) To: David Rientjes Cc: Konrad Rzeszutek Wilk, Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose On 09/20/2012 03:19 AM, David Rientjes wrote: > On Wed, 19 Sep 2012, David Rientjes wrote: > >>> From 0806b133b5b28081adf23d0d04a99636ed3b861b Mon Sep 17 00:00:00 2001 >>> From: Konrad Rzeszutek Wilk<konrad.wilk@oracle.com> >>> Date: Wed, 19 Sep 2012 11:23:01 -0400 >>> Subject: [PATCH 1/2] debugfs: Add lock for u32_array_read >>> >>> Dave Jones spotted that the u32_array_read was doing something funny: >>> >>> ============================================================================= >>> BUG kmalloc-64 (Not tainted): Redzone overwritten >>> ----------------------------------------------------------------------------- >>> >>> INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc >>> INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 >>> __slab_alloc+0x516/0x5a5 >>> __kmalloc+0x213/0x2c0 >>> u32_array_read+0xd1/0x110 >>> .. snip.. >>> INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 >>> __slab_free+0x3f/0x3bf >>> kfree+0x2d5/0x310 >>> u32_array_read+0x99/0x110 >>> >>> Linus tracked it down and found out that "debugfs is racy for that case >>> [read calls in parallel on the debugfs]. At least the file->private_data >>> accesses are, for the case of that "u32_array" case. >>> >>> In fact it is racy in ... the whole "file->private_data" access .. >>> If you have multiple readers on the same file, the whole >>> >>> if (file->private_data) { >>> kfree(file->private_data); >>> file->private_data = NULL; >>> } >>> >>> file->private_data = format_array_alloc("%u", data->array, >>> data->elements); >>> >>> thing is just a disaster waiting to happen." He suggested >>> putting a lock which this patch does. >>> >> >> Since these are non-seekable files, it must also race to find *ppos == 0. >> >>> The consequence of this is that it will trigger more spinlock usage, >>> as this particular debugfs is used to provide a histogram of spinlock >>> contention. But memory corruption is a worst offender then that. >>> >>> Reported-by: Dave Jones<davej@redhat.com> >>> Suggested-by: Linus Torvalds<torvalds@linux-foundation.org> >> >> Tested-by: David Rientjes<rientjes@google.com> >> > > An alternative to this, though, might be to never test for *ppos == 0 in > u32_array_read() and do the format_array_alloc() in u32_array_open() to > initialize file->private_data. If that allocation fails, just return > -ENOMEM. Then you never need to add a mutex in the read path. > Tested-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com> ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-20 12:57 ` 3.6rc6 slab corruption Raghavendra K T @ 2012-09-20 21:18 ` Konrad Rzeszutek Wilk 0 siblings, 0 replies; 33+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-09-20 21:18 UTC (permalink / raw) To: Raghavendra K T Cc: David Rientjes, Konrad Rzeszutek Wilk, Linus Torvalds, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose > >An alternative to this, though, might be to never test for *ppos == 0 in > >u32_array_read() and do the format_array_alloc() in u32_array_open() to > >initialize file->private_data. If that allocation fails, just return > >-ENOMEM. Then you never need to add a mutex in the read path. > > > > Tested-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com> Excellent. Thank you! > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: 3.6rc6 slab corruption. 2012-09-19 19:16 ` Konrad Rzeszutek Wilk 2012-09-19 21:29 ` David Rientjes @ 2012-09-20 12:55 ` Raghavendra K T 1 sibling, 0 replies; 33+ messages in thread From: Raghavendra K T @ 2012-09-20 12:55 UTC (permalink / raw) To: Konrad Rzeszutek Wilk Cc: Linus Torvalds, Konrad Rzeszutek Wilk, Dave Jones, Linux Kernel, Greg Kroah-Hartman, Srivatsa Vaddagiri, Suzuki Poulose On 09/20/2012 12:46 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Sep 18, 2012 at 01:49:52PM -0700, Linus Torvalds wrote: >> On Tue, Sep 18, 2012 at 1:37 PM, Konrad Rzeszutek Wilk >> <konrad.wilk@oracle.com> wrote: >>> >>> 30 words. ~360 + 29 spaces + NULL = 390? >> >> Just allocate the max then. That's tiny. >> >> And it's actually just 330: max ten characters for an unsigned 32-bit number. > > Linus, > Could you take a look at these two patches to see if I missed anything? > Thank you. > >> From 0806b133b5b28081adf23d0d04a99636ed3b861b Mon Sep 17 00:00:00 2001 > From: Konrad Rzeszutek Wilk<konrad.wilk@oracle.com> > Date: Wed, 19 Sep 2012 11:23:01 -0400 > Subject: [PATCH 1/2] debugfs: Add lock for u32_array_read > > Dave Jones spotted that the u32_array_read was doing something funny: > > ============================================================================= > BUG kmalloc-64 (Not tainted): Redzone overwritten > ----------------------------------------------------------------------------- > > INFO: 0xffff88001f4b4970-0xffff88001f4b4977. First byte 0xbb instead of 0xcc > INFO: Allocated in u32_array_read+0xd1/0x110 age=0 cpu=6 pid=32767 > __slab_alloc+0x516/0x5a5 > __kmalloc+0x213/0x2c0 > u32_array_read+0xd1/0x110 > .. snip.. > INFO: Freed in u32_array_read+0x99/0x110 age=0 cpu=0 pid=32749 > __slab_free+0x3f/0x3bf > kfree+0x2d5/0x310 > u32_array_read+0x99/0x110 > > Linus tracked it down and found out that "debugfs is racy for that case > [read calls in parallel on the debugfs]. At least the file->private_data > accesses are, for the case of that "u32_array" case. > > In fact it is racy in ... the whole "file->private_data" access .. > If you have multiple readers on the same file, the whole > > if (file->private_data) { > kfree(file->private_data); > file->private_data = NULL; > } > > file->private_data = format_array_alloc("%u", data->array, > data->elements); > > thing is just a disaster waiting to happen." He suggested > putting a lock which this patch does. > > The consequence of this is that it will trigger more spinlock usage, > as this particular debugfs is used to provide a histogram of spinlock > contention. But memory corruption is a worst offender then that. > > Reported-by: Dave Jones<davej@redhat.com> > Suggested-by: Linus Torvalds<torvalds@linux-foundation.org> Tested-by: Raghavendra <raghavendra.kt@linux.vnet.ibm.com> > --- > fs/debugfs/file.c | 7 ++++++- > 1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c > index 2340f69..c6d9088 100644 > --- a/fs/debugfs/file.c > +++ b/fs/debugfs/file.c > @@ -524,6 +524,7 @@ EXPORT_SYMBOL_GPL(debugfs_create_blob); > struct array_data { > void *array; > u32 elements; > + struct mutex lock; > }; > > static int u32_array_open(struct inode *inode, struct file *file) > @@ -580,6 +581,7 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, > struct array_data *data = inode->i_private; > size_t size; > > + mutex_lock(&data->lock); > if (*ppos == 0) { > if (file->private_data) { > kfree(file->private_data); > @@ -594,8 +596,10 @@ static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len, > if (file->private_data) > size = strlen(file->private_data); > > - return simple_read_from_buffer(buf, len, ppos, > + size = simple_read_from_buffer(buf, len, ppos, > file->private_data, size); > + mutex_unlock(&data->lock); > + return size; > } > > static int u32_array_release(struct inode *inode, struct file *file) > @@ -643,6 +647,7 @@ struct dentry *debugfs_create_u32_array(const char *name, umode_t mode, > > data->array = array; > data->elements = elements; > + mutex_init(&data->lock); > > return debugfs_create_file(name, mode, parent, data,&u32_array_fops); > } I was also able to reproduce the problem with pthread usage and David's msleep trick. [ 65.438698] BUG kmalloc-96 (Not tainted): Object already free [ 65.534572] BUG kmalloc-96 (Not tainted): Poison overwritten [ 2488.195923] BUG kmalloc-96 (Not tainted): Redzone overwritten all the above BUG without David's array allocation in open patch and Konrad's mutex patch. (Both the patches tested separately) ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2012-09-25 2:57 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-09-18 14:35 3.6rc6 slab corruption Dave Jones 2012-09-18 18:38 ` Linus Torvalds 2012-09-18 18:53 ` Dave Jones 2012-09-19 14:00 ` Raghavendra K T 2012-09-19 17:09 ` Linus Torvalds 2012-09-19 21:27 ` David Rientjes 2012-09-19 21:41 ` Dave Jones 2012-09-18 19:23 ` Konrad Rzeszutek Wilk 2012-09-18 20:24 ` David Rientjes 2012-09-18 20:31 ` Konrad Rzeszutek Wilk 2012-09-18 20:36 ` Linus Torvalds 2012-09-19 0:57 ` Raghavendra K T 2012-09-18 20:35 ` Linus Torvalds 2012-09-18 20:37 ` Konrad Rzeszutek Wilk 2012-09-18 20:49 ` Linus Torvalds 2012-09-19 1:16 ` Raghavendra K T 2012-09-19 19:16 ` Konrad Rzeszutek Wilk 2012-09-19 21:29 ` David Rientjes 2012-09-19 21:49 ` David Rientjes 2012-09-19 23:01 ` Linus Torvalds 2012-09-19 23:20 ` David Rientjes 2012-09-20 21:14 ` Konrad Rzeszutek Wilk 2012-09-20 2:29 ` Raghavendra K T 2012-09-20 2:46 ` David Rientjes 2012-09-20 2:55 ` Raghavendra K T 2012-09-20 21:18 ` Konrad Rzeszutek Wilk 2012-09-21 9:16 ` [patch for-3.6] fs, debugfs: fix race in u32_array_read and allocate array at open David Rientjes 2012-09-21 10:22 ` Raghavendra K T 2012-09-24 22:26 ` David Rientjes 2012-09-25 2:54 ` Raghavendra K T 2012-09-20 12:57 ` 3.6rc6 slab corruption Raghavendra K T 2012-09-20 21:18 ` Konrad Rzeszutek Wilk 2012-09-20 12:55 ` Raghavendra K T
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).