* Crashes with 874bbfe600a6 in 3.18.25 @ 2016-01-20 21:19 Jan Kara 2016-01-20 21:39 ` Shaohua Li 0 siblings, 1 reply; 61+ messages in thread From: Jan Kara @ 2016-01-20 21:19 UTC (permalink / raw) To: Shaohua Li; +Cc: LKML, stable, Tejun Heo, Daniel Bilik, Sasha Levin [-- Attachment #1: Type: text/plain, Size: 953 bytes --] Hello, a friend of mine started seeing crashes with 3.18.25 kernel - once appropriate load is put on the machine it crashes within minutes. He tracked down that reverting commit 874bbfe600a6 (this is the commit ID from Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make sure delayed work run in local cpu" makes the kernel stable again. I'm attaching screenshot of the crash - sadly the initial part is missing but it seems that we crashed when processing timers on otherwise idle CPU. This is a production machine so experimentation is not easy but if we really need more information it may be possible to reproduce the issue again and gather it. Anyone has idea what is going on? I was looking into the code for a while but so far I have no good explanation. It would be good to understand the cause instead of just blindly reverting the commit from stable tree... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR [-- Attachment #2: delayed-work-oops.png --] [-- Type: image/png, Size: 23695 bytes --] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-20 21:19 Crashes with 874bbfe600a6 in 3.18.25 Jan Kara @ 2016-01-20 21:39 ` Shaohua Li 2016-01-21 9:52 ` Jan Kara 0 siblings, 1 reply; 61+ messages in thread From: Shaohua Li @ 2016-01-20 21:39 UTC (permalink / raw) To: Jan Kara; +Cc: LKML, stable, Tejun Heo, Daniel Bilik, Sasha Levin On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote: > Hello, > > a friend of mine started seeing crashes with 3.18.25 kernel - once > appropriate load is put on the machine it crashes within minutes. He > tracked down that reverting commit 874bbfe600a6 (this is the commit ID from > Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make > sure delayed work run in local cpu" makes the kernel stable again. I'm > attaching screenshot of the crash - sadly the initial part is missing but > it seems that we crashed when processing timers on otherwise idle CPU. This > is a production machine so experimentation is not easy but if we really > need more information it may be possible to reproduce the issue again and > gather it. > > Anyone has idea what is going on? I was looking into the code for a while > but so far I have no good explanation. It would be good to understand the > cause instead of just blindly reverting the commit from stable tree... Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25? Thanks, Shaohua ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-20 21:39 ` Shaohua Li @ 2016-01-21 9:52 ` Jan Kara 2016-01-21 13:29 ` Sasha Levin 2016-01-22 1:10 ` Sasha Levin 0 siblings, 2 replies; 61+ messages in thread From: Jan Kara @ 2016-01-21 9:52 UTC (permalink / raw) To: Shaohua Li; +Cc: Jan Kara, LKML, stable, Tejun Heo, Daniel Bilik, Sasha Levin On Wed 20-01-16 13:39:01, Shaohua Li wrote: > On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote: > > Hello, > > > > a friend of mine started seeing crashes with 3.18.25 kernel - once > > appropriate load is put on the machine it crashes within minutes. He > > tracked down that reverting commit 874bbfe600a6 (this is the commit ID from > > Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make > > sure delayed work run in local cpu" makes the kernel stable again. I'm > > attaching screenshot of the crash - sadly the initial part is missing but > > it seems that we crashed when processing timers on otherwise idle CPU. This > > is a production machine so experimentation is not easy but if we really > > need more information it may be possible to reproduce the issue again and > > gather it. > > > > Anyone has idea what is going on? I was looking into the code for a while > > but so far I have no good explanation. It would be good to understand the > > cause instead of just blindly reverting the commit from stable tree... > > Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25? That doesn't seem to be included in 3.18-stable although it was CCed to stable. Sasha? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-21 9:52 ` Jan Kara @ 2016-01-21 13:29 ` Sasha Levin 2016-01-22 1:10 ` Sasha Levin 1 sibling, 0 replies; 61+ messages in thread From: Sasha Levin @ 2016-01-21 13:29 UTC (permalink / raw) To: Jan Kara, Shaohua Li; +Cc: LKML, stable, Tejun Heo, Daniel Bilik On 01/21/2016 04:52 AM, Jan Kara wrote: > On Wed 20-01-16 13:39:01, Shaohua Li wrote: >> > On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote: >>> > > Hello, >>> > > >>> > > a friend of mine started seeing crashes with 3.18.25 kernel - once >>> > > appropriate load is put on the machine it crashes within minutes. He >>> > > tracked down that reverting commit 874bbfe600a6 (this is the commit ID from >>> > > Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make >>> > > sure delayed work run in local cpu" makes the kernel stable again. I'm >>> > > attaching screenshot of the crash - sadly the initial part is missing but >>> > > it seems that we crashed when processing timers on otherwise idle CPU. This >>> > > is a production machine so experimentation is not easy but if we really >>> > > need more information it may be possible to reproduce the issue again and >>> > > gather it. >>> > > >>> > > Anyone has idea what is going on? I was looking into the code for a while >>> > > but so far I have no good explanation. It would be good to understand the >>> > > cause instead of just blindly reverting the commit from stable tree... >> > >> > Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25? > That doesn't seem to be included in 3.18-stable although it was CCed to stable. > Sasha? Yup, that's not in. I'll include it and release it along with 3.18.26 on the weekend. Thanks, Sasha ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-21 9:52 ` Jan Kara 2016-01-21 13:29 ` Sasha Levin @ 2016-01-22 1:10 ` Sasha Levin 2016-01-22 16:09 ` Tejun Heo 1 sibling, 1 reply; 61+ messages in thread From: Sasha Levin @ 2016-01-22 1:10 UTC (permalink / raw) To: Jan Kara, Shaohua Li, Tejun Heo; +Cc: LKML, stable, Tejun Heo, Daniel Bilik On 01/21/2016 04:52 AM, Jan Kara wrote: > On Wed 20-01-16 13:39:01, Shaohua Li wrote: >> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote: >>> Hello, >>> >>> a friend of mine started seeing crashes with 3.18.25 kernel - once >>> appropriate load is put on the machine it crashes within minutes. He >>> tracked down that reverting commit 874bbfe600a6 (this is the commit ID from >>> Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make >>> sure delayed work run in local cpu" makes the kernel stable again. I'm >>> attaching screenshot of the crash - sadly the initial part is missing but >>> it seems that we crashed when processing timers on otherwise idle CPU. This >>> is a production machine so experimentation is not easy but if we really >>> need more information it may be possible to reproduce the issue again and >>> gather it. >>> >>> Anyone has idea what is going on? I was looking into the code for a while >>> but so far I have no good explanation. It would be good to understand the >>> cause instead of just blindly reverting the commit from stable tree... >> >> Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25? > > That doesn't seem to be included in 3.18-stable although it was CCed to stable. > Sasha? Looks like it requires more than trivial backport (I think). Tejun? Thanks, Sasha ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-22 1:10 ` Sasha Levin @ 2016-01-22 16:09 ` Tejun Heo 2016-01-23 2:20 ` Ben Hutchings 0 siblings, 1 reply; 61+ messages in thread From: Tejun Heo @ 2016-01-22 16:09 UTC (permalink / raw) To: Sasha Levin Cc: Jan Kara, Shaohua Li, LKML, stable, Daniel Bilik, Thomas Gleixner (cc'ing Thomas) On Thu, Jan 21, 2016 at 08:10:20PM -0500, Sasha Levin wrote: > On 01/21/2016 04:52 AM, Jan Kara wrote: > > On Wed 20-01-16 13:39:01, Shaohua Li wrote: > >> On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote: > >>> Hello, > >>> > >>> a friend of mine started seeing crashes with 3.18.25 kernel - once > >>> appropriate load is put on the machine it crashes within minutes. He > >>> tracked down that reverting commit 874bbfe600a6 (this is the commit ID from > >>> Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make > >>> sure delayed work run in local cpu" makes the kernel stable again. I'm > >>> attaching screenshot of the crash - sadly the initial part is missing but > >>> it seems that we crashed when processing timers on otherwise idle CPU. This > >>> is a production machine so experimentation is not easy but if we really > >>> need more information it may be possible to reproduce the issue again and > >>> gather it. > >>> > >>> Anyone has idea what is going on? I was looking into the code for a while > >>> but so far I have no good explanation. It would be good to understand the > >>> cause instead of just blindly reverting the commit from stable tree... > >> > >> Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25? > > > > That doesn't seem to be included in 3.18-stable although it was CCed to stable. > > Sasha? > > Looks like it requires more than trivial backport (I think). Tejun? The timer migration has changed quite a bit. Given that we've never seen vmstat work crashing in 3.18 era, I wonder whether the right thing to do here is reverting 874bbfe600a6 from 3.18 stable? Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-22 16:09 ` Tejun Heo @ 2016-01-23 2:20 ` Ben Hutchings 2016-01-23 16:11 ` Thomas Gleixner 0 siblings, 1 reply; 61+ messages in thread From: Ben Hutchings @ 2016-01-23 2:20 UTC (permalink / raw) To: Tejun Heo, Sasha Levin Cc: Jan Kara, Shaohua Li, LKML, stable, Daniel Bilik, Thomas Gleixner [-- Attachment #1: Type: text/plain, Size: 2196 bytes --] On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > (cc'ing Thomas) > > On Thu, Jan 21, 2016 at 08:10:20PM -0500, Sasha Levin wrote: > > On 01/21/2016 04:52 AM, Jan Kara wrote: > > > On Wed 20-01-16 13:39:01, Shaohua Li wrote: > > > > On Wed, Jan 20, 2016 at 10:19:26PM +0100, Jan Kara wrote: > > > > > Hello, > > > > > > > > > > a friend of mine started seeing crashes with 3.18.25 kernel - once > > > > > appropriate load is put on the machine it crashes within minutes. He > > > > > tracked down that reverting commit 874bbfe600a6 (this is the commit ID from > > > > > Linus' tree, in stable tree the commit ID is 1e7af294dd03) "workqueue: make > > > > > sure delayed work run in local cpu" makes the kernel stable again. I'm > > > > > attaching screenshot of the crash - sadly the initial part is missing but > > > > > it seems that we crashed when processing timers on otherwise idle CPU. This > > > > > is a production machine so experimentation is not easy but if we really > > > > > need more information it may be possible to reproduce the issue again and > > > > > gather it. > > > > > > > > > > Anyone has idea what is going on? I was looking into the code for a while > > > > > but so far I have no good explanation. It would be good to understand the > > > > > cause instead of just blindly reverting the commit from stable tree... > > > > > > > > Tejun fixed a bug in timer: 22b886dd10180939. is it included in 3.18.25? > > > > > > That doesn't seem to be included in 3.18-stable although it was CCed to stable. > > > Sasha? > > > > Looks like it requires more than trivial backport (I think). Tejun? > > The timer migration has changed quite a bit. Given that we've never > seen vmstat work crashing in 3.18 era, I wonder whether the right > thing to do here is reverting 874bbfe600a6 from 3.18 stable? It's not just 3.18 that has this; 874bbfe600a6 was backported to all stable branches from 3.10 onward. Only the 4.2-ckt branch has 22b886dd10180939. Ben. -- Ben Hutchings Life is what happens to you while you're busy making other plans. - John Lennon [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-23 2:20 ` Ben Hutchings @ 2016-01-23 16:11 ` Thomas Gleixner 2016-01-26 9:34 ` Jan Kara 0 siblings, 1 reply; 61+ messages in thread From: Thomas Gleixner @ 2016-01-23 16:11 UTC (permalink / raw) To: Ben Hutchings Cc: Tejun Heo, Sasha Levin, Jan Kara, Shaohua Li, LKML, stable, Daniel Bilik [-- Attachment #1: Type: TEXT/PLAIN, Size: 690 bytes --] On Sat, 23 Jan 2016, Ben Hutchings wrote: > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > The timer migration has changed quite a bit. Given that we've never > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > stable branches from 3.10 onward. Only the 4.2-ckt branch has > 22b886dd10180939. 22b886dd10180939 fixes a bug which was introduced with the timer wheel overhaul in 4.2. So only 4.2/3 should have it backported. Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-23 16:11 ` Thomas Gleixner @ 2016-01-26 9:34 ` Jan Kara 0 siblings, 0 replies; 61+ messages in thread From: Jan Kara @ 2016-01-26 9:34 UTC (permalink / raw) To: Thomas Gleixner Cc: Ben Hutchings, Tejun Heo, Sasha Levin, Jan Kara, Shaohua Li, LKML, stable, Daniel Bilik On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > The timer migration has changed quite a bit. Given that we've never > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > stable branches from 3.10 onward. Only the 4.2-ckt branch has > > 22b886dd10180939. > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > overhaul in 4.2. So only 4.2/3 should have it backported. Thanks for explanation. So do I understand right that timers are always run on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to run timer for delayed work on the calling CPU) doesn't make sense there? If that is true than reverting the commit from older stable kernels is probably the easiest way to resolve the crashes. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 @ 2016-01-26 9:34 ` Jan Kara 0 siblings, 0 replies; 61+ messages in thread From: Jan Kara @ 2016-01-26 9:34 UTC (permalink / raw) To: Thomas Gleixner Cc: Ben Hutchings, Tejun Heo, Sasha Levin, Jan Kara, Shaohua Li, LKML, stable, Daniel Bilik On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > The timer migration has changed quite a bit.��Given that we've never > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > stable branches from 3.10 onward. �Only the 4.2-ckt branch has > > 22b886dd10180939. > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > overhaul in 4.2. So only 4.2/3 should have it backported. Thanks for explanation. So do I understand right that timers are always run on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to run timer for delayed work on the calling CPU) doesn't make sense there? If that is true than reverting the commit from older stable kernels is probably the easiest way to resolve the crashes. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-26 9:34 ` Jan Kara @ 2016-01-26 9:49 ` Thomas Gleixner -1 siblings, 0 replies; 61+ messages in thread From: Thomas Gleixner @ 2016-01-26 9:49 UTC (permalink / raw) To: Jan Kara Cc: Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik [-- Attachment #1: Type: TEXT/PLAIN, Size: 1923 bytes --] On Tue, 26 Jan 2016, Jan Kara wrote: > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > > > The timer migration has changed quite a bit. Given that we've never > > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > > stable branches from 3.10 onward. Only the 4.2-ckt branch has > > > 22b886dd10180939. > > > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > > overhaul in 4.2. So only 4.2/3 should have it backported. > > Thanks for explanation. So do I understand right that timers are always run > on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to > run timer for delayed work on the calling CPU) doesn't make sense there? If > that is true than reverting the commit from older stable kernels is > probably the easiest way to resolve the crashes. I was merily referring to 22b886dd10180939 which is a bug fix for things we reworked in the timer wheel core code in 4.2. It's completely unrelated to the problem at hand. Non pinned timers can be migrated due to power saving decisions since 2.6.36. What changed over time is how the decision is made, but the general principle still applies. The timer code was completely unchanged between 3.18 and 4.0 and even with the larger overhaul in 4.2 we did not change the migration logic. We merily changed the internal implementation of the timer wheel. I have no idea how 874bbfe600a6 can result in crashing on older kernels. Can you ask the reporter to enable DEBUG_OBJECTS so we might get an idea what goes wrong with that timer. Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 @ 2016-01-26 9:49 ` Thomas Gleixner 0 siblings, 0 replies; 61+ messages in thread From: Thomas Gleixner @ 2016-01-26 9:49 UTC (permalink / raw) To: Jan Kara Cc: Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik [-- Attachment #1: Type: TEXT/PLAIN, Size: 1929 bytes --] On Tue, 26 Jan 2016, Jan Kara wrote: > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > > > The timer migration has changed quite a bit.��Given that we've never > > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > > stable branches from 3.10 onward. �Only the 4.2-ckt branch has > > > 22b886dd10180939. > > > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > > overhaul in 4.2. So only 4.2/3 should have it backported. > > Thanks for explanation. So do I understand right that timers are always run > on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to > run timer for delayed work on the calling CPU) doesn't make sense there? If > that is true than reverting the commit from older stable kernels is > probably the easiest way to resolve the crashes. I was merily referring to 22b886dd10180939 which is a bug fix for things we reworked in the timer wheel core code in 4.2. It's completely unrelated to the problem at hand. Non pinned timers can be migrated due to power saving decisions since 2.6.36. What changed over time is how the decision is made, but the general principle still applies. The timer code was completely unchanged between 3.18 and 4.0 and even with the larger overhaul in 4.2 we did not change the migration logic. We merily changed the internal implementation of the timer wheel. I have no idea how 874bbfe600a6 can result in crashing on older kernels. Can you ask the reporter to enable DEBUG_OBJECTS so we might get an idea what goes wrong with that timer. Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-26 9:34 ` Jan Kara @ 2016-01-26 11:14 ` Petr Mladek -1 siblings, 0 replies; 61+ messages in thread From: Petr Mladek @ 2016-01-26 11:14 UTC (permalink / raw) To: Jan Kara Cc: Thomas Gleixner, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Tue 2016-01-26 10:34:00, Jan Kara wrote: > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > > > The timer migration has changed quite a bit. Given that we've never > > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > > stable branches from 3.10 onward. Only the 4.2-ckt branch has > > > 22b886dd10180939. > > > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > > overhaul in 4.2. So only 4.2/3 should have it backported. > > Thanks for explanation. So do I understand right that timers are always run > on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to > run timer for delayed work on the calling CPU) doesn't make sense there? If > that is true than reverting the commit from older stable kernels is > probably the easiest way to resolve the crashes. The commit 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") forces the timer to run on the local CPU. It might be correct for vmstat. But I wonder if it might break some other delayed work user that depends on running on different CPU. Best Regards, Petr ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 @ 2016-01-26 11:14 ` Petr Mladek 0 siblings, 0 replies; 61+ messages in thread From: Petr Mladek @ 2016-01-26 11:14 UTC (permalink / raw) To: Jan Kara Cc: Thomas Gleixner, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Tue 2016-01-26 10:34:00, Jan Kara wrote: > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > > > The timer migration has changed quite a bit.��Given that we've never > > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > > stable branches from 3.10 onward. �Only the 4.2-ckt branch has > > > 22b886dd10180939. > > > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > > overhaul in 4.2. So only 4.2/3 should have it backported. > > Thanks for explanation. So do I understand right that timers are always run > on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to > run timer for delayed work on the calling CPU) doesn't make sense there? If > that is true than reverting the commit from older stable kernels is > probably the easiest way to resolve the crashes. The commit 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") forces the timer to run on the local CPU. It might be correct for vmstat. But I wonder if it might break some other delayed work user that depends on running on different CPU. Best Regards, Petr ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-26 11:14 ` Petr Mladek @ 2016-01-26 13:09 ` Thomas Gleixner -1 siblings, 0 replies; 61+ messages in thread From: Thomas Gleixner @ 2016-01-26 13:09 UTC (permalink / raw) To: Petr Mladek Cc: Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik [-- Attachment #1: Type: TEXT/PLAIN, Size: 2333 bytes --] On Tue, 26 Jan 2016, Petr Mladek wrote: > On Tue 2016-01-26 10:34:00, Jan Kara wrote: > > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > > > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > > > > > The timer migration has changed quite a bit. Given that we've never > > > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > > > stable branches from 3.10 onward. Only the 4.2-ckt branch has > > > > 22b886dd10180939. > > > > > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > > > overhaul in 4.2. So only 4.2/3 should have it backported. > > > > Thanks for explanation. So do I understand right that timers are always run > > on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to > > run timer for delayed work on the calling CPU) doesn't make sense there? If > > that is true than reverting the commit from older stable kernels is > > probably the easiest way to resolve the crashes. > > The commit 874bbfe600a6 ("workqueue: make sure delayed work run in > local cpu") forces the timer to run on the local CPU. It might be correct > for vmstat. But I wonder if it might break some other delayed work > user that depends on running on different CPU. The default of add_timer() is to run on the current cpu. It only moves the timer to a different cpu when the power saving code says so. So 874bbfe600a6 enforces that the timer runs on the cpu on which queue_delayed_work() is called, but before that commit it was likely that the timer was queued on the calling cpu. So there is nothing which can depend on running on a different CPU, except callers of queue_delayed_work_on() which provide the target cpu explicitely. 874bbfe600a6 does not affect those callers at all. Now, what's different is: + if (cpu == WORK_CPU_UNBOUND) + cpu = raw_smp_processor_id(); dwork->cpu = cpu; So before that change dwork->cpu was set to WORK_CPU_UNBOUND. Now it's set to the current cpu, but I can't see how that matters. Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 @ 2016-01-26 13:09 ` Thomas Gleixner 0 siblings, 0 replies; 61+ messages in thread From: Thomas Gleixner @ 2016-01-26 13:09 UTC (permalink / raw) To: Petr Mladek Cc: Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik [-- Attachment #1: Type: TEXT/PLAIN, Size: 2339 bytes --] On Tue, 26 Jan 2016, Petr Mladek wrote: > On Tue 2016-01-26 10:34:00, Jan Kara wrote: > > On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: > > > On Sat, 23 Jan 2016, Ben Hutchings wrote: > > > > On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: > > > > > > Looks like it requires more than trivial backport (I think). Tejun? > > > > > > > > > > The timer migration has changed quite a bit.��Given that we've never > > > > > seen vmstat work crashing in 3.18 era, I wonder whether the right > > > > > thing to do here is reverting 874bbfe600a6 from 3.18 stable? > > > > > > > > It's not just 3.18 that has this; 874bbfe600a6 was backported to all > > > > stable branches from 3.10 onward. �Only the 4.2-ckt branch has > > > > 22b886dd10180939. > > > > > > 22b886dd10180939 fixes a bug which was introduced with the timer wheel > > > overhaul in 4.2. So only 4.2/3 should have it backported. > > > > Thanks for explanation. So do I understand right that timers are always run > > on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to > > run timer for delayed work on the calling CPU) doesn't make sense there? If > > that is true than reverting the commit from older stable kernels is > > probably the easiest way to resolve the crashes. > > The commit 874bbfe600a6 ("workqueue: make sure delayed work run in > local cpu") forces the timer to run on the local CPU. It might be correct > for vmstat. But I wonder if it might break some other delayed work > user that depends on running on different CPU. The default of add_timer() is to run on the current cpu. It only moves the timer to a different cpu when the power saving code says so. So 874bbfe600a6 enforces that the timer runs on the cpu on which queue_delayed_work() is called, but before that commit it was likely that the timer was queued on the calling cpu. So there is nothing which can depend on running on a different CPU, except callers of queue_delayed_work_on() which provide the target cpu explicitely. 874bbfe600a6 does not affect those callers at all. Now, what's different is: + if (cpu == WORK_CPU_UNBOUND) + cpu = raw_smp_processor_id(); dwork->cpu = cpu; So before that change dwork->cpu was set to WORK_CPU_UNBOUND. Now it's set to the current cpu, but I can't see how that matters. Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-01-26 13:09 ` Thomas Gleixner (?) @ 2016-02-03 9:35 ` Jiri Slaby 2016-02-03 10:41 ` Thomas Gleixner 2016-02-03 12:28 ` Michal Hocko -1 siblings, 2 replies; 61+ messages in thread From: Jiri Slaby @ 2016-02-03 9:35 UTC (permalink / raw) To: Thomas Gleixner, Petr Mladek Cc: Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On 01/26/2016, 02:09 PM, Thomas Gleixner wrote: > On Tue, 26 Jan 2016, Petr Mladek wrote: >> On Tue 2016-01-26 10:34:00, Jan Kara wrote: >>> On Sat 23-01-16 17:11:54, Thomas Gleixner wrote: >>>> On Sat, 23 Jan 2016, Ben Hutchings wrote: >>>>> On Fri, 2016-01-22 at 11:09 -0500, Tejun Heo wrote: >>>>>>> Looks like it requires more than trivial backport (I think). Tejun? >>>>>> >>>>>> The timer migration has changed quite a bit. Given that we've never >>>>>> seen vmstat work crashing in 3.18 era, I wonder whether the right >>>>>> thing to do here is reverting 874bbfe600a6 from 3.18 stable? >>>>> >>>>> It's not just 3.18 that has this; 874bbfe600a6 was backported to all >>>>> stable branches from 3.10 onward. Only the 4.2-ckt branch has >>>>> 22b886dd10180939. >>>> >>>> 22b886dd10180939 fixes a bug which was introduced with the timer wheel >>>> overhaul in 4.2. So only 4.2/3 should have it backported. >>> >>> Thanks for explanation. So do I understand right that timers are always run >>> on the calling CPU in kernels prior to 4.2 and thus commit 874bbfe600a6 (to >>> run timer for delayed work on the calling CPU) doesn't make sense there? If >>> that is true than reverting the commit from older stable kernels is >>> probably the easiest way to resolve the crashes. >> >> The commit 874bbfe600a6 ("workqueue: make sure delayed work run in >> local cpu") forces the timer to run on the local CPU. It might be correct >> for vmstat. But I wonder if it might break some other delayed work >> user that depends on running on different CPU. > > The default of add_timer() is to run on the current cpu. It only moves the > timer to a different cpu when the power saving code says so. So 874bbfe600a6 > enforces that the timer runs on the cpu on which queue_delayed_work() is > called, but before that commit it was likely that the timer was queued on the > calling cpu. So there is nothing which can depend on running on a different > CPU, except callers of queue_delayed_work_on() which provide the target cpu > explicitely. 874bbfe600a6 does not affect those callers at all. > > Now, what's different is: > > + if (cpu == WORK_CPU_UNBOUND) > + cpu = raw_smp_processor_id(); > dwork->cpu = cpu; > > So before that change dwork->cpu was set to WORK_CPU_UNBOUND. Now it's set to > the current cpu, but I can't see how that matters. What happens in later kernels, when the cpu is offlined before the delayed_work timer ticks? In stable 3.12, with the patch, this scenario results in an oops: #5 [ffff8c03fdd63d80] page_fault at ffffffff81523a88 [exception RIP: __queue_work+121] RIP: ffffffff81071989 RSP: ffff8c03fdd63e30 RFLAGS: 00010086 RAX: ffff88048b96bc00 RBX: ffff8c03e9bcc800 RCX: ffff880473820478 RDX: 0000000000000400 RSI: 0000000000000004 RDI: ffff880473820458 RBP: 0000000000000000 R8: ffff8c03fdd71f40 R9: ffff8c03ea4c4002 R10: 0000000000000000 R11: 0000000000000005 R12: ffff880473820458 R13: 00000000000000a8 R14: 000000000000e328 R15: 00000000000000a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff8c03fdd63e68] call_timer_fn at ffffffff81065611 #7 [ffff8c03fdd63e98] run_timer_softirq at ffffffff810663b7 #8 [ffff8c03fdd63f00] __do_softirq at ffffffff8105e2c5 #9 [ffff8c03fdd63f68] call_softirq at ffffffff8152cf9c #10 [ffff8c03fdd63f80] do_softirq at ffffffff81004665 #11 [ffff8c03fdd63fa0] smp_apic_timer_interrupt at ffffffff8152d835 #12 [ffff8c03fdd63fb0] apic_timer_interrupt at ffffffff8152c2dd The CPU was 168, and that one was offlined in the meantime. So __queue_work fails at: if (!(wq->flags & WQ_UNBOUND)) pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); else pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); ^^^ ^^^^ NODE is -1 \ pwq is NULL if (last_pool && last_pool != pwq->pool) { <--- BOOM Any ideas? thanks, -- js suse labs ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 9:35 ` Jiri Slaby @ 2016-02-03 10:41 ` Thomas Gleixner 2016-02-03 12:28 ` Michal Hocko 1 sibling, 0 replies; 61+ messages in thread From: Thomas Gleixner @ 2016-02-03 10:41 UTC (permalink / raw) To: Jiri Slaby Cc: Petr Mladek, Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 3 Feb 2016, Jiri Slaby wrote: > On 01/26/2016, 02:09 PM, Thomas Gleixner wrote: > What happens in later kernels, when the cpu is offlined before the > delayed_work timer ticks? In stable 3.12, with the patch, this scenario > results in an oops: > #5 [ffff8c03fdd63d80] page_fault at ffffffff81523a88 > [exception RIP: __queue_work+121] > RIP: ffffffff81071989 RSP: ffff8c03fdd63e30 RFLAGS: 00010086 > RAX: ffff88048b96bc00 RBX: ffff8c03e9bcc800 RCX: ffff880473820478 > RDX: 0000000000000400 RSI: 0000000000000004 RDI: ffff880473820458 > RBP: 0000000000000000 R8: ffff8c03fdd71f40 R9: ffff8c03ea4c4002 > R10: 0000000000000000 R11: 0000000000000005 R12: ffff880473820458 > R13: 00000000000000a8 R14: 000000000000e328 R15: 00000000000000a8 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #6 [ffff8c03fdd63e68] call_timer_fn at ffffffff81065611 > #7 [ffff8c03fdd63e98] run_timer_softirq at ffffffff810663b7 > #8 [ffff8c03fdd63f00] __do_softirq at ffffffff8105e2c5 > #9 [ffff8c03fdd63f68] call_softirq at ffffffff8152cf9c > #10 [ffff8c03fdd63f80] do_softirq at ffffffff81004665 > #11 [ffff8c03fdd63fa0] smp_apic_timer_interrupt at ffffffff8152d835 > #12 [ffff8c03fdd63fb0] apic_timer_interrupt at ffffffff8152c2dd > > The CPU was 168, and that one was offlined in the meantime. So > __queue_work fails at: > if (!(wq->flags & WQ_UNBOUND)) > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > else > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > ^^^ ^^^^ NODE is -1 > \ pwq is NULL > > if (last_pool && last_pool != pwq->pool) { <--- BOOM I don't see how that works on later kernels. If cpu_to_node() returns -1 we access outside of the array bounds.... Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 9:35 ` Jiri Slaby 2016-02-03 10:41 ` Thomas Gleixner @ 2016-02-03 12:28 ` Michal Hocko 2016-02-03 16:24 ` Tejun Heo 1 sibling, 1 reply; 61+ messages in thread From: Michal Hocko @ 2016-02-03 12:28 UTC (permalink / raw) To: Jiri Slaby Cc: Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Tejun Heo, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik [I wasn't aware of this email thread before so I am jumping in late] On Wed 03-02-16 10:35:32, Jiri Slaby wrote: > On 01/26/2016, 02:09 PM, Thomas Gleixner wrote: > > On Tue, 26 Jan 2016, Petr Mladek wrote: [...] > >> The commit 874bbfe600a6 ("workqueue: make sure delayed work run in > >> local cpu") forces the timer to run on the local CPU. It might be correct > >> for vmstat. But I wonder if it might break some other delayed work > >> user that depends on running on different CPU. > > > > The default of add_timer() is to run on the current cpu. It only moves the > > timer to a different cpu when the power saving code says so. So 874bbfe600a6 > > enforces that the timer runs on the cpu on which queue_delayed_work() is > > called, but before that commit it was likely that the timer was queued on the > > calling cpu. So there is nothing which can depend on running on a different > > CPU, except callers of queue_delayed_work_on() which provide the target cpu > > explicitely. 874bbfe600a6 does not affect those callers at all. > > > > Now, what's different is: > > > > + if (cpu == WORK_CPU_UNBOUND) > > + cpu = raw_smp_processor_id(); > > dwork->cpu = cpu; > > > > So before that change dwork->cpu was set to WORK_CPU_UNBOUND. Now it's set to > > the current cpu, but I can't see how that matters. It matters because if somebody did queue_delayed_work() and the current cpu gets offlined then even though the associated timer gets migrated the __queue_work wouldn't recognize the associated cpu as WORK_CPU_UNBOUND anymore and won't reset the following path will go kaboom... > The CPU was 168, and that one was offlined in the meantime. So > __queue_work fails at: > if (!(wq->flags & WQ_UNBOUND)) > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > else > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > ^^^ ^^^^ NODE is -1 > \ pwq is NULL > > if (last_pool && last_pool != pwq->pool) { <--- BOOM So I think 874bbfe600a6 is really bogus. It should be reverted. We already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU we need it to run on"). This which should be used for the stable trees as a replacement. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 12:28 ` Michal Hocko @ 2016-02-03 16:24 ` Tejun Heo 2016-02-03 16:48 ` Michal Hocko ` (3 more replies) 0 siblings, 4 replies; 61+ messages in thread From: Tejun Heo @ 2016-02-03 16:24 UTC (permalink / raw) To: Michal Hocko Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote: > > The CPU was 168, and that one was offlined in the meantime. So > > __queue_work fails at: > > if (!(wq->flags & WQ_UNBOUND)) > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > > else > > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > > ^^^ ^^^^ NODE is -1 > > \ pwq is NULL > > > > if (last_pool && last_pool != pwq->pool) { <--- BOOM So, the proper fix here is keeping cpu <-> node mapping stable across cpu on/offlining which has been being worked on for a long time now. The patchst is pending and it fixes other issues too. > So I think 874bbfe600a6 is really bogus. It should be reverted. We > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly > schedule per-cpu work on the CPU we need it to run on"). This which > should be used for the stable trees as a replacement. It's not bogus. We can't flip a property that has been guaranteed without any provision for verification. Why do you think vmstat blow up in the first place? vmstat would be the canary case as it runs frequently on all systems. It's exactly the sign that we can't break this guarantee willy-nilly. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 16:24 ` Tejun Heo @ 2016-02-03 16:48 ` Michal Hocko 2016-02-03 16:59 ` Tejun Heo 2016-02-03 17:01 ` Mike Galbraith ` (2 subsequent siblings) 3 siblings, 1 reply; 61+ messages in thread From: Michal Hocko @ 2016-02-03 16:48 UTC (permalink / raw) To: Tejun Heo Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed 03-02-16 11:24:41, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote: > > > The CPU was 168, and that one was offlined in the meantime. So > > > __queue_work fails at: > > > if (!(wq->flags & WQ_UNBOUND)) > > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > > > else > > > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > > > ^^^ ^^^^ NODE is -1 > > > \ pwq is NULL > > > > > > if (last_pool && last_pool != pwq->pool) { <--- BOOM > > So, the proper fix here is keeping cpu <-> node mapping stable across > cpu on/offlining which has been being worked on for a long time now. > The patchst is pending and it fixes other issues too. What if that node was memory offlined as well? It just doesn't make any sense to stick to the old node when the old cpu went away already. If anything and add_timer_on also for WORK_CPU_UNBOUND is really required then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that __queue_work can actually move on to the local CPU properly and handle the offline cpu properly. > > So I think 874bbfe600a6 is really bogus. It should be reverted. We > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly > > schedule per-cpu work on the CPU we need it to run on"). This which > > should be used for the stable trees as a replacement. > > It's not bogus. We can't flip a property that has been guaranteed > without any provision for verification. Why do you think vmstat blow > up in the first place? Because it wants to have a strong per-cpu guarantee while it used to fail to tell so. My understanding was that this is exactly what queue_delayed_work_on is for while WORK_CPU_UNBOUND tells that the caller doesn't really insist on any particular CPU (just local CPU is preferred). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 16:48 ` Michal Hocko @ 2016-02-03 16:59 ` Tejun Heo 2016-02-04 6:37 ` Michal Hocko 0 siblings, 1 reply; 61+ messages in thread From: Tejun Heo @ 2016-02-03 16:59 UTC (permalink / raw) To: Michal Hocko Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote: > > So, the proper fix here is keeping cpu <-> node mapping stable across > > cpu on/offlining which has been being worked on for a long time now. > > The patchst is pending and it fixes other issues too. > > What if that node was memory offlined as well? It just doesn't make any > sense to stick to the old node when the old cpu went away already. If Whether a memory node is offlined or not doesn't affect how cpus map to the node. The mapping is something which is fixed at physical and firmware level throughout while the system is running. If the node becomes memory-less what changes is the memory allocation strategy for the node, not how cpus map to nodes. The only problem here is that we currently lose how we mapped logical IDs to physical ones across off/online cycles. > anything and add_timer_on also for WORK_CPU_UNBOUND is really required > then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that > __queue_work can actually move on to the local CPU properly and handle > the offline cpu properly. delayed_work->cpu is determined on queueing time. Dealing with offlined cpus at execution is completley fine. There's no need to "preserve" anything. > > It's not bogus. We can't flip a property that has been guaranteed > > without any provision for verification. Why do you think vmstat blow > > up in the first place? > > Because it wants to have a strong per-cpu guarantee while it used > to fail to tell so. My understanding was that this is exactly what > queue_delayed_work_on is for while WORK_CPU_UNBOUND tells that the > caller doesn't really insist on any particular CPU (just local CPU is > preferred). What you said just doesn't fit the reality. Again, think about why vmstat crashed. Why is this difficult to understand? -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 16:59 ` Tejun Heo @ 2016-02-04 6:37 ` Michal Hocko 2016-02-04 7:40 ` Michal Hocko 0 siblings, 1 reply; 61+ messages in thread From: Michal Hocko @ 2016-02-04 6:37 UTC (permalink / raw) To: Tejun Heo Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed 03-02-16 11:59:01, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote: [...] > > anything and add_timer_on also for WORK_CPU_UNBOUND is really required > > then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that > > __queue_work can actually move on to the local CPU properly and handle > > the offline cpu properly. > > delayed_work->cpu is determined on queueing time. Dealing with > offlined cpus at execution is completley fine. There's no need to > "preserve" anything. I've seen you have posted a fix in the mean time but just for my understading. Why the following is not an appropriate fix? diff --git a/kernel/workqueue.c b/kernel/workqueue.c index c579dbab2e36..52bb11cf20d1 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1459,9 +1459,9 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, dwork->wq = wq; /* timer isn't guaranteed to run in this cpu, record earlier */ + dwork->cpu = cpu; if (cpu == WORK_CPU_UNBOUND) cpu = raw_smp_processor_id(); - dwork->cpu = cpu; timer->expires = jiffies + delay; add_timer_on(timer, cpu); Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-04 6:37 ` Michal Hocko @ 2016-02-04 7:40 ` Michal Hocko 0 siblings, 0 replies; 61+ messages in thread From: Michal Hocko @ 2016-02-04 7:40 UTC (permalink / raw) To: Tejun Heo Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Thu 04-02-16 07:37:23, Michal Hocko wrote: > On Wed 03-02-16 11:59:01, Tejun Heo wrote: > > On Wed, Feb 03, 2016 at 05:48:52PM +0100, Michal Hocko wrote: > [...] > > > anything and add_timer_on also for WORK_CPU_UNBOUND is really required > > > then we should at least preserve WORK_CPU_UNBOUND in dwork->cpu so that > > > __queue_work can actually move on to the local CPU properly and handle > > > the offline cpu properly. > > > > delayed_work->cpu is determined on queueing time. Dealing with > > offlined cpus at execution is completley fine. There's no need to > > "preserve" anything. > > I've seen you have posted a fix in the mean time but just for my > understading. Why the following is not an appropriate fix? > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index c579dbab2e36..52bb11cf20d1 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1459,9 +1459,9 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, > > dwork->wq = wq; > /* timer isn't guaranteed to run in this cpu, record earlier */ > + dwork->cpu = cpu; > if (cpu == WORK_CPU_UNBOUND) > cpu = raw_smp_processor_id(); > - dwork->cpu = cpu; > timer->expires = jiffies + delay; > > add_timer_on(timer, cpu); Ok, so after some more thinking about that, this won't really help for memory less CPU which would still have NUMA_NO_NODE associated with it AFAIU. So this is definitely better to be handled at unbound_pwq_by_node level. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 16:24 ` Tejun Heo 2016-02-03 16:48 ` Michal Hocko @ 2016-02-03 17:01 ` Mike Galbraith 2016-02-03 17:06 ` Tejun Heo 2016-02-03 18:46 ` Thomas Gleixner 2016-02-05 5:44 ` Mike Galbraith 3 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-03 17:01 UTC (permalink / raw) To: Tejun Heo, Michal Hocko Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 2016-02-03 at 11:24 -0500, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote: > > > The CPU was 168, and that one was offlined in the meantime. So > > > __queue_work fails at: > > > if (!(wq->flags & WQ_UNBOUND)) > > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > > > else > > > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > > > ^^^ ^^^^ NODE is -1 > > > \ pwq is NULL > > > > > > if (last_pool && last_pool != pwq->pool) { <--- BOOM > > So, the proper fix here is keeping cpu <-> node mapping stable across > cpu on/offlining which has been being worked on for a long time now. > The patchst is pending and it fixes other issues too. Hm, so it's ok to queue work to an offline CPU? What happens if it doesn't come back for an eternity or two? -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 17:01 ` Mike Galbraith @ 2016-02-03 17:06 ` Tejun Heo 2016-02-03 17:13 ` Mike Galbraith ` (2 more replies) 0 siblings, 3 replies; 61+ messages in thread From: Tejun Heo @ 2016-02-03 17:06 UTC (permalink / raw) To: Mike Galbraith Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote: > Hm, so it's ok to queue work to an offline CPU? What happens if it > doesn't come back for an eternity or two? Right now, it just loses affinity. A more interesting case is a cpu going offline whlie work items bound to the cpu are still running and the root problem is that we've never distinguished between affinity for correctness and optimization and thus can't flush or warn on the stagglers. The plan is to ensure that all correctness users specify the CPU explicitly. Once we're there, we can warn on illegal usages. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 17:06 ` Tejun Heo @ 2016-02-03 17:13 ` Mike Galbraith 2016-02-03 17:15 ` Tejun Heo 2016-02-04 2:00 ` Mike Galbraith 2016-02-04 10:04 ` Mike Galbraith 2 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-03 17:13 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote: > > Hm, so it's ok to queue work to an offline CPU? What happens if it > > doesn't come back for an eternity or two? > > Right now, it just loses affinity. A more interesting case is a cpu > going offline whlie work items bound to the cpu are still running and > the root problem is that we've never distinguished between affinity > for correctness and optimization and thus can't flush or warn on the > stagglers. The plan is to ensure that all correctness users specify > the CPU explicitly. Once we're there, we can warn on illegal usages. Ah, and the rest (the vast majority) can then be safely deflected away from nohz_full cpus. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 17:13 ` Mike Galbraith @ 2016-02-03 17:15 ` Tejun Heo 0 siblings, 0 replies; 61+ messages in thread From: Tejun Heo @ 2016-02-03 17:15 UTC (permalink / raw) To: Mike Galbraith Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, Feb 03, 2016 at 06:13:15PM +0100, Mike Galbraith wrote: > Ah, and the rest (the vast majority) can then be safely deflected away > from nohz_full cpus. Yeap, it should be possible to bounce majority of work items across CPUs all we want. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 17:06 ` Tejun Heo 2016-02-03 17:13 ` Mike Galbraith @ 2016-02-04 2:00 ` Mike Galbraith 2016-02-05 16:49 ` Tejun Heo 2016-02-04 10:04 ` Mike Galbraith 2 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-04 2:00 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote: > > Hm, so it's ok to queue work to an offline CPU? What happens if it > > doesn't come back for an eternity or two? > > Right now, it just loses affinity. A more interesting case is a cpu > going offline whlie work items bound to the cpu are still running and > the root problem is that we've never distinguished between affinity > for correctness and optimization and thus can't flush or warn on the > stagglers. The plan is to ensure that all correctness users specify > the CPU explicitly. Once we're there, we can warn on illegal usages. Isn't it the case that, currently at least, each and every spot that requires execution on a specific CPU yet does not take active measures to deal with hotplug events is in fact buggy? The timer code clearly states that the user is responsible, and so do both workqueue.[ch]. I was surprised me to hear that some think they have an iron clad guarantee, given the null and void clause is prominently displayed. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-04 2:00 ` Mike Galbraith @ 2016-02-05 16:49 ` Tejun Heo 2016-02-05 20:47 ` Mike Galbraith 0 siblings, 1 reply; 61+ messages in thread From: Tejun Heo @ 2016-02-05 16:49 UTC (permalink / raw) To: Mike Galbraith Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik Hello, Mike. On Thu, Feb 04, 2016 at 03:00:17AM +0100, Mike Galbraith wrote: > Isn't it the case that, currently at least, each and every spot that > requires execution on a specific CPU yet does not take active measures > to deal with hotplug events is in fact buggy? The timer code clearly > states that the user is responsible, and so do both workqueue.[ch]. Yeah, the usages which require affinity for correctness must flush the work items from a cpu down callback. > I was surprised me to hear that some think they have an iron clad > guarantee, given the null and void clause is prominently displayed. Nobody is (or at least should be) expecting workqueue to handle affinity across CPU offlining events. That is not the problem. The problem is that currently queue_work(work) and queue_work_on(smp_processor_id(), work) are identical and there likely are affinity-for-correctness users which are doing the former. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 16:49 ` Tejun Heo @ 2016-02-05 20:47 ` Mike Galbraith 2016-02-05 20:54 ` Tejun Heo 0 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-05 20:47 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Fri, 2016-02-05 at 11:49 -0500, Tejun Heo wrote: > Hello, Mike. > > On Thu, Feb 04, 2016 at 03:00:17AM +0100, Mike Galbraith wrote: > > Isn't it the case that, currently at least, each and every spot that > > requires execution on a specific CPU yet does not take active measures > > to deal with hotplug events is in fact buggy? The timer code clearly > > states that the user is responsible, and so do both workqueue.[ch]. > > Yeah, the usages which require affinity for correctness must flush the > work items from a cpu down callback. Good, we agree. Now bear with me a moment.. That very point is what makes it wrong for the workqueue code to ever target a work item. The instant it does target selection, correctness may be at stake, it doesn't know, thus it must assume the full onus, which it has neither the knowledge not the time to do. That's how we exploded on node = -1, trying to help out the user by doing his job, but then not doing the whole job. IMHO, a better plan is to let the user screw it up all by himself. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 20:47 ` Mike Galbraith @ 2016-02-05 20:54 ` Tejun Heo 2016-02-05 20:59 ` Mike Galbraith 0 siblings, 1 reply; 61+ messages in thread From: Tejun Heo @ 2016-02-05 20:54 UTC (permalink / raw) To: Mike Galbraith Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik Hello, Mike. On Fri, Feb 05, 2016 at 09:47:11PM +0100, Mike Galbraith wrote: > That very point is what makes it wrong for the workqueue code to ever > target a work item. The instant it does target selection, correctness > may be at stake, it doesn't know, thus it must assume the full onus, > which it has neither the knowledge not the time to do. That's how we > exploded on node = -1, trying to help out the user by doing his job, I have a hard time seeing the NUMA_NO_NODE bug as something that indicative of anything. It is a dumb bug from mm side which puts everyone using cpu_to_node() at risk. > but then not doing the whole job. IMHO, a better plan is to let the > user screw it up all by himself. What are you suggesting? Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 20:54 ` Tejun Heo @ 2016-02-05 20:59 ` Mike Galbraith 2016-02-05 21:06 ` Tejun Heo 0 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-05 20:59 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote: > What are you suggesting? That 874bbfe6 should die. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 20:59 ` Mike Galbraith @ 2016-02-05 21:06 ` Tejun Heo 2016-02-06 13:07 ` Henrique de Moraes Holschuh 2016-02-09 15:31 ` Mike Galbraith 0 siblings, 2 replies; 61+ messages in thread From: Tejun Heo @ 2016-02-05 21:06 UTC (permalink / raw) To: Mike Galbraith Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote: > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote: > > > What are you suggesting? > > That 874bbfe6 should die. Yeah, it's gonna be killed. The commit is there because the behavior change broke things. We don't want to guarantee it but have been and can't change it right away just because we don't like it when things may break from it. The plan is to implement a debug option to force workqueue to always execute these work items on a foreign cpu to weed out breakages. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 21:06 ` Tejun Heo @ 2016-02-06 13:07 ` Henrique de Moraes Holschuh 2016-02-07 5:19 ` Mike Galbraith 2016-02-09 15:31 ` Mike Galbraith 1 sibling, 1 reply; 61+ messages in thread From: Henrique de Moraes Holschuh @ 2016-02-06 13:07 UTC (permalink / raw) To: Tejun Heo Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Fri, 05 Feb 2016, Tejun Heo wrote: > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote: > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote: > > > > > What are you suggesting? > > > > That 874bbfe6 should die. > > Yeah, it's gonna be killed. The commit is there because the behavior > change broke things. We don't want to guarantee it but have been and > can't change it right away just because we don't like it when things > may break from it. The plan is to implement a debug option to force > workqueue to always execute these work items on a foreign cpu to weed > out breakages. Is there a path to filter down sane behavior (whichever one it might be) to the affected stable/LTS kernels? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-06 13:07 ` Henrique de Moraes Holschuh @ 2016-02-07 5:19 ` Mike Galbraith 2016-02-07 5:59 ` Mike Galbraith 0 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-07 5:19 UTC (permalink / raw) To: Henrique de Moraes Holschuh, Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Sat, 2016-02-06 at 11:07 -0200, Henrique de Moraes Holschuh wrote: > On Fri, 05 Feb 2016, Tejun Heo wrote: > > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote: > > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote: > > > > > > > What are you suggesting? > > > > > > That 874bbfe6 should die. > > > > Yeah, it's gonna be killed. The commit is there because the behavior > > change broke things. We don't want to guarantee it but have been and > > can't change it right away just because we don't like it when things > > may break from it. The plan is to implement a debug option to force > > workqueue to always execute these work items on a foreign cpu to weed > > out breakages. > > Is there a path to filter down sane behavior (whichever one it might be) to > the affected stable/LTS kernels? What Michal said, replace 874bbfe6 with 176bed1d. Without 22b886dd, 874bbfe6 is a landmine, uses add_timer_on() as if it were mod_timer(), which it is not, or rather was not until 22b886dd came along, and still does not look like the mod_timer() alias that add_timer() is. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-07 5:19 ` Mike Galbraith @ 2016-02-07 5:59 ` Mike Galbraith 0 siblings, 0 replies; 61+ messages in thread From: Mike Galbraith @ 2016-02-07 5:59 UTC (permalink / raw) To: Henrique de Moraes Holschuh, Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Sun, 2016-02-07 at 06:19 +0100, Mike Galbraith wrote: > On Sat, 2016-02-06 at 11:07 -0200, Henrique de Moraes Holschuh wrote: > > On Fri, 05 Feb 2016, Tejun Heo wrote: > > > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote: > > > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote: > > > > > > > > > What are you suggesting? > > > > > > > > That 874bbfe6 should die. > > > > > > Yeah, it's gonna be killed. The commit is there because the behavior > > > change broke things. We don't want to guarantee it but have been and > > > can't change it right away just because we don't like it when things > > > may break from it. The plan is to implement a debug option to force > > > workqueue to always execute these work items on a foreign cpu to weed > > > out breakages. > > > > Is there a path to filter down sane behavior (whichever one it might be) to > > the affected stable/LTS kernels? > > What Michal said, replace 874bbfe6 with 176bed1d. Without 22b886dd, > 874bbfe6 is a landmine, uses add_timer_on() as if it were mod_timer(), > which it is not, or rather was not until 22b886dd came along, and still > does not look like the mod_timer() alias that add_timer() is. BTW, with the 874bbfe6 22b886dd pair, mundane workqueue timers are no longer deflected to housekeeper CPUs, so NO_HZ_FULL regresses. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 21:06 ` Tejun Heo 2016-02-06 13:07 ` Henrique de Moraes Holschuh @ 2016-02-09 15:31 ` Mike Galbraith 2016-02-09 16:39 ` Linus Torvalds 1 sibling, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-09 15:31 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman, Linus Torvalds On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote: > On Fri, Feb 05, 2016 at 09:59:49PM +0100, Mike Galbraith wrote: > > On Fri, 2016-02-05 at 15:54 -0500, Tejun Heo wrote: > > > > > What are you suggesting? > > > > That 874bbfe6 should die. > > Yeah, it's gonna be killed. The commit is there because the behavior > change broke things. We don't want to guarantee it but have been and > can't change it right away just because we don't like it when things > may break from it. The plan is to implement a debug option to force > workqueue to always execute these work items on a foreign cpu to weed > out breakages. A niggling question remaining is when is it gonna be killed? 1. Meanwhile, 874bbfe6 was sent to 2.6.31+, meaning that every stable tree where it landed which did not ALSO receive 22b886dd has become destabilized. We have two 3.12-stability reports, one the hotplug explosion that you provided a workaround for, one the corruption, and one corruption report for 3.18. Both breakage types would be sort of fixed up by getting 22b886dd and your hotplug workaround (which does _not_ guarantee survival) were applied everywhere, however... 2. We also have a report for the 3.18 corruption victim that adding 22b886dd did NOT restore the stable status quo, rather it replaced the corruption that 874bbfe6 caused with a performance regression. 3. 874bbfe6 + 22b886dd also inflicts a NO_HZ_FULL regression. Admittedly not a huge deal, but another regression nonetheless. The only evidence I've seen that anything at all was the broken by the changes that triggered the inception of 874bbfe6 in the first place was the b0rked vmstat thing that Linus had already fixed with 176bed1d. So where is the breakage you mention that makes keeping 874bbfe6 the prudent thing to do vs just reverting 874bbfe6 immediately, perhaps 22b886dd as well given it is fallout thereof, and getting that sent off to stable? It looks for all the world as if the sole excuse for either to exist is to prevent any other stupid mistakes like the vmstat thing from being exposed for what they are by actively hiding them, when in fact, that hiding doesn't survive a hotplug event (as we saw in the crash analysis I showed you). Surely there's a better reason to keep that commit than hiding bugs that can only remain hidden until they meet hotplug. What is it? -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 15:31 ` Mike Galbraith @ 2016-02-09 16:39 ` Linus Torvalds 2016-02-09 16:50 ` Tejun Heo 0 siblings, 1 reply; 61+ messages in thread From: Linus Torvalds @ 2016-02-09 16:39 UTC (permalink / raw) To: Mike Galbraith Cc: Tejun Heo, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman On Tue, Feb 9, 2016 at 7:31 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote: > On Fri, 2016-02-05 at 16:06 -0500, Tejun Heo wrote: >> > >> > That 874bbfe6 should die. >> >> Yeah, it's gonna be killed. The commit is there because the behavior >> change broke things. We don't want to guarantee it but have been and >> can't change it right away just because we don't like it when things >> may break from it. The plan is to implement a debug option to force >> workqueue to always execute these work items on a foreign cpu to weed >> out breakages. > > A niggling question remaining is when is it gonna be killed? It probably should be killed sooner rather than later. Just document that if you need something to run on a _particular_ cpu, you need to use "schedule_delayed_work_on()" and "add_timer_on()". The proper fix was 176bed1de5bf, and 874bbfe6 was just wrong. Linus ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 16:39 ` Linus Torvalds @ 2016-02-09 16:50 ` Tejun Heo 2016-02-09 17:04 ` Mike Galbraith 2016-02-09 17:04 ` Linus Torvalds 0 siblings, 2 replies; 61+ messages in thread From: Tejun Heo @ 2016-02-09 16:50 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman Hello, On Tue, Feb 09, 2016 at 08:39:15AM -0800, Linus Torvalds wrote: > > A niggling question remaining is when is it gonna be killed? > > It probably should be killed sooner rather than later. > > Just document that if you need something to run on a _particular_ cpu, > you need to use "schedule_delayed_work_on()" and "add_timer_on()". I'll queue a patch to put unbound work items on foreign cpus (maybe every Nth to reduce perf impact). Wanted to align it to rc1 and then let it get tested during the devel cycle but missed this window. It's a bit late in devel cycle but we can still do it in this cycle. > The proper fix was 176bed1de5bf, and 874bbfe6 was just wrong. idk, not doing so is likely to cause subtle bugs which are difficult to track down. The problem with -stable is 874bbfe6 being backported without the matching timer fix. The right thing to do now probably is reverting 874bbfe6 for -stable kernels which don't get the timer fix. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 16:50 ` Tejun Heo @ 2016-02-09 17:04 ` Mike Galbraith 2016-02-09 17:54 ` Tejun Heo 2016-02-09 17:04 ` Linus Torvalds 1 sibling, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-09 17:04 UTC (permalink / raw) To: Tejun Heo, Linus Torvalds Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman On Tue, 2016-02-09 at 11:50 -0500, Tejun Heo wrote: > Hello, > > On Tue, Feb 09, 2016 at 08:39:15AM -0800, Linus Torvalds wrote: > > > A niggling question remaining is when is it gonna be killed? > > > > It probably should be killed sooner rather than later. > > > > Just document that if you need something to run on a _particular_ > > cpu, > > you need to use "schedule_delayed_work_on()" and "add_timer_on()". > > I'll queue a patch to put unbound work items on foreign cpus (maybe > every Nth to reduce perf impact). Wanted to align it to rc1 and then > let it get tested during the devel cycle but missed this window. It's > a bit late in devel cycle but we can still do it in this cycle. Or do something like the below, and get guinea pigs for free. workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs WORK_CPU_UNBOUND work items queued to a bound workqueue always run locally. This is a good thing normally, but not when the user has asked us to keep unbound work away from certain CPUs. Round robin these to wq_unbound_cpumask CPUs instead, as perturbation avoidance trumps performance. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> --- kernel/workqueue.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -303,6 +303,9 @@ static bool workqueue_freezing; /* PL: static cpumask_var_t wq_unbound_cpumask; /* PL: low level cpumask for all unbound wqs */ +/* CPU where WORK_CPU_UNBOUND work was last round robin scheduled from this CPU */ +static DEFINE_PER_CPU(unsigned int, wq_unbound_rr_cpu_last); + /* the per-cpu worker pools */ static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], cpu_worker_pools); @@ -1298,6 +1301,28 @@ static bool is_chained_work(struct workq return worker && worker->current_pwq->wq == wq; } +/* + * When queueing WORK_CPU_UNBOUND work to a !WQ_UNBOUND queue, round + * robin among wq_unbound_cpumask to avoid perturbing sensitive tasks. + */ +static unsigned int select_round_robin_cpu(unsigned int cpu) +{ + int new_cpu; + + if (cpumask_test_cpu(cpu, wq_unbound_cpumask)) + return cpu; + if (cpumask_empty(wq_unbound_cpumask)) + return cpu; + new_cpu = __this_cpu_read(wq_unbound_rr_cpu_last); + new_cpu = cpumask_next_and(new_cpu, wq_unbound_cpumask, cpu_online_mask); + if (unlikely(new_cpu >= nr_cpu_ids)) + new_cpu = cpumask_first_and(wq_unbound_cpumask, cpu_online_mask); + if (unlikely(WARN_ON_ONCE(new_cpu >= nr_cpu_ids))) + return cpu; + __this_cpu_write(wq_unbound_rr_cpu_last, new_cpu); + return new_cpu; +} + static void __queue_work(int cpu, struct workqueue_struct *wq, struct work_struct *work) { @@ -1323,7 +1348,7 @@ static void __queue_work(int cpu, struct return; retry: if (req_cpu == WORK_CPU_UNBOUND) - cpu = raw_smp_processor_id(); + cpu = select_round_robin_cpu(raw_smp_processor_id()); /* pwq which will be used unless @work is executing elsewhere */ if (!(wq->flags & WQ_UNBOUND)) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 17:04 ` Mike Galbraith @ 2016-02-09 17:54 ` Tejun Heo 2016-02-09 17:56 ` Mike Galbraith 0 siblings, 1 reply; 61+ messages in thread From: Tejun Heo @ 2016-02-09 17:54 UTC (permalink / raw) To: Mike Galbraith Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman Hello, Mike. On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote: > workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs > > WORK_CPU_UNBOUND work items queued to a bound workqueue always run > locally. This is a good thing normally, but not when the user has > asked us to keep unbound work away from certain CPUs. Round robin > these to wq_unbound_cpumask CPUs instead, as perturbation avoidance > trumps performance. I don't think doing this by default for everyone is a good idea. A lot of workqueue usages tend to touch whatever the scheduler was touching after all. Doing things per-cpu is generally a pretty good thing. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 17:54 ` Tejun Heo @ 2016-02-09 17:56 ` Mike Galbraith 2016-02-09 18:02 ` Mike Galbraith 0 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-09 17:56 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman On Tue, 2016-02-09 at 12:54 -0500, Tejun Heo wrote: > Hello, Mike. > > On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote: > > workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask > > CPUs > > > > WORK_CPU_UNBOUND work items queued to a bound workqueue always run > > locally. This is a good thing normally, but not when the user has > > asked us to keep unbound work away from certain CPUs. Round robin > > these to wq_unbound_cpumask CPUs instead, as perturbation avoidance > > trumps performance. > > I don't think doing this by default for everyone is a good idea. A > lot of workqueue usages tend to touch whatever the scheduler was > touching after all. Doing things per-cpu is generally a pretty good > thing. It doesn't do anything unless the user twiddles the mask to exclude certain (think no_hz_full) CPUs, so there are no clueless victims. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 17:56 ` Mike Galbraith @ 2016-02-09 18:02 ` Mike Galbraith 2016-02-09 18:27 ` Tejun Heo 0 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-09 18:02 UTC (permalink / raw) To: Tejun Heo Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman On Tue, 2016-02-09 at 18:56 +0100, Mike Galbraith wrote: > On Tue, 2016-02-09 at 12:54 -0500, Tejun Heo wrote: > > Hello, Mike. > > > > On Tue, Feb 09, 2016 at 06:04:04PM +0100, Mike Galbraith wrote: > > > workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask > > > CPUs > > > > > > WORK_CPU_UNBOUND work items queued to a bound workqueue always > > > run > > > locally. This is a good thing normally, but not when the user > > > has > > > asked us to keep unbound work away from certain CPUs. Round > > > robin > > > these to wq_unbound_cpumask CPUs instead, as perturbation > > > avoidance > > > trumps performance. > > > > I don't think doing this by default for everyone is a good idea. A > > lot of workqueue usages tend to touch whatever the scheduler was > > touching after all. Doing things per-cpu is generally a pretty > > good > > thing. > > It doesn't do anything unless the user twiddles the mask to exclude > certain (think no_hz_full) CPUs, so there are no clueless victims. (a plus: testers/robots can twiddle mask to help find bugs, _and_ nohz_full people can use it if they so choose) ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 18:02 ` Mike Galbraith @ 2016-02-09 18:27 ` Tejun Heo 0 siblings, 0 replies; 61+ messages in thread From: Tejun Heo @ 2016-02-09 18:27 UTC (permalink / raw) To: Mike Galbraith Cc: Linus Torvalds, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman Hello, Mike. On Tue, Feb 09, 2016 at 07:02:35PM +0100, Mike Galbraith wrote: > > It doesn't do anything unless the user twiddles the mask to exclude > > certain (think no_hz_full) CPUs, so there are no clueless victims. > > (a plus: testers/robots can twiddle mask to help find bugs, _and_ > nohz_full people can use it if they so choose) Ah, yeah, it makes sense then. I'm gonna play with a bit and add a debug option to always force the behavior. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 16:50 ` Tejun Heo 2016-02-09 17:04 ` Mike Galbraith @ 2016-02-09 17:04 ` Linus Torvalds 2016-02-09 17:51 ` Tejun Heo 1 sibling, 1 reply; 61+ messages in thread From: Linus Torvalds @ 2016-02-09 17:04 UTC (permalink / raw) To: Tejun Heo Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman On Tue, Feb 9, 2016 at 8:50 AM, Tejun Heo <tj@kernel.org> wrote: > > idk, not doing so is likely to cause subtle bugs which are difficult > to track down. The problem with -stable is 874bbfe6 being backported > without the matching timer fix. Well, according to this thread, even witht he timer fix the end result then shows odd problems, _and_ has a NO_HZ_FULL regression. I do agree about subtle bugs, but we haven't actually seen any other ones than the vmstat breakage so far. Also, I suspect that to flush out any bugs, we might want to (a) actually dequeue timers and work queues that are bound to a particular CPU when a CPU goes down. Sure, we *could* make it a rule that everybody who binds a timer to a particular CPU should just register the cpu-down thing, but why make a rule that you have to make extra work? People who do per-cpu work should have a setup function for when a new CPU comes _up_, but why make people do pointless extra crap for the cpu-down case when the generic code could just do ti for them. (b) maybe one of the test-bots could be encouraged to do a lot of cpu offlining/onlining as a stress test> That (a) part is important in that it avoids the subtle bug where some timer or workqueue entry ends up being run on the wrong CPU after all, just because the target CPU went down. And the (b) part would hopefully flush out things that didn't start things properly when a new cpu comes online. Hmm? The above is obviously a longer-term thing and a bigger change, but I think we should be able to just revert 874bbfe6 without anything else going on, since I don't think we ever found anything else than vmstat that had issues. Linus ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 17:04 ` Linus Torvalds @ 2016-02-09 17:51 ` Tejun Heo 2016-02-09 18:06 ` Linus Torvalds 0 siblings, 1 reply; 61+ messages in thread From: Tejun Heo @ 2016-02-09 17:51 UTC (permalink / raw) To: Linus Torvalds Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman Hello, On Tue, Feb 09, 2016 at 09:04:18AM -0800, Linus Torvalds wrote: > On Tue, Feb 9, 2016 at 8:50 AM, Tejun Heo <tj@kernel.org> wrote: > > idk, not doing so is likely to cause subtle bugs which are difficult > > to track down. The problem with -stable is 874bbfe6 being backported > > without the matching timer fix. > > Well, according to this thread, even witht he timer fix the end result > then shows odd problems, _and_ has a NO_HZ_FULL regression. I don't know what that odd problem is indicating but it's likely we're seeing another issue exposed by these changes or a bug during backport, but yeah it's problematic. > I do agree about subtle bugs, but we haven't actually seen any other > ones than the vmstat breakage so far. The thing with vmstat is that it's a work item which is most likely to expose the issue as it runs constantly on all systems and we started seeing it triggering soon after timer migration becomes more common. I'd be surprised if we don't discover a lot more subtler ones down the road. Maybe it's that most of them won't trigger often enough to matter much but it's a bit scary. > Also, I suspect that to flush out any bugs, we might want to > > (a) actually dequeue timers and work queues that are bound to a > particular CPU when a CPU goes down. > > Sure, we *could* make it a rule that everybody who binds a timer > to a particular CPU should just register the cpu-down thing, but why > make a rule that you have to make extra work? People who do per-cpu > work should have a setup function for when a new CPU comes _up_, but > why make people do pointless extra crap for the cpu-down case when the > generic code could just do ti for them. This goes the same for work items and timers. If we want to do explicit dequeueing or flushing of cpu-bound stuff on cpu down, we'll have to either dedicate *_on() interfaces for correctness or introduce a separate set of interfaces to use for optimization and correctness. The current situation is that work itmes which are explicitly shut down on cpu-down are correctness usages while the ones which are not are optimization usages. I'll try to scan through the usages and see what the actual proportions are like. Maybe we can get away with declaring that _on() usages are absolute. > (b) maybe one of the test-bots could be encouraged to do a lot of cpu > offlining/onlining as a stress test> > > That (a) part is important in that it avoids the subtle bug where some > timer or workqueue entry ends up being run on the wrong CPU after all, > just because the target CPU went down. > > And the (b) part would hopefully flush out things that didn't start > things properly when a new cpu comes online. > > Hmm? The above is obviously a longer-term thing and a bigger change, > but I think we should be able to just revert 874bbfe6 without anything > else going on, since I don't think we ever found anything else than > vmstat that had issues. So, how about reverting 874bbfe6 and performing random foreign queueing during -rc's for a couple cycles so that we can at least find out the broken ones quickly in devel branch and backport fixes as they're found? Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-09 17:51 ` Tejun Heo @ 2016-02-09 18:06 ` Linus Torvalds 0 siblings, 0 replies; 61+ messages in thread From: Linus Torvalds @ 2016-02-09 18:06 UTC (permalink / raw) To: Tejun Heo Cc: Mike Galbraith, Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik, Greg Kroah-Hartman On Tue, Feb 9, 2016 at 9:51 AM, Tejun Heo <tj@kernel.org> wrote: >> >> (a) actually dequeue timers and work queues that are bound to a >> particular CPU when a CPU goes down. >> > This goes the same for work items and timers. If we want to do > explicit dequeueing or flushing of cpu-bound stuff on cpu down, we'll > have to either dedicate *_on() interfaces for correctness or introduce > a separate set of interfaces to use for optimization and correctness. We already do that. "add_timer_on()" for timers, and cpu != WORK_CPU_UNBOUND for work items. > Maybe we can get away with > declaring that _on() usages are absolute. I really think that anything else would be odd as hell. If you asked for a timer (or work) on a particular CPU, and you get it on another one, that's a bug. It's much better to just dequeue those entries and say "sorry, your CPU went away". Of course, we could play around with just run them early at CPU-down time (and anybody trying to requeue would get an error because the CPU is in the process of going down), but that sounds like more work for any users, and like a much more fundamental difference. The "just silently dequeue" makes more sense, and pairs well with anything that sets things up on CPU-up time (which a percpu entity will have to do anyway). > So, how about reverting 874bbfe6 and performing random foreign > queueing during -rc's for a couple cycles so that we can at least find > out the broken ones quickly in devel branch and backport fixes as > they're found? Yeah, that sounds good to me. Having some "cpu work/timer debug" config option that ends up spreading out non-cpu-specific timers and work in order to find bugs sounds like a good idea. And I don't think it should be limited to rc releases, I think lots of people might be willing to run that (the same way we had people - and even distributions - that did PAGEALLOC_DEBUG which is a lot bigger hammer). Linus ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 17:06 ` Tejun Heo 2016-02-03 17:13 ` Mike Galbraith 2016-02-04 2:00 ` Mike Galbraith @ 2016-02-04 10:04 ` Mike Galbraith 2016-02-04 10:46 ` Thomas Gleixner 2 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-04 10:04 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote: > > Hm, so it's ok to queue work to an offline CPU? What happens if it > > doesn't come back for an eternity or two? > > Right now, it just loses affinity.... WRT affinity... Somebody somewhere queues a delayed work, a timer is started on CPUX, work is targeted at CPUX. Now wash/rinse/repeat mod_delayed_work() along with migrations. Should __queue_delayed_work() not refrain from altering dwork->cpu once set? I'm also wondering why 22b886dd only applies to kernels >= 4.2. <quote> Regardless of the previous CPU a timer was on, add_timer_on() currently simply sets timer->flags to the new CPU. As the caller must be seeing the timer as idle, this is locally fine, but the timer leaving the old base while unlocked can lead to race conditions as follows. Let's say timer was on cpu 0. cpu 0 cpu 1 ----------------------------------------------------------------------------- del_timer(timer) succeeds del_timer(timer) lock_timer_base(timer) locks cpu_0_base add_timer_on(timer, 1) spin_lock(&cpu_1_base->lock) timer->flags set to cpu_1_base operates on @timer operates on @timer </quote> What's the difference between... timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu; and... timer_set_base(timer, base); ...that makes that fix unneeded prior to 4.2? We take the same locks in < 4.2 kernels, so seemingly both will diddle concurrently above. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-04 10:04 ` Mike Galbraith @ 2016-02-04 10:46 ` Thomas Gleixner 2016-02-04 11:07 ` Mike Galbraith 2016-02-04 11:20 ` Jan Kara 0 siblings, 2 replies; 61+ messages in thread From: Thomas Gleixner @ 2016-02-04 10:46 UTC (permalink / raw) To: Mike Galbraith Cc: Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Thu, 4 Feb 2016, Mike Galbraith wrote: > On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote: > > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote: > > > Hm, so it's ok to queue work to an offline CPU? What happens if it > > > doesn't come back for an eternity or two? > > > > Right now, it just loses affinity.... > > WRT affinity... > > Somebody somewhere queues a delayed work, a timer is started on CPUX, > work is targeted at CPUX. Now wash/rinse/repeat mod_delayed_work() > along with migrations. Should __queue_delayed_work() not refrain from > altering dwork->cpu once set? > > I'm also wondering why 22b886dd only applies to kernels >= 4.2. > > <quote> > Regardless of the previous CPU a timer was on, add_timer_on() > currently simply sets timer->flags to the new CPU. As the caller must > be seeing the timer as idle, this is locally fine, but the timer > leaving the old base while unlocked can lead to race conditions as > follows. > > Let's say timer was on cpu 0. > > cpu 0 cpu 1 > ----------------------------------------------------------------------------- > del_timer(timer) succeeds > del_timer(timer) > lock_timer_base(timer) locks cpu_0_base > add_timer_on(timer, 1) > spin_lock(&cpu_1_base->lock) > timer->flags set to cpu_1_base > operates on @timer operates on @timer > </quote> > > What's the difference between... > timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu; > and... > timer_set_base(timer, base); > > ...that makes that fix unneeded prior to 4.2? We take the same locks > in < 4.2 kernels, so seemingly both will diddle concurrently above. Indeed, you are right. The same can happen on pre 4.2, just the fix does not apply as we changed the internals how the base is managed in the timer itself. Backport below. Thanks, tglx 8<---------------------------- --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -956,13 +956,26 @@ EXPORT_SYMBOL(add_timer); */ void add_timer_on(struct timer_list *timer, int cpu) { - struct tvec_base *base = per_cpu(tvec_bases, cpu); + struct tvec_base *new_base = per_cpu(tvec_bases, cpu); + struct tvec_base *base; unsigned long flags; timer_stats_timer_set_start_info(timer); BUG_ON(timer_pending(timer) || !timer->function); - spin_lock_irqsave(&base->lock, flags); - timer_set_base(timer, base); + + /* + * If @timer was on a different CPU, it must be migrated with the + * old base locked to prevent other operations proceeding with the + * wrong base locked. See lock_timer_base(). + */ + base = lock_timer_base(timer, &flags); + if (base != new_base) { + timer_set_base(timer, NULL); + spin_unlock(&base->lock); + base = new_base; + spin_lock(&base->lock); + timer_set_base(timer, base); + } debug_activate(timer, timer->expires); internal_add_timer(base, timer); spin_unlock_irqrestore(&base->lock, flags); ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-04 10:46 ` Thomas Gleixner @ 2016-02-04 11:07 ` Mike Galbraith 2016-02-04 11:20 ` Jan Kara 1 sibling, 0 replies; 61+ messages in thread From: Mike Galbraith @ 2016-02-04 11:07 UTC (permalink / raw) To: Thomas Gleixner Cc: Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Thu, 2016-02-04 at 11:46 +0100, Thomas Gleixner wrote: > On Thu, 4 Feb 2016, Mike Galbraith wrote: > > I'm also wondering why 22b886dd only applies to kernels >= 4.2. > > > > > > Regardless of the previous CPU a timer was on, add_timer_on() > > currently simply sets timer->flags to the new CPU. As the caller must > > be seeing the timer as idle, this is locally fine, but the timer > > leaving the old base while unlocked can lead to race conditions as > > follows. > > > > Let's say timer was on cpu 0. > > > > cpu 0 cpu 1 > > ----------------------------------------------------------------------------- > > del_timer(timer) succeeds > > del_timer(timer) > > lock_timer_base(timer) locks cpu_0_base > > add_timer_on(timer, 1) > > spin_lock(&cpu_1_base->lock) > > timer->flags set to cpu_1_base > > operates on @timer operates on @timer > > > > > > What's the difference between... > > timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu; > > and... > > timer_set_base(timer, base); > > > > ...that makes that fix unneeded prior to 4.2? We take the same locks > > in < 4.2 kernels, so seemingly both will diddle concurrently above. > > Indeed, you are right. Whew, thanks for confirming, looking for what the hell I was missing wasn't going well at all, ate most of my day. > The same can happen on pre 4.2, just the fix does not apply as we changed the > internals how the base is managed in the timer itself. Backport below. Exactly what I did locally. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-04 10:46 ` Thomas Gleixner 2016-02-04 11:07 ` Mike Galbraith @ 2016-02-04 11:20 ` Jan Kara 2016-02-04 16:39 ` Daniel Bilik 1 sibling, 1 reply; 61+ messages in thread From: Jan Kara @ 2016-02-04 11:20 UTC (permalink / raw) To: Thomas Gleixner Cc: Mike Galbraith, Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Thu 04-02-16 11:46:47, Thomas Gleixner wrote: > On Thu, 4 Feb 2016, Mike Galbraith wrote: > > On Wed, 2016-02-03 at 12:06 -0500, Tejun Heo wrote: > > > On Wed, Feb 03, 2016 at 06:01:53PM +0100, Mike Galbraith wrote: > > > > Hm, so it's ok to queue work to an offline CPU? What happens if it > > > > doesn't come back for an eternity or two? > > > > > > Right now, it just loses affinity.... > > > > WRT affinity... > > > > Somebody somewhere queues a delayed work, a timer is started on CPUX, > > work is targeted at CPUX. Now wash/rinse/repeat mod_delayed_work() > > along with migrations. Should __queue_delayed_work() not refrain from > > altering dwork->cpu once set? > > > > I'm also wondering why 22b886dd only applies to kernels >= 4.2. > > > > <quote> > > Regardless of the previous CPU a timer was on, add_timer_on() > > currently simply sets timer->flags to the new CPU. As the caller must > > be seeing the timer as idle, this is locally fine, but the timer > > leaving the old base while unlocked can lead to race conditions as > > follows. > > > > Let's say timer was on cpu 0. > > > > cpu 0 cpu 1 > > ----------------------------------------------------------------------------- > > del_timer(timer) succeeds > > del_timer(timer) > > lock_timer_base(timer) locks cpu_0_base > > add_timer_on(timer, 1) > > spin_lock(&cpu_1_base->lock) > > timer->flags set to cpu_1_base > > operates on @timer operates on @timer > > </quote> > > > > What's the difference between... > > timer->flags = (timer->flags & ~TIMER_BASEMASK) | cpu; > > and... > > timer_set_base(timer, base); > > > > ...that makes that fix unneeded prior to 4.2? We take the same locks > > in < 4.2 kernels, so seemingly both will diddle concurrently above. > > Indeed, you are right. > > The same can happen on pre 4.2, just the fix does not apply as we changed the > internals how the base is managed in the timer itself. Backport below. Thanks for backport Thomas and to Mike for persistence :). I've asked my friend seeing crashes with 3.18.25 to try whether this patch fixes the issues. It may take some time so stay tuned... Honza > 8<---------------------------- > > --- a/kernel/time/timer.c > +++ b/kernel/time/timer.c > @@ -956,13 +956,26 @@ EXPORT_SYMBOL(add_timer); > */ > void add_timer_on(struct timer_list *timer, int cpu) > { > - struct tvec_base *base = per_cpu(tvec_bases, cpu); > + struct tvec_base *new_base = per_cpu(tvec_bases, cpu); > + struct tvec_base *base; > unsigned long flags; > > timer_stats_timer_set_start_info(timer); > BUG_ON(timer_pending(timer) || !timer->function); > - spin_lock_irqsave(&base->lock, flags); > - timer_set_base(timer, base); > + > + /* > + * If @timer was on a different CPU, it must be migrated with the > + * old base locked to prevent other operations proceeding with the > + * wrong base locked. See lock_timer_base(). > + */ > + base = lock_timer_base(timer, &flags); > + if (base != new_base) { > + timer_set_base(timer, NULL); > + spin_unlock(&base->lock); > + base = new_base; > + spin_lock(&base->lock); > + timer_set_base(timer, base); > + } > debug_activate(timer, timer->expires); > internal_add_timer(base, timer); > spin_unlock_irqrestore(&base->lock, flags); > > > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-04 11:20 ` Jan Kara @ 2016-02-04 16:39 ` Daniel Bilik 2016-02-05 2:40 ` Mike Galbraith 0 siblings, 1 reply; 61+ messages in thread From: Daniel Bilik @ 2016-02-04 16:39 UTC (permalink / raw) To: Jan Kara; +Cc: Thomas Gleixner, Mike Galbraith, LKML, stable On Thu, 4 Feb 2016 12:20:44 +0100 Jan Kara <jack@suse.cz> wrote: > Thanks for backport Thomas and to Mike for persistence :). I've asked my > friend seeing crashes with 3.18.25 to try whether this patch fixes the > issues. It may take some time so stay tuned... Patch tested and it really fixes the crash we were experiencing on 3.18.25 with commit 874bbfe+. But it seem to introduce (rather scary) regression. Tested host shows abnormal cpu usage in both kernel and userland under the same load and traffic pattern. One picture is worth a thousand words, so I've taken snapshots of our graphs, see here: http://neosystem.cz/test/linux-3.18.25/ The host was running 3.18.25 with commit 874bbfe+ (1e7af29+ on 3.18-stable) reverted. With this commit included, it crashed within minutes. Around 13:30 we booted 3.18.25 with commit 874bbfe+ included and with the patch from Thomas. And around 15:40 we've booted the host with previous kernel, just to ensure this abnormal behaviour was really caused by the test kernel. Also interesting, in addition to high cpu usage, there is abnormally high number of zombie processes reported by the system. HTH. -- Daniel Bilik neosystem.cz ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-04 16:39 ` Daniel Bilik @ 2016-02-05 2:40 ` Mike Galbraith 2016-02-05 8:11 ` Daniel Bilik 0 siblings, 1 reply; 61+ messages in thread From: Mike Galbraith @ 2016-02-05 2:40 UTC (permalink / raw) To: Daniel Bilik, Jan Kara; +Cc: Thomas Gleixner, LKML, stable On Thu, 2016-02-04 at 17:39 +0100, Daniel Bilik wrote: > On Thu, 4 Feb 2016 12:20:44 +0100 > Jan Kara <jack@suse.cz> wrote: > > > Thanks for backport Thomas and to Mike for persistence :). I've asked my > > friend seeing crashes with 3.18.25 to try whether this patch fixes the > > issues. It may take some time so stay tuned... > > Patch tested and it really fixes the crash we were experiencing on 3.18.25 > with commit 874bbfe+. But it seem to introduce (rather scary) regression. > Tested host shows abnormal cpu usage in both kernel and userland under the > same load and traffic pattern. One picture is worth a thousand words, so > I've taken snapshots of our graphs, see here: > http://neosystem.cz/test/linux-3.18.25/ > The host was running 3.18.25 with commit 874bbfe+ (1e7af29+ on > 3.18-stable) reverted. With this commit included, it crashed within > minutes. Around 13:30 we booted 3.18.25 with commit 874bbfe+ included and > with the patch from Thomas. And around 15:40 we've booted the host with > previous kernel, just to ensure this abnormal behaviour was really caused > by the test kernel. > Also interesting, in addition to high cpu usage, there is abnormally high > number of zombie processes reported by the system. IMHO you should restore the CC list and re-post. (If I were the maintainer of either the workqueue code or 3.18-stable, I'd be highly interested in this finding). -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 2:40 ` Mike Galbraith @ 2016-02-05 8:11 ` Daniel Bilik 2016-02-05 8:33 ` Mike Galbraith 0 siblings, 1 reply; 61+ messages in thread From: Daniel Bilik @ 2016-02-05 8:11 UTC (permalink / raw) To: Mike Galbraith Cc: Jan Kara, Thomas Gleixner, Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek, Sasha Levin, Shaohua Li, LKML, stable On Fri, 05 Feb 2016 03:40:46 +0100 Mike Galbraith <umgwanakikbuti@gmail.com> wrote: > On Thu, 2016-02-04 at 17:39 +0100, Daniel Bilik wrote: > > On Thu, 4 Feb 2016 12:20:44 +0100 > > Jan Kara <jack@suse.cz> wrote: > > > > > Thanks for backport Thomas and to Mike for persistence :). I've > > > asked my friend seeing crashes with 3.18.25 to try whether this > > > patch fixes the issues. It may take some time so stay tuned... > > > > Patch tested and it really fixes the crash we were experiencing on > > 3.18.25 with commit 874bbfe+. But it seem to introduce (rather scary) > > regression. Tested host shows abnormal cpu usage in both kernel and > > userland under the same load and traffic pattern. One picture is worth > > a thousand words, so I've taken snapshots of our graphs, see here: > > http://neosystem.cz/test/linux-3.18.25/ > > The host was running 3.18.25 with commit 874bbfe+ (1e7af29+ on > > 3.18-stable) reverted. With this commit included, it crashed within > > minutes. Around 13:30 we booted 3.18.25 with commit 874bbfe+ included > > and with the patch from Thomas. And around 15:40 we've booted the host > > with previous kernel, just to ensure this abnormal behaviour was > > really caused by the test kernel. > > Also interesting, in addition to high cpu usage, there is abnormally > > high number of zombie processes reported by the system. > > IMHO you should restore the CC list and re-post. (If I were the > maintainer of either the workqueue code or 3.18-stable, I'd be highly > interested in this finding). Sorry, I haven't realized tha patch proposed by Thomas is already on its way to stable. CC restored and re-posting. -- Daniel Bilik neosystem.cz ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-05 8:11 ` Daniel Bilik @ 2016-02-05 8:33 ` Mike Galbraith 0 siblings, 0 replies; 61+ messages in thread From: Mike Galbraith @ 2016-02-05 8:33 UTC (permalink / raw) To: Daniel Bilik Cc: Jan Kara, Thomas Gleixner, Tejun Heo, Michal Hocko, Jiri Slaby, Petr Mladek, Sasha Levin, Shaohua Li, LKML, stable On Fri, 2016-02-05 at 09:11 +0100, Daniel Bilik wrote: > On Fri, 05 Feb 2016 03:40:46 +0100 > Mike Galbraith <umgwanakikbuti@gmail.com> wrote: > > IMHO you should restore the CC list and re-post. (If I were the > > maintainer of either the workqueue code or 3.18-stable, I'd be highly > > interested in this finding). > > Sorry, I haven't realized tha patch proposed by Thomas is already on its > way to stable. CC restored and re-posting. I don't know where it's at, but where things stand is that it is needed, but when combined with the patch which at least uncovered the fact that it's needed, the two aren't playing well together according to your test result. Given both patches are already in kernels upstream, and presumably Thomas's patch will eventually wander to stable to fix them up, there might be some maintainer interest. -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 16:24 ` Tejun Heo 2016-02-03 16:48 ` Michal Hocko 2016-02-03 17:01 ` Mike Galbraith @ 2016-02-03 18:46 ` Thomas Gleixner 2016-02-03 19:01 ` Tejun Heo 2016-02-05 5:44 ` Mike Galbraith 3 siblings, 1 reply; 61+ messages in thread From: Thomas Gleixner @ 2016-02-03 18:46 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 3 Feb 2016, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote: > > > The CPU was 168, and that one was offlined in the meantime. So > > > __queue_work fails at: > > > if (!(wq->flags & WQ_UNBOUND)) > > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > > > else > > > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > > > ^^^ ^^^^ NODE is -1 > > > \ pwq is NULL > > > > > > if (last_pool && last_pool != pwq->pool) { <--- BOOM > > So, the proper fix here is keeping cpu <-> node mapping stable across > cpu on/offlining which has been being worked on for a long time now. > The patchst is pending and it fixes other issues too. > > > So I think 874bbfe600a6 is really bogus. It should be reverted. We > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly > > schedule per-cpu work on the CPU we need it to run on"). This which > > should be used for the stable trees as a replacement. > > It's not bogus. We can't flip a property that has been guaranteed > without any provision for verification. Why do you think vmstat blow > up in the first place? vmstat would be the canary case as it runs > frequently on all systems. It's exactly the sign that we can't break > this guarantee willy-nilly. You're in complete failure denial mode once again. Fact is: That patch breaks stuff because there is no stable cpu -> node mapping accross cpu on/offlining. As a result this selects unbound_pwq_by_node() on node -1. The reason why you need to do that work->cpu assignment might be legitimate, but that does not justify that you expose systems to a lurking out of bounds access which results in a NULL pointer dereference. As long as cpu_to_node(cpu) can return -1, we need a sanity check there. And we need that now and not at some point in the future when the patches establishing a stable cpu -> node mapping are finished. Stop arguing around a bug which really exists and was exposed by this patch. Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 18:46 ` Thomas Gleixner @ 2016-02-03 19:01 ` Tejun Heo 2016-02-03 19:05 ` Thomas Gleixner 0 siblings, 1 reply; 61+ messages in thread From: Tejun Heo @ 2016-02-03 19:01 UTC (permalink / raw) To: Thomas Gleixner Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik Hello, Thomas. On Wed, Feb 03, 2016 at 07:46:11PM +0100, Thomas Gleixner wrote: > > > So I think 874bbfe600a6 is really bogus. It should be reverted. We > > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly > > > schedule per-cpu work on the CPU we need it to run on"). This which > > > should be used for the stable trees as a replacement. > > > > It's not bogus. We can't flip a property that has been guaranteed > > without any provision for verification. Why do you think vmstat blow > > up in the first place? vmstat would be the canary case as it runs > > frequently on all systems. It's exactly the sign that we can't break > > this guarantee willy-nilly. > > You're in complete failure denial mode once again. Well, you're in an unnecessary escalation mode as usual. Was the attitude really necessary? Chill out and read the thread again. Michal is saying the dwork->cpu assignment was bogus and I was refuting that. > Fact is: > > That patch breaks stuff because there is no stable cpu -> node mapping > accross cpu on/offlining. As a result this selects unbound_pwq_by_node() on > node -1. > > The reason why you need to do that work->cpu assignment might be legitimate, > but that does not justify that you expose systems to a lurking out of bounds > access which results in a NULL pointer dereference. > > As long as cpu_to_node(cpu) can return -1, we need a sanity check there. And > we need that now and not at some point in the future when the patches > establishing a stable cpu -> node mapping are finished. > > Stop arguing around a bug which really exists and was exposed by this patch. Michal brought it up here but there's a different thread where Mike reported NUMA_NO_NODE issue and I already posted the fix. http://lkml.kernel.org/g/20160203185425.GK14091@mtj.duckdns.org Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 19:01 ` Tejun Heo @ 2016-02-03 19:05 ` Thomas Gleixner 2016-02-03 19:15 ` Tejun Heo 0 siblings, 1 reply; 61+ messages in thread From: Thomas Gleixner @ 2016-02-03 19:05 UTC (permalink / raw) To: Tejun Heo Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 3 Feb 2016, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 07:46:11PM +0100, Thomas Gleixner wrote: > > > > So I think 874bbfe600a6 is really bogus. It should be reverted. We > > > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: explicitly > > > > schedule per-cpu work on the CPU we need it to run on"). This which > > > > should be used for the stable trees as a replacement. > > > > > > It's not bogus. We can't flip a property that has been guaranteed > > > without any provision for verification. Why do you think vmstat blow > > > up in the first place? vmstat would be the canary case as it runs > > > frequently on all systems. It's exactly the sign that we can't break > > > this guarantee willy-nilly. > > > > You're in complete failure denial mode once again. > > Well, you're in an unnecessary escalation mode as usual. Was the > attitude really necessary? Chill out and read the thread again. > Michal is saying the dwork->cpu assignment was bogus and I was > refuting that. Right, but at the same time you could have admitted, that the current state is buggy and needs a sanity check in unbound_pwq_by_node(). > Michal brought it up here but there's a different thread where Mike > reported NUMA_NO_NODE issue and I already posted the fix. > > http://lkml.kernel.org/g/20160203185425.GK14091@mtj.duckdns.org 5 minute ago w/o cc'ing the people who participated in that discussion. Thanks, tglx ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 19:05 ` Thomas Gleixner @ 2016-02-03 19:15 ` Tejun Heo 0 siblings, 0 replies; 61+ messages in thread From: Tejun Heo @ 2016-02-03 19:15 UTC (permalink / raw) To: Thomas Gleixner Cc: Michal Hocko, Jiri Slaby, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, Feb 03, 2016 at 08:05:57PM +0100, Thomas Gleixner wrote: > > Well, you're in an unnecessary escalation mode as usual. Was the > > attitude really necessary? Chill out and read the thread again. > > Michal is saying the dwork->cpu assignment was bogus and I was > > refuting that. > > Right, but at the same time you could have admitted, that the current state is > buggy and needs a sanity check in unbound_pwq_by_node(). It's crashing. Of course it's buggy. The main discussion on the bug was on the other thread and I was trying to put out the confusions posted on this thread. > > Michal brought it up here but there's a different thread where Mike > > reported NUMA_NO_NODE issue and I already posted the fix. > > > > http://lkml.kernel.org/g/20160203185425.GK14091@mtj.duckdns.org > > 5 minute ago w/o cc'ing the people who participated in that discussion. Yeah, I got to the thread this morning and got your email right after sending out the patch. I don't know. The handling of this wasn't out of the norm. I asked this multiple times but let me try again. Can we please try to stay civil and technical? There are times where escalation is necessary but this one was gratuitous. Thanks. -- tejun ^ permalink raw reply [flat|nested] 61+ messages in thread
* Re: Crashes with 874bbfe600a6 in 3.18.25 2016-02-03 16:24 ` Tejun Heo ` (2 preceding siblings ...) 2016-02-03 18:46 ` Thomas Gleixner @ 2016-02-05 5:44 ` Mike Galbraith 3 siblings, 0 replies; 61+ messages in thread From: Mike Galbraith @ 2016-02-05 5:44 UTC (permalink / raw) To: Tejun Heo, Michal Hocko Cc: Jiri Slaby, Thomas Gleixner, Petr Mladek, Jan Kara, Ben Hutchings, Sasha Levin, Shaohua Li, LKML, stable, Daniel Bilik On Wed, 2016-02-03 at 11:24 -0500, Tejun Heo wrote: > On Wed, Feb 03, 2016 at 01:28:56PM +0100, Michal Hocko wrote: > > > The CPU was 168, and that one was offlined in the meantime. So > > > __queue_work fails at: > > > if (!(wq->flags & WQ_UNBOUND)) > > > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > > > else > > > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > > > ^^^ ^^^^ NODE is -1 > > > \ pwq is NULL > > > > > > if (last_pool && last_pool != pwq->pool) { <--- BOOM > > So, the proper fix here is keeping cpu <-> node mapping stable across > cpu on/offlining which has been being worked on for a long time now. > The patchst is pending and it fixes other issues too. > > > So I think 874bbfe600a6 is really bogus. It should be reverted. We > > already have a proper fix for vmstat 176bed1de5bf ("vmstat: > > explicitly > > schedule per-cpu work on the CPU we need it to run on"). This which > > should be used for the stable trees as a replacement. > > It's not bogus. We can't flip a property that has been guaranteed > without any provision for verification. Why do you think vmstat blow > up in the first place? vmstat would be the canary case as it runs > frequently on all systems. It's exactly the sign that we can't break > this guarantee willy-nilly. If the intent of the below is to fulfill a guarantee... + /* timer isn't guaranteed to run in this cpu, record earlier */ + if (cpu == WORK_CPU_UNBOUND) + cpu = raw_smp_processor_id(); dwork->cpu = cpu; timer->expires = jiffies + delay; - if (unlikely(cpu != WORK_CPU_UNBOUND)) - add_timer_on(timer, cpu); - else - add_timer(timer); + add_timer_on(timer, cpu); ...it appears to be incomplete. Hotplug aside, when adding a timer with the expectation that it stay put, should it not also be pinned? -Mike ^ permalink raw reply [flat|nested] 61+ messages in thread
end of thread, other threads:[~2016-02-09 18:28 UTC | newest] Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-01-20 21:19 Crashes with 874bbfe600a6 in 3.18.25 Jan Kara 2016-01-20 21:39 ` Shaohua Li 2016-01-21 9:52 ` Jan Kara 2016-01-21 13:29 ` Sasha Levin 2016-01-22 1:10 ` Sasha Levin 2016-01-22 16:09 ` Tejun Heo 2016-01-23 2:20 ` Ben Hutchings 2016-01-23 16:11 ` Thomas Gleixner 2016-01-26 9:34 ` Jan Kara 2016-01-26 9:34 ` Jan Kara 2016-01-26 9:49 ` Thomas Gleixner 2016-01-26 9:49 ` Thomas Gleixner 2016-01-26 11:14 ` Petr Mladek 2016-01-26 11:14 ` Petr Mladek 2016-01-26 13:09 ` Thomas Gleixner 2016-01-26 13:09 ` Thomas Gleixner 2016-02-03 9:35 ` Jiri Slaby 2016-02-03 10:41 ` Thomas Gleixner 2016-02-03 12:28 ` Michal Hocko 2016-02-03 16:24 ` Tejun Heo 2016-02-03 16:48 ` Michal Hocko 2016-02-03 16:59 ` Tejun Heo 2016-02-04 6:37 ` Michal Hocko 2016-02-04 7:40 ` Michal Hocko 2016-02-03 17:01 ` Mike Galbraith 2016-02-03 17:06 ` Tejun Heo 2016-02-03 17:13 ` Mike Galbraith 2016-02-03 17:15 ` Tejun Heo 2016-02-04 2:00 ` Mike Galbraith 2016-02-05 16:49 ` Tejun Heo 2016-02-05 20:47 ` Mike Galbraith 2016-02-05 20:54 ` Tejun Heo 2016-02-05 20:59 ` Mike Galbraith 2016-02-05 21:06 ` Tejun Heo 2016-02-06 13:07 ` Henrique de Moraes Holschuh 2016-02-07 5:19 ` Mike Galbraith 2016-02-07 5:59 ` Mike Galbraith 2016-02-09 15:31 ` Mike Galbraith 2016-02-09 16:39 ` Linus Torvalds 2016-02-09 16:50 ` Tejun Heo 2016-02-09 17:04 ` Mike Galbraith 2016-02-09 17:54 ` Tejun Heo 2016-02-09 17:56 ` Mike Galbraith 2016-02-09 18:02 ` Mike Galbraith 2016-02-09 18:27 ` Tejun Heo 2016-02-09 17:04 ` Linus Torvalds 2016-02-09 17:51 ` Tejun Heo 2016-02-09 18:06 ` Linus Torvalds 2016-02-04 10:04 ` Mike Galbraith 2016-02-04 10:46 ` Thomas Gleixner 2016-02-04 11:07 ` Mike Galbraith 2016-02-04 11:20 ` Jan Kara 2016-02-04 16:39 ` Daniel Bilik 2016-02-05 2:40 ` Mike Galbraith 2016-02-05 8:11 ` Daniel Bilik 2016-02-05 8:33 ` Mike Galbraith 2016-02-03 18:46 ` Thomas Gleixner 2016-02-03 19:01 ` Tejun Heo 2016-02-03 19:05 ` Thomas Gleixner 2016-02-03 19:15 ` Tejun Heo 2016-02-05 5:44 ` Mike Galbraith
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.