All of lore.kernel.org
 help / color / mirror / Atom feed
* Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
       [not found]       ` <20140108104340.GC27046@suse.de>
@ 2014-01-08 13:48         ` Mel Gorman
  2014-01-09  4:17           ` Greg KH
                             ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Mel Gorman @ 2014-01-08 13:48 UTC (permalink / raw)
  To: Greg KH
  Cc: athorlton, riel, chegu_vinod, Len Brown, H. Peter Anvin, LKML, stable

Adding LKML to the list as this -stable snifftest has identified an
upstream regression.

On Wed, Jan 08, 2014 at 10:43:40AM +0000, Mel Gorman wrote:
> On Tue, Jan 07, 2014 at 08:30:12PM +0000, Mel Gorman wrote:
> > On Tue, Jan 07, 2014 at 10:54:40AM -0800, Greg KH wrote:
> > > On Tue, Jan 07, 2014 at 06:17:15AM -0800, Greg KH wrote:
> > > > On Tue, Jan 07, 2014 at 02:00:35PM +0000, Mel Gorman wrote:
> > > > > A number of NUMA balancing patches were tagged for -stable but I got a
> > > > > number of rejected mails from either Greg or his robot minion.  The list
> > > > > of relevant patches is
> > > > > 
> > > > > FAILED: patch "[PATCH] mm: numa: serialise parallel get_user_page against THP"
> > > > > FAILED: patch "[PATCH] mm: numa: call MMU notifiers on THP migration"
> > > > > MERGED: Patch "mm: clear pmd_numa before invalidating"
> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PMD during PTE update scan"
> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PTE for pte_numa update"
> > > > > MERGED: Patch "mm: numa: ensure anon_vma is locked to prevent parallel THP splits"
> > > > > MERGED: Patch "mm: numa: avoid unnecessary work on the failure path"
> > > > > MERGED: Patch "sched: numa: skip inaccessible VMAs"
> > > > > FAILED: patch "[PATCH] mm: numa: clear numa hinting information on mprotect"
> > > > > FAILED: patch "[PATCH] mm: numa: avoid unnecessary disruption of NUMA hinting during"
> > > > > Patch "mm: fix TLB flush race between migration, and change_protection_range"
> > > > > Patch "mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates"
> > > > > FAILED: patch "[PATCH] mm: numa: defer TLB flush for THP migration as long as"
> > > > > 
> > > > > Fixing the rejects one at a time may cause other conflicts due to ordering
> > > > > issues. Instead, this patch series against 3.12.6 is the full list of
> > > > > backported patches in the expected order. Greg, unfortunately this means
> > > > > you may have to drop some patches already in your stable tree and reapply
> > > > > but on the plus side they should be then in the correct order for bisection
> > > > > purposes and you'll know I've tested this combination of patches.
> > > > 
> > > > Many thanks for these, I'll go queue them up in a bit and drop the
> > > > others to ensure I got all of this correct.
> > > 
> > > Ok, I've now queued all of these up, in this order, so we should be
> > > good.
> > > 
> > > I'll do a -rc2 in a bit as it needs some testing.
> > > 
> > 
> > Thanks a million. I should be cc'd on some of those so I'll pick up the
> > final result and run it through the same tests just to be sure.
> > 
> 
> Ok, tests completed and look more or less as expected. This is not to
> say the performance results are *good* as such.  Workloads that normally
> demonstrate automatic numa balancing suffered because of other patches that
> were merged (primarily fair zone allocation policy) that had interesting
> side-effects. However, it now does not crash under heavy stress and I
> prefer working a little slowly than crashing fast. NAS at least looks
> better.
> 
> Other workloads like kernel builds, page fault microbench looked good as
> expected from the fair zone allocation policy fixes.
> 
> Big downside is that ebizzy performance is *destroyed* in that RC2 patch
> somewhere
> 
> ebizzy
>                          3.12.6                3.12.6            3.12.7-rc2
>                         vanilla         backport-v1r2             stablerc2
> Mean   1      3278.67 (  0.00%)     3180.67 ( -2.99%)     3212.00 ( -2.03%)
> Mean   2      2322.67 (  0.00%)     2294.67 ( -1.21%)     1839.00 (-20.82%)
> Mean   3      2257.00 (  0.00%)     2218.67 ( -1.70%)     1664.00 (-26.27%)
> Mean   4      2268.00 (  0.00%)     2224.67 ( -1.91%)     1629.67 (-28.15%)
> Mean   5      2247.67 (  0.00%)     2255.67 (  0.36%)     1582.33 (-29.60%)
> Mean   6      2263.33 (  0.00%)     2251.33 ( -0.53%)     1547.67 (-31.62%)
> Mean   7      2273.67 (  0.00%)     2222.67 ( -2.24%)     1545.67 (-32.02%)
> Mean   8      2254.67 (  0.00%)     2232.33 ( -0.99%)     1535.33 (-31.90%)
> Mean   12     2237.67 (  0.00%)     2266.33 (  1.28%)     1543.33 (-31.03%)
> Mean   16     2201.33 (  0.00%)     2252.67 (  2.33%)     1540.33 (-30.03%)
> Mean   20     2205.67 (  0.00%)     2229.33 (  1.07%)     1537.33 (-30.30%)
> Mean   24     2162.33 (  0.00%)     2168.67 (  0.29%)     1535.33 (-29.00%)
> Mean   28     2139.33 (  0.00%)     2107.67 ( -1.48%)     1535.00 (-28.25%)
> Mean   32     2084.67 (  0.00%)     2089.00 (  0.21%)     1537.33 (-26.26%)
> Mean   36     2002.00 (  0.00%)     2020.00 (  0.90%)     1530.33 (-23.56%)
> Mean   40     1972.67 (  0.00%)     1978.67 (  0.30%)     1530.33 (-22.42%)
> Mean   44     1951.00 (  0.00%)     1953.67 (  0.14%)     1531.00 (-21.53%)
> Mean   48     1931.67 (  0.00%)     1930.67 ( -0.05%)     1526.67 (-20.97%)
> 
> Figures are records/sec, more is better for increasing numbers of threads
> up to 48 which is the number of logical CPUs in the machine. Three kernels
> tested
> 
> 3.12.6        is self-explanatory
> backport-v1r2 is the backported series I sent you
> stablerc2     is the rc2 patch I pulled from kernel.org
> 
> I'm not that familiar with the stable workflow but stable-queue.git looked
> like it had the correct quilt tree so bisection is in progress. If I had
> to bet money on it, I'd bet it's going to be scheduler or power management
> related mostly because problems in both of those areas have tended to
> screw ebizzy recently.
> 

I was not far off. Bisection identified the following commit

3d97ea0816589c818ac62fb401e61c3b6a59f351 is the first bad commit
commit 3d97ea0816589c818ac62fb401e61c3b6a59f351
Author: Len Brown <len.brown@intel.com>
Date:   Wed Dec 18 16:44:57 2013 -0500

    x86 idle: Repair large-server 50-watt idle-power regression

    commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream.

    Linux 3.10 changed the timing of how thread_info->flags is touched:

        x86: Use generic idle loop
        (7d1a941731fabf27e5fb6edbebb79fe856edb4e5)

    This caused Intel NHM-EX and WSM-EX servers to experience a large number
    of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
    from deep C-states to shallow C-states, which caused these platforms
    to experience a significant increase in idle power.

    Note that this issue was already present before the commit above,
    however, it wasn't seen often enough to be noticed in power measurements.

    Here we extend an errata workaround from the Core2 EX "Dunnington"
    to extend to NHM-EX and WSM-EX, to prevent these immediate
    returns from MWAIT, reducing idle power on these platforms.

    While only acpi_idle ran on Dunnington, intel_idle
    may also run on these two newer systems.
    As of today, there are no other models that are known
    to need this tweak.

    Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
    Signed-off-by: Len Brown <len.brown@intel.com>
    Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
    Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Len, HPA, the x86 idle regression fix fubars ebizzy as a consequence, I
don't know why. I know the workload is not that important (and I expected
ebizzy to be unaffected in this test) but it is probably indicative of
other performance regressions hiding in there. It was caught via -stable
testing by accident but I checked and upstream is also affected. This is
a snippet from the bisection log

Wed 8 Jan 09:53:59 GMT 2014 compass ebizzy v3.12.6 mean-4:2317 good
Wed 8 Jan 10:13:04 GMT 2014 compass ebizzy v3.12.7-rc2 mean-4:1631 bad
Wed 8 Jan 10:27:45 GMT 2014 compass ebizzy a202b4808e500f4fd53b6cec150c8fe214c70183 mean-4:1620 bad
Wed 8 Jan 10:41:36 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2290 good
Wed 8 Jan 10:55:14 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2266 good
Wed 8 Jan 11:09:04 GMT 2014 compass ebizzy c62a6f8a28bf8897ba0903cf332d761c1132e48d mean-4:1624 bad
Wed 8 Jan 11:22:46 GMT 2014 compass ebizzy 346679aad15c3608844f6b433b8d8ba56ad03802 mean-4:2280 good
Wed 8 Jan 11:36:32 GMT 2014 compass ebizzy 36b9512dc19b535d72c1035048a95ec1c765d403 mean-4:1641 bad
Wed 8 Jan 11:50:22 GMT 2014 compass ebizzy 1a82fc9ab8bb6b4a5ee5cd32d570d6ff0b77efb2 mean-4:1627 bad
Wed 8 Jan 12:04:15 GMT 2014 compass ebizzy 3d97ea0816589c818ac62fb401e61c3b6a59f351 mean-4:1619 bad
Wed 8 Jan 13:10:03 GMT 2014 compass ebizzy v3.13-rc7 mean-4:1619 bad
Wed 8 Jan 13:39:19 GMT 2014 compass ebizzy v3.12.7-rc2-revert mean-4:2276 good

mean-4 figures are records/sec as recorded by the bisection test. The
bisection points are based on the -stable quilt tree so the commit ids are
meaningless but you can see good/bad figures are relatively stable leading
me to conclude the bisection is valid.

v3.12.6 was 2317 records/second and considered "good". The 3.12.7-rc2
stable candidate and 3.13-rc7 are both "bad". Reverting the single patch
from v3.12.7-rc2 restores performance.

Greg, this does not affect your -stable release as such because upstream is
also affected. If you release with the patch merged then the upstream fix
(whatever that is) will also need to be included in -stable later. If you
release without the patch then both upstream fixes will be later required
and some Intel machines will continue to consume excessive amounts of
power in the meantime.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-08 13:48         ` Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches) Mel Gorman
@ 2014-01-09  4:17           ` Greg KH
  2014-01-09 20:07           ` Len Brown
  2014-01-13 19:24           ` Mel Gorman
  2 siblings, 0 replies; 11+ messages in thread
From: Greg KH @ 2014-01-09  4:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: athorlton, riel, chegu_vinod, Len Brown, H. Peter Anvin, LKML, stable

On Wed, Jan 08, 2014 at 01:48:58PM +0000, Mel Gorman wrote:
> Adding LKML to the list as this -stable snifftest has identified an
> upstream regression.
> 
> On Wed, Jan 08, 2014 at 10:43:40AM +0000, Mel Gorman wrote:
> > On Tue, Jan 07, 2014 at 08:30:12PM +0000, Mel Gorman wrote:
> > > On Tue, Jan 07, 2014 at 10:54:40AM -0800, Greg KH wrote:
> > > > On Tue, Jan 07, 2014 at 06:17:15AM -0800, Greg KH wrote:
> > > > > On Tue, Jan 07, 2014 at 02:00:35PM +0000, Mel Gorman wrote:
> > > > > > A number of NUMA balancing patches were tagged for -stable but I got a
> > > > > > number of rejected mails from either Greg or his robot minion.  The list
> > > > > > of relevant patches is
> > > > > > 
> > > > > > FAILED: patch "[PATCH] mm: numa: serialise parallel get_user_page against THP"
> > > > > > FAILED: patch "[PATCH] mm: numa: call MMU notifiers on THP migration"
> > > > > > MERGED: Patch "mm: clear pmd_numa before invalidating"
> > > > > > FAILED: patch "[PATCH] mm: numa: do not clear PMD during PTE update scan"
> > > > > > FAILED: patch "[PATCH] mm: numa: do not clear PTE for pte_numa update"
> > > > > > MERGED: Patch "mm: numa: ensure anon_vma is locked to prevent parallel THP splits"
> > > > > > MERGED: Patch "mm: numa: avoid unnecessary work on the failure path"
> > > > > > MERGED: Patch "sched: numa: skip inaccessible VMAs"
> > > > > > FAILED: patch "[PATCH] mm: numa: clear numa hinting information on mprotect"
> > > > > > FAILED: patch "[PATCH] mm: numa: avoid unnecessary disruption of NUMA hinting during"
> > > > > > Patch "mm: fix TLB flush race between migration, and change_protection_range"
> > > > > > Patch "mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates"
> > > > > > FAILED: patch "[PATCH] mm: numa: defer TLB flush for THP migration as long as"
> > > > > > 
> > > > > > Fixing the rejects one at a time may cause other conflicts due to ordering
> > > > > > issues. Instead, this patch series against 3.12.6 is the full list of
> > > > > > backported patches in the expected order. Greg, unfortunately this means
> > > > > > you may have to drop some patches already in your stable tree and reapply
> > > > > > but on the plus side they should be then in the correct order for bisection
> > > > > > purposes and you'll know I've tested this combination of patches.
> > > > > 
> > > > > Many thanks for these, I'll go queue them up in a bit and drop the
> > > > > others to ensure I got all of this correct.
> > > > 
> > > > Ok, I've now queued all of these up, in this order, so we should be
> > > > good.
> > > > 
> > > > I'll do a -rc2 in a bit as it needs some testing.
> > > > 
> > > 
> > > Thanks a million. I should be cc'd on some of those so I'll pick up the
> > > final result and run it through the same tests just to be sure.
> > > 
> > 
> > Ok, tests completed and look more or less as expected. This is not to
> > say the performance results are *good* as such.  Workloads that normally
> > demonstrate automatic numa balancing suffered because of other patches that
> > were merged (primarily fair zone allocation policy) that had interesting
> > side-effects. However, it now does not crash under heavy stress and I
> > prefer working a little slowly than crashing fast. NAS at least looks
> > better.
> > 
> > Other workloads like kernel builds, page fault microbench looked good as
> > expected from the fair zone allocation policy fixes.
> > 
> > Big downside is that ebizzy performance is *destroyed* in that RC2 patch
> > somewhere
> > 
> > ebizzy
> >                          3.12.6                3.12.6            3.12.7-rc2
> >                         vanilla         backport-v1r2             stablerc2
> > Mean   1      3278.67 (  0.00%)     3180.67 ( -2.99%)     3212.00 ( -2.03%)
> > Mean   2      2322.67 (  0.00%)     2294.67 ( -1.21%)     1839.00 (-20.82%)
> > Mean   3      2257.00 (  0.00%)     2218.67 ( -1.70%)     1664.00 (-26.27%)
> > Mean   4      2268.00 (  0.00%)     2224.67 ( -1.91%)     1629.67 (-28.15%)
> > Mean   5      2247.67 (  0.00%)     2255.67 (  0.36%)     1582.33 (-29.60%)
> > Mean   6      2263.33 (  0.00%)     2251.33 ( -0.53%)     1547.67 (-31.62%)
> > Mean   7      2273.67 (  0.00%)     2222.67 ( -2.24%)     1545.67 (-32.02%)
> > Mean   8      2254.67 (  0.00%)     2232.33 ( -0.99%)     1535.33 (-31.90%)
> > Mean   12     2237.67 (  0.00%)     2266.33 (  1.28%)     1543.33 (-31.03%)
> > Mean   16     2201.33 (  0.00%)     2252.67 (  2.33%)     1540.33 (-30.03%)
> > Mean   20     2205.67 (  0.00%)     2229.33 (  1.07%)     1537.33 (-30.30%)
> > Mean   24     2162.33 (  0.00%)     2168.67 (  0.29%)     1535.33 (-29.00%)
> > Mean   28     2139.33 (  0.00%)     2107.67 ( -1.48%)     1535.00 (-28.25%)
> > Mean   32     2084.67 (  0.00%)     2089.00 (  0.21%)     1537.33 (-26.26%)
> > Mean   36     2002.00 (  0.00%)     2020.00 (  0.90%)     1530.33 (-23.56%)
> > Mean   40     1972.67 (  0.00%)     1978.67 (  0.30%)     1530.33 (-22.42%)
> > Mean   44     1951.00 (  0.00%)     1953.67 (  0.14%)     1531.00 (-21.53%)
> > Mean   48     1931.67 (  0.00%)     1930.67 ( -0.05%)     1526.67 (-20.97%)
> > 
> > Figures are records/sec, more is better for increasing numbers of threads
> > up to 48 which is the number of logical CPUs in the machine. Three kernels
> > tested
> > 
> > 3.12.6        is self-explanatory
> > backport-v1r2 is the backported series I sent you
> > stablerc2     is the rc2 patch I pulled from kernel.org
> > 
> > I'm not that familiar with the stable workflow but stable-queue.git looked
> > like it had the correct quilt tree so bisection is in progress. If I had
> > to bet money on it, I'd bet it's going to be scheduler or power management
> > related mostly because problems in both of those areas have tended to
> > screw ebizzy recently.
> > 
> 
> I was not far off. Bisection identified the following commit
> 
> 3d97ea0816589c818ac62fb401e61c3b6a59f351 is the first bad commit
> commit 3d97ea0816589c818ac62fb401e61c3b6a59f351
> Author: Len Brown <len.brown@intel.com>
> Date:   Wed Dec 18 16:44:57 2013 -0500
> 
>     x86 idle: Repair large-server 50-watt idle-power regression
> 
>     commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream.
> 
>     Linux 3.10 changed the timing of how thread_info->flags is touched:
> 
>         x86: Use generic idle loop
>         (7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
> 
>     This caused Intel NHM-EX and WSM-EX servers to experience a large number
>     of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
>     from deep C-states to shallow C-states, which caused these platforms
>     to experience a significant increase in idle power.
> 
>     Note that this issue was already present before the commit above,
>     however, it wasn't seen often enough to be noticed in power measurements.
> 
>     Here we extend an errata workaround from the Core2 EX "Dunnington"
>     to extend to NHM-EX and WSM-EX, to prevent these immediate
>     returns from MWAIT, reducing idle power on these platforms.
> 
>     While only acpi_idle ran on Dunnington, intel_idle
>     may also run on these two newer systems.
>     As of today, there are no other models that are known
>     to need this tweak.
> 
>     Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
>     Signed-off-by: Len Brown <len.brown@intel.com>
>     Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
>     Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
> Len, HPA, the x86 idle regression fix fubars ebizzy as a consequence, I
> don't know why. I know the workload is not that important (and I expected
> ebizzy to be unaffected in this test) but it is probably indicative of
> other performance regressions hiding in there. It was caught via -stable
> testing by accident but I checked and upstream is also affected. This is
> a snippet from the bisection log
> 
> Wed 8 Jan 09:53:59 GMT 2014 compass ebizzy v3.12.6 mean-4:2317 good
> Wed 8 Jan 10:13:04 GMT 2014 compass ebizzy v3.12.7-rc2 mean-4:1631 bad
> Wed 8 Jan 10:27:45 GMT 2014 compass ebizzy a202b4808e500f4fd53b6cec150c8fe214c70183 mean-4:1620 bad
> Wed 8 Jan 10:41:36 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2290 good
> Wed 8 Jan 10:55:14 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2266 good
> Wed 8 Jan 11:09:04 GMT 2014 compass ebizzy c62a6f8a28bf8897ba0903cf332d761c1132e48d mean-4:1624 bad
> Wed 8 Jan 11:22:46 GMT 2014 compass ebizzy 346679aad15c3608844f6b433b8d8ba56ad03802 mean-4:2280 good
> Wed 8 Jan 11:36:32 GMT 2014 compass ebizzy 36b9512dc19b535d72c1035048a95ec1c765d403 mean-4:1641 bad
> Wed 8 Jan 11:50:22 GMT 2014 compass ebizzy 1a82fc9ab8bb6b4a5ee5cd32d570d6ff0b77efb2 mean-4:1627 bad
> Wed 8 Jan 12:04:15 GMT 2014 compass ebizzy 3d97ea0816589c818ac62fb401e61c3b6a59f351 mean-4:1619 bad
> Wed 8 Jan 13:10:03 GMT 2014 compass ebizzy v3.13-rc7 mean-4:1619 bad
> Wed 8 Jan 13:39:19 GMT 2014 compass ebizzy v3.12.7-rc2-revert mean-4:2276 good
> 
> mean-4 figures are records/sec as recorded by the bisection test. The
> bisection points are based on the -stable quilt tree so the commit ids are
> meaningless but you can see good/bad figures are relatively stable leading
> me to conclude the bisection is valid.
> 
> v3.12.6 was 2317 records/second and considered "good". The 3.12.7-rc2
> stable candidate and 3.13-rc7 are both "bad". Reverting the single patch
> from v3.12.7-rc2 restores performance.
> 
> Greg, this does not affect your -stable release as such because upstream is
> also affected. If you release with the patch merged then the upstream fix
> (whatever that is) will also need to be included in -stable later. If you
> release without the patch then both upstream fixes will be later required
> and some Intel machines will continue to consume excessive amounts of
> power in the meantime.

Thanks, I'll just leave -stable as-is, and pick up the fix from upstream
when it hits there.

greg k-h

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-08 13:48         ` Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches) Mel Gorman
  2014-01-09  4:17           ` Greg KH
@ 2014-01-09 20:07           ` Len Brown
  2014-01-10 10:14             ` Mel Gorman
       [not found]             ` <CAJvTdK=vJxYgtLOYZZPrwGNgYQrFVeCq18RwzEfh5n_tZyeP9g@mail.gmail.com>
  2014-01-13 19:24           ` Mel Gorman
  2 siblings, 2 replies; 11+ messages in thread
From: Len Brown @ 2014-01-09 20:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Greg KH, athorlton, Rik van Riel, chegu_vinod, Len Brown,
	H. Peter Anvin, LKML, stable

Hi Mel,
Thanks for the bisect.
What is the cpuid of the machine that sees the regression?

thanks,
-Len


On Wed, Jan 8, 2014 at 8:48 AM, Mel Gorman <mgorman@suse.de> wrote:
> Adding LKML to the list as this -stable snifftest has identified an
> upstream regression.
>
> On Wed, Jan 08, 2014 at 10:43:40AM +0000, Mel Gorman wrote:
>> On Tue, Jan 07, 2014 at 08:30:12PM +0000, Mel Gorman wrote:
>> > On Tue, Jan 07, 2014 at 10:54:40AM -0800, Greg KH wrote:
>> > > On Tue, Jan 07, 2014 at 06:17:15AM -0800, Greg KH wrote:
>> > > > On Tue, Jan 07, 2014 at 02:00:35PM +0000, Mel Gorman wrote:
>> > > > > A number of NUMA balancing patches were tagged for -stable but I got a
>> > > > > number of rejected mails from either Greg or his robot minion.  The list
>> > > > > of relevant patches is
>> > > > >
>> > > > > FAILED: patch "[PATCH] mm: numa: serialise parallel get_user_page against THP"
>> > > > > FAILED: patch "[PATCH] mm: numa: call MMU notifiers on THP migration"
>> > > > > MERGED: Patch "mm: clear pmd_numa before invalidating"
>> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PMD during PTE update scan"
>> > > > > FAILED: patch "[PATCH] mm: numa: do not clear PTE for pte_numa update"
>> > > > > MERGED: Patch "mm: numa: ensure anon_vma is locked to prevent parallel THP splits"
>> > > > > MERGED: Patch "mm: numa: avoid unnecessary work on the failure path"
>> > > > > MERGED: Patch "sched: numa: skip inaccessible VMAs"
>> > > > > FAILED: patch "[PATCH] mm: numa: clear numa hinting information on mprotect"
>> > > > > FAILED: patch "[PATCH] mm: numa: avoid unnecessary disruption of NUMA hinting during"
>> > > > > Patch "mm: fix TLB flush race between migration, and change_protection_range"
>> > > > > Patch "mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates"
>> > > > > FAILED: patch "[PATCH] mm: numa: defer TLB flush for THP migration as long as"
>> > > > >
>> > > > > Fixing the rejects one at a time may cause other conflicts due to ordering
>> > > > > issues. Instead, this patch series against 3.12.6 is the full list of
>> > > > > backported patches in the expected order. Greg, unfortunately this means
>> > > > > you may have to drop some patches already in your stable tree and reapply
>> > > > > but on the plus side they should be then in the correct order for bisection
>> > > > > purposes and you'll know I've tested this combination of patches.
>> > > >
>> > > > Many thanks for these, I'll go queue them up in a bit and drop the
>> > > > others to ensure I got all of this correct.
>> > >
>> > > Ok, I've now queued all of these up, in this order, so we should be
>> > > good.
>> > >
>> > > I'll do a -rc2 in a bit as it needs some testing.
>> > >
>> >
>> > Thanks a million. I should be cc'd on some of those so I'll pick up the
>> > final result and run it through the same tests just to be sure.
>> >
>>
>> Ok, tests completed and look more or less as expected. This is not to
>> say the performance results are *good* as such.  Workloads that normally
>> demonstrate automatic numa balancing suffered because of other patches that
>> were merged (primarily fair zone allocation policy) that had interesting
>> side-effects. However, it now does not crash under heavy stress and I
>> prefer working a little slowly than crashing fast. NAS at least looks
>> better.
>>
>> Other workloads like kernel builds, page fault microbench looked good as
>> expected from the fair zone allocation policy fixes.
>>
>> Big downside is that ebizzy performance is *destroyed* in that RC2 patch
>> somewhere
>>
>> ebizzy
>>                          3.12.6                3.12.6            3.12.7-rc2
>>                         vanilla         backport-v1r2             stablerc2
>> Mean   1      3278.67 (  0.00%)     3180.67 ( -2.99%)     3212.00 ( -2.03%)
>> Mean   2      2322.67 (  0.00%)     2294.67 ( -1.21%)     1839.00 (-20.82%)
>> Mean   3      2257.00 (  0.00%)     2218.67 ( -1.70%)     1664.00 (-26.27%)
>> Mean   4      2268.00 (  0.00%)     2224.67 ( -1.91%)     1629.67 (-28.15%)
>> Mean   5      2247.67 (  0.00%)     2255.67 (  0.36%)     1582.33 (-29.60%)
>> Mean   6      2263.33 (  0.00%)     2251.33 ( -0.53%)     1547.67 (-31.62%)
>> Mean   7      2273.67 (  0.00%)     2222.67 ( -2.24%)     1545.67 (-32.02%)
>> Mean   8      2254.67 (  0.00%)     2232.33 ( -0.99%)     1535.33 (-31.90%)
>> Mean   12     2237.67 (  0.00%)     2266.33 (  1.28%)     1543.33 (-31.03%)
>> Mean   16     2201.33 (  0.00%)     2252.67 (  2.33%)     1540.33 (-30.03%)
>> Mean   20     2205.67 (  0.00%)     2229.33 (  1.07%)     1537.33 (-30.30%)
>> Mean   24     2162.33 (  0.00%)     2168.67 (  0.29%)     1535.33 (-29.00%)
>> Mean   28     2139.33 (  0.00%)     2107.67 ( -1.48%)     1535.00 (-28.25%)
>> Mean   32     2084.67 (  0.00%)     2089.00 (  0.21%)     1537.33 (-26.26%)
>> Mean   36     2002.00 (  0.00%)     2020.00 (  0.90%)     1530.33 (-23.56%)
>> Mean   40     1972.67 (  0.00%)     1978.67 (  0.30%)     1530.33 (-22.42%)
>> Mean   44     1951.00 (  0.00%)     1953.67 (  0.14%)     1531.00 (-21.53%)
>> Mean   48     1931.67 (  0.00%)     1930.67 ( -0.05%)     1526.67 (-20.97%)
>>
>> Figures are records/sec, more is better for increasing numbers of threads
>> up to 48 which is the number of logical CPUs in the machine. Three kernels
>> tested
>>
>> 3.12.6        is self-explanatory
>> backport-v1r2 is the backported series I sent you
>> stablerc2     is the rc2 patch I pulled from kernel.org
>>
>> I'm not that familiar with the stable workflow but stable-queue.git looked
>> like it had the correct quilt tree so bisection is in progress. If I had
>> to bet money on it, I'd bet it's going to be scheduler or power management
>> related mostly because problems in both of those areas have tended to
>> screw ebizzy recently.
>>
>
> I was not far off. Bisection identified the following commit
>
> 3d97ea0816589c818ac62fb401e61c3b6a59f351 is the first bad commit
> commit 3d97ea0816589c818ac62fb401e61c3b6a59f351
> Author: Len Brown <len.brown@intel.com>
> Date:   Wed Dec 18 16:44:57 2013 -0500
>
>     x86 idle: Repair large-server 50-watt idle-power regression
>
>     commit 40e2d7f9b5dae048789c64672bf3027fbb663ffa upstream.
>
>     Linux 3.10 changed the timing of how thread_info->flags is touched:
>
>         x86: Use generic idle loop
>         (7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
>
>     This caused Intel NHM-EX and WSM-EX servers to experience a large number
>     of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
>     from deep C-states to shallow C-states, which caused these platforms
>     to experience a significant increase in idle power.
>
>     Note that this issue was already present before the commit above,
>     however, it wasn't seen often enough to be noticed in power measurements.
>
>     Here we extend an errata workaround from the Core2 EX "Dunnington"
>     to extend to NHM-EX and WSM-EX, to prevent these immediate
>     returns from MWAIT, reducing idle power on these platforms.
>
>     While only acpi_idle ran on Dunnington, intel_idle
>     may also run on these two newer systems.
>     As of today, there are no other models that are known
>     to need this tweak.
>
>     Link: http://lkml.kernel.org/r/CAJvTdK=%2BaNN66mYpCGgbHGCHhYQAKx-vB0kJSWjVpsNb_hOAtQ@mail.gmail.com
>     Signed-off-by: Len Brown <len.brown@intel.com>
>     Link: http://lkml.kernel.org/r/baff264285f6e585df757d58b17788feabc68918.1387403066.git.len.brown@intel.com
>     Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>
> Len, HPA, the x86 idle regression fix fubars ebizzy as a consequence, I
> don't know why. I know the workload is not that important (and I expected
> ebizzy to be unaffected in this test) but it is probably indicative of
> other performance regressions hiding in there. It was caught via -stable
> testing by accident but I checked and upstream is also affected. This is
> a snippet from the bisection log
>
> Wed 8 Jan 09:53:59 GMT 2014 compass ebizzy v3.12.6 mean-4:2317 good
> Wed 8 Jan 10:13:04 GMT 2014 compass ebizzy v3.12.7-rc2 mean-4:1631 bad
> Wed 8 Jan 10:27:45 GMT 2014 compass ebizzy a202b4808e500f4fd53b6cec150c8fe214c70183 mean-4:1620 bad
> Wed 8 Jan 10:41:36 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2290 good
> Wed 8 Jan 10:55:14 GMT 2014 compass ebizzy c915b8fa860e189cb84898a30f135399baa827fa mean-4:2266 good
> Wed 8 Jan 11:09:04 GMT 2014 compass ebizzy c62a6f8a28bf8897ba0903cf332d761c1132e48d mean-4:1624 bad
> Wed 8 Jan 11:22:46 GMT 2014 compass ebizzy 346679aad15c3608844f6b433b8d8ba56ad03802 mean-4:2280 good
> Wed 8 Jan 11:36:32 GMT 2014 compass ebizzy 36b9512dc19b535d72c1035048a95ec1c765d403 mean-4:1641 bad
> Wed 8 Jan 11:50:22 GMT 2014 compass ebizzy 1a82fc9ab8bb6b4a5ee5cd32d570d6ff0b77efb2 mean-4:1627 bad
> Wed 8 Jan 12:04:15 GMT 2014 compass ebizzy 3d97ea0816589c818ac62fb401e61c3b6a59f351 mean-4:1619 bad
> Wed 8 Jan 13:10:03 GMT 2014 compass ebizzy v3.13-rc7 mean-4:1619 bad
> Wed 8 Jan 13:39:19 GMT 2014 compass ebizzy v3.12.7-rc2-revert mean-4:2276 good
>
> mean-4 figures are records/sec as recorded by the bisection test. The
> bisection points are based on the -stable quilt tree so the commit ids are
> meaningless but you can see good/bad figures are relatively stable leading
> me to conclude the bisection is valid.
>
> v3.12.6 was 2317 records/second and considered "good". The 3.12.7-rc2
> stable candidate and 3.13-rc7 are both "bad". Reverting the single patch
> from v3.12.7-rc2 restores performance.
>
> Greg, this does not affect your -stable release as such because upstream is
> also affected. If you release with the patch merged then the upstream fix
> (whatever that is) will also need to be included in -stable later. If you
> release without the patch then both upstream fixes will be later required
> and some Intel machines will continue to consume excessive amounts of
> power in the meantime.
>
> --
> Mel Gorman
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-09 20:07           ` Len Brown
@ 2014-01-10 10:14             ` Mel Gorman
       [not found]             ` <CAJvTdK=vJxYgtLOYZZPrwGNgYQrFVeCq18RwzEfh5n_tZyeP9g@mail.gmail.com>
  1 sibling, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2014-01-10 10:14 UTC (permalink / raw)
  To: Len Brown
  Cc: Greg KH, athorlton, Rik van Riel, chegu_vinod, Len Brown,
	H. Peter Anvin, LKML, stable

On Thu, Jan 09, 2014 at 03:07:00PM -0500, Len Brown wrote:
> Hi Mel,
> Thanks for the bisect.
> What is the cpuid of the machine that sees the regression?
> 

cpuid information for CPU 0. Machine is 4 socket, 48 threads in total.

CPU 0:
   vendor_id = "GenuineIntel"
   version information (1/eax):
      processor type  = primary processor (0)
      family          = Intel Pentium Pro/II/III/Celeron/Core/Core 2/Atom, AMD Athlon/Duron, Cyrix M2, VIA C3 (6)
      model           = 0xf (15)
      stepping id     = 0x2 (2)
      extended family = 0x0 (0)
      extended model  = 0x2 (2)
      (simple synth)  = Intel Xeon E7-8800 / Xeon E7-4800 / Xeon E7-2800 (Westmere-EX A2), 32nm
   miscellaneous (1/ebx):
      process local APIC physical ID = 0x0 (0)
      cpu count                      = 0x40 (64)
      CLFLUSH line size              = 0x8 (8)
      brand index                    = 0x0 (0)
   brand id = 0x00 (0): unknown
   feature information (1/edx):
      x87 FPU on chip                        = true
      virtual-8086 mode enhancement          = true
      debugging extensions                   = true
      page size extensions                   = true
      time stamp counter                     = true
      RDMSR and WRMSR support                = true
      physical address extensions            = true
      machine check exception                = true
      CMPXCHG8B inst.                        = true
      APIC on chip                           = true
      SYSENTER and SYSEXIT                   = true
      memory type range registers            = true
      PTE global bit                         = true
      machine check architecture             = true
      conditional move/compare instruction   = true
      page attribute table                   = true
      page size extension                    = true
      processor serial number                = false
      CLFLUSH instruction                    = true
      debug store                            = true
      thermal monitor and clock ctrl         = true
      MMX Technology                         = true
      FXSAVE/FXRSTOR                         = true
      SSE extensions                         = true
      SSE2 extensions                        = true
      self snoop                             = true
      hyper-threading / multi-core supported = true
      therm. monitor                         = true
      IA64                                   = false
      pending break event                    = true
   feature information (1/ecx):
      PNI/SSE3: Prescott New Instructions     = true
      PCLMULDQ instruction                    = true
      64-bit debug store                      = true
      MONITOR/MWAIT                           = true
      CPL-qualified debug store               = true
      VMX: virtual machine extensions         = true
      SMX: safer mode extensions              = true
      Enhanced Intel SpeedStep Technology     = true
      thermal monitor 2                       = true
      SSSE3 extensions                        = true
      context ID: adaptive or shared L1 data  = false
      FMA instruction                         = false
      CMPXCHG16B instruction                  = true
      xTPR disable                            = true
      perfmon and debug                       = true
      process context identifiers             = true
      direct cache access                     = true
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true
      extended xAPIC support                  = true
      MOVBE instruction                       = false
      POPCNT instruction                      = true
      time stamp counter deadline             = false
      AES instruction                         = true
      XSAVE/XSTOR states                      = false
      OS-enabled XSAVE/XSTOR                  = false
      AVX: advanced vector extensions         = false
      F16C half-precision convert instruction = false
      RDRAND instruction                      = false
      hypervisor guest status                 = false
   cache and TLB information (2):
      0x5a: data TLB: 2M/4M pages, 4-way, 32 entries
      0x03: data TLB: 4K pages, 4-way, 64 entries
      0x55: instruction TLB: 2M/4M pages, fully, 7 entries
      0xeb: L3 cache: 18M, 24-way, 64 byte lines
      0xb2: instruction TLB: 4K, 4-way, 64 entries
      0xf0: 64 byte prefetching
      0x2c: L1 data cache: 32K, 8-way, 64 byte lines
      0x21: L2 cache: 256K MLC, 8-way, 64 byte lines
      0xca: L2 TLB: 4K, 4-way, 512 entries
      0x09: L1 instruction cache: 32K, 4-way, 64-byte lines
   processor serial number: 0002-06F2-0000-0000-0000-0000
   deterministic cache parameters (4):
      --- cache 0 ---
      cache type                           = data cache (1)
      cache level                          = 0x1 (1)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x1 (1)
      extra processor cores on this die    = 0x1f (31)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0x7 (7)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = false
      complex cache indexing               = false
      number of sets - 1 (s)               = 63
      --- cache 1 ---
      cache type                           = instruction cache (2)
      cache level                          = 0x1 (1)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x1 (1)
      extra processor cores on this die    = 0x1f (31)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0x3 (3)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = false
      complex cache indexing               = false
      number of sets - 1 (s)               = 127
      --- cache 2 ---
      cache type                           = unified cache (3)
      cache level                          = 0x2 (2)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x1 (1)
      extra processor cores on this die    = 0x1f (31)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0x7 (7)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = false
      complex cache indexing               = false
      number of sets - 1 (s)               = 511
      --- cache 3 ---
      cache type                           = unified cache (3)
      cache level                          = 0x3 (3)
      self-initializing cache level        = true
      fully associative cache              = false
      extra threads sharing this cache     = 0x3f (63)
      extra processor cores on this die    = 0x1f (31)
      system coherency line size           = 0x3f (63)
      physical line partitions             = 0x0 (0)
      ways of associativity                = 0x17 (23)
      WBINVD/INVD behavior on lower caches = false
      inclusive to lower caches            = true
      complex cache indexing               = true
      number of sets - 1 (s)               = 12287
   MONITOR/MWAIT (5):
      smallest monitor-line size (bytes)       = 0x40 (64)
      largest monitor-line size (bytes)        = 0x40 (64)
      enum of Monitor-MWAIT exts supported     = true
      supports intrs as break-event for MWAIT  = true
      number of C0 sub C-states using MWAIT    = 0x0 (0)
      number of C1 sub C-states using MWAIT    = 0x2 (2)
      number of C2 sub C-states using MWAIT    = 0x1 (1)
      number of C3/C6 sub C-states using MWAIT = 0x1 (1)
      number of C4/C7 sub C-states using MWAIT = 0x0 (0)
   Thermal and Power Management Features (6):
      digital thermometer                     = true
      Intel Turbo Boost Technology            = false
      ARAT always running APIC timer          = true
      PLN power limit notification            = false
      ECMD extended clock modulation duty     = false
      PTM package thermal management          = false
      digital thermometer thresholds          = 0x1 (1)
      ACNT/MCNT supported performance measure = true
      ACNT2 available                         = false
      performance-energy bias capability      = false
   extended feature flags (7):
      FSGSBASE instructions                   = false
      BMI instruction                         = false
      SMEP support                            = false
      enhanced REP MOVSB/STOSB                = false
      INVPCID instruction                     = false
   Direct Cache Access Parameters (9):
      PLATFORM_DCA_CAP MSR bits = 0
   Architecture Performance Monitoring Features (0xa/eax):
      version ID                               = 0x3 (3)
      number of counters per logical processor = 0x4 (4)
      bit width of counter                     = 0x30 (48)
      length of EBX bit vector                 = 0x7 (7)
   Architecture Performance Monitoring Features (0xa/ebx):
      core cycle event not available           = false
      instruction retired event not available  = false
      reference cycles event not available     = true
      last-level cache ref event not available = false
      last-level cache miss event not avail    = false
      branch inst retired event not available  = false
      branch mispred retired event not avail   = false
   Architecture Performance Monitoring Features (0xa/edx):
      number of fixed counters    = 0x3 (3)
      bit width of fixed counters = 0x30 (48)
   x2APIC features / processor topology (0xb):
      --- level 0 (thread) ---
      bits to shift APIC ID to get next = 0x1 (1)
      logical processors at this level  = 0x2 (2)
      level number                      = 0x0 (0)
      level type                        = thread (1)
      extended APIC ID                  = 0
      --- level 1 (core) ---
      bits to shift APIC ID to get next = 0x6 (6)
      logical processors at this level  = 0xc (12)
      level number                      = 0x1 (1)
      level type                        = core (2)
      extended APIC ID                  = 0
   extended feature flags (0x80000001/edx):
      SYSCALL and SYSRET instructions        = true
      execution disable                      = true
      1-GB large page support                = true
      RDTSCP                                 = true
      64-bit extensions technology available = true
   Intel feature flags (0x80000001/ecx):
      LAHF/SAHF supported in 64-bit mode = true
   brand = "       Intel(R) Xeon(R) CPU E7- 4807  @ 1.87GHz"
   L1 TLB/cache information: 2M/4M pages & L1 TLB (0x80000005/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = 0x0 (0)
      data # entries            = 0x0 (0)
      data associativity        = 0x0 (0)
   L1 TLB/cache information: 4K pages & L1 TLB (0x80000005/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = 0x0 (0)
      data # entries            = 0x0 (0)
      data associativity        = 0x0 (0)
   L1 data cache information (0x80000005/ecx):
      line size (bytes) = 0x0 (0)
      lines per tag     = 0x0 (0)
      associativity     = 0x0 (0)
      size (Kb)         = 0x0 (0)
   L1 instruction cache information (0x80000005/edx):
      line size (bytes) = 0x0 (0)
      lines per tag     = 0x0 (0)
      associativity     = 0x0 (0)
      size (Kb)         = 0x0 (0)
   L2 TLB/cache information: 2M/4M pages & L2 TLB (0x80000006/eax):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 TLB/cache information: 4K pages & L2 TLB (0x80000006/ebx):
      instruction # entries     = 0x0 (0)
      instruction associativity = L2 off (0)
      data # entries            = 0x0 (0)
      data associativity        = L2 off (0)
   L2 unified cache information (0x80000006/ecx):
      line size (bytes) = 0x40 (64)
      lines per tag     = 0x0 (0)
      associativity     = 8-way (6)
      size (Kb)         = 0x100 (256)
   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x0 (0)
      lines per tag         = 0x0 (0)
      associativity         = L2 off (0)
      size (in 512Kb units) = 0x0 (0)
   Advanced Power Management Features (0x80000007/edx):
      temperature sensing diode      = false
      frequency ID (FID) control     = false
      voltage ID (VID) control       = false
      thermal trip (TTP)             = false
      thermal monitor (TM)           = false
      software thermal control (STC) = false
      100 MHz multiplier control     = false
      hardware P-State control       = false
      TscInvariant                   = true
   Physical Address and Linear Address Size (0x80000008/eax):
      maximum physical address bits         = 0x2c (44)
      maximum linear (virtual) address bits = 0x30 (48)
      maximum guest physical address bits   = 0x0 (0)
   Logical CPU cores (0x80000008/ecx):
      number of CPU cores - 1 = 0x0 (0)
      ApicIdCoreIdSize        = 0x0 (0)
   (multi-processing synth): multi-core (c=6), hyper-threaded (t=2)
   (multi-processing method): Intel leaf 0xb
   (APIC widths synth): CORE_width=6 SMT_width=1
   (APIC synth): PKG_ID=0 CORE_ID=0 SMT_ID=0
   (synth) = Intel Xeon E7-8800 / Xeon E7-4800 / Xeon E7-2800 (Westmere-EX A2), 32nm

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
       [not found]             ` <CAJvTdK=vJxYgtLOYZZPrwGNgYQrFVeCq18RwzEfh5n_tZyeP9g@mail.gmail.com>
@ 2014-01-10 10:26               ` Mel Gorman
  2014-01-10 14:38                 ` Mel Gorman
  0 siblings, 1 reply; 11+ messages in thread
From: Mel Gorman @ 2014-01-10 10:26 UTC (permalink / raw)
  To: Len Brown
  Cc: Greg KH, athorlton, Rik van Riel, chegu_vinod, Len Brown,
	H. Peter Anvin, LKML, stable

On Fri, Jan 10, 2014 at 01:04:55AM -0500, Len Brown wrote:
> Hi Mel,
> 
> I downloaded ebizzy and ran on an 80-thread WSM-EX.

Default parameters? If so, the default is 2xNR_CPUs. My initial tests only
ran up to NR_CPUs but I was seeing regressions throughout so I doubt it
levelled out for higher numbers of clients.

I used mmtests to run ebizzy based on the
configs/config-global-dhp__pagealloc-performance config file with the
following relevant lines changed just for the bisection itself

export MMTESTS="ebizzy"
export EBIZZY_MAX_THREADS=5
export EBIZZY_DURATION=20
export EBIZZY_ITERATIONS=3

Even though the test ran up to 5 threads, I only was using the result
for 4 threads for the bisection.

> But I got quite different number than you, so I'm wondering if there is
> something
> special I need to get the same results you see.  I generally see scores
> around 6900 - 7000.
> my reference kernel is built on top of
> b0031f227e47919797dc0e1c1990f3ef151ff0cc
> which is upstream on 12/17, which is when i wrote that patch -- if it
> matters.
> 
> But worse, I don't see any difference in ebizzy performance with/without
> the CLFLUSH patch.
> 
> Please let me know what I can do to reproduce the results you see.
> 

You could try running within mmtests and see what falls out? I don't think
I am doing anything weird in there but it wouldn't be the first time there
was a mistake in testing methodology that led to inconsistent results
between testers.

git clone https://github.com/gormanm/mmtests
cd mmtests
vi configs/config-global-dhp__pagealloc-performance
# edit file to set the lines above to match my bisection
./run-mmtests.sh --no-monitor --config configs/config-global-dhp__pagealloc-performance baseline
# boot new kernel
./run-mmtests.sh --no-monitor --config configs/config-global-dhp__pagealloc-performance patched
cd work/log
../../compare-kernels.sh

Of course, we could also be differing on kernel config in some relevant
way or it might be some other unfortunate timing issue.

> Also, can you try this attached incremental patch to see if it helps?

I'll fire it up after pushing send on this mail.

Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-10 10:26               ` Mel Gorman
@ 2014-01-10 14:38                 ` Mel Gorman
  0 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2014-01-10 14:38 UTC (permalink / raw)
  To: Len Brown
  Cc: Greg KH, athorlton, Rik van Riel, chegu_vinod, Len Brown,
	H. Peter Anvin, LKML, stable

On Fri, Jan 10, 2014 at 10:26:17AM +0000, Mel Gorman wrote:
> > Also, can you try this attached incremental patch to see if it helps?
> 
> I'll fire it up after pushing send on this mail.
> 

Relevant parts of the mmtests config file used

export MMTESTS="ebizzy"
export EBIZZY_MAX_THREADS=$((NUM_CPU*2))
export EBIZZY_DURATION=10
export EBIZZY_ITERATIONS=10

Three kernels tested -- vanilla kernel, fixidle is your patch and revert is a
revert of 40e2d7f9b5dae048789c64672bf3027fbb663ffa

ebizzy
                     3.13.0-rc7            3.13.0-rc7            3.13.0-rc7
                        vanilla          fixidle-v1r1           revert-v1r1
Mean   1      3153.70 (  0.00%)     3271.40 (  3.73%)     3170.40 (  0.53%)
Mean   2      1725.90 (  0.00%)     1714.90 ( -0.64%)     2366.90 ( 37.14%)
Mean   3      1659.70 (  0.00%)     1654.30 ( -0.33%)     2301.10 ( 38.65%)
Mean   4      1636.10 (  0.00%)     1638.20 (  0.13%)     2287.30 ( 39.80%)
Mean   5      1586.90 (  0.00%)     1598.80 (  0.75%)     2312.10 ( 45.70%)
Mean   6      1544.00 (  0.00%)     1555.00 (  0.71%)     2264.00 ( 46.63%)
Mean   7      1543.10 (  0.00%)     1543.70 (  0.04%)     2296.50 ( 48.82%)
Mean   8      1541.70 (  0.00%)     1548.40 (  0.43%)     2284.70 ( 48.19%)
Mean   12     1542.50 (  0.00%)     1550.80 (  0.54%)     2268.20 ( 47.05%)
Mean   16     1543.00 (  0.00%)     1546.10 (  0.20%)     2261.70 ( 46.58%)
Mean   20     1541.60 (  0.00%)     1551.20 (  0.62%)     2262.60 ( 46.77%)
Mean   24     1548.00 (  0.00%)     1546.20 ( -0.12%)     2240.30 ( 44.72%)
Mean   28     1537.00 (  0.00%)     1544.20 (  0.47%)     2172.40 ( 41.34%)
Mean   32     1542.70 (  0.00%)     1552.70 (  0.65%)     2118.80 ( 37.34%)
Mean   36     1538.70 (  0.00%)     1548.80 (  0.66%)     2074.40 ( 34.82%)
Mean   40     1536.40 (  0.00%)     1539.90 (  0.23%)     2041.20 ( 32.86%)
Mean   44     1535.00 (  0.00%)     1542.90 (  0.51%)     2011.60 ( 31.05%)
Mean   48     1534.90 (  0.00%)     1544.00 (  0.59%)     2002.60 ( 30.47%)
Mean   52     1530.40 (  0.00%)     1531.90 (  0.10%)     1994.80 ( 30.35%)
Mean   56     1531.50 (  0.00%)     1527.10 ( -0.29%)     1980.90 ( 29.34%)
Mean   60     1528.90 (  0.00%)     1527.40 ( -0.10%)     1995.60 ( 30.53%)
Mean   64     1527.10 (  0.00%)     1526.50 ( -0.04%)     1985.50 ( 30.02%)
Mean   68     1527.80 (  0.00%)     1522.50 ( -0.35%)     1983.70 ( 29.84%)
Mean   72     1524.50 (  0.00%)     1523.50 ( -0.07%)     1976.70 ( 29.66%)
Mean   76     1520.80 (  0.00%)     1525.20 (  0.29%)     1964.10 ( 29.15%)
Mean   80     1522.30 (  0.00%)     1519.30 ( -0.20%)     1966.20 ( 29.16%)
Mean   84     1522.60 (  0.00%)     1520.00 ( -0.17%)     1948.30 ( 27.96%)
Mean   88     1521.40 (  0.00%)     1521.40 (  0.00%)     1949.00 ( 28.11%)
Mean   92     1515.80 (  0.00%)     1517.10 (  0.09%)     1938.00 ( 27.85%)
Mean   96     1516.00 (  0.00%)     1517.40 (  0.09%)     1930.50 ( 27.34%)

The latest patch makes little difference. Reverting makes a massive
difference. I didn't include the standard deviations but they are very
small and the performance gain from the revert is far outside the noise.

ebizzy Thread spread
                     3.13.0-rc7            3.13.0-rc7            3.13.0-rc7
                        vanilla          fixidle-v1r1           revert-v1r1
Mean   1         0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
Mean   2        39.80 (  0.00%)      203.40 (-411.06%)        0.10 ( 99.75%)
Mean   3        90.90 (  0.00%)       70.10 ( 22.88%)        0.30 ( 99.67%)
Mean   4        83.00 (  0.00%)       79.30 (  4.46%)        0.30 ( 99.64%)
Mean   5        16.20 (  0.00%)       37.90 (-133.95%)        0.40 ( 97.53%)
Mean   6        54.40 (  0.00%)       45.40 ( 16.54%)        0.50 ( 99.08%)
Mean   7        45.90 (  0.00%)       37.90 ( 17.43%)        0.30 ( 99.35%)
Mean   8        36.30 (  0.00%)       43.90 (-20.94%)        0.50 ( 98.62%)
Mean   12       31.80 (  0.00%)       29.80 (  6.29%)        0.60 ( 98.11%)
Mean   16       26.20 (  0.00%)       26.70 ( -1.91%)        0.70 ( 97.33%)
Mean   20       20.70 (  0.00%)       19.00 (  8.21%)        1.00 ( 95.17%)
Mean   24       20.10 (  0.00%)       18.00 ( 10.45%)        1.40 ( 93.03%)
Mean   28       17.50 (  0.00%)       15.80 (  9.71%)        3.50 ( 80.00%)
Mean   32       15.50 (  0.00%)       16.20 ( -4.52%)        4.20 ( 72.90%)
Mean   36       14.60 (  0.00%)       14.60 (  0.00%)        3.80 ( 73.97%)
Mean   40       13.40 (  0.00%)       12.50 (  6.72%)        3.70 ( 72.39%)
Mean   44       12.20 (  0.00%)       13.50 (-10.66%)        3.20 ( 73.77%)
Mean   48       11.80 (  0.00%)       13.00 (-10.17%)        2.70 ( 77.12%)
Mean   52       11.10 (  0.00%)       11.20 ( -0.90%)        2.60 ( 76.58%)
Mean   56       10.00 (  0.00%)       10.50 ( -5.00%)        2.10 ( 79.00%)
Mean   60       10.00 (  0.00%)       10.00 (  0.00%)        2.30 ( 77.00%)
Mean   64        9.30 (  0.00%)        9.30 (  0.00%)        2.60 ( 72.04%)
Mean   68        9.80 (  0.00%)        9.70 (  1.02%)        2.00 ( 79.59%)
Mean   72        9.80 (  0.00%)        9.00 (  8.16%)        2.00 ( 79.59%)
Mean   76        8.80 (  0.00%)        9.60 ( -9.09%)        2.00 ( 77.27%)
Mean   80        8.20 (  0.00%)        8.40 ( -2.44%)        2.00 ( 75.61%)
Mean   84        8.30 (  0.00%)        8.00 (  3.61%)        2.10 ( 74.70%)
Mean   88        8.20 (  0.00%)        7.90 (  3.66%)        2.00 ( 75.61%)
Mean   92        8.40 (  0.00%)        7.50 ( 10.71%)        1.90 ( 77.38%)
Mean   96        8.10 (  0.00%)        7.60 (  6.17%)        2.20 ( 72.84%)

This shows the difference in performance between threads. It's
interesting to note that reverting the patch gives almost equal
performance to each thread

It's worth noting that automatic NUMA balancing is enabled and active
during these tests which would be one large potential difference between
our configs. I do not think it would be enough to explain the large
performance differences though.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-08 13:48         ` Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches) Mel Gorman
  2014-01-09  4:17           ` Greg KH
  2014-01-09 20:07           ` Len Brown
@ 2014-01-13 19:24           ` Mel Gorman
  2014-01-13 21:12             ` Greg KH
  2014-01-14  7:31             ` Len Brown
  2 siblings, 2 replies; 11+ messages in thread
From: Mel Gorman @ 2014-01-13 19:24 UTC (permalink / raw)
  To: Len Brown
  Cc: athorlton, riel, chegu_vinod, Greg KH, H. Peter Anvin, LKML, stable

On Wed, Jan 08, 2014 at 01:48:58PM +0000, Mel Gorman wrote:
> Adding LKML to the list as this -stable snifftest has identified an
> upstream regression.
> 

This is a false alarm.

The test machine in question was originally installed based on a beta
version of openSUSE 13.1. It included a package by default that set default
malloc parameters that I was not aware. Normally the package is there to
catch bugs during beta testing and removed before a GA release but it's
left in place if a user does a distribution update.

With the debugging RPM installed, the free paths contended on a global
mutex in glibc.  Ebizzy had been classified as a CPU intensive and memory
free intensive benchmark (not that common) but turbostat showed that the
CPUs were over 95% of the time in C6 and mpstat verified that the CPUs
were mostly idle. It did not take long to see that everything was blocked
waiting on a futex and to identify where it was in glibc. It's only a
factor when malloc debugging is enabled so normally people would not see it.

The "regression" is because CPUs are reaching C6 as they should and there
is a delay when exiting it. This is behaving as designed and fixing this
would involve doing something stupid. Once the problem RPM was removed
ebizzy performed as expected. 3.13-rc7, the revert and forcing max_cstate=1
all have similar performance.

Sorry about the noise.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-13 19:24           ` Mel Gorman
@ 2014-01-13 21:12             ` Greg KH
  2014-01-14  7:31             ` Len Brown
  1 sibling, 0 replies; 11+ messages in thread
From: Greg KH @ 2014-01-13 21:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Len Brown, athorlton, riel, chegu_vinod, H. Peter Anvin, LKML, stable

On Mon, Jan 13, 2014 at 07:24:06PM +0000, Mel Gorman wrote:
> On Wed, Jan 08, 2014 at 01:48:58PM +0000, Mel Gorman wrote:
> > Adding LKML to the list as this -stable snifftest has identified an
> > upstream regression.
> > 
> 
> This is a false alarm.

Thanks for tracking this down and letting us know.

greg k-h

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-13 19:24           ` Mel Gorman
  2014-01-13 21:12             ` Greg KH
@ 2014-01-14  7:31             ` Len Brown
  2014-01-14  8:01               ` Mike Galbraith
  1 sibling, 1 reply; 11+ messages in thread
From: Len Brown @ 2014-01-14  7:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Len Brown, athorlton, Rik van Riel, chegu_vinod, Greg KH,
	H. Peter Anvin, LKML, stable

> This is a false alarm.

Thanks for the follow-up, Mel.

Agreed, it makes no sense for ebizzy measure 'throughput', when a
library debug bottleneck
prevents it from scaling past 3% CPU utilization.

Still, the broken configuration did find a difference due to the
addition of CLFLUSH on this box.
It makes me wonder if we will find issues on workloads that may depend
on the latency
of idle entry/exit, or perhaps sensitivity to the state of the cache
line containing thread_info->flags.

If somebody runs into such a workload, please try changing this 1 line
of intel_idle.c to limit
the CLFLUSH to C-states deeper than C1E, and let me know what you see.

- if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+ if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
          clflush((void *)&current_thread_info()->flags);

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-14  7:31             ` Len Brown
@ 2014-01-14  8:01               ` Mike Galbraith
  2014-01-14  8:24                 ` Mike Galbraith
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Galbraith @ 2014-01-14  8:01 UTC (permalink / raw)
  To: Len Brown
  Cc: Mel Gorman, Len Brown, athorlton, Rik van Riel, chegu_vinod,
	Greg KH, H. Peter Anvin, LKML, stable

On Tue, 2014-01-14 at 02:31 -0500, Len Brown wrote: 
> > This is a false alarm.
> 
> Thanks for the follow-up, Mel.
> 
> Agreed, it makes no sense for ebizzy measure 'throughput', when a
> library debug bottleneck
> prevents it from scaling past 3% CPU utilization.
> 
> Still, the broken configuration did find a difference due to the
> addition of CLFLUSH on this box.
> It makes me wonder if we will find issues on workloads that may depend
> on the latency
> of idle entry/exit, or perhaps sensitivity to the state of the cache
> line containing thread_info->flags.
> 
> If somebody runs into such a workload, please try changing this 1 line
> of intel_idle.c to limit
> the CLFLUSH to C-states deeper than C1E, and let me know what you see.
> 
> - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> + if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>           clflush((void *)&current_thread_info()->flags);

Hm, seems any high frequency switcher scheduling cross-core (pipe-test,
or maybe a tbench pair) should show the cost to an affected box.

-Mike


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches)
  2014-01-14  8:01               ` Mike Galbraith
@ 2014-01-14  8:24                 ` Mike Galbraith
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Galbraith @ 2014-01-14  8:24 UTC (permalink / raw)
  To: Len Brown
  Cc: Mel Gorman, Len Brown, athorlton, Rik van Riel, chegu_vinod,
	Greg KH, H. Peter Anvin, LKML, stable

On Tue, 2014-01-14 at 09:01 +0100, Mike Galbraith wrote: 
> On Tue, 2014-01-14 at 02:31 -0500, Len Brown wrote: 
> > > This is a false alarm.
> > 
> > Thanks for the follow-up, Mel.
> > 
> > Agreed, it makes no sense for ebizzy measure 'throughput', when a
> > library debug bottleneck
> > prevents it from scaling past 3% CPU utilization.
> > 
> > Still, the broken configuration did find a difference due to the
> > addition of CLFLUSH on this box.
> > It makes me wonder if we will find issues on workloads that may depend
> > on the latency
> > of idle entry/exit, or perhaps sensitivity to the state of the cache
> > line containing thread_info->flags.
> > 
> > If somebody runs into such a workload, please try changing this 1 line
> > of intel_idle.c to limit
> > the CLFLUSH to C-states deeper than C1E, and let me know what you see.
> > 
> > - if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> > + if ((eax > 1) && this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> >           clflush((void *)&current_thread_info()->flags);
> 
> Hm, seems any high frequency switcher scheduling cross-core (pipe-test,
> or maybe a tbench pair) should show the cost to an affected box.

Oh yeah.. :) unless of course it's a Q6600 (poke poke).


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-01-14  8:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1389103248-17617-1-git-send-email-mgorman@suse.de>
     [not found] ` <20140107141715.GA32491@kroah.com>
     [not found]   ` <20140107185440.GA7844@kroah.com>
     [not found]     ` <20140107203012.GA27046@suse.de>
     [not found]       ` <20140108104340.GC27046@suse.de>
2014-01-08 13:48         ` Idle power fix regresses ebizzy performance (was 3.12-stable backport of NUMA balancing patches) Mel Gorman
2014-01-09  4:17           ` Greg KH
2014-01-09 20:07           ` Len Brown
2014-01-10 10:14             ` Mel Gorman
     [not found]             ` <CAJvTdK=vJxYgtLOYZZPrwGNgYQrFVeCq18RwzEfh5n_tZyeP9g@mail.gmail.com>
2014-01-10 10:26               ` Mel Gorman
2014-01-10 14:38                 ` Mel Gorman
2014-01-13 19:24           ` Mel Gorman
2014-01-13 21:12             ` Greg KH
2014-01-14  7:31             ` Len Brown
2014-01-14  8:01               ` Mike Galbraith
2014-01-14  8:24                 ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.