linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
       [not found] <bug-64121-27@https.bugzilla.kernel.org/>
@ 2013-10-31 20:46 ` Andrew Morton
  2013-11-01 18:43   ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2013-10-31 20:46 UTC (permalink / raw)
  To: thomas.jarosch; +Cc: bugzilla-daemon, linux-mm, Johannes Weiner


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 31 Oct 2013 10:53:47 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=64121
> 
>             Bug ID: 64121
>            Summary: [BISECTED] "mm" performance regression updating from
>                     3.2 to 3.3
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 3.3
>           Hardware: i386
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: thomas.jarosch@intra2net.com
>         Regression: No
> 
> Created attachment 112881
>   --> https://bugzilla.kernel.org/attachment.cgi?id=112881&action=edit
> Dmesg output
> 
> Hi,
> 
> I've updated a productive box running kernel 3.0.x to 3.4.67.
> This caused a severe I/O performance regression.
> 
> After some hours I've bisected it down to this commit:
> 
> ---------------------------
> # git bisect good
> ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d is the first bad commit
> commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> Author: Johannes Weiner <jweiner@redhat.com>
> Date:   Tue Jan 10 15:07:42 2012 -0800
> 
>     mm: exclude reserved pages from dirtyable memory
> 
>     Per-zone dirty limits try to distribute page cache pages allocated for
>     writing across zones in proportion to the individual zone sizes, to reduce
>     the likelihood of reclaim having to write back individual pages from the
>     LRU lists in order to make progress.
> 
>     ...
> ---------------------------
> 
> With the "problematic" patch:
> # dd_rescue -A /dev/zero img.disk
> dd_rescue: (info): ipos:     15296.0k, opos:     15296.0k, xferd:     15296.0k
>                    errs:      0, errxfer:         0.0k, succxfer:     15296.0k
>              +curr.rate:      681kB/s, avg.rate:      681kB/s, avg.load:  0.3%
> 
> 
> Without the patch (using 25bd91bd27820d5971258cecd1c0e64b0e485144):
> # dd_rescue -A /dev/zero img.disk
> dd_rescue: (info): ipos:    293888.0k, opos:    293888.0k, xferd:    293888.0k
>                    errs:      0, errxfer:         0.0k, succxfer:    293888.0k
>              +curr.rate:    99935kB/s, avg.rate:    51625kB/s, avg.load:  3.3%
> 
> 
> 
> The kernel is 32bit using PAE mode. The system has 32GB of RAM.
> (compiled with "gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10)")
> 
> Interestingly: If I limit the amount of RAM to roughly 20GB
> via the "mem=20000m" boot parameter, the performance is fine.
> When I increase it to f.e. "mem=23000m", performance is bad.
> 
> Also tested kernel 3.10.17 in 32bit + PAE mode,
> it was fine out of the box.
> 
> 
> So basically we need a fix for the LTS kernel 3.4, I can work around
> this issue with "mem=20000m" until I upgrade to 3.10.
> 
> I'll probably have access to the hardware for one more week
> to test patches, it was lent to me to debug this specific problem.
> 
> The same issue appeared on a complete different machine in July
> using the same 3.4.x kernel. The box had 16GB of RAM.
> I didn't get a chance to access the hardware back then.
> 
> Attached is the dmesg output and my kernel config.

32GB of memory on a highmem machine just isn't going to work well,
sorry.  Our rule of thumb is that 16G is the max.  If it was previously
working OK with 32G then you were very lucky!

That being said, we should try to work out exactly why that commit
caused the big slowdown - perhaps there is something we can do to
restore things.  It appears that the (small?) increase in the per-zone
dirty limit is what kicked things over - perhaps we can permit that to
be tuned back again.  Or something.  Johannes, could you please have a
think about it?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2013-10-31 20:46 ` [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3 Andrew Morton
@ 2013-11-01 18:43   ` Johannes Weiner
  2013-11-04 11:32     ` Thomas Jarosch
  2016-07-18 22:23     ` Thomas Jarosch
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Weiner @ 2013-11-01 18:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Linus Torvalds, thomas.jarosch, bugzilla-daemon, linux-mm

On Thu, Oct 31, 2013 at 01:46:10PM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Thu, 31 Oct 2013 10:53:47 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=64121
> > 
> >             Bug ID: 64121
> >            Summary: [BISECTED] "mm" performance regression updating from
> >                     3.2 to 3.3
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 3.3
> >           Hardware: i386
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Other
> >           Assignee: akpm@linux-foundation.org
> >           Reporter: thomas.jarosch@intra2net.com
> >         Regression: No
> > 
> > Created attachment 112881
> >   --> https://bugzilla.kernel.org/attachment.cgi?id=112881&action=edit
> > Dmesg output
> > 
> > Hi,
> > 
> > I've updated a productive box running kernel 3.0.x to 3.4.67.
> > This caused a severe I/O performance regression.
> > 
> > After some hours I've bisected it down to this commit:
> > 
> > ---------------------------
> > # git bisect good
> > ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d is the first bad commit
> > commit ab8fabd46f811d5153d8a0cd2fac9a0d41fb593d
> > Author: Johannes Weiner <jweiner@redhat.com>
> > Date:   Tue Jan 10 15:07:42 2012 -0800
> > 
> >     mm: exclude reserved pages from dirtyable memory
> > 
> >     Per-zone dirty limits try to distribute page cache pages allocated for
> >     writing across zones in proportion to the individual zone sizes, to reduce
> >     the likelihood of reclaim having to write back individual pages from the
> >     LRU lists in order to make progress.
> > 
> >     ...
> > ---------------------------
> > 
> > With the "problematic" patch:
> > # dd_rescue -A /dev/zero img.disk
> > dd_rescue: (info): ipos:     15296.0k, opos:     15296.0k, xferd:     15296.0k
> >                    errs:      0, errxfer:         0.0k, succxfer:     15296.0k
> >              +curr.rate:      681kB/s, avg.rate:      681kB/s, avg.load:  0.3%
> > 
> > 
> > Without the patch (using 25bd91bd27820d5971258cecd1c0e64b0e485144):
> > # dd_rescue -A /dev/zero img.disk
> > dd_rescue: (info): ipos:    293888.0k, opos:    293888.0k, xferd:    293888.0k
> >                    errs:      0, errxfer:         0.0k, succxfer:    293888.0k
> >              +curr.rate:    99935kB/s, avg.rate:    51625kB/s, avg.load:  3.3%
> > 
> > 
> > 
> > The kernel is 32bit using PAE mode. The system has 32GB of RAM.
> > (compiled with "gcc (GCC) 4.4.4 20100630 (Red Hat 4.4.4-10)")
> > 
> > Interestingly: If I limit the amount of RAM to roughly 20GB
> > via the "mem=20000m" boot parameter, the performance is fine.
> > When I increase it to f.e. "mem=23000m", performance is bad.
> > 
> > Also tested kernel 3.10.17 in 32bit + PAE mode,
> > it was fine out of the box.
> > 
> > 
> > So basically we need a fix for the LTS kernel 3.4, I can work around
> > this issue with "mem=20000m" until I upgrade to 3.10.
> > 
> > I'll probably have access to the hardware for one more week
> > to test patches, it was lent to me to debug this specific problem.
> > 
> > The same issue appeared on a complete different machine in July
> > using the same 3.4.x kernel. The box had 16GB of RAM.
> > I didn't get a chance to access the hardware back then.
> > 
> > Attached is the dmesg output and my kernel config.
> 
> 32GB of memory on a highmem machine just isn't going to work well,
> sorry.  Our rule of thumb is that 16G is the max.  If it was previously
> working OK with 32G then you were very lucky!
> 
> That being said, we should try to work out exactly why that commit
> caused the big slowdown - perhaps there is something we can do to
> restore things.  It appears that the (small?) increase in the per-zone
> dirty limit is what kicked things over - perhaps we can permit that to
> be tuned back again.  Or something.  Johannes, could you please have a
> think about it?

It is a combination of two separate things on these setups.

Traditionally, only lowmem is considered dirtyable so that dirty pages
don't scale with highmem and the kernel doesn't overburden itself with
lowmem pressure from buffers etc.  This is purely about accounting.

My patches on the other hand were about dirty page placement and
avoiding writeback from page reclaim: by subtracting the watermark and
the lowmem reserve (memory not available for user memory / cache) from
each zone's dirtyable memory, we make sure that the zone can always be
rebalanced without writeback.

The problem now is that the lowmem reserves scale with highmem and
there is a point where they entirely overshadow the Normal zone.  This
means that no page cache at all is allowed in lowmem.  Combine this
with how dirtyable memory excludes highmem, and the sum of all
dirtyable memory is nil.  This effectively disables the writeback
cache.

I figure if anything should be fixed it should be the full exclusion
of highmem from dirtyable memory and find a better way to calculate a
minimum.

HOWEVER,

the lowmem reserve is highmem/32 per default.  With a Normal zone of
around 900M, this requires 28G+ worth of HighMem to eclipse lowmem
entirely.  This is almost double of what you consider still okay...

So how would we even pick a sane minimum of dirtyable memory on these
machines?  It's impossible to pick something and say this should work
for most people, those setups are barely working to begin with.  Plus,
people can always set the vm.highmem_is_dirtyable sysctl to 1 or just
set dirty memory limits with dirty_bytes and dirty_background_bytes to
something that gets their crazy setups limping again.

Maybe we should just ignore everything above 16G on 32 bit, but that
would mean actively breaking setups that _individually_ worked before
and never actually hit problems due to their specific circumstances.

On the other hand, I don't think it's reasonable to support this
anymore and it should be more clear that people doing these things are
on their own.

What makes it worse is that all of these reports have been modern 64
bit machines, with modern amounts of memory, running 32 bit kernels.
I'd be more inclined to seriously look into this if it were hardware
that couldn't just run a 64 bit kernel...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2013-11-01 18:43   ` Johannes Weiner
@ 2013-11-04 11:32     ` Thomas Jarosch
  2016-07-18 22:23     ` Thomas Jarosch
  1 sibling, 0 replies; 9+ messages in thread
From: Thomas Jarosch @ 2013-11-04 11:32 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linus Torvalds, bugzilla-daemon, linux-mm

On Friday, 1. November 2013 14:43:32 Johannes Weiner wrote:
> Maybe we should just ignore everything above 16G on 32 bit, but that
> would mean actively breaking setups that _individually_ worked before
> and never actually hit problems due to their specific circumstances.
> 
> On the other hand, I don't think it's reasonable to support this
> anymore and it should be more clear that people doing these things are
> on their own.
> 
> What makes it worse is that all of these reports have been modern 64
> bit machines, with modern amounts of memory, running 32 bit kernels.
> I'd be more inclined to seriously look into this if it were hardware
> that couldn't just run a 64 bit kernel...

thanks for your detailed analysis! 

It's good to know the exact cause of this. Other people with
the same symptoms can now stumble upon this problem report.

We run the same distribution on 32 bit and 64 bit CPUs, that's why we've
avoided to upgrade to 64 bit yet. For our purposes, 16 GB of RAM is more than
enough. So I've implemented a small hack to limit the memory to 16 GB.
That gives way better performance than f.e. a memory limit of 20 GB.


Limit to 20 GB (for comparison):
# dd_rescue /dev/zero disk.img
dd_rescue: (info): ipos:    293888.0k, opos:    293888.0k, xferd:    293888.0k
                   errs:      0, errxfer:         0.0k, succxfer:    293888.0k
             +curr.rate:    99935kB/s, avg.rate:    51625kB/s, avg.load:  3.3%


With the new 16GB limit:
dd_rescue: (info): ipos:   1638400.0k, opos:   1638400.0k, xferd:   1638400.0k
                   errs:      0, errxfer:         0.0k, succxfer:   1638400.0k
             +curr.rate:    83685kB/s, avg.rate:    81205kB/s, avg.load:  6.1%


-> Limiting to 16GB with an "override" boot parameter for people
that really need more RAM might be a good idea even for mainline.


---hackish patch----------------------------------------------------------
Limit memory to 16 GB. See kernel bugzilla #64121.

diff -u -r -p linux.orig/arch/x86/mm/init_32.c linux.i2n/arch/x86/mm/init_32.c
--- linux.orig/arch/x86/mm/init_32.c	2013-11-04 11:52:55.881152576 +0100
+++ linux.i2n/arch/x86/mm/init_32.c	2013-11-04 11:52:01.309151985 +0100
@@ -621,6 +621,13 @@ void __init highmem_pfn_init(void)
 	}
 #endif /* !CONFIG_HIGHMEM64G */
 #endif /* !CONFIG_HIGHMEM */
+#ifdef CONFIG_HIGHMEM64G
+	/* Intra2net: Limit memory to 16GB */
+	if (max_pfn > MAX_NONPAE_PFN * 4) {
+		max_pfn = MAX_NONPAE_PFN * 4;
+		printk(KERN_WARNING "Limited memory to 16GB. See kernel bugzilla #64121\n");
+	}
+#endif
 }
 
 /*
--------------------------------------------------------------------------

Thanks again for your help,
Thomas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2013-11-01 18:43   ` Johannes Weiner
  2013-11-04 11:32     ` Thomas Jarosch
@ 2016-07-18 22:23     ` Thomas Jarosch
  2016-07-21 14:02       ` Vlastimil Babka
  1 sibling, 1 reply; 9+ messages in thread
From: Thomas Jarosch @ 2016-07-18 22:23 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Andrew Morton, Linus Torvalds, bugzilla-daemon, linux-mm

Hi Johannes,

referring to an old kernel bugzilla issue:
https://bugzilla.kernel.org/show_bug.cgi?id=64121

Am 01.11.2013 um 19:43 wrote Johannes Weiner:
> It is a combination of two separate things on these setups.
> 
> Traditionally, only lowmem is considered dirtyable so that dirty pages
> don't scale with highmem and the kernel doesn't overburden itself with
> lowmem pressure from buffers etc.  This is purely about accounting.
> 
> My patches on the other hand were about dirty page placement and
> avoiding writeback from page reclaim: by subtracting the watermark and
> the lowmem reserve (memory not available for user memory / cache) from
> each zone's dirtyable memory, we make sure that the zone can always be
> rebalanced without writeback.
> 
> The problem now is that the lowmem reserves scale with highmem and
> there is a point where they entirely overshadow the Normal zone.  This
> means that no page cache at all is allowed in lowmem.  Combine this
> with how dirtyable memory excludes highmem, and the sum of all
> dirtyable memory is nil.  This effectively disables the writeback
> cache.
> 
> I figure if anything should be fixed it should be the full exclusion
> of highmem from dirtyable memory and find a better way to calculate a
> minimum.

recently we've updated our production mail server from 3.14.69
to 3.14.73 and it worked fine for a few days. When the box is really
busy (=incoming malware via email), the I/O speed drops to crawl,
write speed is about 5 MB/s on Intel SSDs. Yikes.

The box has 16GB RAM, so it should be a safe HIGHMEM configuration.

Downgrading to 3.14.69 or booting with "mem=15000M" works. I've tested
both approaches and the box was stable. Booting 3.14.73 again triggered
the problem within minutes.

Clearly something with the automatic calculation of the lowmem reserve
crossed a tipping point again, even with the previously considered safe
amount of 16GB RAM for HIGHMEM configs. I don't see anything obvious in
the changelogs from 3.14.69 to 3.14.73, but I might have missed it.

> HOWEVER,
> 
> the lowmem reserve is highmem/32 per default.  With a Normal zone of
> around 900M, this requires 28G+ worth of HighMem to eclipse lowmem
> entirely.  This is almost double of what you consider still okay...

is there a way to read out the calculated lowmem reserve via /proc?

It might be interesting the see the lowmem reserve
when booted with mem=15000M or kernel 3.14.69 for comparison.

Do you think it might be worth tinkering with "lowmem_reserve_ratio"?


/proc/meminfo from the box using "mem=15000M" + kernel 3.14.73:

MemTotal:       15001512 kB
HighTotal:      14219160 kB
HighFree:        9468936 kB
LowTotal:         782352 kB
LowFree:          117696 kB
Slab:             430612 kB
SReclaimable:     416752 kB
SUnreclaim:        13860 kB


/proc/meminfo from a similar machine with 16GB RAM + kernel 3.14.73:
(though that machine is just a firewall, so no real disk I/O)

MemTotal:       16407652 kB
HighTotal:      15636376 kB
HighFree:       14415472 kB
LowTotal:         771276 kB
LowFree:          562852 kB
Slab:              34712 kB
SReclaimable:      20888 kB
SUnreclaim:        13824 kB


Any help is appreciated,
Thomas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2016-07-18 22:23     ` Thomas Jarosch
@ 2016-07-21 14:02       ` Vlastimil Babka
  2016-07-27  9:18         ` Thomas Jarosch
  0 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2016-07-21 14:02 UTC (permalink / raw)
  To: Thomas Jarosch, Johannes Weiner
  Cc: Andrew Morton, Linus Torvalds, bugzilla-daemon, linux-mm

On 07/19/2016 12:23 AM, Thomas Jarosch wrote:
> Hi Johannes,
>
> referring to an old kernel bugzilla issue:
> https://bugzilla.kernel.org/show_bug.cgi?id=64121
>
> Am 01.11.2013 um 19:43 wrote Johannes Weiner:
>> It is a combination of two separate things on these setups.
>>
>> Traditionally, only lowmem is considered dirtyable so that dirty pages
>> don't scale with highmem and the kernel doesn't overburden itself with
>> lowmem pressure from buffers etc.  This is purely about accounting.
>>
>> My patches on the other hand were about dirty page placement and
>> avoiding writeback from page reclaim: by subtracting the watermark and
>> the lowmem reserve (memory not available for user memory / cache) from
>> each zone's dirtyable memory, we make sure that the zone can always be
>> rebalanced without writeback.
>>
>> The problem now is that the lowmem reserves scale with highmem and
>> there is a point where they entirely overshadow the Normal zone.  This
>> means that no page cache at all is allowed in lowmem.  Combine this
>> with how dirtyable memory excludes highmem, and the sum of all
>> dirtyable memory is nil.  This effectively disables the writeback
>> cache.
>>
>> I figure if anything should be fixed it should be the full exclusion
>> of highmem from dirtyable memory and find a better way to calculate a
>> minimum.
>
> recently we've updated our production mail server from 3.14.69
> to 3.14.73 and it worked fine for a few days. When the box is really
> busy (=incoming malware via email), the I/O speed drops to crawl,
> write speed is about 5 MB/s on Intel SSDs. Yikes.
>
> The box has 16GB RAM, so it should be a safe HIGHMEM configuration.
>
> Downgrading to 3.14.69 or booting with "mem=15000M" works. I've tested
> both approaches and the box was stable. Booting 3.14.73 again triggered
> the problem within minutes.
>
> Clearly something with the automatic calculation of the lowmem reserve
> crossed a tipping point again, even with the previously considered safe
> amount of 16GB RAM for HIGHMEM configs. I don't see anything obvious in
> the changelogs from 3.14.69 to 3.14.73, but I might have missed it.

I don't see anything either, might be some change e.g. under fs/ though. 
How about git bisect?

>> HOWEVER,
>>
>> the lowmem reserve is highmem/32 per default.  With a Normal zone of
>> around 900M, this requires 28G+ worth of HighMem to eclipse lowmem
>> entirely.  This is almost double of what you consider still okay...
>
> is there a way to read out the calculated lowmem reserve via /proc?

Probably not, but might be possible with live crash session.

> It might be interesting the see the lowmem reserve
> when booted with mem=15000M or kernel 3.14.69 for comparison.
>
> Do you think it might be worth tinkering with "lowmem_reserve_ratio"?
>
>
> /proc/meminfo from the box using "mem=15000M" + kernel 3.14.73:
>
> MemTotal:       15001512 kB
> HighTotal:      14219160 kB
> HighFree:        9468936 kB
> LowTotal:         782352 kB
> LowFree:          117696 kB
> Slab:             430612 kB
> SReclaimable:     416752 kB
> SUnreclaim:        13860 kB
>
>
> /proc/meminfo from a similar machine with 16GB RAM + kernel 3.14.73:
> (though that machine is just a firewall, so no real disk I/O)
>
> MemTotal:       16407652 kB
> HighTotal:      15636376 kB
> HighFree:       14415472 kB
> LowTotal:         771276 kB
> LowFree:          562852 kB
> Slab:              34712 kB
> SReclaimable:      20888 kB
> SUnreclaim:        13824 kB
>
>
> Any help is appreciated,
> Thomas
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2016-07-21 14:02       ` Vlastimil Babka
@ 2016-07-27  9:18         ` Thomas Jarosch
  2016-07-27  9:21           ` Thomas Jarosch
  2016-07-27 16:44           ` Linus Torvalds
  0 siblings, 2 replies; 9+ messages in thread
From: Thomas Jarosch @ 2016-07-27  9:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Andrew Morton, Linus Torvalds, bugzilla-daemon,
	linux-mm

On Thursday, 21. July 2016 16:02:06 Vlastimil Babka wrote:
> > recently we've updated our production mail server from 3.14.69
> > to 3.14.73 and it worked fine for a few days. When the box is really
> > busy (=incoming malware via email), the I/O speed drops to crawl,
> 
> I don't see anything either, might be some change e.g. under fs/ though.
> How about git bisect?

One day later I failed to trigger it, so no easy git bisect.

Yesterday another busy mail server showed the same problem during backup 
creation. This time I knew about slabtop and could see that the 
ext4_inode_cache occupied about 393MB of the 776MB total low memory.
Write speed was down to 25 MB/s.

"sysctl -w vm.drop_caches=3" cleared the inode cache
and the write speed was back to 300 MB/s.

It might be related to memory fragmentation of low memory due to the 
inode cache, the mail server has over 1.400.000 millions files.

I suspect the problem is unrelated to 3.14.73 per se, it seems to trigger 
depending how busy the machine is and the memory layout.

A 64 bit kernel (even with a 32 bit userspace) is the proper solution here.
Still that would mean to deprecate working 32 bit only boxes.

Cheers,
Thomas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2016-07-27  9:18         ` Thomas Jarosch
@ 2016-07-27  9:21           ` Thomas Jarosch
  2016-07-27 16:44           ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: Thomas Jarosch @ 2016-07-27  9:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Andrew Morton, Linus Torvalds, bugzilla-daemon,
	linux-mm

On Wednesday, 27. July 2016 11:18:36 Thomas Jarosch wrote:
> It might be related to memory fragmentation of low memory due to the
> inode cache, the mail server has over 1.400.000 millions files.

1.400.000 files of course. Millions would be a bit much :)

Thomas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2016-07-27  9:18         ` Thomas Jarosch
  2016-07-27  9:21           ` Thomas Jarosch
@ 2016-07-27 16:44           ` Linus Torvalds
  2016-07-29 17:00             ` Thomas Jarosch
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2016-07-27 16:44 UTC (permalink / raw)
  To: Thomas Jarosch
  Cc: Vlastimil Babka, Johannes Weiner, Andrew Morton, bugzilla-daemon,
	linux-mm

On Wed, Jul 27, 2016 at 2:18 AM, Thomas Jarosch
<thomas.jarosch@intra2net.com> wrote:
>
> Yesterday another busy mail server showed the same problem during backup
> creation. This time I knew about slabtop and could see that the
> ext4_inode_cache occupied about 393MB of the 776MB total low memory.

Honestly, we're never going to really fix the problem with low memory
on 32-bit kernels. PAE is a horrible hardware hack, and it was always
very fragile. It's only going to get more fragile as fewer and fewer
people are running 32-bit environments in any big way.

Quite frankly, 32GB of RAM on a 32-bit kernel is so crazy as to be
ludicrous, and nobody sane will support that. Run 32-bit user space by
all means, but the kernel needs to be 64-bit if you have more than 8GB
of RAM.

Realistically, PAE is "workable" up to approximately 4GB of physical
RAM, where the exact limit depends on your workload.

So if the bulk of your memory use is just user-space processes, then
you can more comfortably run with more memory (so 8GB or even 16GB of
RAM might work quite well).

And as mentioned, things are getting worse, and not better. We cared
much more deeply about PAE back in the 2.x timeframe. Back then, it
was a primary target, and you would find people who cared. These days,
it simply isn't. These days, the technical solution to PAE literally
is "just run a 64-bit kernel".

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: Re: [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3
  2016-07-27 16:44           ` Linus Torvalds
@ 2016-07-29 17:00             ` Thomas Jarosch
  0 siblings, 0 replies; 9+ messages in thread
From: Thomas Jarosch @ 2016-07-29 17:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vlastimil Babka, Johannes Weiner, Andrew Morton, bugzilla-daemon,
	linux-mm

On Wednesday, 27. July 2016 09:44:00 Linus Torvalds wrote:
> Quite frankly, 32GB of RAM on a 32-bit kernel is so crazy as to be
> ludicrous, and nobody sane will support that. Run 32-bit user space by
> all means, but the kernel needs to be 64-bit if you have more than 8GB
> of RAM.

thanks for the detailed explanation.

Upgrading to a 64-bit kernel with a 32-bit userspace is the mid-term plan 
which might turn into a short term plan given the occasional hiccup
with PAE / low memory pressure.

Something tells me there might be issues with mISDN using a 64-bit kernel 
with a 32-bit userspace since ISDN is a feature that's not used much 
nowadays either. But that should be more or less easy to solve.

-> I consider the issue "fixed" from my side.

Cheers,
Thomas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-07-29 17:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-64121-27@https.bugzilla.kernel.org/>
2013-10-31 20:46 ` [Bug 64121] New: [BISECTED] "mm" performance regression updating from 3.2 to 3.3 Andrew Morton
2013-11-01 18:43   ` Johannes Weiner
2013-11-04 11:32     ` Thomas Jarosch
2016-07-18 22:23     ` Thomas Jarosch
2016-07-21 14:02       ` Vlastimil Babka
2016-07-27  9:18         ` Thomas Jarosch
2016-07-27  9:21           ` Thomas Jarosch
2016-07-27 16:44           ` Linus Torvalds
2016-07-29 17:00             ` Thomas Jarosch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).