All of lore.kernel.org
 help / color / mirror / Atom feed
* More memory more jitters?
@ 2015-11-14 14:11 CHENG Yuk-Pong, Daniel 
  2015-11-14 14:31 ` Hugo Mills
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: CHENG Yuk-Pong, Daniel  @ 2015-11-14 14:11 UTC (permalink / raw)
  To: linux-btrfs

Hi List,


I have read the Gotcha[1] page:

   Files with a lot of random writes can become heavily fragmented
(10000+ extents) causing trashing on HDDs and excessive multi-second
spikes of CPU load on systems with an SSD or **large amount a RAM**.

Why could large amount of memory worsen the problem?

If **too much** memory is a problem, is it possible to limit the
memory btrfs use?

Background info:

I am running a heavy-write database server with 96GB ram. In the worse
case it cause multi minutes of high cpu loads. Systemd keeping kill
and restarting services, and old job don't die because they stuck in
uninterruptable wait... etc.

Tried with nodatacow, but it seems only affect new file. It is not an
subvolume option either...


Regards,
Daniel


[1] https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More memory more jitters?
  2015-11-14 14:11 More memory more jitters? CHENG Yuk-Pong, Daniel 
@ 2015-11-14 14:31 ` Hugo Mills
  2015-11-14 16:37   ` Duncan
  2015-11-15 16:58 ` Patrik Lundquist
  2015-11-16 13:34 ` Austin S Hemmelgarn
  2 siblings, 1 reply; 6+ messages in thread
From: Hugo Mills @ 2015-11-14 14:31 UTC (permalink / raw)
  To: CHENG Yuk-Pong, Daniel ; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1691 bytes --]

On Sat, Nov 14, 2015 at 10:11:31PM +0800, CHENG Yuk-Pong, Daniel  wrote:
> Hi List,
> 
> 
> I have read the Gotcha[1] page:
> 
>    Files with a lot of random writes can become heavily fragmented
> (10000+ extents) causing trashing on HDDs and excessive multi-second
> spikes of CPU load on systems with an SSD or **large amount a RAM**.
> 
> Why could large amount of memory worsen the problem?

   Because the kernel will hang on to lots of changes in RAM for
longer. With less memory, there's more pressure to write out dirty
pages to disk, so the changes get written out in smaller pieces more
often. With more memory, the changes being written out get "lumpier".

> If **too much** memory is a problem, is it possible to limit the
> memory btrfs use?

   There's some VM knobs you can twiddle, I believe, but I haven't
really played with them myself -- I'm sure there's more knowledgable
people around here who can suggest suitable things to play with.

   Hugo.

> Background info:
> 
> I am running a heavy-write database server with 96GB ram. In the worse
> case it cause multi minutes of high cpu loads. Systemd keeping kill
> and restarting services, and old job don't die because they stuck in
> uninterruptable wait... etc.
> 
> Tried with nodatacow, but it seems only affect new file. It is not an
> subvolume option either...
> 
> 
> Regards,
> Daniel
> 
> 
> [1] https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

-- 
Hugo Mills             | Anyone who says their system is completely secure
hugo@... carfax.org.uk | understands neither systems nor security.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                        Bruce Schneier

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More memory more jitters?
  2015-11-14 14:31 ` Hugo Mills
@ 2015-11-14 16:37   ` Duncan
  2015-11-15  7:40     ` Duncan
  0 siblings, 1 reply; 6+ messages in thread
From: Duncan @ 2015-11-14 16:37 UTC (permalink / raw)
  To: linux-btrfs

Hugo Mills posted on Sat, 14 Nov 2015 14:31:12 +0000 as excerpted:

>> I have read the Gotcha[1] page:
>> 
>>    Files with a lot of random writes can become heavily fragmented
>> (10000+ extents) causing trashing on HDDs and excessive multi-second
>> spikes of CPU load on systems with an SSD or **large amount a RAM**.
>> 
>> Why could large amount of memory worsen the problem?
> 
>    Because the kernel will hang on to lots of changes in RAM for
> longer. With less memory, there's more pressure to write out dirty pages
> to disk, so the changes get written out in smaller pieces more often.
> With more memory, the changes being written out get "lumpier".
> 
>> If **too much** memory is a problem, is it possible to limit the memory
>> btrfs use?
> 
>    There's some VM knobs you can twiddle, I believe, but I haven't
> really played with them myself -- I'm sure there's more knowledgable
> people around here who can suggest suitable things to play with.

Yes.  Don't have time to explain now, but I will later, if nobody beats 
me to it.



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More memory more jitters?
  2015-11-14 16:37   ` Duncan
@ 2015-11-15  7:40     ` Duncan
  0 siblings, 0 replies; 6+ messages in thread
From: Duncan @ 2015-11-15  7:40 UTC (permalink / raw)
  To: linux-btrfs

Duncan posted on Sat, 14 Nov 2015 16:37:14 +0000 as excerpted:

> Hugo Mills posted on Sat, 14 Nov 2015 14:31:12 +0000 as excerpted:
> 
>>> I have read the Gotcha[1] page:
>>> 
>>>    Files with a lot of random writes can become heavily fragmented
>>> (10000+ extents) causing trashing on HDDs and excessive multi-second
>>> spikes of CPU load on systems with an SSD or **large amount a RAM**.
>>> 
>>> Why could large amount of memory worsen the problem?
>> 
>>    Because the kernel will hang on to lots of changes in RAM for
>> longer. With less memory, there's more pressure to write out dirty
>> pages to disk, so the changes get written out in smaller pieces more
>> often. With more memory, the changes being written out get "lumpier".
>> 
>>> If **too much** memory is a problem, is it possible to limit the
>>> memory btrfs use?
>> 
>>    There's some VM knobs you can twiddle, I believe, but I haven't
>> really played with them myself -- I'm sure there's more knowledgable
>> people around here who can suggest suitable things to play with.
> 
> Yes.  Don't have time to explain now, but I will later, if nobody beats
> me to it.

And now it's later... =:^)

The official kernel documentation for this is in
$KERNELDIR/Documentation/filesystems/proc.txt, in
CHAPTER 2: MODIFYING SYSTEM PARAMETERS
(starting at line 1378 in the file as it exists in kernel 4.3), tho 
that's little more than an intro.  As it states,
$KERNELDIR/Documentation/sysctl/* contains rather more information.

Of course there's also various resources on the net covering this 
material, and if google finds this post I suppose it might become one of 
them. =:^]


So in that Documentation/sysctl dir, the README file contains an intro, 
but what we're primarily interested in is covered in vm.txt.  The files 
discussed there are found in /proc/sys/vm, tho your distro almost 
certainly has an init service, sysctl (the systemd-sysctl.service on 
systemd based systems, configured with *.conf files in /usr/lib/ssctl.d/ 
and /etc/sysctl.d/), that pokes non-kernel-default distro-configured and 
admin-configured values into the appropriate /proc/sys/vm/* files at 
boot.  Also check /etc/sysctl.conf, which at least here is symlinked 
from /etc/sysctl.d/99-sysctl.conf so systemd-sysctl loads it.  That's 
actually the file with my settings, here.

So (as root) you can poke the files directly for experimentation, and 
when you've settled on values that work for you, you can put them in /etc/
sysctl.d/*.conf or in /etc/sysctl.conf, or whatever your distro uses 
instead.  But keep in mind that (for systemd based systems anyway) the 
settings in /usr/lib/sysctl.d/*.conf will be loaded first and thus will 
apply if not overridden by your own config, so you might want to check 
there too, to see what's being applied there, before going too wild on 
your overrides.

Of course the sysctl mechanism loads various other settings as well, 
network, core-file, magic-srq, others, but what we're focused on here are 
the vm files and settings.

In particular, our files of interest are the /proc/sys/vm/dirty_* files 
and corresponding vm.dirty_* settings, tho while we're here, I'll mention 
that /proc/sys/vm/swappiness and the corresponding vm.swappiness setting 
is also quite commonly changed by users.

Basically, these dirty_* files control the amount of cached writes that 
can accumulate before the kernel will start writing them to storage at 
two different priority levels, the maximum time they are allowed to age 
before they're written back regardless, and the balance between these two 
writeback priorities.

Now, one thing that's important to keep in mind here is that the kernel 
defaults were originally setup back when 128 MiB RAM was a *LOT* of 
memory, and they aren't necessarily appropriate for systems with the GiB 
or often double-digit GiB RAM that most non-embedded systems come with 
today, particularly where people are still using legacy spinning rust -- 
SSDs are enough faster that the problem doesn't show up to the same 
degree, tho admins may still want to tweak the defaults in some cases.

Another thing to keep in mind for mobile systems in particular is that 
writing data out will of course spin up the drives, so you might want 
rather larger caches and longer timeouts on laptops and the like, and/or 
if you spin down your drives.  But balance that against the knowledge 
that data still in the write cache will be lost if the system crashes 
before it hits storage, so don't go /too/ overboard on extending your 
timeouts.  Timeouts of an hour could well save quite a bit of power, but 
they also risk losing an hour's worth of writes!


OK, from that rather high level view, let's jump to the lower level 
actual settings, tho not yet the actual values.  I'll group the settings 
in my discussion, but you can read the description for each individual 
setting in the vm.txt file mentioned above, if you like.

Note that there's a two-dimension parallel among the four files/settings, 
dirty*_bytes and dirty*_ratio:

dirty_background_bytes
dirty_background_ratio
dirty_bytes
dirty_ratio

In the one dimension you have ratio vs. bytes.  Choose one to use and 
ignore the other.  The kernel defaults to the ratio settings, percent of 
/available/ memory that's dirty (write-cached data waiting to be written 
to storage), but if you prefer to deal in specific sizes, you can write 
your settings to the bytes file, and the kernel will use them instead.  
It uses whichever of the two files/settings, ratio vs. bytes, was written 
last, and the other one will always read as zero if you read it, 
indicating that the other one of the pair is being used.

Note with the ratio files/settings that it's percentage of total 
/available/ memory, which will be rather less than total /system/ 
memory.  But for most modern systems you can estimate initial settings 
using total memory, and then tweak a bit from there if you need to.

In the other dimension you have background, low priority, start writing 
but let other things come first, vs. foreground, higher priority, get 
much more pushy about the writes as they're building up.

But sizes/ratios don't make a lot of sense unless we know the time frame 
we're dealing with, so before we discuss size values, let's talk about 
time.

The other two dirty_* files/settings deal with time (in hundredths of a 
second), not size, and are:

dirty_expire_centisecs
dirty_writeback_centisecs

The expire setting/file controls how long data is cached before the low 
priority writeback kicks in due to time, but does NOT actually trigger 
the writeback itself.

The writeback setting/file controls how often the kernel wakes those low 
priority flusher threads to see if they have anything to do.  If 
something has expired or the background size is too large, they'll start 
working, otherwise they go back to sleep and wait until the next time 
around.

Expire defaults to 2999, basically 30 seconds.
Writeback defaults to 499, 5 seconds.

So unless enough is being written to trigger the size settings, writes 
are allowed to age for 30 seconds by default, then then next time the low 
priority flusher threads wake up, within another five seconds by default, 
they'll start actually writing the data -- at low priority -- back to 
storage.

Here, on my line-powered workstation, I decided that I'm willing to risk 
losing 30 seconds or so of data, so kept the defaults for expire.  
However, I decided I probably didn't need the flushers waking up every 
five seconds to see if there's anything to do, so doubled that to 10 
seconds, 1000 (or 999) centiseconds.

On a laptop, people are very likely to want to power down the storage in 
ordered to save power, and will probably be willing to risk loss of a bit 
more time's worth of data if a crash happens, in ordered to both allow 
that and to ensure that when the storage powerup does happen, they have 
as much to write as possible.  Here, perhaps a five or ten minute (300 or 
600 seconds, 30000 or 60000 centiseconds) expire might be appropriate, if 
they're willing to risk loss of that much work in the event of a crash to 
save power, in which case waking up the flushers every five seconds to 
check if there's something to do doesn't make much sense either, so 
setting that to something like 30 seconds or a minute (3000, 6000 
centiseconds) might make sense.  Few folks will want to risk a full 
hour's worth of work, tho, or even a half hour, no matter the power 
savings it might allow.  Still, I've read of people doing it, and if 
you're for instance playing a game that would be lost on crash anyway (or 
watching a movie that's either coming in off the net or already cached in 
memory so you're not spinning up to /read/ from storage) and not writing 
a paper, it might even make sense.


OK, with the time frame established, we can now look at what sizes make 
sense, and here's where the age of the defaults that are arguably not 
particularly appropriate on modern hardware, comes into the picture.

As I said, the kernel defaults to using ratios, not bytes.  As I also 
said, the ratios are percentages of available memory, not total memory, 
that can be dirty write cache, before the corresponding low or high 
priority writeback to actual storage kicks off, but for first estimates, 
total memory (RAM, not including swap) works just fine.

dirty_ratio is the foreground (high priority) setting, defaulting to 10%.
dirty_background_ratio (low priority) defaults to 5%.

For discussion, I'll use as an example my own workstation, with its 16 
gig of RAM.  I'll also give the 2 gig figure, for those with older 
systems or chromebooks, etc, and use the 64 meg figure as an example of 
what the figures might have looked like when the defaults were picked, 
tho for all I know 16 meg or 256 meg might have been more common at the 
time.

Here's a table.  Approximate figures, rounded down a bit due to available 
vs. total.

Memory size		10% foreground		5% background
-------------------------------------------------------------
64 MiB			6 MiB			3 MiB
2 GiB			200 MiB			100 MiB
16 GiB			1500 MiB		750 MiB

Now I don't remember and am not going to attempt to look up what disk 
speeds were back then, but we know that (for non-SSDs) while they've 
increased, the increases in disk speed have nowhere near kept up with 
size increases, either of disks or of memory.  But we're really only 
concerned with the modern numbers anyway, so we'll look at that.

A reasonably fast disk (not SSD) today can do, ballpark maybe
120 MiB/sec, sequential, average across the disk.  (At the edge, speeds 
are higher, near the center, they'll be lower.)  But make that random, 
like a well used and rather fragmented disk, and speed will be much 
lower.  A few years ago I used to figure 30 MiB/sec, so we'll pick a nice 
round 50 MiB/sec from between the two.


At 50 MiB/sec, that default 10% foreground will take four seconds to 
write out that 200 MiB, a full 30 seconds to write out that 1500 MiB.  
Remember, that's the foreground "high priority do this first" level, so 
as soon as it's hit...

Well, let's just say we know where people's system pauses while writes to 
disk block reading in what they're actually waiting for, come from!

And of course at the 16 GiB RAM level that's also about a gig and a half 
of dirty writes that can be lost in the event of a crash, tho the low 
priority flusher should obviously have kicked in before that, writing 
some of the data at lower priority.

The question then becomes, OK, how much system delay while it writes out 
all that accumulated data are you willing to suffer, vs. writing out out 
sooner, before the backlog gets too big and the pause to write it out 
gets too long?

Meanwhile, until the backlog hits the background number, unless the 
expire timer discussed above expires first, the system will be just 
sitting there, not attempting to write anything at the lower priority 
level.

On a 2 GiB memory system it'll accumulate about a 100 MiB, a couple 
seconds worth of writeout, before it kicks off even low priority flusher 
writes.  On a 16 GiB system, that's already close to 15 seconds worth of 
writing, half the expiry time, for even *LOW* priority writes!!

So particularly as memory sizes increase, we need to lower the background 
number so low priority writes kick off sooner and hopefully get things 
taken care of before high priority writes kick in, and we need to lower 
the foreground number so the backlog doesn't take so long to write out, 
blocking almost all other access to the disk for tens of seconds at a 
time, if the high priority threshold /is/ reached.

What I settled on here, again, with 16 GiB memory, was
1% dirty_background_ratio or about 150 MiB, about 3 seconds worth of 
writes, and 3% dirty_ratio, about 450 MiB or 9 seconds worth of writes.  
9 seconds... I'll tolerate that if I need to.

Note that with background already at 1%, about 150 MiB, if I wanted to go 
lower, I'd have to switch to dirty_background_bytes, as I've read nothing 
indicating the kernel will take fractions of percentages, here, and I 
suspect that would simply give me whatever was set before, the defaults 
if I tried to set it in sysctl at boot.

As a result, I don't really feel comfortable lowering dirty_ratio below 
3%, because it'd be getting uncomfortably close to the background value, 
tho arguably 2%, double the background value, should be fine, as the 
default is double the background value.

So if I decided to upgrade to say 32 GiB RAM or more (and hadn't switched 
to SSD already), I'd probably switch to the bytes settings and try to 
keep it near say 128 MiB background, half a GiB foreground (which would 
give me a 4X ratio between them, while I now have 3X).


Obviously those on laptops may want to increase these numbers, instead, 
tho again, consider how much data you're willing to lose in a crash, and 
don't go hog wild unless you really are willing to lose that data.

Meanwhile, it's also worth noting that there's laptop-mode-tools for 
commandline use, and various graphical tools as well, that can be 
configured to toggle between plugged-in and battery power mode, and 
sometimes have a whole set of different profiles, for toggling these and 
many (many!) other settings between save-power-mode and performance-mode, 
if you'd rather not have your laptop set to 10 minutes expiry and 
gigabytes worth of write-cache /all/ the time, but still want it /some/ 
of the time, when you're really trying to save that power!


OK, but what about those on SSD?  Obviously many SSDs are FAR faster, and 
what's more, they don't suffer the same dropoff between sequential and 
random access modes that spinning rust does.  Here, I upgraded the main 
system to SSD a couple years ago or so, but I do still keep my multimedia 
files on spinning rust.  And while I probably don't need those tight 1% 
background, 3% foreground ratios any more, the SSD writes fast enough 
it's not hurting things, and it still helps when for example doing 
backups to the media drive.  So I've kept them where I had them, tho I'd 
probably not bother changing them from kernel defaults on an all-SSD 
system, or if I upgraded to 32 GiB RAM or something.  (Tho with a mostly 
SSD system, the pressure to upgrade RAM beyond my already 16 GiB is 
pretty much non-existent, tho I do wonder sometimes what it'd be like to 
go say 256 GiB of battery-backed RAM and access stuff at RAM speed 
instead of SSD speed, tho it's not really cost-effective to but dream...)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More memory more jitters?
  2015-11-14 14:11 More memory more jitters? CHENG Yuk-Pong, Daniel 
  2015-11-14 14:31 ` Hugo Mills
@ 2015-11-15 16:58 ` Patrik Lundquist
  2015-11-16 13:34 ` Austin S Hemmelgarn
  2 siblings, 0 replies; 6+ messages in thread
From: Patrik Lundquist @ 2015-11-15 16:58 UTC (permalink / raw)
  To: CHENG Yuk-Pong, Daniel; +Cc: linux-btrfs

On 14 November 2015 at 15:11, CHENG Yuk-Pong, Daniel <j16sdiz@gmail.com> wrote:
>
> Background info:
>
> I am running a heavy-write database server with 96GB ram. In the worse
> case it cause multi minutes of high cpu loads. Systemd keeping kill
> and restarting services, and old job don't die because they stuck in
> uninterruptable wait... etc.
>
> Tried with nodatacow, but it seems only affect new file. It is not an
> subvolume option either...

How about nocow (chattr +C) on the database directories? You will have
to copy the files to make nocow versions of them.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: More memory more jitters?
  2015-11-14 14:11 More memory more jitters? CHENG Yuk-Pong, Daniel 
  2015-11-14 14:31 ` Hugo Mills
  2015-11-15 16:58 ` Patrik Lundquist
@ 2015-11-16 13:34 ` Austin S Hemmelgarn
  2 siblings, 0 replies; 6+ messages in thread
From: Austin S Hemmelgarn @ 2015-11-16 13:34 UTC (permalink / raw)
  To: CHENG Yuk-Pong, Daniel , linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1780 bytes --]

On 2015-11-14 09:11, CHENG Yuk-Pong, Daniel  wrote:
> Hi List,
>
>
> I have read the Gotcha[1] page:
>
>     Files with a lot of random writes can become heavily fragmented
> (10000+ extents) causing trashing on HDDs and excessive multi-second
> spikes of CPU load on systems with an SSD or **large amount a RAM**.
>
> Why could large amount of memory worsen the problem?
>
> If **too much** memory is a problem, is it possible to limit the
> memory btrfs use?
As Duncan already replied, your issue is probably with the kernel's 
ancient defaults for write-back buffering.  It defaults to waiting for 
10% of system RAM to be pages that need written to disk before forcing 
anything to be flushed.  This worked fine when you had systems where 
256M was a lot of RAM, but is absolutely inane when you get above about 
4G (the actual point at which it becomes a problem is highly dependent 
on your storage hardware however).  I find that on most single disk 
systems with a fast disk, you start to get slowdowns when trying to 
cache more than about 256M for writeback.
>
> Background info:
>
> I am running a heavy-write database server with 96GB ram. In the worse
> case it cause multi minutes of high cpu loads. Systemd keeping kill
> and restarting services, and old job don't die because they stuck in
> uninterruptable wait... etc.
>
> Tried with nodatacow, but it seems only affect new file. It is not an
> subvolume option either...
This is a known limitation, although NOCOW is still something that 
should be used for database files.  The trick to get it set on an 
existing file is to create a new, empty file, set the attribute on that, 
then copy the existing file into the new one, then rename the new one 
over the old one.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-11-16 13:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-14 14:11 More memory more jitters? CHENG Yuk-Pong, Daniel 
2015-11-14 14:31 ` Hugo Mills
2015-11-14 16:37   ` Duncan
2015-11-15  7:40     ` Duncan
2015-11-15 16:58 ` Patrik Lundquist
2015-11-16 13:34 ` Austin S Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.