All of lore.kernel.org
 help / color / mirror / Atom feed
* Severe slowdown caused by jbd2 process
@ 2011-01-21  0:13 Jon Leighton
  2011-01-21  1:31 ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Jon Leighton @ 2011-01-21  0:13 UTC (permalink / raw)
  To: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]

Hi there,

I have been experiencing some slowness with an ext4 filesystem. I will
try to explain and hopefully somebody can identify whether this is
"normal" or not. Sorry if I am in any sense unscientific - filesystems
are somewhere near the edge of my computer science knowledge :)

Basically I am involved with doing some development on the Ruby on Rails
web app framework, and the automated tests for one component (Active
Record) does a lot of reading/writing from a database.

I realised that the test suite was running significantly slower for me
than for another developer, so I started to investigate. First I created
an unencrypted partition and put my databases on it, as I had previously
had everything encrypted.

This made it somewhat faster, but not massively.

I then used iotop to see what was going on when I ran the tests. I
discovered that the process jbd2/sda3-8 was doing *lots* of IO when I
run these tests.

I did some googling and tried a few things. Removing the journal solved
the problem (as would be expected, I guess), but also recreating the
partition as ext3 rather than ext4 solved it too (which perhaps
indicates a regression?) When I say 'solved', I mean it took a single
run of this particular test suite from say 4.5 minutes to more like
60-80 seconds.

I found some other people reporting a similar problem:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/607560

They complain of the jbd2 process running every few seconds. This is
something I had not noticed before, but I can observe this on my system
too. It runs and uses a lot of IO for a short period of time maybe every
2 seconds. So I think I am experiencing the same problem.

FWIW, using the noatime option does not help at all. Also, I have tried
using a very recent kernel build with no success. And I have run iotop
on another laptop (which also has an ext4 partition) and I cannot
observe this frequent running of jbd2.

So: does this sound like a bug, and if so, what can be done? I'm very
happy to provide any additional information as needed.

Many thanks,

Jon

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21  0:13 Severe slowdown caused by jbd2 process Jon Leighton
@ 2011-01-21  1:31 ` Josef Bacik
       [not found]   ` <1295601083.5799.3.camel@tybalt>
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-01-21  1:31 UTC (permalink / raw)
  To: Jon Leighton; +Cc: linux-ext4

On Fri, Jan 21, 2011 at 12:13:02AM +0000, Jon Leighton wrote:
> Hi there,
> 
> I have been experiencing some slowness with an ext4 filesystem. I will
> try to explain and hopefully somebody can identify whether this is
> "normal" or not. Sorry if I am in any sense unscientific - filesystems
> are somewhere near the edge of my computer science knowledge :)
> 
> Basically I am involved with doing some development on the Ruby on Rails
> web app framework, and the automated tests for one component (Active
> Record) does a lot of reading/writing from a database.
> 
> I realised that the test suite was running significantly slower for me
> than for another developer, so I started to investigate. First I created
> an unencrypted partition and put my databases on it, as I had previously
> had everything encrypted.
> 
> This made it somewhat faster, but not massively.
> 
> I then used iotop to see what was going on when I ran the tests. I
> discovered that the process jbd2/sda3-8 was doing *lots* of IO when I
> run these tests.
> 
> I did some googling and tried a few things. Removing the journal solved
> the problem (as would be expected, I guess), but also recreating the
> partition as ext3 rather than ext4 solved it too (which perhaps
> indicates a regression?) When I say 'solved', I mean it took a single
> run of this particular test suite from say 4.5 minutes to more like
> 60-80 seconds.
> 
> I found some other people reporting a similar problem:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/607560
> 
> They complain of the jbd2 process running every few seconds. This is
> something I had not noticed before, but I can observe this on my system
> too. It runs and uses a lot of IO for a short period of time maybe every
> 2 seconds. So I think I am experiencing the same problem.
> 
> FWIW, using the noatime option does not help at all. Also, I have tried
> using a very recent kernel build with no success. And I have run iotop
> on another laptop (which also has an ext4 partition) and I cannot
> observe this frequent running of jbd2.
> 
> So: does this sound like a bug, and if so, what can be done? I'm very
> happy to provide any additional information as needed.
> 

What kind of database is this?  Does it use lots of files?  When it's being
particularly slow could you run

echo w > /proc/sysrq-trigger

a couple of times, spread out.  This will give us an idea of what everybody is
doing when things are going slow.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
       [not found]   ` <1295601083.5799.3.camel@tybalt>
@ 2011-01-21 12:59     ` Josef Bacik
  2011-01-21 14:03       ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-01-21 12:59 UTC (permalink / raw)
  To: Jon Leighton; +Cc: Josef Bacik, linux-ext4

On Fri, Jan 21, 2011 at 09:11:23AM +0000, Jon Leighton wrote:
> Hi Josef,
> 
> Thanks for the reply.
> 
> On Thu, 2011-01-20 at 20:31 -0500, Josef Bacik wrote:
> > What kind of database is this?  Does it use lots of files?
> 
> This happens with all databases that I test with: sqlite3, mysql and
> postgresql. Which would seem to indicate that the issue is not actually
> related to the databases, but is being made evident by them when they do
> lots of reads/writes. (The jbd2 every 2 seconds thing happens even when
> all database are completely shut down.)
>

Right I'm not trying to blame the database, more trying to get an idea the kind
of IO that they are generating so we can figure out what is being slow.
 
> > When it's being
> > particularly slow could you run
> > 
> > echo w > /proc/sysrq-trigger
> > 
> > a couple of times, spread out.  This will give us an idea of what everybody is
> > doing when things are going slow.  Thanks,
> 
> Cool, I have done that and attached the results. The partition in
> question (the one with the databases on) is /dev/sda4.
>

Hrm so an fsync heavy workload it looks like.  I'll run some fsync tests locally
and see if I can see the kind of slowdowns you are experiencing.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21 12:59     ` Josef Bacik
@ 2011-01-21 14:03       ` Josef Bacik
  2011-01-21 14:28         ` Jon Leighton
  0 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2011-01-21 14:03 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Jon Leighton, linux-ext4

On Fri, Jan 21, 2011 at 07:59:22AM -0500, Josef Bacik wrote:
> On Fri, Jan 21, 2011 at 09:11:23AM +0000, Jon Leighton wrote:
> > Hi Josef,
> > 
> > Thanks for the reply.
> > 
> > On Thu, 2011-01-20 at 20:31 -0500, Josef Bacik wrote:
> > > What kind of database is this?  Does it use lots of files?
> > 
> > This happens with all databases that I test with: sqlite3, mysql and
> > postgresql. Which would seem to indicate that the issue is not actually
> > related to the databases, but is being made evident by them when they do
> > lots of reads/writes. (The jbd2 every 2 seconds thing happens even when
> > all database are completely shut down.)
> >
> 
> Right I'm not trying to blame the database, more trying to get an idea the kind
> of IO that they are generating so we can figure out what is being slow.
>  
> > > When it's being
> > > particularly slow could you run
> > > 
> > > echo w > /proc/sysrq-trigger
> > > 
> > > a couple of times, spread out.  This will give us an idea of what everybody is
> > > doing when things are going slow.  Thanks,
> > 
> > Cool, I have done that and attached the results. The partition in
> > question (the one with the databases on) is /dev/sda4.
> >
> 
> Hrm so an fsync heavy workload it looks like.  I'll run some fsync tests locally
> and see if I can see the kind of slowdowns you are experiencing.  Thanks,
>

Heh so now that I've had a moment to wake up, how about running your test on
ext3, but mount the partition with

mount -o barrier

and see if it's still faster than ext4.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21 14:03       ` Josef Bacik
@ 2011-01-21 14:28         ` Jon Leighton
  2011-01-21 14:31           ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Jon Leighton @ 2011-01-21 14:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 400 bytes --]

Hi there,

On Fri, 2011-01-21 at 09:03 -0500, Josef Bacik wrote:
> Heh so now that I've had a moment to wake up, how about running your test on
> ext3, but mount the partition with
> 
> mount -o barrier
> 
> and see if it's still faster than ext4.  Thanks,

You're right, that slows it down significantly on the ext3 partition. Is
this expected behaviour with ext4 then?

Thanks

Jon

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21 14:28         ` Jon Leighton
@ 2011-01-21 14:31           ` Josef Bacik
  2011-01-21 23:56             ` Ted Ts'o
  2011-01-24 20:41             ` Darrick J. Wong
  0 siblings, 2 replies; 17+ messages in thread
From: Josef Bacik @ 2011-01-21 14:31 UTC (permalink / raw)
  To: Jon Leighton; +Cc: Josef Bacik, linux-ext4

On Fri, Jan 21, 2011 at 02:28:29PM +0000, Jon Leighton wrote:
> Hi there,
> 
> On Fri, 2011-01-21 at 09:03 -0500, Josef Bacik wrote:
> > Heh so now that I've had a moment to wake up, how about running your test on
> > ext3, but mount the partition with
> > 
> > mount -o barrier
> > 
> > and see if it's still faster than ext4.  Thanks,
> 
> You're right, that slows it down significantly on the ext3 partition. Is
> this expected behaviour with ext4 then?
>

Yup, whatever you are doing in your webapp is making your database do lots of
fsyncs, which is going to suck.  If you are on a battery backed system or just
don't care if you lose your database and rather it be faster you can mount your
ext4 fs with -o nobarrier.  Thanks,

Josef 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21 14:31           ` Josef Bacik
@ 2011-01-21 23:56             ` Ted Ts'o
  2011-01-22  1:11               ` torn5
  2011-01-22 13:05               ` Ric Wheeler
  2011-01-24 20:41             ` Darrick J. Wong
  1 sibling, 2 replies; 17+ messages in thread
From: Ted Ts'o @ 2011-01-21 23:56 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Jon Leighton, linux-ext4

On Fri, Jan 21, 2011 at 09:31:45AM -0500, Josef Bacik wrote:
> 
> Yup, whatever you are doing in your webapp is making your database do lots of
> fsyncs, which is going to suck.  If you are on a battery backed system or just
> don't care if you lose your database and rather it be faster you can mount your
> ext4 fs with -o nobarrier.  Thanks,

Note that if you don't use -o barrier on ext3, or use -o nobarrier on
ext4, the chance of significant file system damage if you have a power
failure, since without the barrier, the file system doesn't wait for
disk to acknowledge that the data has hit the barrier.  The problem is
that if you are using a barrier operation, you're not going to be able
to get more than about 30-50 non-trivial[1] fsync's per second on a
standard HDD; barriers are inherently slow.

[1] Where there was some kind of data write between the two fsync's.
You may be able to get faster back-to-back fsync() with no intervening
data writes, but that's not terribly interesting.  :-)

A UPS should protect you against most of the dangers of not using
barriers.  The other choice is to be more intelligent with your coding
(and/or with your database choice) to avoid needing a huge number of
fsync's, as they are going to be costly.  If you can batch multiple
database operations under a single commit, for example, you should be
able to eliminate the need for so many fsync's.

		        	       	       - Ted



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21 23:56             ` Ted Ts'o
@ 2011-01-22  1:11               ` torn5
  2011-01-22  1:34                 ` Ted Ts'o
  2011-01-22 13:05               ` Ric Wheeler
  1 sibling, 1 reply; 17+ messages in thread
From: torn5 @ 2011-01-22  1:11 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Josef Bacik, Jon Leighton, linux-ext4


On 01/22/2011 12:56 AM, Ted Ts'o wrote:
> On Fri, Jan 21, 2011 at 09:31:45AM -0500, Josef Bacik wrote:
>    
>> Yup, whatever you are doing in your webapp is making your database do lots of
>> fsyncs, which is going to suck.  If you are on a battery backed system or just
>> don't care if you lose your database and rather it be faster you can mount your
>> ext4 fs with -o nobarrier.  Thanks,
>>      
> Note that if you don't use -o barrier on ext3, or use -o nobarrier on
> ext4, the chance of significant file system damage if you have a power
> failure, since without the barrier, the file system doesn't wait for
> disk to acknowledge that the data has hit the barrier.  The problem is
> that if you are using a barrier operation, you're not going to be able
> to get more than about 30-50 non-trivial[1] fsync's per second on a
> standard HDD; barriers are inherently slow.
>    

I think that currently the fsyncs have a double meaning: they are used 
to make a filesystem operation happen before another filesystem 
operation, and to make a filesystem operation happen before a network 
operation. I don't think the second case can be speeded up (there can be 
a distributed transaction involved) but the first could probably be 
speeded up, but I'm thinking how...

Do you think nobarrier + data=journal would provide the same guarantees 
of barrier and almost the same performances of nobarrier (for random I/O)?

Hmm maybe you need the barriers enabled to make even data=journal work 
reliably?
But then there should be a mount option (barriersonlyjournal?) so that 
barriers are only generated every so many seconds and only for 
committing a big transaction to the journal, while applications' fsyncs 
would be made with nobarriers.
This should provide the benefits I mentioned, for disk-to-disk 
sequentiality (not disk-to-network), shouldn't it?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-22  1:11               ` torn5
@ 2011-01-22  1:34                 ` Ted Ts'o
  2011-01-22 16:21                   ` torn5
  0 siblings, 1 reply; 17+ messages in thread
From: Ted Ts'o @ 2011-01-22  1:34 UTC (permalink / raw)
  To: torn5; +Cc: Josef Bacik, Jon Leighton, linux-ext4

On Sat, Jan 22, 2011 at 02:11:34AM +0100, torn5 wrote:
> I think that currently the fsyncs have a double meaning: they are
> used to make a filesystem operation happen before another filesystem
> operation, and to make a filesystem operation happen before a
> network operation. I don't think the second case can be speeded up
> (there can be a distributed transaction involved) 

It all depends on the application.  If you have many simultanous
transactions with different peers (say, SMTP for example), you could
just simply have the server batch multiple commits for multiple
incoming mail messages into the database before sending allowing
sending 200 acknowledgement which means, "yes I have this mail
message" to the various MTA's.  In other cases, if you are sending a
huge number of transactions from one server to another, maybe you
change things so that you transactions get acknowledged batches.  So
that might require an application protocol change, but it could be
done (if you have control of both the ends of the connection).

At the end of the day, though, if the application protocol design is
stupid, there's not much you can do.  That's like the difference
between XMODEM (for those who are old enough to remember it), and
ZMODEM (which had a sliding window acknowledgement system).

> Do you think nobarrier + data=journal would provide the same
> guarantees of barrier and almost the same performances of nobarrier
> (for random I/O)?

No.  Fundamentally barriers are bout making sure the data actually
hits the disk platters.  If you don't use a barrier operation, the
hard drive could potential delay writing disk sectors for seconds,
perhaps even minutes, in order to try to optimize disk head movements.
So if you have a sudden power drop, without barriers, even though you
*think* you had sent the commit to disk, and had told your network
partner, "I have it, and commit not to lose it", if you drop power at
precisely the wrong time, data could be lost.  Using data=journal
doesn't change this fact.

> But then there should be a mount option (barriersonlyjournal?) so
> that barriers are only generated every so many seconds and only for
> committing a big transaction to the journal, while applications'
> fsyncs would be made with nobarriers.

In general, an fsync() has to force a journal commit.  There are a few
cases where an fdatasync() could avoid needing a journal commit, but
usually when application uses fdatasync(), they really want to assure
that their data writes are really pushed out to the disk platter, and
a barriersonlyjournal command would defeat that need for a database
which is trying to provide ACID semantics.

      	 					- Ted


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21 23:56             ` Ted Ts'o
  2011-01-22  1:11               ` torn5
@ 2011-01-22 13:05               ` Ric Wheeler
  1 sibling, 0 replies; 17+ messages in thread
From: Ric Wheeler @ 2011-01-22 13:05 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Josef Bacik, Jon Leighton, linux-ext4

On 01/21/2011 06:56 PM, Ted Ts'o wrote:
> On Fri, Jan 21, 2011 at 09:31:45AM -0500, Josef Bacik wrote:
>> Yup, whatever you are doing in your webapp is making your database do lots of
>> fsyncs, which is going to suck.  If you are on a battery backed system or just
>> don't care if you lose your database and rather it be faster you can mount your
>> ext4 fs with -o nobarrier.  Thanks,
> Note that if you don't use -o barrier on ext3, or use -o nobarrier on
> ext4, the chance of significant file system damage if you have a power
> failure, since without the barrier, the file system doesn't wait for
> disk to acknowledge that the data has hit the barrier.  The problem is
> that if you are using a barrier operation, you're not going to be able
> to get more than about 30-50 non-trivial[1] fsync's per second on a
> standard HDD; barriers are inherently slow.
>
> [1] Where there was some kind of data write between the two fsync's.
> You may be able to get faster back-to-back fsync() with no intervening
> data writes, but that's not terribly interesting.  :-)
>
> A UPS should protect you against most of the dangers of not using
> barriers.  The other choice is to be more intelligent with your coding
> (and/or with your database choice) to avoid needing a huge number of
> fsync's, as they are going to be costly.  If you can batch multiple
> database operations under a single commit, for example, you should be
> able to eliminate the need for so many fsync's.
>
> 		        	       	       - Ted
>

Just a note that databases usually already think hard about batching updates 
into transactions which all go to disk on a commit.

Various databases have statistics to show the average size of a transaction, etc 
and that can help you tune your workload,

Ric


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-22  1:34                 ` Ted Ts'o
@ 2011-01-22 16:21                   ` torn5
  2011-01-22 19:37                     ` Theodore Tso
  0 siblings, 1 reply; 17+ messages in thread
From: torn5 @ 2011-01-22 16:21 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: torn5, Josef Bacik, Jon Leighton, linux-ext4

On 01/22/2011 02:34 AM, Ted Ts'o wrote:
> ....
>
> At the end of the day, though, if the application protocol design is
> stupid, there's not much you can do.
> ....

Thanks for your reply.
You are right, now I'm starting to understand that what I was trying to 
achieve was actually a change in the application logic...

I'd have a different question now:
Is the fsync in a nobarrier mount totally swallowed?
If not:
a) what guarantees does it provide in a nobarrier situation and
b) is there a "fakefsync" mount option or some other way to make it a 
no-op? (I understand the risk, and the fact that this is actually a 
change in the application's logic)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-22 16:21                   ` torn5
@ 2011-01-22 19:37                     ` Theodore Tso
  2011-01-22 23:22                       ` torn5
  0 siblings, 1 reply; 17+ messages in thread
From: Theodore Tso @ 2011-01-22 19:37 UTC (permalink / raw)
  To: torn5; +Cc: Josef Bacik, Jon Leighton, linux-ext4


On Jan 22, 2011, at 11:21 AM, torn5 wrote:
> 
> I'd have a different question now:
> Is the fsync in a nobarrier mount totally swallowed?

No.   It will still cause a journal commit, and send disk writes down to the HDD.   How those disk writes will be interpreted by the HDD is completely up to the HDD's firmware.   It could seek like mad and try to write all of those disk blocks as they arrive, or it could try to batch writes which are farther away to minimize disk head movement, and perhaps combine writes that arrive potentially seconds or minutes apart. 

> If not:
> a) what guarantees does it provide in a nobarrier situation and

As long as there is not a power failure (or disk failure, of couse), those disk writes will eventually hit the platter.   The data should be consistent on disk if the kernel were to panic, or someone were to hit the reset button.   So you will have at least that level of guarantee.  But if the power cord gets kicked out of the wall, or the floor waxer in the data center causes the circuit breaker to pop, or the flood waters in Queensland start pouring into the underground car park and the transformer locater in said car park shorts out, you have no guarantees at all.

> b) is there a "fakefsync" mount option or some other way to make it a no-op? (I understand the risk, and the fact that this is actually a change in the application's logic)

No, sorry.   Usually the fsync is there for a good reason, and if fsync's are completely eliminated, you have absolutely no guarantees at all.   (Kernel panics, reset buttons, etc., all will cause the database to be totally scrambled.)   Providing such a knob to system administrators who might use it to "speed up" their application, is considered a bit of an attractive nuisance --- short of like providing a button with an LED display that says in bright green friendly colors, "push to test", and then once pushed, changes to an angry red color, "release to detonate".   :-)

You can hack the kernel to do that, though.  Someone who is bright enough to figure out how to create their own fake-fsync mount option is hopefully smart enough to understand the consequences of doing that.  (Just like someone who can figure out how to defeat the safety mechanisms on a lawn mower, and then uses a lawn mower to trim a hedge, is hopefully enough to understand the consequences of what happens if he drops said lawn mower on his foot and loses it.)

The smart, thing, of course, is to write your application logic in a way that doesn't cause so many database transactions.

-- Ted



> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-22 19:37                     ` Theodore Tso
@ 2011-01-22 23:22                       ` torn5
  2011-01-23  5:17                         ` Ted Ts'o
  0 siblings, 1 reply; 17+ messages in thread
From: torn5 @ 2011-01-22 23:22 UTC (permalink / raw)
  To: Theodore Tso; +Cc: torn5, Josef Bacik, Jon Leighton, linux-ext4

On 01/22/2011 08:37 PM, Theodore Tso wrote:
> On Jan 22, 2011, at 11:21 AM, torn5 wrote:
>> Is the fsync in a nobarrier mount totally swallowed?
>>      
> No.   It will still cause a journal commit, and send disk writes down to the HDD.   How those disk writes will be interpreted by the HDD is completely up to the HDD's firmware.
...
>> If not:
>> a) what guarantees does it provide in a nobarrier situation and
>>      
> As long as there is not a power failure (or disk failure, of couse), those disk writes will eventually hit the platter. ....

VERY interesting
thanks for the explanation

>> b) is there a "fakefsync" mount option or some other way to make it a no-op? (I understand the risk, and the fact that this is actually a change in the application's logic)
>>      
> No, sorry.   Usually the fsync is there for a good reason, and if fsync's are completely eliminated, you have absolutely no guarantees at all.   (Kernel panics, reset buttons, etc., all will cause the database to be totally scrambled.)   Providing such a knob to system administrators who might use it to "speed up" their application, is considered a bit of an attractive nuisance
>    

Sometimes it's useful, and that's the reason why Postgresql and Mysql 
both have a no-fsync mode.
Sometimes you have to do something for which intermediate state doesn't 
matter. Think at it as a computation: if it fails, you restart it from 
the beginning. In scientific research this is often the case. Often to 
save time you use software already written, which might have an 
excessively conservative behaviour for a "computation" , and this slows 
down your computation. But rewriting such application is simply too 
much, so you end up waiting patiently... that's why a fakefsync mount 
option would be nice to have.

Anyway, you said fsyncs in nobarriers mode (only?) generate a journal 
commit and push writes to the HDD.
Then if I also disable the journal the only thing that remains is the 
push of data to the HDD, right?
This is near to a no-op I would say because data should have gone to the 
disks earlier or later... Ow... oh no, it's not, because you wait for 
the disk to return a completion and in the meanwhile you cannot use the 
CPU. Right? Ok so for a single threaded app there is indeed difference.

May I ask how is this "push of data to the disk" implemented: does it 
skip the request queue for the disk (i.e. jumps ahead of the queue), or 
has other kinds of special priority, or it is submitted to the tail like 
normal and the fysnc waits patiently for it to reach the disk?

Thank you for all these explanations

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-22 23:22                       ` torn5
@ 2011-01-23  5:17                         ` Ted Ts'o
  2011-01-23 18:43                           ` torn5
  0 siblings, 1 reply; 17+ messages in thread
From: Ted Ts'o @ 2011-01-23  5:17 UTC (permalink / raw)
  To: torn5; +Cc: Josef Bacik, Jon Leighton, linux-ext4

On Sun, Jan 23, 2011 at 12:22:19AM +0100, torn5 wrote:
> 
> Sometimes it's useful, and that's the reason why Postgresql and
> Mysql both have a no-fsync mode.

Yes, and that's why the application is the right place to decide
whether or not to do fsync.

> Sometimes you have to do something for which intermediate state
> doesn't matter. Think at it as a computation: if it fails, you
> restart it from the beginning. In scientific research this is often
> the case. Often to save time you use software already written, which
> might have an excessively conservative behaviour for a "computation"
> , and this slows down your computation. But rewriting such
> application is simply too much, so you end up waiting patiently...

You're using open source software, right?  If so, you can edit the
source and recompile it.  :-)

Oh, you're using proprietary software?  That doesn't have an
fsync-mode?  Now you know one of the serious downsides of buying a car
whose hood is welded shut.

> that's why a fakefsync mount option would be nice to have.

Yes, except the file system developers don't want to take on the moral
liability of system administrators using such a mount option
incorrectly.  Might as well ask why Lawn Mower manufacturers don't
make lawn mowers where you can disable the safety device that prevents
the blade from spinning when the wheels are lifted off the ground.
Just "it could be useful" because you could trim hedges with the lawn
mower isn't going to be sufficient justification....

> Anyway, you said fsyncs in nobarriers mode (only?) generate a
> journal commit and push writes to the HDD.
> Then if I also disable the journal the only thing that remains is
> the push of data to the HDD, right?
> This is near to a no-op I would say because data should have gone to
> the disks earlier or later... Ow... oh no, it's not, because you
> wait for the disk to return a completion and in the meanwhile you
> cannot use the CPU. Right?

We wait for the blocks queued for I/O to be sent to the disk.  That's
not quite the same thing, but yes, it can cause delay if you have a
lot of writes pending to be sent to the disk.

> May I ask how is this "push of data to the disk" implemented: does
> it skip the request queue for the disk (i.e. jumps ahead of the
> queue), or has other kinds of special priority, or it is submitted
> to the tail like normal and the fysnc waits patiently for it to
> reach the disk?

The fsync waits for all data to be sent to disk.  It has to; since we
can't easily, given the current disk protocols, distinguish between
the 5 MB of I/O that pertains to file A which is being fsync'ed, but
not the 20 MB of I/O pertaining to file B which is going on in the
background.  There is a way, for some newer disk drives, to do what's
called a FUA (Force Unit Attention) where a single block write request
bypasses all caches, including the track buffer, and it goes straight
to disk.  (Well, you could, but you'd regret it.)  But since a FUA
write bypasses all HDD optimizations, you can't really use it for bulk
file data.  You could use it if there was a few blocks that needed to
be sent to the disk *now*, bypassing all other I/O requests, but in
practice you need to do a lot more than that when fulfilling a fsync()
request.

Again, the right answer is for the application to be smart.  And if
it's not smart, and it's open source, fix the application.  If it's a
crappy proprietary userspace application, open a bug report; that's
why you pay the manufacturer $$$ for support, right?  And if they
won't fix it, well, then vote with your wallet, and go elsewhere.
Preferably to an properly written open source application.  :-)

						- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-23  5:17                         ` Ted Ts'o
@ 2011-01-23 18:43                           ` torn5
  2011-01-24 20:16                             ` Ted Ts'o
  0 siblings, 1 reply; 17+ messages in thread
From: torn5 @ 2011-01-23 18:43 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: torn5, Josef Bacik, Jon Leighton, linux-ext4

On 01/23/2011 06:17 AM, Ted Ts'o wrote:
>
>> that's why a fakefsync mount option would be nice to have.
>>      
> Yes, except the file system developers don't want to take on the moral
> liability of system administrators using such a mount option
> incorrectly.

I understand

> The fsync waits for all data to be sent to disk.  It has to; since we
> can't easily, given the current disk protocols, distinguish between
> the 5 MB of I/O that pertains to file A which is being fsync'ed, but
> not the 20 MB of I/O pertaining to file B which is going on in the
> background.

So it's a queue drain + cache flush, right?

> There is a way, for some newer disk drives, to do what's
> called a FUA (Force Unit Attention) ...
>    

I thought it was possible via the completion notifications from the disk.
AFAIK if a disk is in NCQ mode it will return completion for a command 
only when the write was really delivered to the platters. While in 
non-NCQ mode the disk immediately returns completion and caches the 
write. Is this correct?

Oh ok but that's not the problem, I understand now, the problem is that 
you want to see all 5MB of data delivered to the platters, not only 1 
write command...
So the only way is a queue drain.

So if we want to see faster fsyncs we have to reduce the nr_requests of 
a disk, so that the request_queue is short, right?


There were ideas around for an API for dependencies among BIOs.
e.g. here:
https://lwn.net/Articles/399148/
This would solve the problem of needing a queue drain for an fsync, 
right? Ext4 could make the last BIO of the file being synced to depend 
on all the other BIOs related to the same file, and then wait the NCQ 
completion notification for the last BIO. There wouldn't be a need to to 
drain the queue any more.
At that point it could even make sense to make all fsyncs-related I/O to 
jump at the head of the request_queue, so that fsyncs (hopefully related 
to small amounts of data) could return quickly even when there is a 
large file streaming or copy in the background filling the whole 
request_queue...
Does what I'm saying make sense?
I understand this feature would require major changes in Linux though...


Thank you for all these explanations,
these things really help us ignorant ext4 users understand...


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-23 18:43                           ` torn5
@ 2011-01-24 20:16                             ` Ted Ts'o
  0 siblings, 0 replies; 17+ messages in thread
From: Ted Ts'o @ 2011-01-24 20:16 UTC (permalink / raw)
  To: torn5; +Cc: Josef Bacik, Jon Leighton, linux-ext4

On Sun, Jan 23, 2011 at 07:43:10PM +0100, torn5 wrote:
> I thought it was possible via the completion notifications from the disk.
> AFAIK if a disk is in NCQ mode it will return completion for a
> command only when the write was really delivered to the platters.
> While in non-NCQ mode the disk immediately returns completion and
> caches the write. Is this correct?

No, that's not correct.  The completion notification from the disk is
merely that the DMA has completed.  It does not mean that the data has
hit the platters.  This is true in both NCQ and non-NCQ mode.

You can disable the write cache (which is what I think you're thinking
about), but the performance hit is pretty significant on standard
HDD's.

						- Ted

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Severe slowdown caused by jbd2 process
  2011-01-21 14:31           ` Josef Bacik
  2011-01-21 23:56             ` Ted Ts'o
@ 2011-01-24 20:41             ` Darrick J. Wong
  1 sibling, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2011-01-24 20:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Jon Leighton, linux-ext4

On Fri, Jan 21, 2011 at 09:31:45AM -0500, Josef Bacik wrote:
> On Fri, Jan 21, 2011 at 02:28:29PM +0000, Jon Leighton wrote:
> > Hi there,
> > 
> > On Fri, 2011-01-21 at 09:03 -0500, Josef Bacik wrote:
> > > Heh so now that I've had a moment to wake up, how about running your test on
> > > ext3, but mount the partition with
> > > 
> > > mount -o barrier
> > > 
> > > and see if it's still faster than ext4.  Thanks,
> > 
> > You're right, that slows it down significantly on the ext3 partition. Is
> > this expected behaviour with ext4 then?
> >
> 
> Yup, whatever you are doing in your webapp is making your database do lots of
> fsyncs, which is going to suck.  If you are on a battery backed system or just
> don't care if you lose your database and rather it be faster you can mount your
> ext4 fs with -o nobarrier.  Thanks,

If for some reason you can't change the database to be less fsync-happy and
there are a lot of threads issuing fsync in parallel, you might try building a
kernel with the patch set that Tejun Heo has put together
(https://lkml.org/lkml/2011/1/21/251) to merge flush requests coming from
multiple threads.  It might speed things up for you.

(I don't know how much that patch set will change between now and the time it
goes upstream...)

--D

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-01-24 20:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21  0:13 Severe slowdown caused by jbd2 process Jon Leighton
2011-01-21  1:31 ` Josef Bacik
     [not found]   ` <1295601083.5799.3.camel@tybalt>
2011-01-21 12:59     ` Josef Bacik
2011-01-21 14:03       ` Josef Bacik
2011-01-21 14:28         ` Jon Leighton
2011-01-21 14:31           ` Josef Bacik
2011-01-21 23:56             ` Ted Ts'o
2011-01-22  1:11               ` torn5
2011-01-22  1:34                 ` Ted Ts'o
2011-01-22 16:21                   ` torn5
2011-01-22 19:37                     ` Theodore Tso
2011-01-22 23:22                       ` torn5
2011-01-23  5:17                         ` Ted Ts'o
2011-01-23 18:43                           ` torn5
2011-01-24 20:16                             ` Ted Ts'o
2011-01-22 13:05               ` Ric Wheeler
2011-01-24 20:41             ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.