brtfs on top of dmcrypt with SSD. No corruption iff write cache off?

All of lore.kernel.org
 help / color / mirror / Atom feed

* brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
@ 2012-01-30  0:37 Marc MERLIN
  2012-02-01 17:56 ` Chris Mason
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-01-30  0:37 UTC (permalink / raw)
  To: linux-btrfs

Howdy,

I'm considering using brtfs for my new laptop install.

Encryption is however a requirement, and ecryptfs doesn't quite cut it for
me, so that leaves me with dmcrypt which is what I've been using with
ext3/ext4 for years.

https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html 
still states that 
'dm-crypt block devices require write-caching to be turned off on the
underlying HDD'
While the report was for 2.6.33, I'll assume it's still true.

I was considering migrating to a recent 256GB SSD and 3.2.x kernel.

First, I'd like to check if the 'turn off write cache' comment is still
accurate and if it does apply to SSDs too.

Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD
and what the performance is like with write cache turned off (I'm actually
not too sure what the impact is for SSDs considering that writing to flash
can actually be slower than writing to a hard drive).

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-01-30  0:37 brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Marc MERLIN
@ 2012-02-01 17:56 ` Chris Mason
  2012-02-02  3:23   ` Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Mason @ 2012-02-01 17:56 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Sun, Jan 29, 2012 at 04:37:54PM -0800, Marc MERLIN wrote:
> Howdy,
> 
> I'm considering using brtfs for my new laptop install.
> 
> Encryption is however a requirement, and ecryptfs doesn't quite cut it for
> me, so that leaves me with dmcrypt which is what I've been using with
> ext3/ext4 for years.
> 
> https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html 
> still states that 
> 'dm-crypt block devices require write-caching to be turned off on the
> underlying HDD'
> While the report was for 2.6.33, I'll assume it's still true.
> 
> 
> I was considering migrating to a recent 256GB SSD and 3.2.x kernel.
> 
> First, I'd like to check if the 'turn off write cache' comment is still
> accurate and if it does apply to SSDs too.
> 
> Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD
> and what the performance is like with write cache turned off (I'm actually
> not too sure what the impact is for SSDs considering that writing to flash
> can actually be slower than writing to a hard drive).

Performance without the cache on is going to vary wildly from one SSD to
another.  Some really need it to give them nice fat writes while others
do better on smaller writes.  It's best to just test yours and see.

With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm
are doing the right thing for barriers.

-chris

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-01 17:56 ` Chris Mason
@ 2012-02-02  3:23   ` Marc MERLIN
  2012-02-02 12:42     ` Chris Mason
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-02-02  3:23 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On Wed, Feb 01, 2012 at 12:56:24PM -0500, Chris Mason wrote:
> > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD
> > and what the performance is like with write cache turned off (I'm actually
> > not too sure what the impact is for SSDs considering that writing to flash
> > can actually be slower than writing to a hard drive).
> 
> Performance without the cache on is going to vary wildly from one SSD to
> another.  Some really need it to give them nice fat writes while others
> do better on smaller writes.  It's best to just test yours and see.
> 
> With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm
> are doing the right thing for barriers.

Thanks for the answer.
Can you confirm that I still must disable write cache on the SSD to avoid
corruption with btrfs on top of dmcrypt, or is there a chance that it just
works now?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-02  3:23   ` Marc MERLIN
@ 2012-02-02 12:42     ` Chris Mason
       [not found]       ` <20120202152722.GI12429@merlins.org>
  0 siblings, 1 reply; 68+ messages in thread
From: Chris Mason @ 2012-02-02 12:42 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Wed, Feb 01, 2012 at 07:23:45PM -0800, Marc MERLIN wrote:
> On Wed, Feb 01, 2012 at 12:56:24PM -0500, Chris Mason wrote:
> > > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD
> > > and what the performance is like with write cache turned off (I'm actually
> > > not too sure what the impact is for SSDs considering that writing to flash
> > > can actually be slower than writing to a hard drive).
> > 
> > Performance without the cache on is going to vary wildly from one SSD to
> > another.  Some really need it to give them nice fat writes while others
> > do better on smaller writes.  It's best to just test yours and see.
> > 
> > With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm
> > are doing the right thing for barriers.
> 
> Thanks for the answer.
> Can you confirm that I still must disable write cache on the SSD to avoid
> corruption with btrfs on top of dmcrypt, or is there a chance that it just
> works now?

No, with 3.2 or higher it is expected to work.  dm-crypt is doing the
barriers correctly and as of 3.2 btrfs is sending them down correctly.

-chris


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
       [not found]       ` <20120202152722.GI12429@merlins.org>
@ 2012-02-12 22:32         ` Marc MERLIN
  2012-02-12 23:47           ` Milan Broz
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-02-12 22:32 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs, mbroz; +Cc: jeff

On Thu, Feb 02, 2012 at 07:27:22AM -0800, Marc MERLIN wrote:
> On Thu, Feb 02, 2012 at 07:42:41AM -0500, Chris Mason wrote:
> > On Wed, Feb 01, 2012 at 07:23:45PM -0800, Marc MERLIN wrote:
> > > On Wed, Feb 01, 2012 at 12:56:24PM -0500, Chris Mason wrote:
> > > > > Second, I was wondering if anyone is running btrfs over dmcrypt on an SSD
> > > > > and what the performance is like with write cache turned off (I'm actually
> > > > > not too sure what the impact is for SSDs considering that writing to flash
> > > > > can actually be slower than writing to a hard drive).
> > > > 
> > > > Performance without the cache on is going to vary wildly from one SSD to
> > > > another.  Some really need it to give them nice fat writes while others
> > > > do better on smaller writes.  It's best to just test yours and see.
> > > > 
> > > > With a 3.2 kernel (it really must be 3.2 or higher), both btrfs and dm
> > > > are doing the right thing for barriers.
> > > 
> > > Thanks for the answer.
> > > Can you confirm that I still must disable write cache on the SSD to avoid
> > > corruption with btrfs on top of dmcrypt, or is there a chance that it just
> > > works now?
> > 
> > No, with 3.2 or higher it is expected to work.  dm-crypt is doing the
> > barriers correctly and as of 3.2 btrfs is sending them down correctly.
> 
> Thanks for confirming, I'll give this a shot.
> (no warranty implied of course :) ).

Actually I had one more question.

I read this page:
http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html

I'm not super clear if with 3.2.5 kernel, I need to pass the special
allow_discards option for brtfs and dm-crypt to be safe together, or whether
they now talk through an API and everything "just works" :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-12 22:32         ` Marc MERLIN
@ 2012-02-12 23:47           ` Milan Broz
  2012-02-13  0:14             ` Marc MERLIN
  2012-02-18 16:07             ` brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Marc MERLIN
  0 siblings, 2 replies; 68+ messages in thread
From: Milan Broz @ 2012-02-12 23:47 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Chris Mason, linux-btrfs, jeff

On 02/12/2012 11:32 PM, Marc MERLIN wrote:
> Actually I had one more question.
>
> I read this page:
> http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html
>
> I'm not super clear if with 3.2.5 kernel, I need to pass the special
> allow_discards option for brtfs and dm-crypt to be safe together, or whether
> they now talk through an API and everything "just works" :)

If you want discards to be supported in dmcrypt, you have to enable it manually.

The most comfortable way is just use recent cryptsetup and add
--allow-discards option to luksOpen command.

It will be never enabled by default in dmcrypt for security reasons
http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html

Milan

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-12 23:47           ` Milan Broz
@ 2012-02-13  0:14             ` Marc MERLIN
  2012-02-15 15:42               ` Calvin Walton
                                 ` (4 more replies)
  2012-02-18 16:07             ` brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Marc MERLIN
  1 sibling, 5 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-02-13  0:14 UTC (permalink / raw)
  To: Milan Broz; +Cc: Chris Mason, linux-btrfs, jeff

On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote:
> On 02/12/2012 11:32 PM, Marc MERLIN wrote:
> >Actually I had one more question.
> >
> >I read this page:
> >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html
> >
> >I'm not super clear if with 3.2.5 kernel, I need to pass the special
> >allow_discards option for brtfs and dm-crypt to be safe together, or 
> >whether
> >they now talk through an API and everything "just works" :)
> 
> If you want discards to be supported in dmcrypt, you have to enable it 
> manually.
> 
> The most comfortable way is just use recent cryptsetup and add
> --allow-discards option to luksOpen command.
> 
> It will be never enabled by default in dmcrypt for security reasons
> http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html

Thanks for the answer.
I knew that it created some security problems but I had not yet found the
page you just gave, which effectively states that TRIM isn't actually that
big a win on recent SSDs (I thought it was actually pretty important to use
it on them until now).

Considering that I have a fairly new crucial 256GB SDD, I'm going to assume
that this bit applies to me:
"On the other side, TRIM is usually overrated. Drive itself should keep good
performance even without TRIM, either by using internal garbage collecting
process or by some area reserved for optimal writes handling."

So it sounds like I should just not give the "ssd" mount option to btrfs,
and not worry about TRIM. 
My main concern was to make sure I wasn't risking corruption with btrfs on
top of dm-crypt, which is now true with 3.2.x, and I now understand that it
is true regardless of whether I use --allow-discards in cryptsetup, so I'm
just not going to use it given the warning page you just posted.

Thanks for the answer.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-13  0:14             ` Marc MERLIN
@ 2012-02-15 15:42               ` Calvin Walton
  2012-02-15 16:55                 ` Marc MERLIN
  2012-02-16  6:33               ` Chris Samuel
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 68+ messages in thread
From: Calvin Walton @ 2012-02-15 15:42 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Milan Broz, Chris Mason, linux-btrfs, jeff

On Sun, 2012-02-12 at 16:14 -0800, Marc MERLIN wrote:
> Considering that I have a fairly new crucial 256GB SDD, I'm going to assume
> that this bit applies to me:
> "On the other side, TRIM is usually overrated. Drive itself should keep good
> performance even without TRIM, either by using internal garbage collecting
> process or by some area reserved for optimal writes handling."
> 
> So it sounds like I should just not give the "ssd" mount option to btrfs,
> and not worry about TRIM. 

The 'ssd' option on btrfs is actually completely unrelated to trim
support. Instead, it changes how blocks are allocated on the device,
taking advantage of the the improved random read/write speed. The 'ssd'
option should be autodetected on most SSDs, but I don't know if this is
handled correctly when you're using dm-crypt. (Btrfs prints a message at
mount time when it autodetects this.) It shouldn't hurt to leave it.

Discard is handled with a separate mount option on btrfs (called
'discard'), and is disabled by default even if you have the 'ssd' option
enabled, because of the negative performance impact it has had on some
SSDs.

-- 
Calvin Walton <calvin.walton@kepstin.ca>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-15 15:42               ` Calvin Walton
@ 2012-02-15 16:55                 ` Marc MERLIN
  2012-02-15 16:59                   ` Hugo Mills
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-02-15 16:55 UTC (permalink / raw)
  To: Calvin Walton; +Cc: Milan Broz, Chris Mason, linux-btrfs, jeff

On Wed, Feb 15, 2012 at 10:42:43AM -0500, Calvin Walton wrote:
> On Sun, 2012-02-12 at 16:14 -0800, Marc MERLIN wrote:
> > Considering that I have a fairly new crucial 256GB SDD, I'm going to assume
> > that this bit applies to me:
> > "On the other side, TRIM is usually overrated. Drive itself should keep good
> > performance even without TRIM, either by using internal garbage collecting
> > process or by some area reserved for optimal writes handling."
> > 
> > So it sounds like I should just not give the "ssd" mount option to btrfs,
> > and not worry about TRIM. 
> 
> The 'ssd' option on btrfs is actually completely unrelated to trim
> support. Instead, it changes how blocks are allocated on the device,
> taking advantage of the the improved random read/write speed. The 'ssd'
> option should be autodetected on most SSDs, but I don't know if this is
> handled correctly when you're using dm-crypt. (Btrfs prints a message at
> mount time when it autodetects this.) It shouldn't hurt to leave it.
 
Yes, I found out more after I got my laptop back up (I had limited search
while I was rebuilding it). Thanks for clearing up my improper guess at the
time :)

The good news is that ssd mode is autodetected through dmcrypt:
[   23.130486] device label btrfs_pool1 devid 1 transid 732 /dev/mapper/cryptroot
[   23.130854] btrfs: disk space caching is enabled
[   23.175547] Btrfs detected SSD devices, enabling SSD mode

> Discard is handled with a separate mount option on btrfs (called
> 'discard'), and is disabled by default even if you have the 'ssd' option
> enabled, because of the negative performance impact it has had on some
> SSDs.

That's what I read up later. It's a bit counter intuitive after all the work
what went into TRIM to then figure out that actually there are more reasons
not to bother with it then to do :)
On the plus side, it means SSDs are getting better and don't need special
code that makes data recovery harder should you ever need it.

I tried updating the wiki pages, because:
https://btrfs.wiki.kernel.org/articles/f/a/q/FAQ_1fe9.html
says nothing about
- trim/discard
- dmcrypt

while
https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html
still states 'btrfs volumes on top of dm-crypt block devices (and possibly
LVM) require write-caching to be turned off on the underlying HDD. Failing
to do so, in the event of a power failure, may result in corruption not yet
handled by btrfs code. (2.6.33) '

I'm happy to fix both pages, but the login link of course doesn't work and
I'm not sure where the canonical copy to edit actually is or if I can get
access.

That said if someone else can fix it too, that's great :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-15 16:55                 ` Marc MERLIN
@ 2012-02-15 16:59                   ` Hugo Mills
  2012-02-22 10:28                     ` Justin Ossevoort
  0 siblings, 1 reply; 68+ messages in thread
From: Hugo Mills @ 2012-02-15 16:59 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Calvin Walton, Milan Broz, Chris Mason, linux-btrfs, jeff

[-- Attachment #1: Type: text/plain, Size: 903 bytes --]

On Wed, Feb 15, 2012 at 08:55:40AM -0800, Marc MERLIN wrote:
> https://btrfs.wiki.kernel.org/articles/g/o/t/Gotchas.html
> still states 'btrfs volumes on top of dm-crypt block devices (and possibly
> LVM) require write-caching to be turned off on the underlying HDD. Failing
> to do so, in the event of a power failure, may result in corruption not yet
> handled by btrfs code. (2.6.33) '
> 
> I'm happy to fix both pages, but the login link of course doesn't work and
> I'm not sure where the canonical copy to edit actually is or if I can get
> access.

   btrfs.ipv5.de

   It's a "temporary" location (6 months and counting) while
kernel.org is static-pages-only.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- The English language has the mot juste for every occasion. ---    

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-13  0:14             ` Marc MERLIN
  2012-02-15 15:42               ` Calvin Walton
@ 2012-02-16  6:33               ` Chris Samuel
  2012-02-18 12:33               ` Martin Steigerwald
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 68+ messages in thread
From: Chris Samuel @ 2012-02-16  6:33 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: Text/Plain, Size: 822 bytes --]

On Monday 13 February 2012 11:14:00 Marc MERLIN wrote:

> I knew that it created some security problems but I had not yet
> found the page you just gave, which effectively states that TRIM
> isn't actually that big a win on recent SSDs (I thought it was
> actually pretty important to use it on them until now).

In addition TRIM has always been problematic for SATA SSD's as it was 
initially specified as a non-queueable command:

https://lwn.net/Articles/347511/

However, it appears that a queueable TRIM command has been introduced 
for SATA 3.1:

http://techreport.com/discussions.x/21311

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

This email may come with a PGP signature as a file. Do not panic.
For more info see: http://en.wikipedia.org/wiki/OpenPGP

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 482 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-13  0:14             ` Marc MERLIN
  2012-02-15 15:42               ` Calvin Walton
  2012-02-16  6:33               ` Chris Samuel
@ 2012-02-18 12:33               ` Martin Steigerwald
  2012-02-18 12:39               ` Martin Steigerwald
  2012-07-18 18:13               ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN
  4 siblings, 0 replies; 68+ messages in thread
From: Martin Steigerwald @ 2012-02-18 12:33 UTC (permalink / raw)
  To: linux-btrfs

Am Montag, 13. Februar 2012 schrieb Marc MERLIN:
> On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote:
> > On 02/12/2012 11:32 PM, Marc MERLIN wrote:
> > >Actually I had one more question.
> > >
> > >I read this page:
> > >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html
> > >
> > >I'm not super clear if with 3.2.5 kernel, I need to pass the speci=
al
> > >allow_discards option for brtfs and dm-crypt to be safe together, =
or
> > >whether
> > >they now talk through an API and everything "just works" :)
> >=20
> > If you want discards to be supported in dmcrypt, you have to enable
> > it manually.
> >=20
> > The most comfortable way is just use recent cryptsetup and add
> > --allow-discards option to luksOpen command.
> >=20
> > It will be never enabled by default in dmcrypt for security reasons
> > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html
>=20
> Thanks for the answer.
> I knew that it created some security problems but I had not yet found
> the page you just gave, which effectively states that TRIM isn't
> actually that big a win on recent SSDs (I thought it was actually
> pretty important to use it on them until now).

Well I find

"On the other side, TRIM is usually overrated. Drive itself should keep=
=20
good performance even without TRIM, either by using internal garbage=20
collecting process or by some area reserved for optimal writes handling=
=2E"

a very, very weak statement on the matter, cause it lacks any links to =
any=20
evidence for the statement made. For that I meant to knew up to know is=
=20
that wear leaveling of the SSD can be more effective the more space the=
 SSD=20
controller/firmware can use for wear leveling freely. Thus when I give=20
space back to the SSD via fstrim it has more space for wear leveling wh=
ich=20
should lead to more evenly distributed write pattterns and more evenly=20
distributed write accesses to flash cells and thus a longer life time. =
I do=20
not see any other means on how the SSD drive can do that data has been=20
freed again except for it being overwritten, then the old write locatio=
n=20
can be freed of course. But then BTRFS does COW and thus when I underst=
and=20
this correctly, the SSD wouldn=B4t even be told when a file is overwrit=
ten,=20
cause actually it isn=B4t, but is written to a new location. Thus espec=
ially=20
for BTRFS I see even more reasons to use fstrim.

I have no scientifical backing either, but at least I tried to think of=
 an=20
explaination that appears logical to me instead of just saying it is so=
=20
without providing any explaination at all. Yes, I dislike bold statemen=
ts=20
without any backing at all. (If I overread something in the text, pleas=
e=20
hint me to it, but I did not see any explaination or link to support th=
e=20
statement.)

I use ecryptfs and I use fstrim occassionally with BTRFS and Ext4, but =
I=20
do not use online discard.

Thanks,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-13  0:14             ` Marc MERLIN
                                 ` (2 preceding siblings ...)
  2012-02-18 12:33               ` Martin Steigerwald
@ 2012-02-18 12:39               ` Martin Steigerwald
  2012-02-18 12:49                 ` Martin Steigerwald
  2012-07-18 18:13               ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN
  4 siblings, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-02-18 12:39 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Milan Broz, Chris Mason, jeff

Sorry, I forget to keep Cc=B4s, different lists, different habits.

To Milan Broz: Well now I noticed that you linked to your own blog entr=
y.=20
Please do not take my below statements personally - I might have writte=
n=20
them a bit harshly. Actually I do not really know whether your statemen=
t=20
that TRIM is overrated is correct, but before believing that TRIM does =
not=20
give much advantage, I would like to see at least some evidence of any=20
sort, cause for me my explaination below that it should make a differen=
ce=20
at least seems logical to me.

Am Montag, 13. Februar 2012 schrieb Marc MERLIN:
> On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote:
> > On 02/12/2012 11:32 PM, Marc MERLIN wrote:
> > >Actually I had one more question.
> > >
> > >I read this page:
> > >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html
> > >
> > >I'm not super clear if with 3.2.5 kernel, I need to pass the speci=
al
> > >allow_discards option for brtfs and dm-crypt to be safe together, =
or
> > >whether
> > >they now talk through an API and everything "just works" :)
> >=20
> > If you want discards to be supported in dmcrypt, you have to enable
> > it manually.
> >=20
> > The most comfortable way is just use recent cryptsetup and add
> > --allow-discards option to luksOpen command.
> >=20
> > It will be never enabled by default in dmcrypt for security reasons
> > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html
>=20
> Thanks for the answer.
> I knew that it created some security problems but I had not yet found
> the page you just gave, which effectively states that TRIM isn't
> actually that big a win on recent SSDs (I thought it was actually
> pretty important to use it on them until now).

Well I find

"On the other side, TRIM is usually overrated. Drive itself should keep=
=20
good performance even without TRIM, either by using internal garbage=20
collecting process or by some area reserved for optimal writes handling=
=2E"

a very, very weak statement on the matter, cause it lacks any links to =
any=20
evidence for the statement made. For that I meant to knew up to know is=
=20
that wear leaveling of the SSD can be more effective the more space the=
 SSD=20
controller/firmware can use for wear leveling freely. Thus when I give=20
space back to the SSD via fstrim it has more space for wear leveling wh=
ich=20
should lead to more evenly distributed write pattterns and more evenly=20
distributed write accesses to flash cells and thus a longer life time. =
I do=20
not see any other means on how the SSD drive can do that data has been=20
freed again except for it being overwritten, then the old write locatio=
n=20
can be freed of course. But then BTRFS does COW and thus when I underst=
and=20
this correctly, the SSD wouldn=B4t even be told when a file is overwrit=
ten,=20
cause actually it isn=B4t, but is written to a new location. Thus espec=
ially=20
for BTRFS I see even more reasons to use fstrim.

I have no scientifical backing either, but at least I tried to think of=
 an=20
explaination that appears logical to me instead of just saying it is so=
=20
without providing any explaination at all. Yes, I dislike bold statemen=
ts=20
without any backing at all. (If I overread something in the text, pleas=
e=20
hint me to it, but I did not see any explaination or link to support th=
e=20
statement.)

I use ecryptfs and I use fstrim occassionally with BTRFS and Ext4, but =
I=20
do not use online discard.

Thanks,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-18 12:39               ` Martin Steigerwald
@ 2012-02-18 12:49                 ` Martin Steigerwald
  0 siblings, 0 replies; 68+ messages in thread
From: Martin Steigerwald @ 2012-02-18 12:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Milan Broz, Chris Mason, jeff

Am Samstag, 18. Februar 2012 schrieb Martin Steigerwald:
[=E2=80=A6]
> Am Montag, 13. Februar 2012 schrieb Marc MERLIN:
> > On Mon, Feb 13, 2012 at 12:47:54AM +0100, Milan Broz wrote:
> > > On 02/12/2012 11:32 PM, Marc MERLIN wrote:
> > > >Actually I had one more question.
> > > >
> > > >I read this page:
> > > >http://www.redhat.com/archives/dm-devel/2011-July/msg00042.html
> > > >
> > > >I'm not super clear if with 3.2.5 kernel, I need to pass the
> > > >special allow_discards option for brtfs and dm-crypt to be safe
> > > >together, or whether
> > > >they now talk through an API and everything "just works" :)
> > >=20
> > > If you want discards to be supported in dmcrypt, you have to enab=
le
> > > it manually.
> > >=20
> > > The most comfortable way is just use recent cryptsetup and add
> > > --allow-discards option to luksOpen command.
> > >=20
> > > It will be never enabled by default in dmcrypt for security reaso=
ns
> > > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html
> >=20
> > Thanks for the answer.
> > I knew that it created some security problems but I had not yet fou=
nd
> > the page you just gave, which effectively states that TRIM isn't
> > actually that big a win on recent SSDs (I thought it was actually
> > pretty important to use it on them until now).
>=20
> Well I find
>=20
> "On the other side, TRIM is usually overrated. Drive itself should ke=
ep
> good performance even without TRIM, either by using internal garbage
> collecting process or by some area reserved for optimal writes
> handling."
>=20
> a very, very weak statement on the matter, cause it lacks any links t=
o
> any evidence for the statement made. For that I meant to knew up to
> know is that wear leaveling of the SSD can be more effective the more
> space the SSD controller/firmware can use for wear leveling freely.
> Thus when I give space back to the SSD via fstrim it has more space
> for wear leveling which should lead to more evenly distributed write
> pattterns and more evenly distributed write accesses to flash cells
> and thus a longer life time. I do not see any other means on how the
> SSD drive can do that data has been freed again except for it being
> overwritten, then the old write location can be freed of course. But
> then BTRFS does COW and thus when I understand this correctly, the SS=
D
> wouldn=C2=B4t even be told when a file is overwritten, cause actually=
 it
> isn=C2=B4t, but is written to a new location. Thus especially for BTR=
=46S I
> see even more reasons to use fstrim.

Hmmm, even with COW old data is overwritten eventually, cause especiall=
y=20
on SSDs free space might be sparse. So even when when overwriting a fil=
e or=20
some parts of it writes to a new location, at some time in the future=20
BTRFS will likely overwrite the old one.

So seen in that light, an occasional fstrim might just help to make fre=
e=20
space available sooner.

Heck, I find it so difficult to know whats best when SSD / flash is inv=
olved.

Thanks,
--=20
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-12 23:47           ` Milan Broz
  2012-02-13  0:14             ` Marc MERLIN
@ 2012-02-18 16:07             ` Marc MERLIN
  2012-02-19  0:53               ` Clemens Eisserer
  1 sibling, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-02-18 16:07 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Milan Broz, Chris Mason, jeff

On Sat, Feb 18, 2012 at 01:39:24PM +0100, Martin Steigerwald wrote:
> To Milan Broz: Well now I noticed that you linked to your own blog entry. 

He did not, I'm the one who did.
I asked a bunch of questions since the online docs didn't address them for
me. Some of you answered those for me, I asked access to the wiki and I
updated the wiki to have the information you gave me.

While I have no inherent bias one way or another, obviously I did put some
of your opinions on the wiki :)

> Please do not take my below statements personally - I might have written 
> them a bit harshly. Actually I do not really know whether your statement 
> that TRIM is overrated is correct, but before believing that TRIM does not 
> give much advantage, I would like to see at least some evidence of any 
> sort, cause for me my explaination below that it should make a difference 
> at least seems logical to me.

That sounds like a reasonable request to me.

In the meantime I changed the page to 
'Does Btrfs support TRIM/discard?

"-o discard" is supported, but can have some negative consequences on
performance on some SSDs or at least whether it adds worthwhile performance
is up for debate depending on who you ask, and makes undeletion/recovery
near impossible while being a security problem if you use dm-crypt
underneath (see
http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html ), therefore
it is not enabled by default. You are welcome to run your own benchmarks and
post them here, with the caveat that they'll be very SSD firmware specific.'

I'll leave further edits to others ;)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-18 16:07             ` brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Marc MERLIN
@ 2012-02-19  0:53               ` Clemens Eisserer
  0 siblings, 0 replies; 68+ messages in thread
From: Clemens Eisserer @ 2012-02-19  0:53 UTC (permalink / raw)
  To: linux-btrfs

Hi,

> "-o discard" is supported, but can have some negative consequences on
> performance on some SSDs or at least whether it adds worthwhile performance
> is up for debate ....

The problem is, that TRIM can't be benchmarked in a traditional way -
by running one run with TRIM and another one without and later
comparing numbers.
Sure, enabling TRIM causes your benchmarks to run a bit slower in the
first place (as the TRIM commands need to be generated and interpreted
by the SSD, which is not free) - however without TRIM the SSD won't
have any clue what data is "dead" and what needs to be rescued until
you overwrite it.

On my SandForce SSD I had disabled TRIM (because those drives are
known for the good recovery without TRIM), and for now the SSD is in
the following state:

dd oflag=sync if=/some_large_file_in_page_cache of=testfile bs=4M
567558236 bytes (568 MB) copied, 23.7798 s, 23.9 MB/s

233 SandForce_Internal      0x0000   000   000   000    Old_age
Offline      -       3968
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age
Always       -       1280

So that drive is writing at 24mb/s and has a write amplification of
3.1 - usually you see values between one and two.
Completly "normal" desktop useage.

- Clemens

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-15 16:59                   ` Hugo Mills
@ 2012-02-22 10:28                     ` Justin Ossevoort
  2012-02-22 11:07                       ` Hugo Mills
  0 siblings, 1 reply; 68+ messages in thread
From: Justin Ossevoort @ 2012-02-22 10:28 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

On 15/02/12 17:59, Hugo Mills wrote:
>    It's a "temporary" location (6 months and counting) while
> kernel.org is static-pages-only.
> 
>    Hugo.
> 

Couldn't btrfs.wiki.kernel.org be made a CNAME for btrfs.ipv5.de for the
time being until kernel.org is up and running?
Or alternatively all requests could be
proxied/frame-forwarded/redirected (in order of preference, to make sure
everybody keeps using the original url as much as possible) to
btrfs.ipv5.de?

Kernel.org is a complex site and I can easily foresee them having to
spent months still fixing/hardening all of it's internals and backends
before getting around to these "extra" project facilities.

Regards,

    justin....

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD. No corruption iff write cache off?
  2012-02-22 10:28                     ` Justin Ossevoort
@ 2012-02-22 11:07                       ` Hugo Mills
  0 siblings, 0 replies; 68+ messages in thread
From: Hugo Mills @ 2012-02-22 11:07 UTC (permalink / raw)
  To: Justin Ossevoort; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1394 bytes --]

On Wed, Feb 22, 2012 at 11:28:13AM +0100, Justin Ossevoort wrote:
> On 15/02/12 17:59, Hugo Mills wrote:
> >    It's a "temporary" location (6 months and counting) while
> > kernel.org is static-pages-only.
> > 
> >    Hugo.
> > 
> 
> Couldn't btrfs.wiki.kernel.org be made a CNAME for btrfs.ipv5.de for the
> time being until kernel.org is up and running?
> Or alternatively all requests could be
> proxied/frame-forwarded/redirected (in order of preference, to make sure
> everybody keeps using the original url as much as possible) to
> btrfs.ipv5.de?

   That would require additional effort on the part of kernel.org,
which probably won't be forthcoming. I don't really have any authority
to ask kernel.org people to do anything anyway -- it'll probably have
to come from Chris.

> Kernel.org is a complex site and I can easily foresee them having to
> spent months still fixing/hardening all of it's internals and backends
> before getting around to these "extra" project facilities.

   Indeed. It's not clear whether we'll get the kernel.org wiki back
in writable form at all.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
              --- Python is executable pseudocode; perl ---              
                        is executable line-noise.                        

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-02-13  0:14             ` Marc MERLIN
                                 ` (3 preceding siblings ...)
  2012-02-18 12:39               ` Martin Steigerwald
@ 2012-07-18 18:13               ` Marc MERLIN
  2012-07-18 20:04                 ` Fajar A. Nugraha
  2012-07-18 21:49                 ` Martin Steigerwald
  4 siblings, 2 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-18 18:13 UTC (permalink / raw)
  To: linux-btrfs, Chris Mason, mbroz, Calvin Walton, Martin Steigerwald; +Cc: jeff

On Sat, Feb 18, 2012 at 08:07:02AM -0800, Marc MERLIN wrote:
> On Sat, Feb 18, 2012 at 01:39:24PM +0100, Martin Steigerwald wrote:
> > To Milan Broz: Well now I noticed that you linked to your own blog entry. 
> 
> He did not, I'm the one who did.
> I asked a bunch of questions since the online docs didn't address them for
> me. Some of you answered those for me, I asked access to the wiki and I
> updated the wiki to have the information you gave me.
> 
> While I have no inherent bias one way or another, obviously I did put some
> of your opinions on the wiki :)
> 
> > Please do not take my below statements personally - I might have written 
> > them a bit harshly. Actually I do not really know whether your statement 
> > that TRIM is overrated is correct, but before believing that TRIM does not 
> > give much advantage, I would like to see at least some evidence of any 
> > sort, cause for me my explaination below that it should make a difference 
> > at least seems logical to me.
> 
> That sounds like a reasonable request to me.
> 
> In the meantime I changed the page to 
> 'Does Btrfs support TRIM/discard?
> 
> "-o discard" is supported, but can have some negative consequences on
> performance on some SSDs or at least whether it adds worthwhile performance
> is up for debate depending on who you ask, and makes undeletion/recovery
> near impossible while being a security problem if you use dm-crypt
> underneath (see
> http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html ), therefore
> it is not enabled by default. You are welcome to run your own benchmarks and
> post them here, with the caveat that they'll be very SSD firmware specific.'
> 
> I'll leave further edits to others ;)

TL;DR:
I'm going to change the FAQ to say people should use TRIM with dmcrypt
because not doing so definitely causes some lesser SSDs to suck, or 
possibly even fail and lose our data.

Longer version:
Ok, so several months later I can report back with useful info.

Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused
its garbage collection algorithm to fail (killing the drive and all its
data), and it was also causing BRTFS to hang badly when I was getting
within 10GB of the drive getting full.

I reported some problems I had with btrfs being very slow and hanging when I
only had 10GB free, and I'm now convinced that it was the SSD that was at
fault.

On the Crucial RealSSD C300 256GB, and from talking to their tech support
and other folks who happened to have gotten that 'drive' at work and also
got weird unexplained failures, I'm convinced that even its latest 007
firmware (the firmware it shipped with would just hang the system for a few
seconds every so often so I did upgrade to 007 early on), the drive does
very poorly without TRIM when it's getting close to full.
In my case, I hit a bug that is likely firmware and caused the drive to die
(not show up on the bus anymore). I went through special reset, power on but
don't talk to it procedures that eventually brought the drive back after 4
hours of trying, except the drive is now 'blank', all the partitions and
data are gone as far as the controller is concerned.
(yes, I had backups, thanks for asking :) ).

>From talking to their techs and other folks, it seems clear that TRIM is
greatly encouraged, and I'm pretty sure that had I used TRIM, I would not
have hit the problems that caused my drive to fail and suck so much when it
was getting full.

Now, I still think that the Crucial RealSSD C300 256GB has poor firmware
compared to some of its competition and there is no excuse for it just dying 
like that, but I'm going to change the FAQ to recommend that people do use
TRIM if they can, even if they use dm-crypt (assuming their kernel is at
least 3.2.0).

Any objections and/or comments?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-07-18 18:13               ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN
@ 2012-07-18 20:04                 ` Fajar A. Nugraha
  2012-07-18 20:37                   ` Marc MERLIN
  2012-07-18 21:34                   ` Clemens Eisserer
  2012-07-18 21:49                 ` Martin Steigerwald
  1 sibling, 2 replies; 68+ messages in thread
From: Fajar A. Nugraha @ 2012-07-18 20:04 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Thu, Jul 19, 2012 at 1:13 AM, Marc MERLIN <marc@merlins.org> wrote:
> TL;DR:
> I'm going to change the FAQ to say people should use TRIM with dmcrypt
> because not doing so definitely causes some lesser SSDs to suck, or
> possibly even fail and lose our data.
>
>
> Longer version:
> Ok, so several months later I can report back with useful info.
>
> Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused
> its garbage collection algorithm to fail (killing the drive and all its
> data), and it was also causing BRTFS to hang badly when I was getting
> within 10GB of the drive getting full.
>
> I reported some problems I had with btrfs being very slow and hanging when I
> only had 10GB free, and I'm now convinced that it was the SSD that was at
> fault.
>
> On the Crucial RealSSD C300 256GB, and from talking to their tech support
> and other folks who happened to have gotten that 'drive' at work and also
> got weird unexplained failures, I'm convinced that even its latest 007
> firmware (the firmware it shipped with would just hang the system for a few
> seconds every so often so I did upgrade to 007 early on), the drive does
> very poorly without TRIM when it's getting close to full.


If you're going to edit the wiki, I'd suggest you say "SOME SSDs might
need to use TRIM with dmcrypt". That's because some SSD controllers
(e.g. sandforce) performs just fine without TRIM, and in my case TRIM
made performance worse.

-- 
Fajar

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-07-18 20:04                 ` Fajar A. Nugraha
@ 2012-07-18 20:37                   ` Marc MERLIN
  2012-07-18 21:34                   ` Clemens Eisserer
  1 sibling, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-18 20:37 UTC (permalink / raw)
  To: Fajar A. Nugraha; +Cc: linux-btrfs

On Thu, Jul 19, 2012 at 03:04:29AM +0700, Fajar A. Nugraha wrote:
> > On the Crucial RealSSD C300 256GB, and from talking to their tech support
> > and other folks who happened to have gotten that 'drive' at work and also
> > got weird unexplained failures, I'm convinced that even its latest 007
> > firmware (the firmware it shipped with would just hang the system for a few
> > seconds every so often so I did upgrade to 007 early on), the drive does
> > very poorly without TRIM when it's getting close to full.
> 
> If you're going to edit the wiki, I'd suggest you say "SOME SSDs might
> need to use TRIM with dmcrypt". That's because some SSD controllers
> (e.g. sandforce) performs just fine without TRIM, and in my case TRIM
> made performance worse.

Right. I'm definitely planning on writing that it is controller dependent.

By the way, I read that SF performs worse with dmcrypt because:
http://superuser.com/questions/250626/sandforce-ssd-encryption-security-and-support
'Using software full-disk-encryption on an SSD seriously impacts the
performance of the drive - up to a point where it can be slower than a
regular hard disk.'
http://c089.wordpress.com/2011/02/28/encrypting-solid-state-disks/
http://www.madshrimps.be/articles/article/965#axzz1FGfzD15q

So, really, it seems that one should not buy a drive with a sandforce
controller if you plan to use dmcrypt.
I haven't found clear data to say which controllers other than sandforce
have the same problem.

https://docs.google.com/a/google.com/viewer?a=v&q=cache:eyoe3gL10qcJ:www.samsung.com/global/business/semiconductor/file/product/ssd/SamsungSSD_Encryption_Benchmarks_201011.pdf+&hl=en&gl=us&pid=bl&srcid=ADGEEShHEj1jkxdwa7JREmubW6iAyc2RGqQt8MoGxMgQrRNDifoGqTYWjYdaUypnaRtkjKrjiOt1JCr4dDt4ycD2rjWO51PtAo67JvnZGe6Gx1s-9yjkRVYZWsUHZApkjm7dWVzkdQg6&sig=AHIEtbRaViqG4cYC_RYajAW04JRjNhQq6g
seems to imply that software encryption on samsung drives (which is what I
just got as a replacement for my crucial), does not have the same penalty
than you have with the SF controller.

But back to the point about "yes, it depends".

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-07-18 20:04                 ` Fajar A. Nugraha
  2012-07-18 20:37                   ` Marc MERLIN
@ 2012-07-18 21:34                   ` Clemens Eisserer
  2012-07-18 21:48                     ` Marc MERLIN
  1 sibling, 1 reply; 68+ messages in thread
From: Clemens Eisserer @ 2012-07-18 21:34 UTC (permalink / raw)
  To: linux-btrfs

Hi,

> Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused
> its garbage collection algorithm to fail (killing the drive and all its
> data), and it was also causing BRTFS to hang badly when I was getting
> within 10GB of the drive getting full.

Funny thing, as without TRIM the drive does not even know how "full" it is ;)

- Clemens

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-07-18 21:34                   ` Clemens Eisserer
@ 2012-07-18 21:48                     ` Marc MERLIN
  0 siblings, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-18 21:48 UTC (permalink / raw)
  To: Clemens Eisserer; +Cc: linux-btrfs

On Wed, Jul 18, 2012 at 11:34:32PM +0200, Clemens Eisserer wrote:
> Hi,
> 
> > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what caused
> > its garbage collection algorithm to fail (killing the drive and all its
> > data), and it was also causing BRTFS to hang badly when I was getting
> > within 10GB of the drive getting full.
> 
> Funny thing, as without TRIM the drive does not even know how "full" it is ;)

That's very true. To be honest, I don't fully know what happened to that
drive, but I actually heard the following from Crucial tech support twice:
"without TRIM, the drive needs to do garbage collection (correct), and it
only happens when the drive has been fully idle for about 15mn (oh boy,
why?). We recommend you let your drive sit at the bios or just with sata
power from time to time so that it can do garbage collection"

Then, there is also this gem:
http://forums.crucial.com/t5/Solid-State-Drives-SSD/Bios-is-not-recognize-crucail-m4-SSD-64GB/td-p/63835  
If the drive cannot garbage collect, it will shut off and appear dead.
When that happens, it needs to be powered on for 20mn with no data, turned
off, and on again for 20mn, and off.
Then, it's supposed to come back and be usable again.
(well, mine not so much. It came back eventually, but empty).

This is not a btrfs problem per se, but since we had a discussion on TRIM
and SSDs and what to do with dm-crypt, I thought I'd report back since it's
actually not a simple situation at all, and very SSD dependent afterall.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-07-18 18:13               ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN
  2012-07-18 20:04                 ` Fajar A. Nugraha
@ 2012-07-18 21:49                 ` Martin Steigerwald
  2012-07-18 22:04                   ` Marc MERLIN
  1 sibling, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-07-18 21:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Chris Mason, mbroz, Calvin Walton, jeff

Am Mittwoch, 18. Juli 2012 schrieb Marc MERLIN:
> On Sat, Feb 18, 2012 at 08:07:02AM -0800, Marc MERLIN wrote:
> > On Sat, Feb 18, 2012 at 01:39:24PM +0100, Martin Steigerwald wrote:
> > > To Milan Broz: Well now I noticed that you linked to your own blog
> > > entry.
> > 
> > He did not, I'm the one who did.
> > I asked a bunch of questions since the online docs didn't address
> > them for me. Some of you answered those for me, I asked access to
> > the wiki and I updated the wiki to have the information you gave me.
> > 
> > While I have no inherent bias one way or another, obviously I did put
> > some of your opinions on the wiki :)
> > 
> > > Please do not take my below statements personally - I might have
> > > written them a bit harshly. Actually I do not really know whether
> > > your statement that TRIM is overrated is correct, but before
> > > believing that TRIM does not give much advantage, I would like to
> > > see at least some evidence of any sort, cause for me my
> > > explaination below that it should make a difference at least seems
> > > logical to me.
> > 
> > That sounds like a reasonable request to me.
> > 
> > In the meantime I changed the page to
> > 'Does Btrfs support TRIM/discard?
> > 
> > "-o discard" is supported, but can have some negative consequences on
> > performance on some SSDs or at least whether it adds worthwhile
> > performance is up for debate depending on who you ask, and makes
> > undeletion/recovery near impossible while being a security problem
> > if you use dm-crypt underneath (see
> > http://asalor.blogspot.com/2011/08/trim-dm-crypt-problems.html ),
> > therefore it is not enabled by default. You are welcome to run your
> > own benchmarks and post them here, with the caveat that they'll be
> > very SSD firmware specific.'
> > 
> > I'll leave further edits to others ;)
> 
> TL;DR:
> I'm going to change the FAQ to say people should use TRIM with dmcrypt
> because not doing so definitely causes some lesser SSDs to suck, or
> possibly even fail and lose our data.

I can´t comment on dm-crypt since I do not use it.

I am still not convinced that dm-crypt is the best way to go about 
encryption especially for SSDs. But its more of a gut feeling than 
anything that I can explain easily.

I use ecryptfs, formerly encfs, but encfs is much slower. The advantage 
for me is that I can backup my data without decrypting and reencrypting it 
just with rsync as long as I exclude the uncrypted view on the data. I 
also always thought not being block device based would be an advantage 
too, but I am not so sure about it anymore.

Disadvantage would be some information leakage like amount of file and 
directory layout and approximate sizes of files. And the need to handle the 
swap partition separately. What I do not even do right now.

> Longer version:
> Ok, so several months later I can report back with useful info.
> 
> Not using TRIM on my Crucial RealSSD C300 256GB is most likely what
> caused its garbage collection algorithm to fail (killing the drive and
> all its data), and it was also causing BRTFS to hang badly when I was
> getting within 10GB of the drive getting full.

How did you know that it was its garbage collection algorithm?

> I reported some problems I had with btrfs being very slow and hanging
> when I only had 10GB free, and I'm now convinced that it was the SSD
> that was at fault.
> 
> On the Crucial RealSSD C300 256GB, and from talking to their tech
> support and other folks who happened to have gotten that 'drive' at
> work and also got weird unexplained failures, I'm convinced that even
> its latest 007 firmware (the firmware it shipped with would just hang
> the system for a few seconds every so often so I did upgrade to 007
> early on), the drive does very poorly without TRIM when it's getting
> close to full.
> In my case, I hit a bug that is likely firmware and caused the drive to
> die (not show up on the bus anymore). I went through special reset,
> power on but don't talk to it procedures that eventually brought the
> drive back after 4 hours of trying, except the drive is now 'blank',
> all the partitions and data are gone as far as the controller is
> concerned.
> (yes, I had backups, thanks for asking :) ).
> 
> From talking to their techs and other folks, it seems clear that TRIM
> is greatly encouraged, and I'm pretty sure that had I used TRIM, I
> would not have hit the problems that caused my drive to fail and suck
> so much when it was getting full.

I still think that telling the SSD about free space is helping it to 
balance its wear leveling process.

Say I have a 10 GB file that I newer ever touch again. The SSD firmware will 
likely begin to move it around to blocks with more write accesses in order 
to use that blocks that have once be written to for further writes. For 
that wear leveling its beneficial if the SSD has free blocks to move stuff 
to. The more the better it seems to me. Cause if it only has little free 
space it possibly has to do more moving around to equally distribute write 
accesses.

> Now, I still think that the Crucial RealSSD C300 256GB has poor
> firmware compared to some of its competition and there is no excuse
> for it just dying like that, but I'm going to change the FAQ to
> recommend that people do use TRIM if they can, even if they use
> dm-crypt (assuming their kernel is at least 3.2.0).
> 
> Any objections and/or comments?

I still only use fstrim from time to time. About once a week or after lots 
of drive churning or removing lots of data. I also have a logical volume 
of about 20 GiB that I keep free for most of the time. And other filesystem 
are quite full, but there is also some little free space of about 20-30 
GiB together. So it should be about 40-50 GiB free most of the time.

The 300 GB Intel SSD 320 in this ThinkPad T520 is still fine after about 1 
year and 2-3 months. I do not see any performance degradation whatsover so 
far. Last time I looked also SMART data looked fine, but I have not much 
experience with SMART on SSDs so far.

Regards,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-07-18 21:49                 ` Martin Steigerwald
@ 2012-07-18 22:04                   ` Marc MERLIN
  2012-07-19 10:40                     ` Martin Steigerwald
  2012-07-22 18:58                     ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN
  0 siblings, 2 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-18 22:04 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff

On Wed, Jul 18, 2012 at 11:49:36PM +0200, Martin Steigerwald wrote:
> I am still not convinced that dm-crypt is the best way to go about 
> encryption especially for SSDs. But its more of a gut feeling than 
> anything that I can explain easily.
 
I agree that dmcrypt is not great, and it even makes some SSDs slower than
hard drives as per some reports I just posted in another mail.
But:

> I use ecryptfs, formerly encfs, but encfs is much slower. The advantage 

ecryptfs is:
1) painfully slow compared to dmcrypt in my tests. As in so slow that I
don't even need to benchmark it.
2) unable to encrypt very long filenames, so when I copy my archive on an
ecryptfs volume, some files won't copy unless I rename them.

I would love for ecryptfs to have the performance of dmcrypt, because it'd
be easier for me to use it, but it didn't even come close.
 
> > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what
> > caused its garbage collection algorithm to fail (killing the drive and
> > all its data), and it was also causing BRTFS to hang badly when I was
> > getting within 10GB of the drive getting full.
> 
> How did you know that it was its garbage collection algorithm?
 
I don't, hence "most likely". The failure of the drive I got was likely
garbage collection related from what I got from the techs I talked to.

> > From talking to their techs and other folks, it seems clear that TRIM
> > is greatly encouraged, and I'm pretty sure that had I used TRIM, I
> > would not have hit the problems that caused my drive to fail and suck
> > so much when it was getting full.
> 
> I still think that telling the SSD about free space is helping it to 
> balance its wear leveling process.
 
Yes.

> > Any objections and/or comments?
> 
> I still only use fstrim from time to time. About once a week or after lots 
> of drive churning or removing lots of data. I also have a logical volume 
> of about 20 GiB that I keep free for most of the time. And other filesystem 
> are quite full, but there is also some little free space of about 20-30 
> GiB together. So it should be about 40-50 GiB free most of the time.
 
I'm curious. If your filesystem supports trim (i.e. ext4 and btrfs), is
there every a reason to turn off trim in the FS and use fstrim instead?

> The 300 GB Intel SSD 320 in this ThinkPad T520 is still fine after about 1 
> year and 2-3 months. I do not see any performance degradation whatsover so 
> far. Last time I looked also SMART data looked fine, but I have not much 
> experience with SMART on SSDs so far.

My experience and what I read online is that SMART on SSDs doesn't seem to
help much in many cases. I've seen too many reports of SSDs dying very
suddenly with absolutely no warning.
Hard drives, if you look at smart data over time, typically give you plenty
of warning before they die (as long as you don't drop them off a table
without parking their heads).

If you're curious, here's the last dump of my SMART data on the SSD that
died:
=== START OF INFORMATION SECTION ===
Model Family:     Crucial RealSSD C300
Device Model:     C300-CTFDDAC256MAG
Serial Number:    00000000111003044A9C
LU WWN Device Id: 5 00a075 103044a9c
Firmware Version: 0007
User Capacity:    256,060,514,304 bytes [256 GB]

  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   000    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2977
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       472
170 Grown_Failing_Block_Ct  0x0033   100   100   000    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Wear_Levelling_Count    0x0033   100   100   000    Pre-fail  Always       -       35
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       1
181 Non4k_Aligned_Access    0x0022   100   100   000    Old_age   Always       -       2757376737803
183 SATA_Iface_Downshift    0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   000    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 Factory_Bad_Block_Ct    0x000e   100   100   000    Old_age   Always       -       422
195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0036   100   100   000    Old_age   Always       -       0
202 Perc_Rated_Life_Used    0x0018   100   100   000    Old_age   Offline      -       0
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2958         -
# 2  Extended offline    Completed without error       00%      2937         -
# 3  Short offline       Completed without error       00%      2936         -
# 4  Short offline       Completed without error       00%      2915         -
# 5  Short offline       Completed without error       00%      2892         -
# 6  Short offline       Completed without error       00%      2881         -
# 7  Short offline       Completed without error       00%      2859         -
# 8  Vendor (0xff)       Completed without error       00%      2852         -

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> Trim or no Trim
  2012-07-18 22:04                   ` Marc MERLIN
@ 2012-07-19 10:40                     ` Martin Steigerwald
  2012-07-22 18:58                     ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN
  1 sibling, 0 replies; 68+ messages in thread
From: Martin Steigerwald @ 2012-07-19 10:40 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff

Am Donnerstag, 19. Juli 2012 schrieb Marc MERLIN:
> On Wed, Jul 18, 2012 at 11:49:36PM +0200, Martin Steigerwald wrote:
> > I am still not convinced that dm-crypt is the best way to go about
> > encryption especially for SSDs. But its more of a gut feeling than
> > anything that I can explain easily.
> 
> I agree that dmcrypt is not great, and it even makes some SSDs slower
> than hard drives as per some reports I just posted in another mail.
> 
> But:
> > I use ecryptfs, formerly encfs, but encfs is much slower. The
> > advantage
> 
> ecryptfs is:
> 1) painfully slow compared to dmcrypt in my tests. As in so slow that I
> don't even need to benchmark it.

Huh! Its an order of magnitude faster than encfs here – well no wonder
considering it uses FUSE – and its definately fast enough for my work
account on this machine.

> 2) unable to encrypt very long filenames, so when I copy my archive on
> an ecryptfs volume, some files won't copy unless I rename them.

Hmmm, okay, thats an issue. I do not seem to have that long names in that
directory tough.

> I would love for ecryptfs to have the performance of dmcrypt, because
> it'd be easier for me to use it, but it didn't even come close.

I never benchmarked it except for the time it needs to build my training
slides. It was about twice as fast as encfs and it was fast enough for
my case. Might be that dm-crypt is even faster, but then I do not see any
visible performance difference between my unencrypted private account
and the encrypted work account. Might still be that there is a difference,
its actually likely, but I did not noticed it so far.

I difference is: My /home is still on Ext4 here. Only / is on BTRFS. So
maybe ecryptfs is only that slow on BTRFS?

> > > Not using TRIM on my Crucial RealSSD C300 256GB is most likely what
> > > caused its garbage collection algorithm to fail (killing the drive
> > > and all its data), and it was also causing BRTFS to hang badly
> > > when I was getting within 10GB of the drive getting full.
> > 
> > How did you know that it was its garbage collection algorithm?
> 
> I don't, hence "most likely". The failure of the drive I got was likely
> garbage collection related from what I got from the techs I talked to.

Ah okay. I wouldn´t now it either. Its almost a black box for me. Like
my car. If it has something I drive it to the dealer´s garage and be
done with it.

As long as the SSD works. I try to treat it somewhat gently.

[…]
> > > Any objections and/or comments?
> > 
> > I still only use fstrim from time to time. About once a week or after
> > lots of drive churning or removing lots of data. I also have a
> > logical volume of about 20 GiB that I keep free for most of the
> > time. And other filesystem are quite full, but there is also some
> > little free space of about 20-30 GiB together. So it should be about
> > 40-50 GiB free most of the time.
> 
> I'm curious. If your filesystem supports trim (i.e. ext4 and btrfs), is
> there every a reason to turn off trim in the FS and use fstrim instead?

Frankly, I do not know exactly. AFAIR It has been reported here and
elsewhere that constant trimming will yield a performance penalty –
maybe thats why your ecryptfs was so slow? some stacking effect combined
with constant trimming – and might even harm some cheaper SSDs.

Thus I do batched trimming. I am not sure whether some filesystem have
been changed to some intermediate with the "discard" mount option as well.
AFAIR XFS has been enhanced to provide performant trimming in batches with
"discard" mount option. Do not know about BTRFS so far.

Regarding SSDs much seems like guess work.

> > The 300 GB Intel SSD 320 in this ThinkPad T520 is still fine after
> > about 1 year and 2-3 months. I do not see any performance
> > degradation whatsover so far. Last time I looked also SMART data
> > looked fine, but I have not much experience with SMART on SSDs so
> > far.
> 
> My experience and what I read online is that SMART on SSDs doesn't seem
> to help much in many cases. I've seen too many reports of SSDs dying
> very suddenly with absolutely no warning.
> Hard drives, if you look at smart data over time, typically give you
> plenty of warning before they die (as long as you don't drop them off
> a table without parking their heads).

Hmmm, so as always its good to have backups. Plan to refresh my backup
in the next days. ;)

> If you're curious, here's the last dump of my SMART data on the SSD
> that died:

Thanks. Interesting. I do not see anything obvious that would tell me
that it might fail. But then there is no guarentee for harddisks either.

I got:

merkaba:~> smartctl -a /dev/sda
smartctl 5.43 2012-06-05 r3561 [x86_64-linux-3.5.0-rc7-tp520-toi-3.3-timekeeping+] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Intel 320 Series SSDs
Device Model:     INTEL SSDSA2CW300G3
Serial Number:    […]
LU WWN Device Id: […]
Firmware Version: 4PC10362
User Capacity:    300.069.052.416 bytes [300 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Jul 19 12:28:43 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[…]

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2443
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1263
170 Reserve_Block_Count     0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       169
183 Runtime_Bad_Block       0x0030   100   100   000    Old_age   Offline      -       0
184 End-to-End_Error        0x0032   100   100   090    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       120
199 UDMA_CRC_Error_Count    0x0030   100   100   000    Old_age   Offline      -       0
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       127984
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       314
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       61
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       146593
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       127984
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       201548

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Vendor (0xe0)       Completed without error       00%      2443         -
# 2  Vendor (0xd8)       Completed without error       00%       271         -
# 3  Vendor (0xd8)       Completed without error       00%       271         -
# 4  Vendor (0xa8)       Completed without error       00%       324         -

0xd8 is a long test, 0xa8 is a short test. Don´t know about 0xe0. Seems
that smartctl does not know this test numbers yet.

Except for that unsafe shutdown count I do not see anything interesting in
there. Oh, that Erase fail count – so some cells already got broken? Lets
see how that was initially:

martin@merkaba:~/Computer/Merkaba> diff -u smartctl-a-2011-05-19-nach-secure-erase.txt  smartctl-a-2012-07-19.txt
--- smartctl-a-2011-05-19-nach-secure-erase.txt 2011-05-19 16:20:50.000000000 +0200
+++ smartctl-a-2011-07-19.txt   2012-07-19 12:34:22.512228427 +0200
@@ -1,15 +1,18 @@
-smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
-Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
+smartctl 5.43 2012-06-05 r3561 [x86_64-linux-3.5.0-rc7-tp520-toi-3.3-timekeeping+] (local build)
+Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
 
 === START OF INFORMATION SECTION ===
+Model Family:     Intel 320 Series SSDs
 Device Model:     INTEL SSDSA2CW300G3
 Serial Number:    […]
-Firmware Version: 4PC10302
-User Capacity:    300,069,052,416 bytes
-Device is:        Not in smartctl database [for details use: -P showall]
+LU WWN Device Id: […]
+Firmware Version: 4PC10362
+User Capacity:    300.069.052.416 bytes [300 GB]
+Sector Size:      512 bytes logical/physical
+Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   8
 ATA Standard is:  ATA-8-ACS revision 4
-Local Time is:    Thu May 19 16:20:49 2011 CEST
+Local Time is:    Thu Jul 19 12:34:22 2012 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
[…]
@@ -56,29 +59,34 @@
   3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
   4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
   5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
-  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1
- 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4
-170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       0
-171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
-172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
-184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
+  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       2443
+ 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1263
+170 Reserve_Block_Count     0x0033   100   100   010    Pre-fail  Always       -       0
+171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
+172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       169
+183 Runtime_Bad_Block       0x0030   100   100   000    Old_age   Offline      -       0
+184 End-to-End_Error        0x0032   100   100   090    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
-192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
-225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       995
-226 Load-in_Time            0x0032   100   100   000    Old_age   Always       -       2203323
-227 Torq-amp_Count          0x0032   100   100   000    Old_age   Always       -       49
-228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       12587069
+192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       120
+199 UDMA_CRC_Error_Count    0x0030   100   100   000    Old_age   Offline      -       0
+225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       127984
+226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       314
+227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       61
+228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       146593
 232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
 233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
-241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       995
-242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       466
+241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       127984
+242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       201549
 
 SMART Error Log Version: 1
 No Errors Logged
 
 SMART Self-test log structure revision number 1
-No self-tests have been logged.  [To run self-tests, use: smartctl -t]
-
+Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
+# 1  Vendor (0xe0)       Completed without error       00%      2443         -
+# 2  Vendor (0xd8)       Completed without error       00%       271         -
+# 3  Vendor (0xd8)       Completed without error       00%       271         -
+# 4  Vendor (0xa8)       Completed without error       00%       324         -
 
 Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
 SMART Selective self-test log data structure revision number 0


Jup, erase fail count has raised. Whatever that means. I think I will
keep an eye on it.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
  2012-07-18 22:04                   ` Marc MERLIN
  2012-07-19 10:40                     ` Martin Steigerwald
@ 2012-07-22 18:58                     ` Marc MERLIN
  2012-07-22 19:35                       ` Martin Steigerwald
  1 sibling, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-07-22 18:58 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff

I'm still getting a bit more data before updating the btrfs wiki with
my best recommendations for today.

First, everything I've read so far says that the ssd btrfs mount option
makes btrfs slower in benchmarks.
What gives? 
Anyone using it or know of a reason not to mount my ssd with nossd?

Next, I got a new Samsumg 830 512GB SSD which is supposed to be very
high performance.
The raw device seems fast enough on a quick hdparm test:

But once I encrypt it, it drops to 5 times slower than my 1TB spinning
disk in the same laptop:
gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt 
/dev/mapper/ssdcrypt:
 Timing cached reads:   15412 MB in  2.00 seconds = 7715.37 MB/sec
 Timing buffered disk reads:  70 MB in  3.06 seconds =  22.91 MB/sec <<<<

gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk)
/dev/mapper/cryptroot:
 Timing cached reads:   16222 MB in  2.00 seconds = 8121.03 MB/sec
 Timing buffered disk reads: 308 MB in  3.01 seconds = 102.24 MB/sec <<<<

The non encrypted SSD device gives me:
/dev/sda4:
 Timing cached reads:   14258 MB in  2.00 seconds = 7136.70 MB/sec
 Timing buffered disk reads: 1392 MB in  3.00 seconds = 463.45 MB/sec

which is 4x faster than my non encrypted spinning disk, as expected.
I used aes-xts-plain as recommended on
http://www.mayrhofer.eu.org/ssd-linux-benchmark

gandalfthegreat:~# cryptsetup status /dev/mapper/ssdcrypt
/dev/mapper/ssdcrypt is active.
  type:    LUKS1
  cipher:  aes-xts-plain
  keysize: 256 bits
  device:  /dev/sda4
  offset:  4096 sectors
  size:    926308752 sectors
  mode:    read/write
gandalfthegreat:~# lsmod |grep -e aes
aesni_intel            50443  66 
cryptd                 14517  18 ghash_clmulni_intel,aesni_intel
aes_x86_64             16796  1 aesni_intel

I realize this is not btrfs' fault :) although I'm a bit stuck
as to how to benchmark btrfs if the underlying encrypted device is so
slow :)

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
  2012-07-22 18:58                     ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN
@ 2012-07-22 19:35                       ` Martin Steigerwald
  2012-07-22 19:43                         ` Martin Steigerwald
  2012-07-22 20:44                         ` Marc MERLIN
  0 siblings, 2 replies; 68+ messages in thread
From: Martin Steigerwald @ 2012-07-22 19:35 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff

Hi Marc,

Am Sonntag, 22. Juli 2012 schrieb Marc MERLIN:
> I'm still getting a bit more data before updating the btrfs wiki with
> my best recommendations for today.
> 
> First, everything I've read so far says that the ssd btrfs mount option
> makes btrfs slower in benchmarks.
> What gives?
> Anyone using it or know of a reason not to mount my ssd with nossd?
> 
> 
> Next, I got a new Samsumg 830 512GB SSD which is supposed to be very
> high performance.
> The raw device seems fast enough on a quick hdparm test:
> 
> 
> But once I encrypt it, it drops to 5 times slower than my 1TB spinning
> disk in the same laptop:
> gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt
> /dev/mapper/ssdcrypt:
>  Timing cached reads:   15412 MB in  2.00 seconds = 7715.37 MB/sec
>  Timing buffered disk reads:  70 MB in  3.06 seconds =  22.91 MB/sec
> <<<<
> 
> gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk)
> /dev/mapper/cryptroot:
>  Timing cached reads:   16222 MB in  2.00 seconds = 8121.03 MB/sec
>  Timing buffered disk reads: 308 MB in  3.01 seconds = 102.24 MB/sec
> <<<<

Have you looked whether certain kernel threads are eating CPU?

I would have a look at this.

Or use atop to have a complete system overview during the hdparm run. You 
may want to use its default 10 seconds delay.

Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds 
is really short.)

Did you repeat tests three or five times and looked at the deviation?

For what it is worth I can beat that with ecryptfs on top of Ext4 ontop of 
an Intel SSD 320 (SATA 300 based):

martin@merkaba:~> su - ms
Passwort: 
ms@merkaba:~> df -hT .
Dateisystem    Typ      Größe Benutzt Verf. Verw% Eingehängt auf
/home/.ms      ecryptfs  224G    211G   11G   96% /home/ms
ms@merkaba:~> dd if=/dev/zero of=testfile bs=1M count=1000 conv=fsync
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB) kopiert, 20,1466 s, 52,0 MB/s
ms@merkaba:~> rm testfile 
ms@merkaba:~> sudo fstrim /home
[sudo] password for ms: 
ms@merkaba:~>

Thats way slower than a dd without encryption, but its way faster than 
your hdparm figures. The SSD was underutilized according to the harddisk 
LED of this ThinkPad T520 with Intel i5 Sandybridge 2.5 GHz dualcore. Did 
start atop to late to see whats going on.

(I did not yet test ecryptfs on top of BTRFS, but you didn´t test a 
filesystem with hdparm anyway.)

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
  2012-07-22 19:35                       ` Martin Steigerwald
@ 2012-07-22 19:43                         ` Martin Steigerwald
  2012-07-22 20:44                         ` Marc MERLIN
  1 sibling, 0 replies; 68+ messages in thread
From: Martin Steigerwald @ 2012-07-22 19:43 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Chris Mason, mbroz, Calvin Walton, jeff

Am Sonntag, 22. Juli 2012 schrieb Martin Steigerwald:
> Hi Marc,
> 
> Am Sonntag, 22. Juli 2012 schrieb Marc MERLIN:
> > I'm still getting a bit more data before updating the btrfs wiki with
> > my best recommendations for today.
> > 
> > First, everything I've read so far says that the ssd btrfs mount
> > option makes btrfs slower in benchmarks.
> > What gives?
> > Anyone using it or know of a reason not to mount my ssd with nossd?
> > 
> > 
> > Next, I got a new Samsumg 830 512GB SSD which is supposed to be very
> > high performance.
> > The raw device seems fast enough on a quick hdparm test:
> > 
> > 
> > But once I encrypt it, it drops to 5 times slower than my 1TB
> > spinning disk in the same laptop:
> > gandalfthegreat:~# hdparm -tT /dev/mapper/ssdcrypt
> > 
> > /dev/mapper/ssdcrypt:
> >  Timing cached reads:   15412 MB in  2.00 seconds = 7715.37 MB/sec
> >  Timing buffered disk reads:  70 MB in  3.06 seconds =  22.91 MB/sec
> > 
> > <<<<
> > 
> > gandalfthegreat:~# hdparm -tT /dev/mapper/cryptroot (spinning disk)
> > 
> > /dev/mapper/cryptroot:
> >  Timing cached reads:   16222 MB in  2.00 seconds = 8121.03 MB/sec
> >  Timing buffered disk reads: 308 MB in  3.01 seconds = 102.24 MB/sec
> > 
> > <<<<
> 
> Have you looked whether certain kernel threads are eating CPU?
> 
> I would have a look at this.
> 
> Or use atop to have a complete system overview during the hdparm run.
> You may want to use its default 10 seconds delay.
> 
> Anyway, hdparm is only a very rough measurement. (Test time 2 / 3
> seconds is really short.)
> 
> Did you repeat tests three or five times and looked at the deviation?
> 
> For what it is worth I can beat that with ecryptfs on top of Ext4 ontop
> of an Intel SSD 320 (SATA 300 based):
> 
> martin@merkaba:~> su - ms
> Passwort:
> ms@merkaba:~> df -hT .
> Dateisystem    Typ      Größe Benutzt Verf. Verw% Eingehängt auf
> /home/.ms      ecryptfs  224G    211G   11G   96% /home/ms
> ms@merkaba:~> dd if=/dev/zero of=testfile bs=1M count=1000 conv=fsync
> 1000+0 Datensätze ein
> 1000+0 Datensätze aus
> 1048576000 Bytes (1,0 GB) kopiert, 20,1466 s, 52,0 MB/s
> ms@merkaba:~> rm testfile
> ms@merkaba:~> sudo fstrim /home
> [sudo] password for ms:
> ms@merkaba:~>
> 
> Thats way slower than a dd without encryption, but its way faster than
> your hdparm figures. The SSD was underutilized according to the
> harddisk LED of this ThinkPad T520 with Intel i5 Sandybridge 2.5 GHz
> dualcore. Did start atop to late to see whats going on.
> 
> (I did not yet test ecryptfs on top of BTRFS, but you didn´t test a
> filesystem with hdparm anyway.)


You measured read speed. So here read speed as well:

martin@merkaba:~#1> su - ms
Passwort: 
ms@merkaba:~> LANG=C df -hT .
Filesystem     Type      Size  Used Avail Use% Mounted on
/home/.ms      ecryptfs  224G  211G   11G  96% /home/ms
ms@merkaba:~> LANG=C df -hT /home
Filesystem               Type  Size  Used Avail Use% Mounted on
/dev/mapper/merkaba-home ext4  224G  211G   11G  96% /home


ms@merkaba:~> dd if=/dev/zero of=testfile bs=1M count=1000 conv=fsync
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB) kopiert, 19,4592 s, 53,9 MB/s


ms@merkaba:~> sync ; su -c "echo 3 > /proc/sys/vm/drop_caches" ; dd 
if=testfile of=/dev/null bs=1M count=1000
Passwort: 
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB) kopiert, 12,4633 s, 84,1 MB/s
ms@merkaba:~>

ms@merkaba:~> sync ; su -c "echo 3 > /proc/sys/vm/drop_caches" ; dd 
if=testfile of=/dev/null bs=1M count=1000
Passwort: 
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB) kopiert, 12,0747 s, 86,8 MB/s


ms@merkaba:~> sync ; su -c "echo 3 > /proc/sys/vm/drop_caches" ; dd 
if=testfile of=/dev/null bs=1M count=1000
Passwort: 
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB) kopiert, 11,7131 s, 89,5 MB/s
ms@merkaba:~>


Figures got faster with each measurement. Maybe SSD internal caching?

At leasts that way faster than 22.91 MB/sec ;)

On reads dd used up one core. On writes dd and flush-ecryptfs used up a 
little more than one core, about 110-150%, but not all two cores 
completely.

Kernel is 3.5.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance?
  2012-07-22 19:35                       ` Martin Steigerwald
  2012-07-22 19:43                         ` Martin Steigerwald
@ 2012-07-22 20:44                         ` Marc MERLIN
  2012-07-22 22:41                           ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN
  1 sibling, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-07-22 20:44 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff

On Sun, Jul 22, 2012 at 09:35:10PM +0200, Martin Steigerwald wrote:
> Or use atop to have a complete system overview during the hdparm run. You 
> may want to use its default 10 seconds delay.
 
atop looked totally fine and showed that a long dd took 6% CPU of one core.

> Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds 
> is really short.)
> 
> Did you repeat tests three or five times and looked at the deviation?

Yes. I also ran dd with 1GB, and got the exact same number.

Thanks for confirming that it is indeed way faster for you, and you're going
through the filesystem layer which makes it slower compared to my doing dd
against the raw device.

But anyway, while my question about using 'nossd' is on topic here, speed of
raw device falling dramatically once encrypted isn't really relevant here,
so I'll continue on the dm-crypt mailing list.

Thanks
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk
  2012-07-22 20:44                         ` Marc MERLIN
@ 2012-07-22 22:41                           ` Marc MERLIN
  2012-07-23  6:42                             ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-07-22 22:41 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Chris Mason, mbroz, Calvin Walton, jeff

On Sun, Jul 22, 2012 at 01:44:28PM -0700, Marc MERLIN wrote:
> On Sun, Jul 22, 2012 at 09:35:10PM +0200, Martin Steigerwald wrote:
> > Or use atop to have a complete system overview during the hdparm run. You 
> > may want to use its default 10 seconds delay.
>  
> atop looked totally fine and showed that a long dd took 6% CPU of one core.
> 
> > Anyway, hdparm is only a very rough measurement. (Test time 2 / 3 seconds 
> > is really short.)
> > 
> > Did you repeat tests three or five times and looked at the deviation?
> 
> Yes. I also ran dd with 1GB, and got the exact same number.

Oh my, the plot thickens.
I'm bringing this back on this list because
1) I've confirmed that I can get 500MB/s unencrypted reads (2GB file)
   with btfrs on the SSD
2) Despite getting a measly 23MB/s reading from /dev/mapper/ssdcrypt,
   once I mount it as a btrfs drive and I a read a file from it, I now
   get 267MB/s
3) Stating a bunch of files on the ssd is very slow (5-6x slower than
   stating the same files from disk). Using noatime did not help.

On an _unencrypted_ partition on the SSD, running du -sh on a directory
with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on
encrypted spinning drive, both with a similar btrfs filesystem, and 
the same kernel (3.4.4).

Since I'm getting some kind of "seek problem" on the ssd (yes, there are
no real seeks on SSD), it seems that I do have a problem with btrfs
that is at least not related to encryption.

Unencrypted btrfs on SSD:
gandalfthegreat:/mnt/mnt2# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.667s

Encrypted btrfs on spinning drive:
gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m3.881s

I've run this many times and get the same numbers.
I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow.  

I also tried with:
- space_cache and nospace_cache
- ssd and nossd
- noatime didn't seem to help even though I was hopeful on this one.

In all cases, I get:
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.537s

22 seconds for 15K files on an SSD while being 5 times slower than a
spinning disk with the same data.
What's going on?

More details below for anyone interested:
----------------------------------------------------------------------------
The test below shows that I can access the encrypted SSD 10x faster by talking
to it through a btrfs filesystem than doing a dd read from the device node.

So I suppose I could just ignore the dd device node thing, however not really.
Let's take a directory with some source files inside:

Let's start with the hard drive in my laptop (dmcrypt with btrtfs)
gandalfthegreat:/var/local# find src | wc -l
15261
gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m4.068s
So on an encrypted spinning disk, it takes 4 seconds.

On my SSD in the same machine with the same encryption and the same btrfs filesystem,
I get 5-6 times slower:
gandalfthegreat:/mnt/btrfs_pool1/var/local# time du -sh src
514M	src
real	0m24.937s

Incidently 5x is also the speed difference between my encrypted HD and encrypted SSD
with dd.
Now, why du is 5x slower and dd of a file from the filesystem is 2.5x
faster, I have no idea :-/

See below:

1) drop caches
gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches

gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=w2k-s001.vmdk of=/dev/null
2146631680 bytes (2.1 GB) copied, 8.03898 s, 267 MB/s
-> 267MB/s reading from the file through the encrypted filesystem. That's good.

For comparison
gandalfthegreat:/mnt/mnt2# dd if=w2k-s001.vmdk of=/dev/null 
2146631680 bytes (2.1 GB) copied, 4.33393 s, 495 MB/s
-> almost 500MB/s reading through another unencrypted filesystem on the same SSD

gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=/dev/mapper/ssdcrypt of=/dev/null bs=1M count=1000
1048576000 bytes (1.0 GB) copied, 45.1234 s, 23.2 MB/s

-> 23MB/s reading from the block device that my FS is mounted from. WTF?

gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches; dd if=w2k-s001.vmdk of=test
2146631680 bytes (2.1 GB) copied, 17.9129 s, 120 MB/s

-> 120MB/s copying a file from the SSD to itself. That's not bad.

gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# echo 3 > /proc/sys/vm/drop_caches; dd if=test of=/dev/null
2146631680 bytes (2.1 GB) copied, 8.4907 s, 253 MB/s

-> reading the new copied file still shows 253MB/s, good.

gandalfthegreat:/mnt/btrfs_pool1/var/local/VirtualBox VMs/w2k_virtual# dd if=test of=/dev/null
2146631680 bytes (2.1 GB) copied, 2.11001 s, 1.0 GB/s

-> reading without dropping the cache shows 1GB/s 

I'm very lost now, any idea what's going on?
- big file copies from encrypted SSD are fast as expected
- raw device access is unexplainably slow (10x slower than btrfs on the
  raw device)
- stating 15K files on the encrypted ssd with btrfs is super slow

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-22 22:41                           ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN
@ 2012-07-23  6:42                             ` Marc MERLIN
  2012-07-24  7:56                               ` Martin Steigerwald
  2012-07-27 11:08                               ` Chris Mason
  0 siblings, 2 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-23  6:42 UTC (permalink / raw)
  To: linux-btrfs

I just realized that the older thread got a bit confusing, so I'll keep
problems separate and make things simpler :)

On an _unencrypted_ partition on the SSD, running du -sh on a directory
with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on
encrypted spinning drive, both with a similar btrfs filesystem, and 
the same kernel (3.4.4).

Unencrypted btrfs on SSD:
gandalfthegreat:~# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.667s

Encrypted btrfs on spinning drive of the same src directory:
gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m3.881s

I've run this many times and get the same numbers.
I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow.  

I also tried with:
- space_cache and nospace_cache
- ssd and nossd
- noatime didn't seem to help even though I was hopeful on this one.

In all cases, I get:
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.537s


I'm having the same slow speed on 2 btrfs filesystems on the same SSD.
One is encrypted, the other one isnt:
Label: 'btrfs_pool1'  uuid: d570c40a-4a0b-4d03-b1c9-cff319fc224d
	Total devices 1 FS bytes used 144.74GB
	devid    1 size 441.70GB used 195.04GB path /dev/dm-0

Label: 'boot'  uuid: 84199644-3542-430a-8f18-a5aa58959662
	Total devices 1 FS bytes used 2.33GB
	devid    1 size 25.00GB used 5.04GB path /dev/sda2

If instead of stating a bunch of files, I try reading a big file, I do get speeds
that are quite fast (253MB/s and 423MB/s).

22 seconds for 15K files on an SSD is super slow and being 5 times
slower than a spinning disk with the same data.
What's going on?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-23  6:42                             ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
@ 2012-07-24  7:56                               ` Martin Steigerwald
  2012-07-27  4:40                                 ` Marc MERLIN
  2012-07-27 11:08                               ` Chris Mason
  1 sibling, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-07-24  7:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN

Am Montag, 23. Juli 2012 schrieb Marc MERLIN:
> I just realized that the older thread got a bit confusing, so I'll keep
> problems separate and make things simpler :)
> 
> On an _unencrypted_ partition on the SSD, running du -sh on a directory
> with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on
> encrypted spinning drive, both with a similar btrfs filesystem, and
> the same kernel (3.4.4).
> 
> Unencrypted btrfs on SSD:
> gandalfthegreat:~# mount -o
> compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2
> gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du
> -sh src 514M	src
> real	0m22.667s
> 
> Encrypted btrfs on spinning drive of the same src directory:
> gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du
> -sh src 514M	src
> real	0m3.881s

find is fast, du is much slower:

merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l )
404166
( find /usr | wc -l; )  0,03s user 0,07s system 1% cpu 9,212 total
merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr )      
11G     /usr
( du -sh /usr; )  1,00s user 19,07s system 41% cpu 48,886 total


Now I try to find something with less files.

merkaba:~> find /usr/share/doc | wc -l                                   
50715
merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr/share/doc 
| wc -l )
50715
( find /usr/share/doc | wc -l; )  0,00s user 0,02s system 1% cpu 1,398 
total
merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh 
/usr/share/doc )      
606M    /usr/share/doc
( du -sh /usr/share/doc; )  0,20s user 3,63s system 35% cpu 10,691 total

merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time du -sh /usr/share/doc   
606M    /usr/share/doc
du -sh /usr/share/doc  0,19s user 3,54s system 35% cpu 10,386 total


Anyway thats still much faster than your measurements.



merkaba:~> df -hT /usr                
Dateisystem    Typ   Größe Benutzt Verf. Verw% Eingehängt auf
/dev/dm-0      btrfs   19G     11G  5,6G   67% /
merkaba:~> btrfs fi sh                                                           
failed to read /dev/sr0
Label: 'debian'  uuid: […]
        Total devices 1 FS bytes used 10.25GB
        devid    1 size 18.62GB used 18.62GB path /dev/dm-0

Btrfs Btrfs v0.19
merkaba:~> btrfs fi df /   
Data: total=15.10GB, used=9.59GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.75GB, used=670.43MB
Metadata: total=8.00MB, used=0.00


merkaba:~> grep btrfs /proc/mounts
/dev/dm-0 / btrfs rw,noatime,compress=lzo,ssd,space_cache,inode_cache 0 0


Somewhat aged BTRFS filesystem on ThinkPad T520, Intel SSD 320, kernel 
3.5.

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* du -s src is a lot slower on SSD than spinning disk in the same laptop
@ 2012-07-25 15:45 Marc MERLIN
       [not found] ` <alpine.LFD.2.00.1207252023080.4340@(none)>
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-07-25 15:45 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

First, I had he problem with btrfs (details below), and then I noticed that
while ext4 is actually twice as fast as btrfs, it's still a lot slower at
stat on my fast Samsung 830 512G SSD than my 1TB laptop hard drive (both
drives being in the same thinkpad T530 with 3.4.4 kernel).

How can things be so slow? 12-13 seconds to scan 15K inodes on a freshly
made filesystem (and 22 secs with btrfs, so at least ext4 wins). The same
thing on my spinning drive takes fewer than 4 seconds, and SSD should be
several times faster than that.
SSD throughput was measured over 400MB/s on the raw device and 268MB/s
through the filesystem:

gandalfthegreat:/mnt/mnt2# du -sh w2k-s002.vmdk
2.0G	w2k-s002.vmdk
gandalfthegreat:/mnt/mnt2# dd if=w2k-s002.vmdk of=/dev/null
2145320960 bytes (2.1 GB) copied, 8.01154 s, 268 MB/s

But stats are just super slow
gandalfthegreat:/mnt/mnt2# time du -sh src/
519M	src/
real	0m12.380s
gandalfthegreat:/mnt/mnt2# reset_cache 
gandalfthegreat:/mnt/mnt2# time bash -c 'find src | wc -l'
15261
real	0m11.612s

Partition was made with:
mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2

Disk /dev/sda: 512.1 GB, 512110190592 bytes
255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x09aaf50a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048      502271      250112   83  Linux
/dev/sda2          502272    52930559    26214144   83  Linux << this partition

Details of my btrfs tests below, showing that it's not a problem specific
to ext4.

/dev/sda: Device Model:     SAMSUNG SSD 830 Series
/dev/sda: Firmware Version: CXM03B1Q
/dev/sda: User Capacity:    512,110,190,592 bytes [512 GB]

Any idea what could be going wrong here?

Thanks,
Marc

----- Forwarded message from Marc MERLIN <marc@merlins.org> -----

On an _unencrypted_ partition on the SSD, running du -sh on a directory
with 15K files, takes 23 seconds on unencrypted SSD and 4 secs on
encrypted spinning drive, both with a similar btrfs filesystem, and 
the same kernel (3.4.4).

Unencrypted btrfs on SSD:
gandalfthegreat:~# mount -o compress=lzo,discard,nossd,space_cache,noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.667s

Encrypted btrfs on spinning drive of the same src directory:
gandalfthegreat:/var/local# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m3.881s

I've run this many times and get the same numbers.
I've tried deadline and noop on /dev/sda (the SSD) and du is just as slow.  

I also tried with:
- space_cache and nospace_cache
- ssd and nossd
- noatime didn't seem to help even though I was hopeful on this one.

In all cases, I get:
gandalfthegreat:/mnt/mnt2# echo 3 > /proc/sys/vm/drop_caches; time du -sh src
514M	src
real	0m22.537s

I'm having the same slow speed on 2 btrfs filesystems on the same SSD.
One is encrypted, the other one isnt:
Label: 'btrfs_pool1'  uuid: d570c40a-4a0b-4d03-b1c9-cff319fc224d
	Total devices 1 FS bytes used 144.74GB
	devid    1 size 441.70GB used 195.04GB path /dev/dm-0

Label: 'boot'  uuid: 84199644-3542-430a-8f18-a5aa58959662
	Total devices 1 FS bytes used 2.33GB
	devid    1 size 25.00GB used 5.04GB path /dev/sda2

If instead of stating a bunch of files, I try reading a big file, I do get speeds
that are quite fast (253MB/s and 423MB/s).

22 seconds for 15K files on an SSD is super slow and being 5 times
slower than a spinning disk with the same data.
What's going on?

Thanks,
Marc

----- End forwarded message -----

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
       [not found] ` <alpine.LFD.2.00.1207252023080.4340@(none)>
@ 2012-07-25 23:38   ` Marc MERLIN
  2012-07-26  3:32   ` Ted Ts'o
  1 sibling, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-25 23:38 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: tytso, linux-ext4, axboe, Milan Broz

On Wed, Jul 25, 2012 at 08:40:28PM +0200, Lukáš Czerner wrote:
> yesterday Milan Broz pointed me to the other problem of yours where
> by reading from dm_crypt target on the SSD was much slower than on
> the spinning disk. I am not sure if he already shared some
> information he managed to gather, but we saw that reading from SSD
> was much slower probably because reads were divided into tons of
> small (page size) bios as opposed to bigger chunks on spinning
> disk.

Yes, indeed. To make things simpler, I'm not using dmcrypt here.

> This is probably the same reason here. The reason is most likely
> that we handle SSD differently than spinning disk (turn off
> elevator, different io scheduler and possibly other things I am not
> aware of). Also IIRC bio merging should be less "aggressive" on SSD
> than spinning disk (or maybe even turned off?), because SSD should
> supposedly handle much more iops than than spinning drive, hence
> waiting for a merge might slow things down. However in this case it
> seems to have quite opposite effect for some reason.
> 
> You may try to convince kernel to treat your SSD as rotational disk
> by:
> 
> echo 1 > /sys/block/sda/queue/rotational
> and see if it improves things.

Actually I had done that, along with making the readahead 8K change, but
that didn't help:

gandalfthegreat:/mnt/mnt2# time du -sh src/
519M	src/
real	0m12.176s
gandalfthegreat:/mnt/mnt2# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda2        25G  3.0G   21G  13% /mnt/mnt2
gandalfthegreat:/mnt/mnt2# blockdev --report /dev/sda2
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw  8192   512  4096     502272     26843283456   /dev/sda2
gandalfthegreat:/mnt/mnt2# cat /sys/block/sda/queue/rotational
1
gandalfthegreat:/mnt/mnt2# 

> Unfortunately I think that there is not much we can do about this
> from the file system level. Someone from block level should
> definitely take a look at this issue. Jens ?

Ok, I just wanted to rule out that it was not a VFS issue.
If you're confident it's block level and basically having storage that is
too fast (and indeed, I just bought one of the fastest SSDs out there) is
actually causing problems that make the entire system much slower as a
result, I'm happy to take it up another list.

Where would you recommend I go with this?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
       [not found] ` <alpine.LFD.2.00.1207252023080.4340@(none)>
  2012-07-25 23:38   ` Marc MERLIN
@ 2012-07-26  3:32   ` Ted Ts'o
  2012-07-26  3:35     ` Marc MERLIN
  2012-07-26  6:54     ` Marc MERLIN
  1 sibling, 2 replies; 68+ messages in thread
From: Ted Ts'o @ 2012-07-26  3:32 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Marc MERLIN, linux-ext4, axboe, Milan Broz

On Wed, Jul 25, 2012 at 08:40:28PM +0200, Lukáš Czerner wrote:
> > First, I had he problem with btrfs (details below), and then I noticed that
> > while ext4 is actually twice as fast as btrfs, it's still a lot slower at
> > stat on my fast Samsung 830 512G SSD than my 1TB laptop hard drive (both
> > drives being in the same thinkpad T530 with 3.4.4 kernel).
> > 
> > How can things be so slow? 12-13 seconds to scan 15K inodes on a freshly
> > made filesystem (and 22 secs with btrfs, so at least ext4 wins). The same
> > thing on my spinning drive takes fewer than 4 seconds, and SSD should be
> > several times faster than that.
> > SSD throughput was measured over 400MB/s on the raw device and 268MB/s
> > through the filesystem:

Was this an identical file system image on HDD and SSD?

The obvious thing to do is to get a blktrace of du -sh w/ a cold cache
for both the HDD and SSD.  Regardless of whether it's something we can
address at the fs level, or in the block device layer, the blktrace
should make it really clear what is going on.

       	       	      	    	    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
  2012-07-26  3:32   ` Ted Ts'o
@ 2012-07-26  3:35     ` Marc MERLIN
  2012-07-26  6:54     ` Marc MERLIN
  1 sibling, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-26  3:35 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz

On Wed, Jul 25, 2012 at 11:32:23PM -0400, Ted Ts'o wrote:
> On Wed, Jul 25, 2012 at 08:40:28PM +0200, Lukáš Czerner wrote:
> > > First, I had he problem with btrfs (details below), and then I noticed that
> > > while ext4 is actually twice as fast as btrfs, it's still a lot slower at
> > > stat on my fast Samsung 830 512G SSD than my 1TB laptop hard drive (both
> > > drives being in the same thinkpad T530 with 3.4.4 kernel).
> > > 
> > > How can things be so slow? 12-13 seconds to scan 15K inodes on a freshly
> > > made filesystem (and 22 secs with btrfs, so at least ext4 wins). The same
> > > thing on my spinning drive takes fewer than 4 seconds, and SSD should be
> > > several times faster than that.
> > > SSD throughput was measured over 400MB/s on the raw device and 268MB/s
> > > through the filesystem:
> 
> Was this an identical file system image on HDD and SSD?
> 
> The obvious thing to do is to get a blktrace of du -sh w/ a cold cache
> for both the HDD and SSD.  Regardless of whether it's something we can
> address at the fs level, or in the block device layer, the blktrace
> should make it really clear what is going on.

I'm not very familiar with blktrace, but here is one run I did when I was
trying to debug why dd of my dmcrypted FS was running at 27MB/s instead of
270MB/s.
http://marc.merlins.org/tmp/blktrace.txt
(dd shows up as 'bash' for some reason).

If that format isn't good, can you tell me which blktrace command lines I
should use to get you the exact output you'd like?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
  2012-07-26  3:32   ` Ted Ts'o
  2012-07-26  3:35     ` Marc MERLIN
@ 2012-07-26  6:54     ` Marc MERLIN
  2012-08-01  5:30       ` Marc MERLIN
  1 sibling, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-07-26  6:54 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz

On Wed, Jul 25, 2012 at 11:32:23PM -0400, Ted Ts'o wrote:
> On Wed, Jul 25, 2012 at 08:40:28PM +0200, Lukáš Czerner wrote:
> > > First, I had he problem with btrfs (details below), and then I noticed that
> > > while ext4 is actually twice as fast as btrfs, it's still a lot slower at
> > > stat on my fast Samsung 830 512G SSD than my 1TB laptop hard drive (both
> > > drives being in the same thinkpad T530 with 3.4.4 kernel).
> > > 
> > > How can things be so slow? 12-13 seconds to scan 15K inodes on a freshly
> > > made filesystem (and 22 secs with btrfs, so at least ext4 wins). The same
> > > thing on my spinning drive takes fewer than 4 seconds, and SSD should be
> > > several times faster than that.
> > > SSD throughput was measured over 400MB/s on the raw device and 268MB/s
> > > through the filesystem:
> 
> Was this an identical file system image on HDD and SSD?

Not entirely, but I did another test on similar ext4 FS, one on the SSD
and one on th HDD:

gandalfthegreat:/boot# find /boot/src | wc -l
6261
gandalfthegreat:/boot# reset_cache; time du -sh /boot/src ; reset_cache; time du -sh /boot2/src
169M	/boot/src
real	0m5.248s

157M	/boot2/src
real	0m0.698s

gandalfthegreat:/boot# df -h /boot /boot2
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       241M  226M  2.5M  99% /boot
/dev/sdb2       298M  273M  9.3M  97% /boot2
gandalfthegreat:/boot# grep boot /proc/mounts
/dev/sdb2 /boot2 ext4 rw,relatime,stripe=4,data=ordered 0 0
/dev/sda1 /boot ext4 rw,relatime,discard,stripe=32,data=ordered 0 0

The only sizeable difference is that sda1 is a 4K block size I forced to
help with faster writes on the ssd.

In the example above, the SSD is almost 10 times slower than the HDD.

blktrace -d /dev/sda -o - | blkparse -i - during du is 
http://marc.merlins.org/tmp/blktrace_du_sda1.txt

Does that give enough info on what's going on?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-24  7:56                               ` Martin Steigerwald
@ 2012-07-27  4:40                                 ` Marc MERLIN
  0 siblings, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-27  4:40 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs

On Tue, Jul 24, 2012 at 09:56:26AM +0200, Martin Steigerwald wrote:
> find is fast, du is much slower:
> 
> merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( find /usr | wc -l )
> 404166
> ( find /usr | wc -l; )  0,03s user 0,07s system 1% cpu 9,212 total
> merkaba:~> echo 3 > /proc/sys/vm/drop_caches ; time ( du -sh /usr )      
> 11G     /usr
> ( du -sh /usr; )  1,00s user 19,07s system 41% cpu 48,886 total
 
You're right.

gandalfthegreat [mc]# time du -sh src
514M	src
real	0m25.159s
gandalfthegreat [mc]# reset_cache 
gandalfthegreat [mc]# time bash -c "find src | wc -l"
15261
real	0m7.614s

But find is still slower the du on my spinning disk for the same tree.
 
> Anyway thats still much faster than your measurements.
 
Right. Your numbers look reasonable for an SSD.

Since then, I did some more tests and I'm also getting slower than normal
speeds with ext4, an indications that it's a problem with the block layer.
I'm working with some folks try to pin down the core problem, but it looks
like it's not an easy one.

Thanks for your data.
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-23  6:42                             ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
  2012-07-24  7:56                               ` Martin Steigerwald
@ 2012-07-27 11:08                               ` Chris Mason
  2012-07-27 18:42                                 ` Marc MERLIN
  1 sibling, 1 reply; 68+ messages in thread
From: Chris Mason @ 2012-07-27 11:08 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote:
> 
> 22 seconds for 15K files on an SSD is super slow and being 5 times
> slower than a spinning disk with the same data.
> What's going on?

Hi Marc,

The easiest way to figure out is with latencytop.  I'd either run the
latencytop gui or use the latencytop -c patch which sends a text dump to
the console.

This is assuming that you're not pegged at 100% CPU...

https://oss.oracle.com/~mason/latencytop.patch

-chris


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-07-27 11:08                               ` Chris Mason
@ 2012-07-27 18:42                                 ` Marc MERLIN
  0 siblings, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-07-27 18:42 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

On Fri, Jul 27, 2012 at 07:08:35AM -0400, Chris Mason wrote:
> On Mon, Jul 23, 2012 at 12:42:03AM -0600, Marc MERLIN wrote:
> > 
> > 22 seconds for 15K files on an SSD is super slow and being 5 times
> > slower than a spinning disk with the same data.
> > What's going on?
> 
> Hi Marc,
> 
> The easiest way to figure out is with latencytop.  I'd either run the
> latencytop gui or use the latencytop -c patch which sends a text dump to
> the console.
> 
> This is assuming that you're not pegged at 100% CPU...
> 
> https://oss.oracle.com/~mason/latencytop.patch

Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU 
(not even close and I get the same problem with unencrypted filesystem, actually
du -sh is exactly the same speed on encrypted and unecrypted).

Here's the result I think you were looking for. I'm not good at reading this,
but hopefully it tells you something useful :)

The full run is here if that helps:
http://marc.merlins.org/tmp/latencytop.txt

Process du (6748) Total: 4280.5 msec
	Reading directory content	 15.2 msec         11.0 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_next_leaf 
		btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath 
	[sleep_on_page]	 13.5 msec         88.2 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_lookup_csum __btrfs_lookup_bio_sums 
		btrfs_lookup_bio_sums btrfs_submit_compressed_read btrfs_submit_bio_hook 
	Page fault	 12.9 msec          0.6 %
		sleep_on_page_killable wait_on_page_bit_killable 
		__lock_page_or_retry filemap_fault __do_fault handle_pte_fault 
		handle_mm_fault do_page_fault page_fault 
	Executing a program	  7.1 msec          0.2 %
		sleep_on_page_killable __lock_page_killable 
		generic_file_aio_read do_sync_read vfs_read kernel_read 
		prepare_binprm do_execve_common.isra.27 do_execve sys_execve 
		stub_execve 

Process du (6748) Total: 9517.4 msec
	[sleep_on_page]	 23.0 msec         82.8 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup 
		__lookup_hash 
	Reading directory content	 13.2 msec         17.2 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath 

Process du (6748) Total: 9524.0 msec
	[sleep_on_page]	 17.1 msec         88.5 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_lookup_inode btrfs_iget btrfs_lookup_dentry btrfs_lookup 
		__lookup_hash 
	Reading directory content	 16.0 msec         11.5 %
		sleep_on_page wait_on_page_bit read_extent_buffer_pages 
		btree_read_extent_buffer_pages.constprop.110 read_tree_block 
		read_block_for_search.isra.32 btrfs_search_slot 
		btrfs_real_readdir vfs_readdir sys_getdents system_call_fastpath 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
  2012-07-26  6:54     ` Marc MERLIN
@ 2012-08-01  5:30       ` Marc MERLIN
  2012-08-01  6:01         ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
                           ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-08-01  5:30 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz

On Wed, Jul 25, 2012 at 11:54:12PM -0700, Marc MERLIN wrote:
> and one on th HDD:
> 
> gandalfthegreat:/boot# find /boot/src | wc -l
> 6261
> gandalfthegreat:/boot# reset_cache; time du -sh /boot/src ; reset_cache; time du -sh /boot2/src
> 169M	/boot/src
> real	0m5.248s
> 
> 157M	/boot2/src
> real	0m0.698s
> 
> gandalfthegreat:/boot# df -h /boot /boot2
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sda1       241M  226M  2.5M  99% /boot
> /dev/sdb2       298M  273M  9.3M  97% /boot2
> gandalfthegreat:/boot# grep boot /proc/mounts
> /dev/sdb2 /boot2 ext4 rw,relatime,stripe=4,data=ordered 0 0
> /dev/sda1 /boot ext4 rw,relatime,discard,stripe=32,data=ordered 0 0
> 
> The only sizeable difference is that sda1 is a 4K block size I forced to
> help with faster writes on the ssd.
> 
> In the example above, the SSD is almost 10 times slower than the HDD.
> 
> blktrace -d /dev/sda -o - | blkparse -i - during du is 
> http://marc.merlins.org/tmp/blktrace_du_sda1.txt
> 
> Does that give enough info on what's going on?

(TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster
than btrfs, but 3x slower than the same filesystem on spinning disk :( )

Ok, just to help with debuggging this,
1)  I put my samsung 830 SSD into another thinkpad and it wasn't faster or
slower.

2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that
just died and killed all my data), and du took 0.3 seconds on both my old
and new thinkpads.
The old thinkpad is running ubuntu 32bit the new one debian testing 64bit
both with kernel 3.4.4.

So, clearly, there is something wrong with the samsung 830 SSD with linux
but I have no clue what :(
In raw speed (dd) the samsung is faster than the crucial (350MB/s vs
500MB/s).
It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
I have a hard time believing that samsung is selling SSDs that are slower
than hard drives at random IO and 'seeks'.

Mmmh, and to do more tests, I eventually got a 2nd ssd from samsung (same
kind), just to make sure the one I had wasn't bad.

Unfortunately the results are similar.
I upgraded to 3.5.0 in the meantime:

First: btrfs is the slowest:
gandalfthegreat:/mnt/ssd/var/local# time du -sh src/
514M	src/
real	0m25.741s

Second: ext4 with mkfs.ext4 -O extent -b 4096 /dev/sda3
gandalfthegreat:/mnt/mnt3# reset_cache
gandalfthegreat:/mnt/mnt3# time du -sh src/
519M	src/
real	0m12.459s
gandalfthegreat:~# grep mnt3 /proc/mounts
/dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0

A freshly made ntfs filesystem through fuse is actually FASTER!
gandalfthegreat:/mnt/mnt2# reset_cache 
gandalfthegreat:/mnt/mnt2# time du -sh src/
506M	src/
real	0m8.928s
gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts
/dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0

How can ntfs via fuse be the fastest?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  5:30       ` Marc MERLIN
@ 2012-08-01  6:01         ` Marc MERLIN
  2012-08-01  6:08           ` Fajar A. Nugraha
  2012-08-01  6:36           ` Chris Samuel
  2012-08-01  8:18         ` du -s src is a lot slower on SSD than spinning disk in the same laptop Spelic
  2012-08-16  7:50         ` Marc MERLIN
  2 siblings, 2 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-08-01  6:01 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs, Ted Ts'o
  Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz

On Fri, Jul 27, 2012 at 11:42:39AM -0700, Marc MERLIN wrote:
> > https://oss.oracle.com/~mason/latencytop.patch
> 
> Thanks for the patch, and yes I can confirm I'm definitely not pegged on CPU 
> (not even close and I get the same problem with unencrypted filesystem, actually
> du -sh is exactly the same speed on encrypted and unecrypted).
> 
> Here's the result I think you were looking for. I'm not good at reading this,
> but hopefully it tells you something useful :)
> 
> The full run is here if that helps:
> http://marc.merlins.org/tmp/latencytop.txt

I did some other tests since last week since my laptop is hard to use
considering how slow the SSD is.

(TL;DR: ntfs on linux via fuse is 33% faster than ext4, which is 2x faster
than btrfs, but 3x slower than the same filesystem on spinning disk :( )

Ok, just to help with debuggging this,
1)  I put my samsung 830 SSD into another thinkpad and it wasn't faster or
slower.

2) Then I put a crucial 256 C300 SSD (the replacement for the one I had that
just died and killed all my data), and du took 0.3 seconds on both my old
and new thinkpads.
The old thinkpad is running ubuntu 32bit the new one debian testing 64bit
both with kernel 3.4.4.

So, clearly, there is something wrong with the samsung 830 SSD with linux
but I have no clue what :(
In raw speed (dd) the samsung is faster than the crucial (350MB/s vs
500MB/s).
It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
I have a hard time believing that samsung is selling SSDs that are slower
than hard drives at random IO and 'seeks'.

3) I just got a 2nd ssd from samsung (same kind), just to make sure the one
I had wasn't bad. It's brand new, and I formatted it carefully on 512
boundaries:
/dev/sda1            2048      502271      250112   83  Linux
/dev/sda2          502272    52930559    26214144    7  HPFS/NTFS/exFAT
/dev/sda3        52930560    73902079    10485760   82  Linux swap / Solaris
/dev/sda4        73902080  1000215215   463156568   83  Linux

I also upgraded to 3.5.0 in the meantime but unfortunately the results are
similar.

First: btrfs is the slowest:
gandalfthegreat:/mnt/ssd/var/local# time du -sh src/
514M	src/
real	0m25.741s
gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts 
/dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0

Second: ext4 is 2x faster than btrfs with mkfs.ext4 -O extent -b 4096 /dev/sda3
gandalfthegreat:/mnt/mnt3# reset_cache
gandalfthegreat:/mnt/mnt3# time du -sh src/
519M	src/
real	0m12.459s
gandalfthegreat:~# grep mnt3 /proc/mounts
/dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0

Third, A freshly made ntfs filesystem through fuse is actually FASTER!
gandalfthegreat:/mnt/mnt2# reset_cache 
gandalfthegreat:/mnt/mnt2# time du -sh src/
506M	src/
real	0m8.928s
gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts
/dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0

How can ntfs via fuse be the fastest and btrfs so slow?
Of course, all 3 are slower than the same filesystem on spinning too, but
I'm wondering if there is a scheduling issue that is somehow causing the
extreme slowness I'm seeing.

Did the latencytop trace I got help in any way?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:01         ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
@ 2012-08-01  6:08           ` Fajar A. Nugraha
  2012-08-01  6:21             ` Marc MERLIN
  2012-08-01  6:36           ` Chris Samuel
  1 sibling, 1 reply; 68+ messages in thread
From: Fajar A. Nugraha @ 2012-08-01  6:08 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs

On Wed, Aug 1, 2012 at 1:01 PM, Marc MERLIN <marc@merlins.org> wrote:

> So, clearly, there is something wrong with the samsung 830 SSD with linux


> It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
> I have a hard time believing that samsung is selling SSDs that are slower
> than hard drives at random IO and 'seeks'.

You'd be surprised on how badly some vendors can screw up :)


> First: btrfs is the slowest:

> gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts
> /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0

Just checking, did you explicitly activate "discard"? Cause on my
setup (with corsair SSD) it made things MUCH slower. Also, try adding
"noatime" (just in case the slow down was because "du" cause many
access time updates)

-- 
Fajar

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:08           ` Fajar A. Nugraha
@ 2012-08-01  6:21             ` Marc MERLIN
  2012-08-01 21:57               ` Martin Steigerwald
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-08-01  6:21 UTC (permalink / raw)
  To: Fajar A. Nugraha; +Cc: linux-btrfs

On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote:
> > It it were a random crappy SSD from a random vendor, I'd blame the SSD, but
> > I have a hard time believing that samsung is selling SSDs that are slower
> > than hard drives at random IO and 'seeks'.
> 
> You'd be surprised on how badly some vendors can screw up :)
 
At some point, it may come down to that indeed :-/
I'm still hopefully that Samsung didn't, but we'll see.
 
> > First: btrfs is the slowest:
> 
> > gandalfthegreat:/mnt/ssd/var/local# grep /mnt/ssd/var /proc/mounts
> > /dev/mapper/ssd /mnt/ssd/var btrfs rw,noatime,compress=lzo,ssd,discard,space_cache 0 0
> 
> Just checking, did you explicitly activate "discard"? Cause on my

Yes. Note that it should a noop when all you're doing is stating inodes and
not writing (I'm using noatime).

> setup (with corsair SSD) it made things MUCH slower. Also, try adding
> "noatime" (just in case the slow down was because "du" cause many
> access time updates)

I have noatime in there already :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:01         ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
  2012-08-01  6:08           ` Fajar A. Nugraha
@ 2012-08-01  6:36           ` Chris Samuel
  2012-08-01  6:40             ` Marc MERLIN
  1 sibling, 1 reply; 68+ messages in thread
From: Chris Samuel @ 2012-08-01  6:36 UTC (permalink / raw)
  To: Marc MERLIN
  Cc: Chris Mason, linux-btrfs, Ted Ts'o, Lukáš Czerner,
	linux-ext4, axboe, Milan Broz

On 01/08/12 16:01, Marc MERLIN wrote:

> Third, A freshly made ntfs filesystem through fuse is actually FASTER!

Could it be that Samsungs FTL has optimisations in it for NTFS ?

A horrible thought, but not impossible..

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:36           ` Chris Samuel
@ 2012-08-01  6:40             ` Marc MERLIN
  0 siblings, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-08-01  6:40 UTC (permalink / raw)
  To: Chris Samuel
  Cc: Chris Mason, linux-btrfs, Ted Ts'o, Lukáš Czerner,
	linux-ext4, axboe, Milan Broz

On Wed, Aug 01, 2012 at 04:36:22PM +1000, Chris Samuel wrote:
> On 01/08/12 16:01, Marc MERLIN wrote:
> 
> > Third, A freshly made ntfs filesystem through fuse is actually FASTER!
> 
> Could it be that Samsungs FTL has optimisations in it for NTFS ?
> 
> A horrible thought, but not impossible..

Not impossible, but it's can be the main reason since it clocks still 2x
slower with ntfs than a spinning disk with encrypted btrfs.
Since SSDs should "seek" 10-100x faster than spinning disks, that can't be
the only reason.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
  2012-08-01  5:30       ` Marc MERLIN
  2012-08-01  6:01         ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
@ 2012-08-01  8:18         ` Spelic
  2012-08-16  7:50         ` Marc MERLIN
  2 siblings, 0 replies; 68+ messages in thread
From: Spelic @ 2012-08-01  8:18 UTC (permalink / raw)
  To: linux-ext4

On 08/01/12 07:30, Marc MERLIN wrote:
> 2) Then I put a crucial 256 C300 SSD (the replacement for the one I 
> had that just died and killed all my data), and du took 0.3 seconds on 
> both my old and new thinkpads. 


What does FIO say on your Samsung 830 SSD, 4k random reads?
Last time I checked online benchmarks Samsung 830 had very good 4k 
random reads, but it's difficult to find online benchmarks for 
incompressible data (which fio rightly uses btw)

multithreded test:
vi fiotest_mt.fio
         [random-read]
         rw=randread
         numjobs=20
         thread
         iodepth=10
         group_reporting
         blocksize=4k
         size=512m
         directory=/mnt/<yourmountpoint>
         end_fsync=1
         runtime=10

and single threaded (more important here):
vi fiotest_st.fio
         [random-read]
         rw=randread
         numjobs=1
         thread
         iodepth=1
         group_reporting
         blocksize=4k
         size=4g
         directory=/mnt/<yourmountpoint>
         end_fsync=1
         runtime=10



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01  6:21             ` Marc MERLIN
@ 2012-08-01 21:57               ` Martin Steigerwald
  2012-08-02  5:07                 ` Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-08-01 21:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Fajar A. Nugraha

Hi Marc,

Am Mittwoch, 1. August 2012 schrieb Marc MERLIN:
> On Wed, Aug 01, 2012 at 01:08:46PM +0700, Fajar A. Nugraha wrote:
> > > It it were a random crappy SSD from a random vendor, I'd blame the
> > > SSD, but I have a hard time believing that samsung is selling SSDs
> > > that are slower than hard drives at random IO and 'seeks'.
> > 
> > You'd be surprised on how badly some vendors can screw up :)
> 
> At some point, it may come down to that indeed :-/
> I'm still hopefully that Samsung didn't, but we'll see.

Its getting quite strange.

I lost track of whether you did that already or not, but if you didn´t
please post some

vmstat 1

iostat -xd 1

on the device while it is being slow.

I am interested in wait I/O and latencies and disk utilization.


Comparison data of Intel SSD 320 in ThinkPad T520 during

merkaba:~> echo 3 > /proc/sys/drop_caches ; du -sch /usr

on BTRFS with Kernel 3.5:


martin@merkaba:~> vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  1  21556 4442668   2056 502352    0    0   194    85  247  120 11  2 87  0
 1  2  21556 4408888   2448 514884    0    0 11684   328 4975 24585  5 16 65 14
 1  0  21556 4389880   2448 528060    0    0 13400     0 4574 23452  2 16 68 14
 3  1  21556 4370068   2448 545052    0    0 18132     0 5499 27220  1 18 64 16
 1  0  21556 4350228   2448 555580    0    0 10856     0 4122 25339  3 16 67 14
 1  1  21556 4315604   2448 569756    0    0 12648     0 4647 31153  5 14 66 15
 0  1  21556 4295652   2456 581480    0    0 11548    56 4093 24618  2 13 69 16
 0  1  21556 4286720   2456 591580    0    0 10824     0 3750 21445  1 12 71 16
 0  1  21556 4266308   2456 603620    0    0 12932     0 4841 26447  4 12 68 17
 1  0  21556 4248228   2456 613808    0    0 10264     4 3703 22108  1 13 71 15
 5  1  21556 4231976   2456 624356    0    0 10540     0 3581 20436  1 10 72 17
 0  1  21556 4197168   2456 639108    0    0 12952     0 4738 28223  4 15 66 15
 4  1  21556 4178456   2456 650552    0    0 11656     0 4234 23480  2 14 68 16
 0  1  21556 4163616   2456 662992    0    0 13652     0 4619 26580  1 16 70 13
 4  1  21556 4138288   2456 675696    0    0 13352     0 4422 22254  1 16 70 13
 1  0  21556 4113204   2456 689060    0    0 13232     0 4312 21936  1 15 70 14
 0  1  21556 4085532   2456 704160    0    0 14972     0 4820 24238  1 16 69 14
 2  0  21556 4055740   2456 719644    0    0 15736     0 5099 25513  3 17 66 14
 0  1  21556 4028612   2456 734380    0    0 14504     0 4795 25052  3 15 68 14
 2  0  21556 3999108   2456 749040    0    0 14656    16 4672 21878  1 17 69 13
 1  1  21556 3972732   2456 762108    0    0 12972     0 4717 22411  1 17 70 13
 5  0  21556 3949684   2584 773484    0    0 11528    52 4837 24107  3 15 67 15
 1  0  21556 3912504   2584 787420    0    0 12156     0 4883 25201  4 15 67 14


martin@merkaba:~> iostat -xd 1 /dev/sda
Linux 3.5.0-tp520 (merkaba)     01.08.2012      _x86_64_        (4 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1,29     1,44   11,58   12,78   684,74   299,75    80,81     0,24    9,86    0,95   17,93   0,29   0,71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2808,00    0,00 11232,00     0,00     8,00     0,57    0,21    0,21    0,00   0,19  54,50

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2967,00    0,00 11868,00     0,00     8,00     0,63    0,21    0,21    0,00   0,21  60,90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00    11,00 2992,00    4,00 11968,00    56,00     8,03     0,64    0,22    0,22    0,25   0,21  62,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2680,00    0,00 10720,00     0,00     8,00     0,70    0,26    0,26    0,00   0,25  66,70

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3153,00    0,00 12612,00     0,00     8,00     0,72    0,23    0,23    0,00   0,22  69,30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2769,00    0,00 11076,00     0,00     8,00     0,63    0,23    0,23    0,00   0,21  58,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 2523,00    1,00 10092,00     4,00     8,00     0,74    0,29    0,29    0,00   0,28  71,30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3026,00    0,00 12104,00     0,00     8,00     0,73    0,24    0,24    0,00   0,21  64,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3069,00    0,00 12276,00     0,00     8,00     0,67    0,22    0,22    0,00   0,20  62,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3346,00    0,00 13384,00     0,00     8,00     0,64    0,19    0,19    0,00   0,18  59,90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3188,00    0,00 12752,00     0,00     8,00     0,80    0,25    0,25    0,00   0,17  54,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3433,00    0,00 13732,00     0,00     8,00     1,03    0,30    0,30    0,00   0,17  57,00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3565,00    0,00 14260,00     0,00     8,00     0,92    0,26    0,26    0,00   0,16  57,30

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3972,00    0,00 15888,00     0,00     8,00     1,13    0,29    0,29    0,00   0,16  62,90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3743,00    0,00 14972,00     0,00     8,00     1,03    0,28    0,28    0,00   0,16  59,40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3408,00    0,00 13632,00     0,00     8,00     1,08    0,32    0,32    0,00   0,17  56,70

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00 3730,00    3,00 14920,00    16,00     8,00     1,14    0,31    0,31    0,00   0,15  56,30




I also suggest to use fio with with the ssd-test example on the SSD. I
have some comparison data available for my setup. Heck it should be
publicly available in my ADMIN magazine article about fio. I used a bit
different fio jobs with block sizes of 2k to 16k, but its similar enough
and I might even have some 4k examples at hand or can easily create one.
I also raised size and duration a bit.

An example based on whats in my article:

[global]
ioengine=libaio
direct=1
iodepth=64
runtime=60

filename=testfile
size=2G
bsrange=2k-16k

refill_buffers=1

[randomwrite]
stonewall
rw=randwrite

[sequentialwrite]
stonewall
rw=write

[randomread]
stonewall
rw=randread

[sequentialread]
stonewall
rw=read

Remove bsrange if you want 4k blocks only. I put the writes above the
reads cause the writes with refill_buffers=1 initialize the testfile
with random data. It would contain only zeros which will be compressible
nicely by modern SandForce controllers otherwise. Above job is untested,
but it should do.

Please remove the test file and fstrim your disk unless you have discard
mount option on after having run any write tests. (Not necessarily after
each one ;).


Also I am interested in

merkaba:~> hdparm -I /dev/sda | grep -i queue
        Queue depth: 32
           *    Native Command Queueing (NCQ)

output for your SSD.

Try to load the ssd with different iodepths up to twice the amount
displayed by hdparm.

But note: a du -sch will not reach that high iodepth. I expect it to
get as low as an iodepth of one in that case. You see in my comparison
output that a single du -sch is not able to load the Intel SSD 320 fully,
but load is just about 50-70%. And iowait is just about 10-20%.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-01 21:57               ` Martin Steigerwald
@ 2012-08-02  5:07                 ` Marc MERLIN
  2012-08-02 11:18                   ` Martin Steigerwald
  2012-08-02 11:25                   ` Martin Steigerwald
  0 siblings, 2 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-08-02  5:07 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote:
> Its getting quite strange.
 
I would agree :)

Before I paste a bunch of thing, I wanted to thank you for not giving up on me 
and offering your time to help me figure this out :)

> I lost track of whether you did that already or not, but if you didn´t
> please post some
> 
> vmstat 1
> iostat -xd 1
> on the device while it is being slow.
 
Sure thing, here's the 24 second du -s run:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  1      0 2747264     44 348388    0    0    28    50  242  184 19  6 74  1
 1  0      0 2744128     44 351700    0    0   144     0 2758 32115 30  5 61  4
 2  1      0 2743100     44 351992    0    0   792     0 2616 30613 28  4 50 18
 1  1      0 2741592     44 352668    0    0   776     0 2574 31551 29  4 45 21
 1  1      0 2740720     44 353432    0    0   692     0 2734 32891 30  4 45 22
 1  1      0 2740104     44 354284    0    0   460     0 2639 31585 30  4 45 21
 3  1      0 2738520     44 354692    0    0   544   264 2834 30302 32  5 42 21
 1  1      0 2736936     44 355476    0    0  1064  2012 2867 31172 28  4 45 23
 3  1      0 2734892     44 356840    0    0  1236     0 2865 31100 30  3 46 21
 3  1      0 2733840     44 357432    0    0   636    12 2930 31362 30  4 45 21
 3  1      0 2735152     44 357892    0    0   612     0 2546 31641 28  4 46 22
 3  1      0 2733608     44 358680    0    0   668     0 2846 29747 30  3 46 22
 2  1      0 2731480     44 359556    0    0   712   248 2744 27939 31  3 44 22
 2  1      0 2732688     44 360736    0    0  1204     0 2774 30979 30  3 46 21
 4  1      0 2730276     44 361872    0    0   872   256 3174 31657 30  5 45 20
 1  1      0 2728372     44 363008    0    0   912   132 2752 31811 30  3 46 22
 1  1      0 2727060     44 363240    0    0   476     0 2595 30313 33  4 43 21
 1  1      0 2725856     44 364000    0    0   580     0 2547 30803 29  4 46 21
 1  1      0 2723628     44 364724    0    0   756   272 2766 30652 31  4 44 22
 2  1      0 2722128     44 365616    0    0   744     8 2771 31760 30  4 45 21
 2  1      0 2721024     44 366284    0    0   788   196 2695 31819 31  4 44 22
 3  1      0 2719312     44 367084    0    0   732     0 2737 32038 30  5 45 20
 1  1      0 2717436     44 367880    0    0   728     0 2570 30810 29  3 46 22
 3  1      0 2716820     44 368500    0    0   608     0 2614 30949 29  4 45 22
 2  1      0 2715576     44 369240    0    0   628   264 2715 31180 34  4 40 22
 1  1      0 2714308     44 369952    0    0   624     0 2533 31155 28  4 46 22

Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 	08/01/2012 	_x86_64_	(4 CPU)
rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
  2.18     0.68    6.45   17.77    78.12   153.39    19.12     0.18    7.51    8.52    7.15   4.46  10.81
  0.00     0.00  118.00    0.00   540.00     0.00     9.15     1.18    9.93    9.93    0.00   4.98  58.80
  0.00     0.00  217.00    0.00   868.00     0.00     8.00     1.90    8.77    8.77    0.00   4.44  96.40
  0.00     0.00  192.00    0.00   768.00     0.00     8.00     1.63    8.44    8.44    0.00   5.10  98.00
  0.00     0.00  119.00    0.00   476.00     0.00     8.00     1.06    9.01    9.01    0.00   8.20  97.60
  0.00     0.00  125.00    0.00   500.00     0.00     8.00     1.08    8.67    8.67    0.00   7.55  94.40
  0.00     0.00  165.00    0.00   660.00     0.00     8.00     1.50    9.12    9.12    0.00   5.87  96.80
  0.00     0.00  195.00   13.00   780.00   272.00    10.12     1.68    8.10    7.94   10.46   4.65  96.80
  0.00     0.00  173.00    0.00   692.00     0.00     8.00     1.72    9.87    9.87    0.00   5.71  98.80
  0.00     0.00  171.00    0.00   684.00     0.00     8.00     1.62    9.33    9.33    0.00   5.75  98.40
  0.00     0.00  161.00    0.00   644.00     0.00     8.00     1.52    9.57    9.57    0.00   6.14  98.80
  0.00     0.00  136.00    0.00   544.00     0.00     8.00     1.26    9.29    9.29    0.00   7.24  98.40
  0.00     0.00  199.00    0.00   796.00     0.00     8.00     1.94    9.73    9.73    0.00   4.94  98.40
  0.00     0.00  201.00    0.00   804.00     0.00     8.00     1.70    8.54    8.54    0.00   4.80  96.40
  0.00     0.00  272.00   15.00  1088.00   272.00     9.48     2.35    8.21    8.46    3.73   3.39  97.20
  0.00     0.00  208.00    1.00   832.00     4.00     8.00     1.90    9.01    9.04    4.00   4.67  97.60
  0.00     0.00  228.00    0.00   912.00     0.00     8.00     1.94    8.56    8.56    0.00   4.23  96.40
  0.00     0.00  107.00    0.00   428.00     0.00     8.00     0.98    9.05    9.05    0.00   9.12  97.60
  0.00     0.00  163.00    0.00   652.00     0.00     8.00     1.60    9.89    9.89    0.00   6.01  98.00
  0.00     0.00  162.00    0.00   648.00     0.00     8.00     1.83   10.20   10.20    0.00   6.07  98.40
  1.00    10.00  224.00   71.00   900.00  2157.50    20.73     2.82   10.14    9.50   12.17   3.35  98.80
 41.00     8.00  166.00  176.00   832.00   333.50     6.82     2.48    7.24    8.48    6.07   2.92 100.00
 18.00     0.00   78.00   72.00   388.00    36.00     5.65     2.30   15.28   16.56   13.89   6.67 100.00
 13.00     0.00  190.00   25.00   824.00    12.50     7.78     1.95    9.17    9.43    7.20   4.58  98.40
  0.00     0.00  156.00    0.00   624.00     0.00     8.00     1.36    8.72    8.72    0.00   6.26  97.60
  0.00     0.00  158.00   19.00   632.00    64.00     7.86     1.50    8.45    8.61    7.16   5.60  99.20
  0.00     0.00  176.00    0.00   704.00     0.00     8.00     1.65    9.39    9.39    0.00   5.61  98.80
  0.00     0.00   44.00    0.00   176.00     0.00     8.00     0.38    8.55    8.55    0.00   8.27  36.40

> I am interested in wait I/O and latencies and disk utilization.
 
Cool tool, I didn't know about iostat.
My r_await numbers don't look good obviously and yet %util is pretty much
100% the entire time.

Does that show that it's indeed the device that is unable to deliver the requests any quicker, despite
being an ssd, or are you reading this differently?

> Also I am interested in
> merkaba:~> hdparm -I /dev/sda | grep -i queue
>         Queue depth: 32
>            *    Native Command Queueing (NCQ)
> output for your SSD.

gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue   
	Queue depth: 32
	   *	Native Command Queueing (NCQ)
gandalfthegreat:/var/local# 

I've the the fio tests in:
/dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0

(discard is there, so fstrim shouldn't be needed)
 
> I also suggest to use fio with with the ssd-test example on the SSD. I
> have some comparison data available for my setup. Heck it should be
> publicly available in my ADMIN magazine article about fio. I used a bit
> different fio jobs with block sizes of 2k to 16k, but its similar enough
> and I might even have some 4k examples at hand or can easily create one.
> I also raised size and duration a bit.
> 
> An example based on whats in my article:

Thanks, here's the output.
I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending
on the io size.  That's "intereesting" to say the least.

So, doctor, is it bad? :)

randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 4 processes
randomwrite: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0  iops] [eta 00m:00s]  
randomwrite: (groupid=0, jobs=1): err= 0: pid=7193
  write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec
    slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55
    clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63
     lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57
    clat percentiles (msec):
     |  1.00th=[  225],  5.00th=[  241], 10.00th=[  247], 20.00th=[  260],
     | 30.00th=[  269], 40.00th=[  281], 50.00th=[  293], 60.00th=[  306],
     | 70.00th=[  322], 80.00th=[  351], 90.00th=[  545], 95.00th=[  570],
     | 99.00th=[  627], 99.50th=[  644], 99.90th=[  709], 99.95th=[  725],
     | 99.99th=[  742]
    bw (KB/s)  : min=   11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48
    lat (usec) : 50=0.01%
    lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89%
    lat (msec) : 500=72.44%, 750=14.43%
  cpu          : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=11396/d=0, short=r=0/w=0/d=0
sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218
  write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec
    slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10
    clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12
     lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36
    clat percentiles (msec):
     |  1.00th=[  269],  5.00th=[  306], 10.00th=[  322], 20.00th=[  338],
     | 30.00th=[  355], 40.00th=[  371], 50.00th=[  388], 60.00th=[  404],
     | 70.00th=[  420], 80.00th=[  445], 90.00th=[  603], 95.00th=[  766],
     | 99.00th=[  881], 99.50th=[  906], 99.90th=[  938], 99.95th=[  947],
     | 99.99th=[  963]
    bw (KB/s)  : min=  418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39
    lat (usec) : 20=0.01%
    lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93%
    lat (msec) : 750=6.87%, 1000=5.51%
  cpu          : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=8991/d=0, short=r=0/w=0/d=0
randomread: (groupid=2, jobs=1): err= 0: pid=7234
  read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec
    slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63
    clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21
     lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81
    clat percentiles (msec):
     |  1.00th=[   15],  5.00th=[   25], 10.00th=[   33], 20.00th=[   39],
     | 30.00th=[   50], 40.00th=[   57], 50.00th=[   61], 60.00th=[   70],
     | 70.00th=[   76], 80.00th=[   89], 90.00th=[  111], 95.00th=[  139],
     | 99.00th=[  176], 99.50th=[  190], 99.90th=[  217], 99.95th=[  229],
     | 99.99th=[  247]
    bw (KB/s)  : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68
    lat (usec) : 50=0.01%
    lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44%
    lat (msec) : 100=55.05%, 250=14.34%
  cpu          : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=56073/w=0/d=0, short=r=0/w=0/d=0
sequentialread: (groupid=3, jobs=1): err= 0: pid=7249
  read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt=  4418msec
    slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96
    clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25
     lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06
    clat percentiles (usec):
     |  1.00th=[   10],  5.00th=[  187], 10.00th=[  217], 20.00th=[  249],
     | 30.00th=[  278], 40.00th=[  302], 50.00th=[  330], 60.00th=[  362],
     | 70.00th=[  398], 80.00th=[  450], 90.00th=[  716], 95.00th=[ 6624],
     | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696],
     | 99.99th=[67072]
    bw (KB/s)  : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18
    lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36%
    lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52%
    lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45%
    lat (msec) : 100=0.05%
  cpu          : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=233071/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec

Run status group 1 (all jobs):
  WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec

Run status group 2 (all jobs):
   READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec

Run status group 3 (all jobs):
   READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msec

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02  5:07                 ` Marc MERLIN
@ 2012-08-02 11:18                   ` Martin Steigerwald
  2012-08-02 17:39                     ` Marc MERLIN
  2012-08-02 11:25                   ` Martin Steigerwald
  1 sibling, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-08-02 11:18 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote:
> > Its getting quite strange.
>  
> I would agree :)
> 
> Before I paste a bunch of thing, I wanted to thank you for not giving up on me 
> and offering your time to help me figure this out :)

You are welcome.

Well I am holding Linux performance analysis & tuning trainings and I am
really interested into issues like this ;)

I will take care of myself and I take my time to respond or even do not
respond at all anymore if I run out of ideas ;).

> > I lost track of whether you did that already or not, but if you didn´t
> > please post some
> > 
> > vmstat 1
> > iostat -xd 1
> > on the device while it is being slow.
>  
> Sure thing, here's the 24 second du -s run:
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
>  2  1      0 2747264     44 348388    0    0    28    50  242  184 19  6 74  1
>  1  0      0 2744128     44 351700    0    0   144     0 2758 32115 30  5 61  4
>  2  1      0 2743100     44 351992    0    0   792     0 2616 30613 28  4 50 18
>  1  1      0 2741592     44 352668    0    0   776     0 2574 31551 29  4 45 21
>  1  1      0 2740720     44 353432    0    0   692     0 2734 32891 30  4 45 22
>  1  1      0 2740104     44 354284    0    0   460     0 2639 31585 30  4 45 21
>  3  1      0 2738520     44 354692    0    0   544   264 2834 30302 32  5 42 21
>  1  1      0 2736936     44 355476    0    0  1064  2012 2867 31172 28  4 45 23

A bit more wait I/O with not even 10% of the throughput as compared to
the Intel SSD 320 figures. Seems that Intel SSD is running circles around
your Samsung SSD while not – as expected for that use case – being fully
utilized.

> Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat) 	08/01/2012 	_x86_64_	(4 CPU)
> rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>   2.18     0.68    6.45   17.77    78.12   153.39    19.12     0.18    7.51    8.52    7.15   4.46  10.81
>   0.00     0.00  118.00    0.00   540.00     0.00     9.15     1.18    9.93    9.93    0.00   4.98  58.80
>   0.00     0.00  217.00    0.00   868.00     0.00     8.00     1.90    8.77    8.77    0.00   4.44  96.40
>   0.00     0.00  192.00    0.00   768.00     0.00     8.00     1.63    8.44    8.44    0.00   5.10  98.00
>   0.00     0.00  119.00    0.00   476.00     0.00     8.00     1.06    9.01    9.01    0.00   8.20  97.60
>   0.00     0.00  125.00    0.00   500.00     0.00     8.00     1.08    8.67    8.67    0.00   7.55  94.40
>   0.00     0.00  165.00    0.00   660.00     0.00     8.00     1.50    9.12    9.12    0.00   5.87  96.80
>   0.00     0.00  195.00   13.00   780.00   272.00    10.12     1.68    8.10    7.94   10.46   4.65  96.80
>   0.00     0.00  173.00    0.00   692.00     0.00     8.00     1.72    9.87    9.87    0.00   5.71  98.80
>   0.00     0.00  171.00    0.00   684.00     0.00     8.00     1.62    9.33    9.33    0.00   5.75  98.40
>   0.00     0.00  161.00    0.00   644.00     0.00     8.00     1.52    9.57    9.57    0.00   6.14  98.80
>   0.00     0.00  136.00    0.00   544.00     0.00     8.00     1.26    9.29    9.29    0.00   7.24  98.40
>   0.00     0.00  199.00    0.00   796.00     0.00     8.00     1.94    9.73    9.73    0.00   4.94  98.40
>   0.00     0.00  201.00    0.00   804.00     0.00     8.00     1.70    8.54    8.54    0.00   4.80  96.40
>   0.00     0.00  272.00   15.00  1088.00   272.00     9.48     2.35    8.21    8.46    3.73   3.39  97.20
[…]
> > I am interested in wait I/O and latencies and disk utilization.
>  
> Cool tool, I didn't know about iostat.
> My r_await numbers don't look good obviously and yet %util is pretty much
> 100% the entire time.
> 
> Does that show that it's indeed the device that is unable to deliver the requests any quicker, despite
> being an ssd, or are you reading this differently?

That, or…

> > Also I am interested in
> > merkaba:~> hdparm -I /dev/sda | grep -i queue
> >         Queue depth: 32
> >            *    Native Command Queueing (NCQ)
> > output for your SSD.
> 
> gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue   
> 	Queue depth: 32
> 	   *	Native Command Queueing (NCQ)
> gandalfthegreat:/var/local# 
> 
> I've the the fio tests in:
> /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0

… you are still using dm_crypt?

Please test without dm_crypt. My figures are from within LVM, but no
dm_crypt. Its good to have a comparable base for the measurements.

> (discard is there, so fstrim shouldn't be needed)

I can´t imagine why it should matter, but maybe its worth having some
tests without „discard“. 

> > I also suggest to use fio with with the ssd-test example on the SSD. I
> > have some comparison data available for my setup. Heck it should be
> > publicly available in my ADMIN magazine article about fio. I used a bit
> > different fio jobs with block sizes of 2k to 16k, but its similar enough
> > and I might even have some 4k examples at hand or can easily create one.
> > I also raised size and duration a bit.
> > 
> > An example based on whats in my article:
> 
> Thanks, here's the output.
> I see bw jumping from bw=1700.8KB/s all the way to bw=474684KB/s depending
> on the io size.  That's "intereesting" to say the least.
> 
> So, doctor, is it bad? :)

Well I do not engineer those SSDs, but to me it looks like either dm_crypt
is wildly hogging performance or the SSD is way slower than what I would
expect. But then you did some tests without dm_crypt AFAIR, but then it
would be interesting to repeat these tests without dm_crypt as well.

> randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8
> Starting 4 processes
> randomwrite: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0  iops] [eta 00m:00s]  
> randomwrite: (groupid=0, jobs=1): err= 0: pid=7193
>   write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec
>     slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55
>     clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63
>      lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57
>     clat percentiles (msec):
>      |  1.00th=[  225],  5.00th=[  241], 10.00th=[  247], 20.00th=[  260],
>      | 30.00th=[  269], 40.00th=[  281], 50.00th=[  293], 60.00th=[  306],
>      | 70.00th=[  322], 80.00th=[  351], 90.00th=[  545], 95.00th=[  570],
>      | 99.00th=[  627], 99.50th=[  644], 99.90th=[  709], 99.95th=[  725],
>      | 99.99th=[  742]
>     bw (KB/s)  : min=   11, max= 2591, per=99.83%, avg=1697.13, stdev=491.48
>     lat (usec) : 50=0.01%
>     lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89%
>     lat (msec) : 500=72.44%, 750=14.43%

Gosh, look at these latencies!

72,44% of all requests above 500 (in words: five hundred) milliseconds!
And 14,43% above 750 msecs. The percentage of requests served at 100 msecs
or less is was below one percent! Hey, is this an SSD or what?

Please really test without dm_crypt so that we can see whether it
introduces any kind of latency and if so in what amount.

But then let me compare. You are using iodepth=64 which fills the
device with request at maximum speed. So this will likely increase
latencies.

martin@merkaba:~[…]> fio iops-iodepth64.job 
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
fio 1.57
Starting 4 processes
[…]
zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=13103
  write: io=2048.0MB, bw=76940KB/s, iops=14032 , runt= 27257msec
    slat (usec): min=3 , max=6426 , avg=14.35, stdev=22.33
    clat (usec): min=82 , max=414190 , avg=4540.81, stdev=4996.47
     lat (usec): min=97 , max=414205 , avg=4555.53, stdev=4996.42
    bw (KB/s) : min=12890, max=112728, per=100.39%, avg=77240.48, stdev=20310.79
  cpu          : usr=9.67%, sys=29.47%, ctx=72053, majf=0, minf=532
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/382480/0, short=0/0/0
     lat (usec): 100=0.01%, 250=0.14%, 500=0.66%, 750=0.82%, 1000=1.52%
     lat (msec): 2=10.86%, 4=27.88%, 10=57.18%, 20=0.79%, 50=0.03%
     lat (msec): 100=0.07%, 250=0.04%, 500=0.01%

Still even with iodepth 64 totally different picture. And look at the IOPS
and throughput.

For reference, this refers to

[global]
ioengine=libaio
direct=1
iodepth=64
# Für zufällige Daten über die komplette Länge
# der Datei vorher Job sequentiell laufen lassen
filename=testdatei
size=2G
bsrange=2k-16k

refill_buffers=1

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

[zufälligschreiben]
stonewall
rw=randwrite
runtime=60

[sequentiellschreiben]
stonewall
rw=write
runtime=60

but on Ext4 instead of BTRFS.

This could be another good test. Text with Ext4 on plain logical volume
without dm_crypt.

>   cpu          : usr=0.30%, sys=2.65%, ctx=9699, majf=0, minf=19
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=11396/d=0, short=r=0/w=0/d=0
> sequentialwrite: (groupid=1, jobs=1): err= 0: pid=7218
>   write: io=80994KB, bw=1349.7KB/s, iops=149 , runt= 60011msec
>     slat (usec): min=15 , max=296798 , avg=6658.06, stdev=7414.10
>     clat (usec): min=15 , max=962319 , avg=418984.76, stdev=128626.12
>      lat (msec): min=12 , max=967 , avg=425.64, stdev=130.36
>     clat percentiles (msec):
>      |  1.00th=[  269],  5.00th=[  306], 10.00th=[  322], 20.00th=[  338],
>      | 30.00th=[  355], 40.00th=[  371], 50.00th=[  388], 60.00th=[  404],
>      | 70.00th=[  420], 80.00th=[  445], 90.00th=[  603], 95.00th=[  766],
>      | 99.00th=[  881], 99.50th=[  906], 99.90th=[  938], 99.95th=[  947],
>      | 99.99th=[  963]
>     bw (KB/s)  : min=  418, max= 1952, per=99.31%, avg=1339.72, stdev=354.39
>     lat (usec) : 20=0.01%
>     lat (msec) : 20=0.01%, 50=0.11%, 100=0.27%, 250=0.29%, 500=86.93%
>     lat (msec) : 750=6.87%, 1000=5.51%

Sequential write latencies are abysmal as well.

sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=13105
  write: io=2048.0MB, bw=155333KB/s, iops=17290 , runt= 13501msec
    slat (usec): min=2 , max=4299 , avg=13.51, stdev=18.91
    clat (usec): min=334 , max=201706 , avg=3680.94, stdev=2088.53
     lat (usec): min=340 , max=201718 , avg=3694.83, stdev=2088.28
    bw (KB/s) : min=144952, max=162856, per=100.04%, avg=155394.92, stdev=5720.43
  cpu          : usr=16.77%, sys=31.17%, ctx=47790, majf=0, minf=535
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=0/233444/0, short=0/0/0
     lat (usec): 500=0.01%, 750=0.02%, 1000=0.03%
     lat (msec): 2=0.55%, 4=77.52%, 10=21.86%, 20=0.01%, 100=0.01%
     lat (msec): 250=0.01%

>   cpu          : usr=0.25%, sys=2.32%, ctx=13426, majf=0, minf=21
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=0/w=8991/d=0, short=r=0/w=0/d=0
> randomread: (groupid=2, jobs=1): err= 0: pid=7234
>   read : io=473982KB, bw=7899.5KB/s, iops=934 , runt= 60005msec
>     slat (usec): min=2 , max=31048 , avg=1065.57, stdev=3519.63
>     clat (usec): min=21 , max=246981 , avg=67367.84, stdev=33459.21
>      lat (msec): min=1 , max=263 , avg=68.43, stdev=33.81
>     clat percentiles (msec):
>      |  1.00th=[   15],  5.00th=[   25], 10.00th=[   33], 20.00th=[   39],
>      | 30.00th=[   50], 40.00th=[   57], 50.00th=[   61], 60.00th=[   70],
>      | 70.00th=[   76], 80.00th=[   89], 90.00th=[  111], 95.00th=[  139],
>      | 99.00th=[  176], 99.50th=[  190], 99.90th=[  217], 99.95th=[  229],
>      | 99.99th=[  247]
>     bw (KB/s)  : min= 2912, max=11909, per=100.00%, avg=7900.46, stdev=2255.68
>     lat (usec) : 50=0.01%
>     lat (msec) : 2=0.17%, 4=0.16%, 10=0.35%, 20=2.49%, 50=27.44%
>     lat (msec) : 100=55.05%, 250=14.34%

Even read latencies are really high.

zufälliglesen: (groupid=0, jobs=1): err= 0: pid=13101
  read : io=2048.0MB, bw=162545KB/s, iops=29719 , runt= 12902msec
    slat (usec): min=2 , max=1485 , avg=10.58, stdev=10.81
    clat (usec): min=246 , max=18706 , avg=2140.05, stdev=1073.28
     lat (usec): min=267 , max=18714 , avg=2150.97, stdev=1074.42
    bw (KB/s) : min=108000, max=205060, per=101.08%, avg=164305.12, stdev=32494.26
  cpu          : usr=12.74%, sys=49.83%, ctx=82018, majf=0, minf=276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=383439/0/0, short=0/0/0
     lat (usec): 250=0.01%, 500=0.03%, 750=0.30%, 1000=3.21%
     lat (msec): 2=48.26%, 4=44.68%, 10=3.35%, 20=0.18%


Here for comparison that 500 GB 2.5 inch eSATA Hitachi disk (5400rpm):

merkaba:[…]> fio --readonly iops-gerät-lesend-iodepth64.job
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
fio 1.57
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [79411K/0K /s] [8617 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=32578
  read : io=58290KB, bw=984592 B/s, iops=106 , runt= 60623msec
    slat (usec): min=3 , max=82 , avg=24.63, stdev= 6.64
    clat (msec): min=40 , max=2825 , avg=602.78, stdev=374.10
     lat (msec): min=40 , max=2825 , avg=602.80, stdev=374.10
    bw (KB/s) : min=    0, max= 1172, per=65.52%, avg=629.66, stdev=466.50
  cpu          : usr=0.24%, sys=0.64%, ctx=6443, majf=0, minf=275
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.5%, >=64=99.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=6435/0/0, short=0/0/0
     lat (msec): 50=0.20%, 100=2.61%, 250=15.37%, 500=26.64%, 750=25.07%
     lat (msec): 1000=16.13%, 2000=13.44%, >=2000=0.54%

Okay, so at least your SSD has shorter random read latencies than this
slow harddisk. ;)

Has a queue depth of 32 as well.

>   cpu          : usr=0.38%, sys=2.89%, ctx=8425, majf=0, minf=276
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=56073/w=0/d=0, short=r=0/w=0/d=0
> sequentialread: (groupid=3, jobs=1): err= 0: pid=7249
>   read : io=2048.0MB, bw=474684KB/s, iops=52754 , runt=  4418msec
>     slat (usec): min=1 , max=44869 , avg=15.35, stdev=320.96
>     clat (usec): min=0 , max=67787 , avg=1145.60, stdev=3487.25
>      lat (usec): min=2 , max=67801 , avg=1161.14, stdev=3508.06
>     clat percentiles (usec):
>      |  1.00th=[   10],  5.00th=[  187], 10.00th=[  217], 20.00th=[  249],
>      | 30.00th=[  278], 40.00th=[  302], 50.00th=[  330], 60.00th=[  362],
>      | 70.00th=[  398], 80.00th=[  450], 90.00th=[  716], 95.00th=[ 6624],
>      | 99.00th=[16064], 99.50th=[20096], 99.90th=[44800], 99.95th=[61696],
>      | 99.99th=[67072]
>     bw (KB/s)  : min=50063, max=635019, per=97.34%, avg=462072.50, stdev=202894.18
>     lat (usec) : 2=0.65%, 4=0.02%, 10=0.30%, 20=0.24%, 50=0.36%
>     lat (usec) : 100=0.52%, 250=18.08%, 500=65.03%, 750=4.88%, 1000=0.52%
>     lat (msec) : 2=0.81%, 4=0.95%, 10=4.72%, 20=2.43%, 50=0.45%
>     lat (msec) : 100=0.05%
>   cpu          : usr=7.52%, sys=67.56%, ctx=1494, majf=0, minf=277
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=233071/w=0/d=0, short=r=0/w=0/d=0

Intel SSD 320:

sequentielllesen: (groupid=1, jobs=1): err= 0: pid=13102
  read : io=2048.0MB, bw=252699KB/s, iops=28043 , runt=  8299msec
    slat (usec): min=2 , max=1416 , avg=10.75, stdev=10.95
    clat (usec): min=305 , max=201105 , avg=2268.82, stdev=2066.69
     lat (usec): min=319 , max=201114 , avg=2279.94, stdev=2066.61
    bw (KB/s) : min=249424, max=254472, per=100.03%, avg=252776.50, stdev=1416.63
  cpu          : usr=12.29%, sys=43.62%, ctx=43913, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=232729/0/0, short=0/0/0
     lat (usec): 500=0.01%, 750=0.06%, 1000=0.10%
     lat (msec): 2=11.00%, 4=88.32%, 10=0.50%, 20=0.01%, 50=0.01%
     lat (msec): 100=0.01%, 250=0.01%


Harddisk again:

sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32580
  read : io=4510.3MB, bw=76964KB/s, iops=8551 , runt= 60008msec
    slat (usec): min=1 , max=3373 , avg=10.03, stdev= 8.96
    clat (msec): min=1 , max=49 , avg= 7.47, stdev= 1.14
     lat (msec): min=1 , max=49 , avg= 7.48, stdev= 1.14
    bw (KB/s) : min=72608, max=79588, per=100.05%, avg=77003.96, stdev=1626.66
  cpu          : usr=6.89%, sys=16.96%, ctx=229486, majf=0, minf=277
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued r/w/d: total=513184/0/0, short=0/0/0
     lat (msec): 2=0.01%, 4=0.02%, 10=99.67%, 20=0.29%, 50=0.01%

Maximum latencies on our SSD are longer. Minimum latencies shorter.

Well sequential stuff might not be located sequential on the SSD, the SSD
controller might put the stuff from a big file into totally different places.

> Run status group 0 (all jobs):
>   WRITE: io=102048KB, aggrb=1700KB/s, minb=1700KB/s, maxb=1700KB/s, mint=60003msec, maxt=60003msec
> 
> Run status group 1 (all jobs):
>   WRITE: io=80994KB, aggrb=1349KB/s, minb=1349KB/s, maxb=1349KB/s, mint=60011msec, maxt=60011msec
> 
> Run status group 2 (all jobs):
>    READ: io=473982KB, aggrb=7899KB/s, minb=7899KB/s, maxb=7899KB/s, mint=60005msec, maxt=60005msec
> 
> Run status group 3 (all jobs):
>    READ: io=2048.0MB, aggrb=474683KB/s, minb=474683KB/s, maxb=474683KB/s, mint=4418msec, maxt=4418msec

Thats pathetic except for the last job group (each group has one job here
cause of the stonewall commands) of reading sequentially. Strangely
sequential write is also abysmally slow.

This is how I expect this to look like with iodepth 64:

Run status group 0 (all jobs):
   READ: io=2048.0MB, aggrb=162544KB/s, minb=166445KB/s, maxb=166445KB/s, mint=12902msec, maxt=12902msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=252699KB/s, minb=258764KB/s, maxb=258764KB/s, mint=8299msec, maxt=8299msec

Run status group 2 (all jobs):
  WRITE: io=2048.0MB, aggrb=76939KB/s, minb=78786KB/s, maxb=78786KB/s, mint=27257msec, maxt=27257msec

Run status group 3 (all jobs):
  WRITE: io=2048.0MB, aggrb=155333KB/s, minb=159061KB/s, maxb=159061KB/s, mint=13501msec, maxt=13501msec


Can you also post the last lines:

Disk stats (read/write):
  dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78%
    sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78%
martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>


Its gives information on who good the I/O scheduler was able to merge requests.

I didn´t see much of a difference between CFQ and noop, so it may not
matter much, but since it gives also a number on total disk utilization
its still quite nice to have.


So my recommendation of now:

Remove as much factors as possible and in order to compare results with
what I posted try with plain logical volume with Ext4.

If the values are still quite slow, I think its good to ask Linux
block layer experts – for example by posting on fio mailing list where
there are people subscribed that may be able to provide other test
results – and SSD experts. There might be a Linux block layer mailing
list or use libata or fsdevel, I don´t know.

If your Samsung SSD turns out to be this slow or almost this slow on
such a basic level, I am out of ideas.

Besides except:

Have the IOPS run on the device it self. That will remove any filesystem
layer. But only the read only tests, to make sure I suggest to use fio
with the --readonly option as safety guard. Unless you have a spare SSD
that you can afford to use for write testing which will likely destroy
every filesystem on it. Or let it run on just one logical volume.

If its then slow, then I´d tend to use a different SSD and test from
there. If its then faster, I´d tend to believe that its either a hardware
or probably more likely a compatibilty issue.

What does your SSD report discard alignment – well but even that should
not matter for reads:

merkaba:/sys/block/sda> cat discard_alignment 
0

Seems Intel tells us to not care at all. Or the value for some reason
cannot be read.

merkaba:/sys/block/sda> cd queue
merkaba:/sys/block/sda/queue> grep . *
add_random:1
discard_granularity:512
discard_max_bytes:2147450880
discard_zeroes_data:1
hw_sector_size:512

I would be interested whether these above differ.

grep: iosched: Ist ein Verzeichnis
iostats:1
logical_block_size:512
max_hw_sectors_kb:32767
max_integrity_segments:0
max_sectors_kb:512
max_segments:168
max_segment_size:65536
minimum_io_size:512
nomerges:0
nr_requests:128
optimal_io_size:0
physical_block_size:512
read_ahead_kb:128
rotational:0
rq_affinity:1
scheduler:noop deadline [cfq]

And whether there is any optimal io size.

One note regarding this: Back then I made the Ext4 I carried above tests
in the last year to be aligned to what I think could be good for the SSD.
I do not know whether this matters much.

merkaba:~> tune2fs -l /dev/merkaba/home 
RAID stripe width:        128

Which should be in blocks and alignto 512 KiB. Hmmm, I am a bit puzzled
about this value at the moment. Either 1 MiB or 128 KiB would make sense
to me. But then what I do not about the erase block size of that SSD.


Then add BTRFS and then dm_crypt and look at how the numbers change.

I think this is a plan to find out whether its either really the hardware
or some wierd happening in the low level parts of Linux, e.g. the block
layer or dm_crypt or filing system.

Reduce it to the most basic level and then work from there.

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02  5:07                 ` Marc MERLIN
  2012-08-02 11:18                   ` Martin Steigerwald
@ 2012-08-02 11:25                   ` Martin Steigerwald
  1 sibling, 0 replies; 68+ messages in thread
From: Martin Steigerwald @ 2012-08-02 11:25 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Marc MERLIN, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> So, doctor, is it bad? :)
> 
> randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8
> Starting 4 processes
> randomwrite: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0  iops] [eta 00m:00s]  
> randomwrite: (groupid=0, jobs=1): err= 0: pid=7193
>   write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec
>     slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55
>     clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63
>      lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57
>     clat percentiles (msec):

Heck, I didn´t look at the IOPS figure!

189 IOPS for a SATA-600 SSD. Thats pathetic.

So again, please test this without dm_crypt. I can´t believe that this
is the maximum the hardware is able to achieve.

A really fast 15000 rpm SAS harddisk might top that.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 11:18                   ` Martin Steigerwald
@ 2012-08-02 17:39                     ` Marc MERLIN
  2012-08-02 20:20                       ` Martin Steigerwald
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-08-02 17:39 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote:
> > I've the the fio tests in:
> > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0
> 
> … you are still using dm_crypt?
 
That was my biggest partition and so far I've found no performance impact
on file access between unencrypted and dm_crypt.
I just took out my swap partition and made a smaller btrfs there:
/dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0

I mounted without discard.

> >     lat (usec) : 50=0.01%
> >     lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89%
> >     lat (msec) : 500=72.44%, 750=14.43%
> 
> Gosh, look at these latencies!
> 
> 72,44% of all requests above 500 (in words: five hundred) milliseconds!
> And 14,43% above 750 msecs. The percentage of requests served at 100 msecs
> or less is was below one percent! Hey, is this an SSD or what?
 
Yeah, that's kind of what I've been complaining about since the beginning :)
Once I'm reading sequentially, it goes fast, but random access/latency is
indeed abysmal.

> Still even with iodepth 64 totally different picture. And look at the IOPS
> and throughput.
 
Yep. I know mine are bad :(

> For reference, this refers to
> 
> [global]
> ioengine=libaio
> direct=1
> iodepth=64

Since it's slightly different than the first job file you gave me, I re-ran
with this one this time.

gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2
gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:~# grep sda2 /proc/mounts
/dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0

here's the btrfs test (ext4 is lower down):
gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 4 processes
zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193  iops] [eta 02m:10s]     
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318
  read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec
    slat (usec): min=3 , max=37432 , avg=7252.52, stdev=5717.70
    clat (usec): min=13 , max=981927 , avg=454046.13, stdev=110527.92
     lat (msec): min=5 , max=999 , avg=461.30, stdev=112.00
    clat percentiles (msec):
     |  1.00th=[  145],  5.00th=[  269], 10.00th=[  371], 20.00th=[  408],
     | 30.00th=[  424], 40.00th=[  437], 50.00th=[  449], 60.00th=[  457],
     | 70.00th=[  474], 80.00th=[  490], 90.00th=[  570], 95.00th=[  644],
     | 99.00th=[  865], 99.50th=[  922], 99.90th=[  963], 99.95th=[  979],
     | 99.99th=[  979]
    bw (KB/s)  : min=    8, max= 2807, per=100.00%, avg=1227.75, stdev=317.57
    lat (usec) : 20=0.01%
    lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82%
    lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58%
  cpu          : usr=0.12%, sys=1.13%, ctx=12186, majf=0, minf=276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=8262/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30340
  read : io=2048.0MB, bw=211257KB/s, iops=23473 , runt=  9927msec
    slat (usec): min=1 , max=56321 , avg=20.51, stdev=424.44
    clat (usec): min=0 , max=57987 , avg=2695.98, stdev=6624.00
     lat (usec): min=1 , max=58015 , avg=2716.75, stdev=6642.09
    clat percentiles (usec):
     |  1.00th=[    1],  5.00th=[   10], 10.00th=[   30], 20.00th=[  100],
     | 30.00th=[  217], 40.00th=[  362], 50.00th=[  494], 60.00th=[  636],
     | 70.00th=[  892], 80.00th=[ 1656], 90.00th=[ 7392], 95.00th=[21632],
     | 99.00th=[29056], 99.50th=[29568], 99.90th=[43776], 99.95th=[46848],
     | 99.99th=[57600]
    bw (KB/s)  : min=166675, max=260984, per=99.83%, avg=210892.26, stdev=22433.65
    lat (usec) : 2=2.16%, 4=0.43%, 10=2.33%, 20=2.80%, 50=5.72%
    lat (usec) : 100=6.47%, 250=12.35%, 500=18.29%, 750=15.52%, 1000=5.44%
    lat (msec) : 2=13.04%, 4=3.59%, 10=3.83%, 20=1.88%, 50=6.11%
    lat (msec) : 100=0.04%
  cpu          : usr=4.51%, sys=35.70%, ctx=11480, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=233025/w=0/d=0, short=r=0/w=0/d=0
zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30348
  write: io=110768KB, bw=1845.1KB/s, iops=208 , runt= 60007msec
    slat (usec): min=26 , max=50160 , avg=4789.12, stdev=4968.75
    clat (usec): min=29 , max=752494 , avg=301858.86, stdev=80422.24
     lat (msec): min=12 , max=757 , avg=306.65, stdev=81.52
    clat percentiles (msec):
     |  1.00th=[  208],  5.00th=[  241], 10.00th=[  245], 20.00th=[  255],
     | 30.00th=[  262], 40.00th=[  269], 50.00th=[  277], 60.00th=[  289],
     | 70.00th=[  306], 80.00th=[  326], 90.00th=[  363], 95.00th=[  519],
     | 99.00th=[  627], 99.50th=[  652], 99.90th=[  734], 99.95th=[  742],
     | 99.99th=[  750]
    bw (KB/s)  : min=  616, max= 2688, per=99.72%, avg=1839.80, stdev=399.60
    lat (usec) : 50=0.01%
    lat (msec) : 20=0.02%, 50=0.04%, 100=0.07%, 250=14.31%, 500=79.62%
    lat (msec) : 750=5.91%, 1000=0.01%
  cpu          : usr=0.28%, sys=2.60%, ctx=10167, majf=0, minf=20
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=12495/d=0, short=r=0/w=0/d=0
sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30364
  write: io=84902KB, bw=1414.1KB/s, iops=156 , runt= 60005msec
    slat (usec): min=18 , max=40825 , avg=6389.99, stdev=6072.56
    clat (usec): min=22 , max=887097 , avg=401897.77, stdev=108015.23
     lat (msec): min=11 , max=899 , avg=408.29, stdev=109.47
    clat percentiles (msec):
     |  1.00th=[  262],  5.00th=[  302], 10.00th=[  318], 20.00th=[  338],
     | 30.00th=[  351], 40.00th=[  363], 50.00th=[  375], 60.00th=[  388],
     | 70.00th=[  404], 80.00th=[  433], 90.00th=[  502], 95.00th=[  693],
     | 99.00th=[  783], 99.50th=[  824], 99.90th=[  873], 99.95th=[  881],
     | 99.99th=[  889]
    bw (KB/s)  : min=  346, max= 2115, per=99.44%, avg=1406.11, stdev=333.58
    lat (usec) : 50=0.01%
    lat (msec) : 20=0.02%, 50=0.03%, 100=0.07%, 250=0.59%, 500=89.12%
    lat (msec) : 750=7.91%, 1000=2.24%
  cpu          : usr=0.28%, sys=2.21%, ctx=13224, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.3%, >=64=99.3%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=9369/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=73682KB, aggrb=1227KB/s, minb=1227KB/s, maxb=1227KB/s, mint=60004msec, maxt=60004msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=211257KB/s, minb=211257KB/s, maxb=211257KB/s, mint=9927msec, maxt=9927msec

Run status group 2 (all jobs):
  WRITE: io=110768KB, aggrb=1845KB/s, minb=1845KB/s, maxb=1845KB/s, mint=60007msec, maxt=60007msec

Run status group 3 (all jobs):
  WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec
gandalfthegreat:/mnt/mnt2#
(no lines beyond this, fio 2.0.8)

> This could be another good test. Text with Ext4 on plain logical volume
> without dm_crypt.
> Can you also post the last lines:
> 
> Disk stats (read/write):
>   dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78%
>     sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78%
> martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>
 
I didn't get these lines.
 
> Its gives information on who good the I/O scheduler was able to merge requests.
> 
> I didn´t see much of a difference between CFQ and noop, so it may not
> matter much, but since it gives also a number on total disk utilization
> its still quite nice to have.

I tried deadline and noop, and indeed I'm not see much of a difference for my basic tests.
Fro now I have deadline.
 
> So my recommendation of now:
> 
> Remove as much factors as possible and in order to compare results with
> what I posted try with plain logical volume with Ext4.

gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2
/dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0

gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 4 processes
zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280  iops] [eta 01m:21s]  
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077
  read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt=  7592msec
    slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01
    clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63
     lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36
    clat percentiles (usec):
     |  1.00th=[  450],  5.00th=[  548], 10.00th=[  620], 20.00th=[  724],
     | 30.00th=[  820], 40.00th=[  908], 50.00th=[ 1004], 60.00th=[ 1096],
     | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040],
     | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168],
     | 99.99th=[33024]
    bw (KB/s)  : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29
    lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70%
    lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17%
    lat (msec) : 100=0.01%
  cpu          : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=383190/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081
  read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec
    slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83
    clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26
     lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77
    clat percentiles (usec):
     |  1.00th=[ 2608],  5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376],
     | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920],
     | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704],
     | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296],
     | 99.99th=[16512]
    bw (KB/s)  : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35
    lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25%
  cpu          : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=232601/w=0/d=0, short=r=0/w=0/d=0
zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088
  write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec
    slat (usec): min=7 , max=51044 , avg=4757.60, stdev=4816.78
    clat (usec): min=958 , max=582205 , avg=299854.73, stdev=56663.66
     lat (msec): min=12 , max=588 , avg=304.61, stdev=57.22
    clat percentiles (msec):
     |  1.00th=[   73],  5.00th=[  225], 10.00th=[  251], 20.00th=[  273],
     | 30.00th=[  285], 40.00th=[  297], 50.00th=[  306], 60.00th=[  314],
     | 70.00th=[  318], 80.00th=[  330], 90.00th=[  343], 95.00th=[  359],
     | 99.00th=[  482], 99.50th=[  519], 99.90th=[  553], 99.95th=[  570],
     | 99.99th=[  570]
    bw (KB/s)  : min= 1033, max= 3430, per=99.67%, avg=1856.90, stdev=307.24
    lat (usec) : 1000=0.01%
    lat (msec) : 20=0.06%, 50=0.72%, 100=0.44%, 250=8.50%, 500=89.52%
    lat (msec) : 750=0.75%
  cpu          : usr=0.27%, sys=1.27%, ctx=9505, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=12580/d=0, short=r=0/w=0/d=0
sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102
  write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec
    slat (usec): min=5 , max=40817 , avg=3911.57, stdev=4785.23
    clat (msec): min=4 , max=682 , avg=246.61, stdev=77.35
     lat (msec): min=5 , max=688 , avg=250.53, stdev=78.39
    clat percentiles (msec):
     |  1.00th=[   73],  5.00th=[  143], 10.00th=[  186], 20.00th=[  206],
     | 30.00th=[  221], 40.00th=[  229], 50.00th=[  239], 60.00th=[  245],
     | 70.00th=[  258], 80.00th=[  273], 90.00th=[  318], 95.00th=[  416],
     | 99.00th=[  529], 99.50th=[  562], 99.90th=[  660], 99.95th=[  668],
     | 99.99th=[  676]
    bw (KB/s)  : min=  811, max= 4243, per=99.65%, avg=2279.92, stdev=610.32
    lat (msec) : 10=0.26%, 20=0.03%, 50=0.22%, 100=1.97%, 250=62.01%
    lat (msec) : 500=33.43%, 750=2.07%
  cpu          : usr=0.35%, sys=1.46%, ctx=10993, majf=0, minf=21
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=0/w=15293/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec

Run status group 1 (all jobs):
   READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec

Run status group 2 (all jobs):
  WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec

Run status group 3 (all jobs):
  WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msec

Disk stats (read/write):
  sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50%
gandalfthegreat:/mnt/mnt2# 
 
> If the values are still quite slow, I think its good to ask Linux
> block layer experts – for example by posting on fio mailing list where
> there are people subscribed that may be able to provide other test
> results – and SSD experts. There might be a Linux block layer mailing
> list or use libata or fsdevel, I don´t know.
 
I'll try there next. I think after this mail, and your help which is very
much appreciated, it's fair to take things out of this list to stop spamming
the wrong people :)

> Have the IOPS run on the device it self. That will remove any filesystem
> layer. But only the read only tests, to make sure I suggest to use fio
> with the --readonly option as safety guard. Unless you have a spare SSD
> that you can afford to use for write testing which will likely destroy
> every filesystem on it. Or let it run on just one logical volume.
 
Can you send me a recommended job config you'd like me to run if the runs
above haven't already answered your questions?

> What does your SSD report discard alignment – well but even that should
> not matter for reads:
> 
> merkaba:/sys/block/sda> cat discard_alignment 
> 0

gandalfthegreat:/sys/block/sda# cat discard_alignment 
0
 
> Seems Intel tells us to not care at all. Or the value for some reason
> cannot be read.
> 
> merkaba:/sys/block/sda> cd queue
> merkaba:/sys/block/sda/queue> grep . *
> add_random:1
> discard_granularity:512
> discard_max_bytes:2147450880
> discard_zeroes_data:1
> hw_sector_size:512
> 
> I would be interested whether these above differ.
 
I have the exact same.

> And whether there is any optimal io size.

I also have
optimal_io_size:0

> One note regarding this: Back then I made the Ext4 I carried above tests
> in the last year to be aligned to what I think could be good for the SSD.
> I do not know whether this matters much.
> 
> merkaba:~> tune2fs -l /dev/merkaba/home 
> RAID stripe width:        128

> I think this is a plan to find out whether its either really the hardware
> or some wierd happening in the low level parts of Linux, e.g. the block
> layer or dm_crypt or filing system.
> 
> Reduce it to the most basic level and then work from there.

So, I went back to basics.

On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote:
> Heck, I didn´t look at the IOPS figure!
> 
> 189 IOPS for a SATA-600 SSD. Thats pathetic.
 
Yeah, and the new tests above without dmcrypt are no better.

> A really fast 15000 rpm SAS harddisk might top that.

My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than 
the SSD right now :)

I read your comments about the multiple things you had in mind. Given the
new data, should I just go to an io list now, or even save myself some time
and just return the damned SSDs and buy from another vendor? :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 17:39                     ` Marc MERLIN
@ 2012-08-02 20:20                       ` Martin Steigerwald
  2012-08-02 20:44                         ` Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-08-02 20:20 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote:
> > > I've the the fio tests in:
> > > /dev/mapper/cryptroot /var btrfs rw,noatime,compress=lzo,nossd,discard,space_cache 0 0
> > 
> > … you are still using dm_crypt?
[…]
> I just took out my swap partition and made a smaller btrfs there:
> /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0
> 
> I mounted without discard.
[…]
> > For reference, this refers to
> > 
> > [global]
> > ioengine=libaio
> > direct=1
> > iodepth=64
> 
> Since it's slightly different than the first job file you gave me, I re-ran
> with this one this time.
> 
> gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2
> gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2
> gandalfthegreat:~# grep sda2 /proc/mounts
> /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0
> 
> here's the btrfs test (ext4 is lower down):
> gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8

Still abysmal except for sequential reads.

> Starting 4 processes
> zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
> zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193  iops] [eta 02m:10s]     
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318
>   read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec
[…]
>     lat (usec) : 20=0.01%
>     lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82%
>     lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58%

1 second latency? Oh well…


> Run status group 3 (all jobs):
>   WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, mint=60005msec, maxt=60005msec
> gandalfthegreat:/mnt/mnt2#
> (no lines beyond this, fio 2.0.8)
[…]
> > Can you also post the last lines:
> > 
> > Disk stats (read/write):
> >   dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, in_queue=3867448, util=98.81%, aggrios=504829/504643, aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, aggrutil=98.78%
> >     sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, in_queue=3223048, util=98.78%
> > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>
>  
> I didn't get these lines.

Hmmm, back then I guess this was fio 1.5.9 or so.

> I tried deadline and noop, and indeed I'm not see much of a difference for my basic tests.
> Fro now I have deadline.
>  
> > So my recommendation of now:
> > 
> > Remove as much factors as possible and in order to compare results with
> > what I posted try with plain logical volume with Ext4.
> 
> gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 /dev/sda2
> /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0
> 
> gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8
> Starting 4 processes
> zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
> zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280  iops] [eta 01m:21s]  
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077
>   read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt=  7592msec
>     slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01
>     clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63
>      lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36
>     clat percentiles (usec):
>      |  1.00th=[  450],  5.00th=[  548], 10.00th=[  620], 20.00th=[  724],
>      | 30.00th=[  820], 40.00th=[  908], 50.00th=[ 1004], 60.00th=[ 1096],
>      | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040],
>      | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168],
>      | 99.99th=[33024]
>     bw (KB/s)  : min=76463, max=385328, per=100.00%, avg=277313.20, stdev=94661.29
>     lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70%
>     lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17%
>     lat (msec) : 100=0.01%
>   cpu          : usr=11.91%, sys=51.64%, ctx=91337, majf=0, minf=277
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=383190/w=0/d=0, short=r=0/w=0/d=0

Hey, whats this? With Ext4 you have really good random read performance
now! Way better than the Intel SSD 320 and…

> sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30081
>   read : io=2048.0MB, bw=150043KB/s, iops=16641 , runt= 13977msec
>     slat (usec): min=1 , max=2134 , avg= 7.09, stdev=10.83
>     clat (usec): min=298 , max=16751 , avg=3836.41, stdev=755.26
>      lat (usec): min=304 , max=16771 , avg=3843.77, stdev=754.77
>     clat percentiles (usec):
>      |  1.00th=[ 2608],  5.00th=[ 2960], 10.00th=[ 3152], 20.00th=[ 3376],
>      | 30.00th=[ 3536], 40.00th=[ 3664], 50.00th=[ 3792], 60.00th=[ 3920],
>      | 70.00th=[ 4080], 80.00th=[ 4256], 90.00th=[ 4448], 95.00th=[ 4704],
>      | 99.00th=[ 5216], 99.50th=[ 5984], 99.90th=[13888], 99.95th=[15296],
>      | 99.99th=[16512]
>     bw (KB/s)  : min=134280, max=169692, per=100.00%, avg=150111.30, stdev=7227.35
>     lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
>     lat (msec) : 2=0.07%, 4=64.79%, 10=34.87%, 20=0.25%
>   cpu          : usr=6.81%, sys=20.01%, ctx=77116, majf=0, minf=279
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
>      issued    : total=r=232601/w=0/d=0, short=r=0/w=0/d=0

… even faster than sequential reads. That just doesn´t make any sense
to me.

The performance of your Samsung SSD seems to be quite erratic. It seems
that the device is capable of being fast, but only sometimes shows this
capability.

> zufälligschreiben: (groupid=2, jobs=1): err= 0: pid=30088
>   write: io=111828KB, bw=1863.7KB/s, iops=209 , runt= 60006msec
[…]
> sequentiellschreiben: (groupid=3, jobs=1): err= 0: pid=30102
>   write: io=137316KB, bw=2288.2KB/s, iops=254 , runt= 60013msec
[…]
> Run status group 0 (all jobs):
>    READ: io=2048.0MB, aggrb=276231KB/s, minb=276231KB/s, maxb=276231KB/s, mint=7592msec, maxt=7592msec
> 
> Run status group 1 (all jobs):
>    READ: io=2048.0MB, aggrb=150043KB/s, minb=150043KB/s, maxb=150043KB/s, mint=13977msec, maxt=13977msec
> 
> Run status group 2 (all jobs):
>   WRITE: io=111828KB, aggrb=1863KB/s, minb=1863KB/s, maxb=1863KB/s, mint=60006msec, maxt=60006msec
> 
> Run status group 3 (all jobs):
>   WRITE: io=137316KB, aggrb=2288KB/s, minb=2288KB/s, maxb=2288KB/s, mint=60013msec, maxt=60013msec

Writes are slow again.

> Disk stats (read/write):
>   sda: ios=505649/30833, merge=110818/7835, ticks=917980/187892, in_queue=1106088, util=98.50%

Well there we have that line. You don´t seem to have the other line
possibly due to not using LVM.

> gandalfthegreat:/mnt/mnt2# 
>  
> > If the values are still quite slow, I think its good to ask Linux
> > block layer experts – for example by posting on fio mailing list where
> > there are people subscribed that may be able to provide other test
> > results – and SSD experts. There might be a Linux block layer mailing
> > list or use libata or fsdevel, I don´t know.
>  
> I'll try there next. I think after this mail, and your help which is very
> much appreciated, it's fair to take things out of this list to stop spamming
> the wrong people :)

Well the only thing interesting regarding BTRFS is the whopping
difference in random read performance between BTRFS and Ext4 on this
SSD.

> > Have the IOPS run on the device it self. That will remove any filesystem
> > layer. But only the read only tests, to make sure I suggest to use fio
> > with the --readonly option as safety guard. Unless you have a spare SSD
> > that you can afford to use for write testing which will likely destroy
> > every filesystem on it. Or let it run on just one logical volume.
>  
> Can you send me a recommended job config you'd like me to run if the runs
> above haven't already answered your questions?

[global]
ioengine=libaio
direct=1
filename=/dev/sdb
size=476940m
bsrange=2k-16k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

I won´t expect much of a difference, but then the random read performance
is quite different between Ext4 and BTRFS on this disk. That would make
it interesting to test without any filesystem in between and over the
whole device.

It might make sense to test around with different alignment, I think the
option is called bsalign or so…

> On Thu, Aug 02, 2012 at 01:25:17PM +0200, Martin Steigerwald wrote:
> > Heck, I didn´t look at the IOPS figure!
> > 
> > 189 IOPS for a SATA-600 SSD. Thats pathetic.
>  
> Yeah, and the new tests above without dmcrypt are no better.
> 
> > A really fast 15000 rpm SAS harddisk might top that.
> 
> My slow 1TB 5400rpm laptop hard drive runs du -s 4x faster than 
> the SSD right now :)
> 
> I read your comments about the multiple things you had in mind. Given the
> new data, should I just go to an io list now, or even save myself some time
> and just return the damned SSDs and buy from another vendor? :)

… or get yourself another SSD. Its your decision.

I admire your endurance. ;)

Thanks,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 20:20                       ` Martin Steigerwald
@ 2012-08-02 20:44                         ` Marc MERLIN
  2012-08-02 21:21                           ` Martin Steigerwald
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-08-02 20:44 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote:
> Hey, whats this? With Ext4 you have really good random read performance
> now! Way better than the Intel SSD 320 and…

Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
Obviously it's sending IO in a way that either the IO subsystem, linux
driver, or drive prefer.

> The performance of your Samsung SSD seems to be quite erratic. It seems
> that the device is capable of being fast, but only sometimes shows this
> capability.
 
That's indeed exactly what I'm seeing in real life :)

> > > Have the IOPS run on the device it self. That will remove any filesystem
> > > layer. But only the read only tests, to make sure I suggest to use fio
> > > with the --readonly option as safety guard. Unless you have a spare SSD
> > > that you can afford to use for write testing which will likely destroy
> > > every filesystem on it. Or let it run on just one logical volume.
> >  
> > Can you send me a recommended job config you'd like me to run if the runs
> > above haven't already answered your questions?
 
> [global]
(...)

I used this and just changed filename to /dev/sda. Since I'm reading
from the beginning of the drive, reads have to be aligned.

> I won´t expect much of a difference, but then the random read performance
> is quite different between Ext4 and BTRFS on this disk. That would make
> it interesting to test without any filesystem in between and over the
> whole device.

Here is the output:
gandalfthegreat:~# fio --readonly ./fio.job3
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0  iops] [eta 01m:00s] 
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172
  read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec
    slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64
    clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66
     lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88
    clat percentiles (usec):
     |  1.00th=[  126],  5.00th=[  235], 10.00th=[ 5216], 20.00th=[ 5920],
     | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480],
     | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768],
     | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656],
     | 99.99th=[27264]
    bw (KB/s)  : min=  423, max= 5776, per=100.00%, avg=986.48, stdev=480.68
    lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02%
    lat (usec) : 1000=0.02%
    lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18%
  cpu          : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=6532/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199
  read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec
    slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91
    clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32
     lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06
    clat percentiles (usec):
     |  1.00th=[   61],  5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920],
     | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608],
     | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536],
     | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568],
     | 99.99th=[34048]
    bw (KB/s)  : min=  405, max= 2680, per=100.00%, avg=912.92, stdev=261.62
    lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02%
    lat (usec) : 1000=0.03%
    lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36%
  cpu          : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=6101/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec

Run status group 1 (all jobs):
   READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec

Disk stats (read/write):
  sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30%

> … or get yourself another SSD. Its your decision.
> 
> I admire your endurance. ;)

Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the
company is greating great reviews for them, I'm now pretty sure that
it's either a problem with a linux driver, which is interesting for us
all to debug :)
If I go buy another brand, the next guy will have the same problems than
me.

But yes, if we figure this out, Samsung owes you and me some money :)

I'll try plugging this SSD in a totally different PC and see what happens.
This may say if it's an AHCI/intel sata driver problem.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 20:44                         ` Marc MERLIN
@ 2012-08-02 21:21                           ` Martin Steigerwald
  2012-08-02 21:49                             ` Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-08-02 21:21 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote:
> > Hey, whats this? With Ext4 you have really good random read performance
> > now! Way better than the Intel SSD 320 and…
> 
> Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
> Obviously it's sending IO in a way that either the IO subsystem, linux
> driver, or drive prefer.

But only on reads.

> > > > Have the IOPS run on the device it self. That will remove any filesystem
> > > > layer. But only the read only tests, to make sure I suggest to use fio
> > > > with the --readonly option as safety guard. Unless you have a spare SSD
> > > > that you can afford to use for write testing which will likely destroy
> > > > every filesystem on it. Or let it run on just one logical volume.
> > >  
> > > Can you send me a recommended job config you'd like me to run if the runs
> > > above haven't already answered your questions?
>  
> > [global]
> (...)
> 
> I used this and just changed filename to /dev/sda. Since I'm reading
> from the beginning of the drive, reads have to be aligned.
> 
> > I won´t expect much of a difference, but then the random read performance
> > is quite different between Ext4 and BTRFS on this disk. That would make
> > it interesting to test without any filesystem in between and over the
> > whole device.
> 
> Here is the output:
> gandalfthegreat:~# fio --readonly ./fio.job3
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
> 2.0.8
> Starting 2 processes
> Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0  iops] [eta 01m:00s] 
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172
>   read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec

WTF?

Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely
without it.

Also I noticed that I had iodepth 1 in there to circumvent any in drive
cache / optimization.

>     slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64
>     clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66
>      lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88
>     clat percentiles (usec):
>      |  1.00th=[  126],  5.00th=[  235], 10.00th=[ 5216], 20.00th=[ 5920],
>      | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480],
>      | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768],
>      | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656],
>      | 99.99th=[27264]
>     bw (KB/s)  : min=  423, max= 5776, per=100.00%, avg=986.48, stdev=480.68
>     lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02%
>     lat (usec) : 1000=0.02%
>     lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18%
>   cpu          : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=6532/w=0/d=0, short=r=0/w=0/d=0

Latency is still way to high even with iodepth 1, 10 milliseconds for
43% of requests. And the throughput and IOPS is still abysmal even
for iodepth 1 (see below for Intel SSD 320 values).

Okay, one further idea: Remove the bsrange to just test with 4k blocks.

Additionally test these are aligned with

       blockalign=int[,int], ba=int[,int]
              At what boundary to align random IO offsets. Defaults to
              the  same  as  'blocksize'  the minimum blocksize given.
              Minimum alignment is typically 512b for using direct IO,
              though  it  usually  depends on the hardware block size.
              This option is mutually exclusive with  using  a  random
              map for files, so it will turn off that option.

I would first test with 4k blocks as is. And then do something like:

blocksize=4k
blockalign=4k

And then raise

blockalign to some values that may matter like 8k, 128k, 512k, 1m or so.

But thats just guess work. I do not even exactly now if it works this
way in fio.

There is something pretty wierd going on. But I am not sure what it is.
Maybe an alignment issue, since Ext4 with stripe alignment was able to
do so much faster on reads.

> sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199
>   read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec

Ey, whats this?

>     slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91
>     clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32
>      lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06
>     clat percentiles (usec):
>      |  1.00th=[   61],  5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920],
>      | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608],
>      | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536],
>      | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568],
>      | 99.99th=[34048]
>     bw (KB/s)  : min=  405, max= 2680, per=100.00%, avg=912.92, stdev=261.62
>     lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02%
>     lat (usec) : 1000=0.03%
>     lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36%
>   cpu          : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued    : total=r=6101/w=0/d=0, short=r=0/w=0/d=0
> 
> Run status group 0 (all jobs):
>    READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec
> 
> Run status group 1 (all jobs):
>    READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec
> 
> Disk stats (read/write):
>   sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30%

What on earth is this?

You are testing the raw device and are getting these fun values out of it?
Let me repeat that here:

merkaba:/tmp> cat iops-read.job 
[global]
ioengine=libaio
direct=1
filename=/dev/sda
bsrange=2k-16k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

merkaba:/tmp> fio iops-read.job
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [57784K/0K /s] [6471 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31915
  read : io=1681.2MB, bw=28692KB/s, iops=3193 , runt= 60001msec
    slat (usec): min=5 , max=991 , avg=31.18, stdev=10.07
    clat (usec): min=2 , max=3312 , avg=276.67, stdev=124.41
     lat (usec): min=48 , max=3337 , avg=308.54, stdev=125.51
    clat percentiles (usec):
     |  1.00th=[   61],  5.00th=[   82], 10.00th=[  103], 20.00th=[  159],
     | 30.00th=[  201], 40.00th=[  237], 50.00th=[  270], 60.00th=[  306],
     | 70.00th=[  354], 80.00th=[  402], 90.00th=[  446], 95.00th=[  478],
     | 99.00th=[  524], 99.50th=[  548], 99.90th=[  644], 99.95th=[  716],
     | 99.99th=[ 1020]
    bw (KB/s)  : min=27760, max=31784, per=100.00%, avg=28694.49, stdev=625.77
    lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.15%, 100=8.76%
    lat (usec) : 250=34.69%, 500=54.13%, 750=2.23%, 1000=0.03%
    lat (msec) : 2=0.01%, 4=0.01%
  cpu          : usr=2.91%, sys=12.49%, ctx=199678, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=191587/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=31921
  read : io=3895.9MB, bw=66487KB/s, iops=7384 , runt= 60001msec
    slat (usec): min=5 , max=518 , avg=30.18, stdev= 8.16
    clat (usec): min=1 , max=3394 , avg=100.21, stdev=83.51
     lat (usec): min=44 , max=3429 , avg=131.69, stdev=84.44
    clat percentiles (usec):
     |  1.00th=[   54],  5.00th=[   60], 10.00th=[   61], 20.00th=[   68],
     | 30.00th=[   72], 40.00th=[   79], 50.00th=[   86], 60.00th=[   92],
     | 70.00th=[   99], 80.00th=[  105], 90.00th=[  112], 95.00th=[  163],
     | 99.00th=[  556], 99.50th=[  716], 99.90th=[  900], 99.95th=[  940],
     | 99.99th=[ 1080]
    bw (KB/s)  : min=30276, max=81176, per=99.95%, avg=66453.19, stdev=10893.90
    lat (usec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.57%
    lat (usec) : 100=69.52%, 250=26.59%, 500=2.01%, 750=0.89%, 1000=0.39%
    lat (msec) : 2=0.02%, 4=0.01%
  cpu          : usr=7.12%, sys=27.34%, ctx=461537, majf=0, minf=27
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=443106/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=1681.2MB, aggrb=28691KB/s, minb=28691KB/s, maxb=28691KB/s, mint=60001msec, maxt=60001msec

Run status group 1 (all jobs):
   READ: io=3895.9MB, aggrb=66487KB/s, minb=66487KB/s, maxb=66487KB/s, mint=60001msec, maxt=60001msec

Disk stats (read/write):
  sda: ios=632851/228, merge=0/140, ticks=106392/351, in_queue=105682, util=87.78%


I have seen about 4000 IOPS with pure 4k blocks from that drive with 4k
blocks, so that seems to be fine. And look at latencies, barely some
requests in low millisecond range.

Now just for the sake of it with iodepth 64:

merkaba:/tmp> fio iops-read.job
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [217.8M/0K /s] [24.1K/0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=31945
  read : io=12412MB, bw=211819KB/s, iops=23795 , runt= 60003msec
    slat (usec): min=2 , max=2158 , avg=15.69, stdev= 8.96
    clat (usec): min=138 , max=19349 , avg=2670.01, stdev=861.85
     lat (usec): min=189 , max=19366 , avg=2686.29, stdev=861.67
    clat percentiles (usec):
     |  1.00th=[ 1256],  5.00th=[ 1448], 10.00th=[ 1592], 20.00th=[ 1864],
     | 30.00th=[ 2128], 40.00th=[ 2384], 50.00th=[ 2640], 60.00th=[ 2896],
     | 70.00th=[ 3152], 80.00th=[ 3408], 90.00th=[ 3728], 95.00th=[ 3952],
     | 99.00th=[ 4448], 99.50th=[ 4896], 99.90th=[ 8768], 99.95th=[10048],
     | 99.99th=[12096]
    bw (KB/s)  : min=116464, max=216400, per=100.00%, avg=211823.16, stdev=9787.89
    lat (usec) : 250=0.01%, 500=0.01%, 750=0.02%, 1000=0.05%
    lat (msec) : 2=25.18%, 4=70.40%, 10=4.28%, 20=0.05%
  cpu          : usr=14.39%, sys=44.38%, ctx=624900, majf=0, minf=278
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1427811/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=32301
  read : io=12499MB, bw=213307KB/s, iops=23691 , runt= 60002msec
    slat (usec): min=1 , max=1509 , avg=12.42, stdev= 8.12
    clat (usec): min=306 , max=201523 , avg=2685.75, stdev=1220.14
     lat (usec): min=316 , max=201536 , avg=2698.70, stdev=1220.06
    clat percentiles (usec):
     |  1.00th=[ 1816],  5.00th=[ 2024], 10.00th=[ 2096], 20.00th=[ 2192],
     | 30.00th=[ 2256], 40.00th=[ 2352], 50.00th=[ 2416], 60.00th=[ 2544],
     | 70.00th=[ 2736], 80.00th=[ 2992], 90.00th=[ 3632], 95.00th=[ 4320],
     | 99.00th=[ 5664], 99.50th=[ 6112], 99.90th=[ 7136], 99.95th=[ 7712],
     | 99.99th=[26240]
    bw (KB/s)  : min=144720, max=256568, per=99.96%, avg=213210.08, stdev=23175.58
    lat (usec) : 500=0.01%, 750=0.02%, 1000=0.06%
    lat (msec) : 2=4.10%, 4=88.72%, 10=7.09%, 20=0.01%, 50=0.01%
    lat (msec) : 100=0.01%, 250=0.01%
  cpu          : usr=11.77%, sys=34.79%, ctx=440558, majf=0, minf=280
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=1421534/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=12412MB, aggrb=211818KB/s, minb=211818KB/s, maxb=211818KB/s, mint=60003msec, maxt=60003msec

Run status group 1 (all jobs):
   READ: io=12499MB, aggrb=213306KB/s, minb=213306KB/s, maxb=213306KB/s, mint=60002msec, maxt=60002msec

Disk stats (read/write):
  sda: ios=2151624/243, merge=692022/115, ticks=5719203/651132, in_queue=6372962, util=99.87%

(Seems that extra line is gone from fio 2, I darkly remember that Jens
Axboe removed some output he found to be superfluuos, but the disk stats
are still there.)

In that case random versus sequential do not even seem to make much of
a difference.

> > … or get yourself another SSD. Its your decision.
> > 
> > I admire your endurance. ;)
> 
> Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the
> company is greating great reviews for them, I'm now pretty sure that
> it's either a problem with a linux driver, which is interesting for us
> all to debug :)

Either that or there possible is some firmware update? But then those
reviews would mention it I bet.

I´d still look whether there is some firmware update available labeled:

"This will raise performance on lots of Linux based workloads by factors
of 10x to 250x" ;)

Anyway, I think now we are pretty near to the hardware.

So issue has to be somewhere in the block layer, the controller,
the SSD firmware or the SSD itself IMHO.

> If I go buy another brand, the next guy will have the same problems than
> me.
> 
> But yes, if we figure this out, Samsung owes you and me some money :)

;)
 
> I'll try plugging this SSD in a totally different PC and see what happens.
> This may say if it's an AHCI/intel sata driver problem.

Seems we will continue until someone starts to complain here. Maybe
another list will be more approbiate? But then this thread has it all
in one ;). Adding a CC with some introductory note might be approbiate.
Its your problem, so you decide ;). I´d suggest the fio mailing list,
there are other performance people how may want to chime in.

Another idea: Is there anything funny in SMART values (smartctl -a
and possibly even -x)? Well I gather these SSDs are new, but still.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 21:21                           ` Martin Steigerwald
@ 2012-08-02 21:49                             ` Marc MERLIN
  2012-08-03 18:45                               ` Martin Steigerwald
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-08-02 21:49 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

On Thu, Aug 02, 2012 at 11:21:51PM +0200, Martin Steigerwald wrote:
> > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
> > Obviously it's sending IO in a way that either the IO subsystem, linux
> > driver, or drive prefer.
> 
> But only on reads.

Yes. I was focussing on the simple case. SSDs can be slower than hard
drives on writes, and now you're getting into deep firmware magic for
corner cases, so I didn't want to touch that here :)
 
> WTF?
> 
> Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely
> without it.

I left it as is:
[global]
ioengine=libaio
direct=1
filename=/dev/sda
size=476940m
bsrange=2k-16k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

I just removed it now.

> Okay, one further idea: Remove the bsrange to just test with 4k blocks.

Ok, there you go:
[global]
ioengine=libaio
direct=1
filename=/dev/sda
blocksize=4k
blockalign=4k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

gandalfthegreat:~# fio --readonly ./fio.job4
zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [1476K/0K /s] [369 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3614
  read : io=60832KB, bw=1013.7KB/s, iops=253 , runt= 60012msec
    slat (usec): min=4 , max=892 , avg=24.36, stdev=16.05
    clat (usec): min=2 , max=27575 , avg=3915.39, stdev=4990.69
     lat (usec): min=44 , max=27591 , avg=3940.35, stdev=4991.99
    clat percentiles (usec):
     |  1.00th=[   41],  5.00th=[   46], 10.00th=[   51], 20.00th=[   66],
     | 30.00th=[  113], 40.00th=[  163], 50.00th=[  205], 60.00th=[ 4960],
     | 70.00th=[ 5856], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608],
     | 99.00th=[14272], 99.50th=[17280], 99.90th=[18816], 99.95th=[21120],
     | 99.99th=[22912]
    bw (KB/s)  : min=  253, max= 2520, per=100.00%, avg=1014.86, stdev=364.74
    lat (usec) : 4=0.05%, 10=0.01%, 20=0.01%, 50=8.07%, 100=1iiiiiiii9.94%
    lat (usec) : 250=26.26%, 500=2.78%, 750=0.30%, 1000=0.09%
    lat (msec) : 2=0.12%, 4=0.07%, 10=20.75%, 20=21.51%, 50=0.05%
  cpu          : usr=0.31%, sys=0.88%, ctx=15250, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=15208/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3636
  read : io=88468KB, bw=1474.3KB/s, iops=368 , runt= 60009msec
    slat (usec): min=3 , max=340 , avg=16.46, stdev=10.23
    clat (usec): min=1 , max=27356 , avg=2692.99, stdev=4564.76
     lat (usec): min=30 , max=27370 , avg=2709.87, stdev=4566.31
    clat percentiles (usec):
     |  1.00th=[   27],  5.00th=[   29], 10.00th=[   31], 20.00th=[   35],
     | 30.00th=[   47], 40.00th=[   61], 50.00th=[   78], 60.00th=[  107],
     | 70.00th=[  270], 80.00th=[ 5856], 90.00th=[11712], 95.00th=[12608],
     | 99.00th=[16512], 99.50th=[17536], 99.90th=[18816], 99.95th=[19584],
     | 99.99th=[21120]
    bw (KB/s)  : min=  282, max= 5096, per=100.00%, avg=1474.40, stdev=899.36
    lat (usec) : 2=0.05%, 4=0.01%, 20=0.02%, 50=32.45%, 100=24.61%
    lat (usec) : 250=12.52%, 500=1.40%, 750=0.03%, 1000=0.03%
    lat (msec) : 2=0.02%, 4=0.03%, 10=13.95%, 20=14.85%, 50=0.03%
  cpu          : usr=0.33%, sys=0.90%, ctx=22176, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=22117/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=60832KB, aggrb=1013KB/s, minb=1013KB/s, maxb=1013KB/s, mint=60012msec, maxt=60012msec

Run status group 1 (all jobs):
   READ: io=88468KB, aggrb=1474KB/s, minb=1474KB/s, maxb=1474KB/s, mint=60009msec, maxt=60009msec

Disk stats (read/write):
  sda: ios=54013/1836, merge=291/17, ticks=251660/11052, in_queue=262668, util=99.51%


and the same test with blockalign=512k:
gandalfthegreat:~# fio --readonly ./fio.job4 # blockalign=512k
fio: Any use of blockalign= turns off randommap
zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [4180K/0K /s] [1045 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3679
  read : io=56096KB, bw=957355 B/s, iops=233 , runt= 60001msec
    slat (usec): min=4 , max=344 , avg=27.73, stdev=15.51
    clat (usec): min=2 , max=38954 , avg=4244.27, stdev=5153.32
     lat (usec): min=44 , max=38980 , avg=4272.70, stdev=5154.68
    clat percentiles (usec):
     |  1.00th=[   42],  5.00th=[   46], 10.00th=[   53], 20.00th=[   69],
     | 30.00th=[  120], 40.00th=[  169], 50.00th=[  211], 60.00th=[ 5024],
     | 70.00th=[ 5920], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608],
     | 99.00th=[16768], 99.50th=[17536], 99.90th=[18560], 99.95th=[20096],
     | 99.99th=[37632]
    bw (KB/s)  : min=  263, max= 3133, per=100.00%, avg=935.09, stdev=392.66
    lat (usec) : 4=0.05%, 10=0.03%, 20=0.01%, 50=7.84%, 100=19.12%
    lat (usec) : 250=25.84%, 500=1.61%, 750=0.08%, 1000=0.05%
    lat (msec) : 2=0.06%, 4=0.08%, 10=21.99%, 20=23.17%, 50=0.06%
  cpu          : usr=0.29%, sys=0.94%, ctx=14069, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=14024/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3703
  read : io=99592KB, bw=1659.7KB/s, iops=414 , runt= 60008msec
    slat (usec): min=4 , max=186 , avg=15.83, stdev= 9.95
    clat (usec): min=1 , max=25530 , avg=2390.79, stdev=4340.79
     lat (usec): min=28 , max=25539 , avg=2407.01, stdev=4342.90
    clat percentiles (usec):
     |  1.00th=[   26],  5.00th=[   28], 10.00th=[   30], 20.00th=[   34],
     | 30.00th=[   43], 40.00th=[   57], 50.00th=[   72], 60.00th=[   91],
     | 70.00th=[  155], 80.00th=[ 5152], 90.00th=[11712], 95.00th=[12480],
     | 99.00th=[12992], 99.50th=[16768], 99.90th=[18560], 99.95th=[18816],
     | 99.99th=[21632]
    bw (KB/s)  : min=  269, max= 8465, per=100.00%, avg=1663.84, stdev=1367.82
    lat (usec) : 2=0.01%, 4=0.02%, 20=0.01%, 50=35.42%, 100=26.00%
    lat (usec) : 250=11.29%, 500=1.31%, 750=0.05%, 1000=0.01%
    lat (msec) : 2=0.04%, 4=0.02%, 10=12.55%, 20=13.25%, 50=0.01%
  cpu          : usr=0.31%, sys=0.99%, ctx=24963, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=24898/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=56096KB, aggrb=934KB/s, minb=934KB/s, maxb=934KB/s, mint=60001msec, maxt=60001msec

Run status group 1 (all jobs):
   READ: io=99592KB, aggrb=1659KB/s, minb=1659KB/s, maxb=1659KB/s, mint=60008msec, maxt=60008msec

Disk stats (read/write):
  sda: ios=55021/1637, merge=72/39, ticks=240896/12080, in_queue=252912, util=99.59%


> > Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the
> > company is greating great reviews for them, I'm now pretty sure that
> > it's either a problem with a linux driver, which is interesting for us
> > all to debug :)
> 
> Either that or there possible is some firmware update? But then those
> reviews would mention it I bet.
 
So far, there are no firmware updates from mine, I checked right after
unboxing it :)

> > I'll try plugging this SSD in a totally different PC and see what happens.
> > This may say if it's an AHCI/intel sata driver problem.
> 
> Seems we will continue until someone starts to complain here. Maybe
> another list will be more approbiate? But then this thread has it all
> in one ;). Adding a CC with some introductory note might be approbiate.
> Its your problem, so you decide ;). I´d suggest the fio mailing list,
> there are other performance people how may want to chime in.
 
Actually you know the lists and people involved more than me. I'd be
happy if you added a Cc to another list you think is best, and we can
move there.

> Another idea: Is there anything funny in SMART values (smartctl -a
> and possibly even -x)? Well I gather these SSDs are new, but still.


smartctl -a /dev/sda
(snipped)
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       36
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       7
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       9
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   058   040   000    Old_age   Always       -       42
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   253   253   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       4
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       433339112


smartctl -x /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.5.0-amd64-preempt-noide-20120410] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG SSD 830 Series
Serial Number:    S0WGNEAC400484
LU WWN Device Id: 5 002538 043584d30
Firmware Version: CXM03B1Q
User Capacity:    512,110,190,592 bytes [512 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 revision 2
Local Time is:    Thu Aug  2 14:42:08 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x83)	Offline data collection activity
					is in a Reserved state.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		( 1980) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					No Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  33) minutes.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   PO--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   099   099   000    -    36
 12 Power_Cycle_Count       -O--CK   099   099   000    -    7
177 Wear_Leveling_Count     PO--C-   099   099   000    -    9
179 Used_Rsvd_Blk_Cnt_Tot   PO--C-   100   100   010    -    0
181 Program_Fail_Cnt_Total  -O--CK   100   100   010    -    0
182 Erase_Fail_Count_Total  -O--CK   100   100   010    -    0
183 Runtime_Bad_Block       PO--C-   100   100   010    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
190 Airflow_Temperature_Cel -O--CK   058   040   000    -    42
195 Hardware_ECC_Recovered  -O-RC-   200   200   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   253   253   000    -    0
235 Unknown_Attribute       -O--C-   099   099   000    -    4
241 Total_LBAs_Written      -O--CK   099   099   000    -    433352008
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
GP/S  Log at address 0x00 has    1 sectors [Log Directory]
GP/S  Log at address 0x01 has    1 sectors [Summary SMART error log]
GP/S  Log at address 0x02 has    2 sectors [Comprehensive SMART error log]
GP/S  Log at address 0x03 has    2 sectors [Ext. Comprehensive SMART error log]
GP/S  Log at address 0x06 has    1 sectors [SMART self-test log]
GP/S  Log at address 0x07 has    2 sectors [Extended self-test log]
GP/S  Log at address 0x09 has    1 sectors [Selective self-test log]
GP/S  Log at address 0x10 has    1 sectors [NCQ Command Error]
GP/S  Log at address 0x11 has    1 sectors [SATA Phy Event Counters]
GP/S  Log at address 0x80 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x81 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x82 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x83 has   16 sectors [Host vendor specific log]
(snip)

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (2 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%        27         -
# 2  Short offline       Completed without error       00%         8         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Warning: device does not support SCT Commands
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2        49152  R_ERR response for data FIS
0x0003  2        53248  R_ERR response for device-to-host data FIS
0x0004  2        57344  R_ERR response for host-to-device data FIS
0x0005  2         7952  R_ERR response for non-data FIS
0x0006  2         2562  R_ERR response for device-to-host non-data FIS
0x0007  2         4096  R_ERR response for host-to-device non-data FIS
0x0008  2         7952  Device-to-host non-data FIS retries
0x0009  2         7953  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2           25  Device-to-host register FISes sent due to a COMRESET
0x000b  2         6145  CRC errors within host-to-device FIS
0x000d  2         7953  Non-CRC errors within host-to-device FIS
0x000f  2         8176  R_ERR response for host-to-device data FIS, CRC
0x0010  2           15  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2        49153  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2        12112  R_ERR response for host-to-device non-data FIS, non-CRC

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-02 21:49                             ` Marc MERLIN
@ 2012-08-03 18:45                               ` Martin Steigerwald
  2012-08-16  7:45                                 ` Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Martin Steigerwald @ 2012-08-03 18:45 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-btrfs, Fajar A. Nugraha

Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> > > I'll try plugging this SSD in a totally different PC and see what
> > > happens. This may say if it's an AHCI/intel sata driver problem.
> >
> > 
> >
> > Seems we will continue until someone starts to complain here. Maybe
> > another list will be more approbiate? But then this thread has it all
> > in one ;). Adding a CC with some introductory note might be
> > approbiate. Its your problem, so you decide ;). I´d suggest the fio
> > mailing list, there are other performance people how may want to
> > chime in.
> 
>  
> Actually you know the lists and people involved more than me. I'd be
> happy if you added a Cc to another list you think is best, and we can
> move there.

I have no more ideas for the moment.

I suggest you to try the fio mailing list at fio lalala vger.kernel.org.

I currently have no ideas for other lists. I bet there is some block layer 
/ libata / scsi list that might match. Just browse the lists from 
kernel.org and pick the one that sounds most suitable for me¹. Probably 
fsdevel but since the thing does not seem to be filesystem related beside 
some alignment effect (stripe in Ext4). Hmmm, probably linux-scsi, but 
look there first, whether block layer and libata related things are 
discussed there.

I would start a new thread. Tell the problem, tell in summary what you 
have tried and the outcome of that was and ask for advice. Add a link to 
the original thread here. (It might be a good idea to have one more post 
here with a link to the new thread so that other curious people can follow 
there – in case there are any. Otherwise I would end it on this thread. 
The thread view already looks ridiculous in KMail here ;-)

And put me on Cc ;). I´d really like to know whats going on here. As 
written I have no more ideas right now. It seems to be getting to low 
level for me.

[1] http://vger.kernel.org/vger-lists.html

Ciao,
-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: How can btrfs take 23sec to stat 23K files from an SSD?
  2012-08-03 18:45                               ` Martin Steigerwald
@ 2012-08-16  7:45                                 ` Marc MERLIN
  0 siblings, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-08-16  7:45 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: linux-btrfs, Fajar A. Nugraha

To close this thread.

I gave up on the drive after it showed abysmal benchmarking results in
windows too.

I owed everyone an update, which I just finished typing:
http://marc.merlins.org/perso/linux/post_2012-08-15_The-tale-of-SSDs_-Crucial-C300-early-Death_-Samsung-830-extreme-random-IO-slowness_-and-settling-with-OCZ-Vertex-4.html

I'm not sure how I could have gotten 2 bad drives from Samsung in 2
different shipments, so I'm afraid the entire line may be bad. At least, it
was for me after extensive benchmarking, and even using their own windows
benchmarking tool.

In the end, I got a OCZ Vertex 4 and it's superfast as per the benchmarks I
posted in the link above.

What is interesting is that ext4 is somewhat faster than btrfs in the tests
I posted above, but not in ways that should be truly worrisome.
I'm still happy to have COW and snapshots, even if that costs me
performance.

Sorry for spamming the list with what apparently just ended up a crappy
drive (well, 2 since I got two just to make sure) from a vendor who
shouldn't have been selling crappy SSDs.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
  2012-08-01  5:30       ` Marc MERLIN
  2012-08-01  6:01         ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
  2012-08-01  8:18         ` du -s src is a lot slower on SSD than spinning disk in the same laptop Spelic
@ 2012-08-16  7:50         ` Marc MERLIN
       [not found]           ` <502CC2A2.4010506@shiftmail.org>
  2 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-08-16  7:50 UTC (permalink / raw)
  To: Ted Ts'o; +Cc: Lukáš Czerner, linux-ext4, axboe, Milan Broz

On Tue, Jul 31, 2012 at 10:30:42PM -0700, Marc MERLIN wrote:
> First: btrfs is the slowest:
> gandalfthegreat:/mnt/ssd/var/local# time du -sh src/
> 514M	src/
> real	0m25.741s
> 
> Second: ext4 with mkfs.ext4 -O extent -b 4096 /dev/sda3
> gandalfthegreat:/mnt/mnt3# reset_cache
> gandalfthegreat:/mnt/mnt3# time du -sh src/
> 519M	src/
> real	0m12.459s
> gandalfthegreat:~# grep mnt3 /proc/mounts
> /dev/sda3 /mnt/mnt3 ext4 rw,noatime,discard,data=ordered 0 0
> 
> A freshly made ntfs filesystem through fuse is actually FASTER!
> gandalfthegreat:/mnt/mnt2# reset_cache 
> gandalfthegreat:/mnt/mnt2# time du -sh src/
> 506M	src/
> real	0m8.928s
> gandalfthegreat:/mnt/mnt2# grep mnt2 /proc/mounts
> /dev/sda2 /mnt/mnt2 fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other,blksize=4096 0 0
> 
> How can ntfs via fuse be the fastest?

To provide closure to this thread.

I owed everyone an update, which I just finished typing:
http://marc.merlins.org/perso/linux/post_2012-08-15_The-tale-of-SSDs_-Crucial-C300-early-Death_-Samsung-830-extreme-random-IO-slowness_-and-settling-with-OCZ-Vertex-4.html

Basically, the samsung 830 just sucks. I got 2 of them, they both utterly
sucked. There is no excuse for an SSD being several times slower than a slow
hard drive on _READs_ (not even talking about writes).

I'm not sure how I could have gotten 2 bad drives from Samsung in 2
different shipments, so I'm afraid the entire line may be bad. At least, it
was for me after extensive benchmarking, and even using their own windows
benchmarking tool.

In the end, I got a OCZ Vertex 4 and it's superfast as per the benchmarks I
posted in the link above.

As good news, ext4 is faster than btrfs in my (limited) tests across both
SSDs and my hard drive.

Cheers,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: du -s src is a lot slower on SSD than spinning disk in the same laptop
       [not found]           ` <502CC2A2.4010506@shiftmail.org>
@ 2012-08-16 17:55             ` Marc MERLIN
  2012-09-05 16:52               ` ext4 crash with 3.5.2 in ext4_ext_remove_space Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-08-16 17:55 UTC (permalink / raw)
  To: Spelic; +Cc: linux-ext4

On Thu, Aug 16, 2012 at 11:51:30AM +0200, Spelic wrote:
> On 08/16/12 09:50, Marc MERLIN wrote:
> >On Tue, Jul 31, 2012 at 10:30:42PM -0700, Marc MERLIN wrote:
> >>How can ntfs via fuse be the fastest?
> >To provide closure to this thread.
> >
> >I owed everyone an update, which I just finished typing:
> >http://marc.merlins.org/perso/linux/post_2012-08-15_The-tale-of-SSDs_-Crucial-C300-early-Death_-Samsung-830-extreme-random-IO-slowness_-and-settling-with-OCZ-Vertex-4.html
> 
> Hello,
> reading your article a few ideas came to my mind.
> 
> Firstly, for the Crucial, I haven't read much about that bug, but would 
> leaving the last 20% of the drive free (make an empty partition there 
> and then fstrim it and leave it empty) help with that bug? Maybe the 
> garbage collection algorithm hasn't got enough free blocks to shuffle 
> data around. I am interested because Crucial would be my prime choice 
> for what I read around. When it resuscitated did you try to TRIM it 
> wholly to try regain performances? I don't know if you still have it around.

The drive was still mostly dead, it didn't work long enough for me to try to
reformat it. I RMA'ed it and haven't used the replacement yet because I
don't have data I don't care about enough :)

> Secondly, for the Samsung 830:
> The random access slowness is reproducible also without dm-crypt it 
> seems to me. This benchmark of yours was NOT on dm-crypt, correct?
> http://www.spinics.net/lists/linux-btrfs/msg18238.html

Correct.

> In your  fio benchmarks, e.g.
> http://www.spinics.net/lists/linux-btrfs/msg18204.html
> I noticed how the iodepth is at 64
> 
> The Samsung SSD has a bug on high IO depths:
> 
> http://www.bit-tech.net/hardware/2011/09/29/samsung-ssd-830-256gb-review/3
> http://www.bit-tech.net/hardware/2011/09/29/samsung-ssd-830-256gb-review/5
> already at 32 behaves bad.
> 
> Please do
> echo 1 > /sys/block/sdX/device/queue_depth

Sorry, I don't have the drives anymore, so I can't, but between you and me
if the drive fails to perform after as much work I put in it, as in its
default configuration on windows 7, that's pretty damn sad.

Point being that those things should kind of "just work", like the Crucial
(before sudden death) and the OCZ.
I spent too many hours of my life trying to get the damn drive to perform,
and if it takes even more kernel and IO experts to kind of get the drive to
work (assuming it would have, and at this point I'm not certain), then it's
crap in my book (also you can't tune half that stuff on windows for the
primary users who would buy that drive).

You are welcome to order one though and try your own tests and post them
here. Worst case, you can return it if it sucks for you too.

I just added the screeshot of the windows benchmarks if that helps.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* ext4 crash with 3.5.2 in ext4_ext_remove_space
  2012-08-16 17:55             ` Marc MERLIN
@ 2012-09-05 16:52               ` Marc MERLIN
  2012-09-05 17:50                 ` Lukáš Czerner
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-09-05 16:52 UTC (permalink / raw)
  To: linux-ext4

I get a crash when mounting a filesystem.
I'm making an image now with e2image -r before I run e2fsck on it.

Is there anything else you'd like me to do?

[13090.175424] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
[13090.184897] IP: [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
[13090.192927] PGD 1120d7067 PUD 1123ac067 PMD 0 
[13090.198469] Oops: 0000 [#1] PREEMPT SMP 
[13090.203508] CPU 1 
[13090.205368] Modules linked in:[13090.209508]  ppdev lp tun autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sata_mv kl5kusb105 ftdi_sio keyspan nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_REJECT xt_state xt_tcpudp xt_LOG iptable_mangle iptable_filter deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_sse2_x86_64 lrw serpent_generic xts gf128mul cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo blowfish_generic blowfish_x86_64 blowfish_common dm_crypt dm_mirror dm_region_hash dm_log aes_x86_64 fuse lm85 hwmon_vid dm_snapshot dm_mod iptable_nat ip_tables nf_conntrack_ftp ipt_MASQUERADE nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 x_tables nf_conntrack sg 
 st snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek rc_ati_x10 ati_remote rc_core pl2303 usbserial i915 snd_hda_intel snd_cmipci snd_hda_codec gameport drm_kms_helper drm snd_opl3_lib snd_mpu401_uart eeepc_wmi asus_wmi
 i2c_algo_bit[13090.315877]  sparse_keymap rfkill snd_hwdep snd_seq_midi acpi_cpufreq snd_seq_midi_event mperf kvm_intel kvm processor snd_seq snd_pcm pci_hotplug ehci_hcd wmi parport_pc xhci_hcd microcode button video sata_sil24 parport snd_rawmidi snd_timer usbcore crc32c_intel ghash_clmulni_intel i2c_i801 snd_seq_device cryptd evdev snd lpc_ich mei i2c_core pcspkr snd_page_alloc thermal_sys usb_common soundcore coretemp r8169 mii tpm_tis tpm tpm_bios [last unloaded: kl5kusb105]

[13090.370146] Pid: 9658, comm: mount Not tainted 3.5.2-amd64-preempt-noide-20120903 #1 System manufacturer System Product Name/P8H67-M PRO
[13090.387559] RIP: 0010:[<ffffffff8119b92f>]  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
[13090.398254] RSP: 0018:ffff88012b4c7a18  EFLAGS: 00010246
[13090.406241] RAX: 0000000000000000 RBX: ffff8800a41fb4e8 RCX: ffff8800a41fb450
[13090.416102] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 000000001b340000
[13090.426003] RBP: ffff88012b4c7af8 R08: 000000001b340000 R09: ffff88008ae6f6e0
[13090.435920] R10: ffff880000000000 R11: 0000000000000000 R12: ffff8801156e86f0
[13090.445782] R13: 0000000000000000 R14: ffff8801156e86c0 R15: 0000000000000000
[13090.455627] FS:  0000000000000000(0000) GS:ffff88013fa80000(0063) knlGS:00000000f75ae750
[13090.466451] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
[13090.475010] CR2: 0000000000000028 CR3: 0000000115614000 CR4: 00000000000407e0
[13090.484921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[13090.494802] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[13090.504649] Process mount (pid: 9658, threadinfo ffff88012b4c6000, task ffff88008d91c4c0)
[13090.515567] Stack:
[13090.520209]  ffff88012b4c7a28 ffffffff81175f65 ffff88012b4c7a88 ffffffff81179c44
[13090.530426]  ffff880100001000 00000000fffffff5 ffffffffa41fb438 ffff8801156e8748
[13090.540694]  00000000000a77ff ffff8800a41fb438 0000000008007800 ffff880122e67800
[13090.551124] Call Trace:
[13090.556238]  [<ffffffff81175f65>] ? brelse+0xe/0x10
[13090.564018]  [<ffffffff81179c44>] ? ext4_mark_iloc_dirty+0x51c/0x581
[13090.573175]  [<ffffffff8119d51c>] ext4_ext_truncate+0xcd/0x179
[13090.581856]  [<ffffffff81132b1c>] ? __inode_wait_for_writeback+0x67/0xa9
[13090.591316]  [<ffffffff81177e74>] ext4_truncate+0x9c/0xdf
[13090.599434]  [<ffffffff8117b721>] ext4_evict_inode+0x1e9/0x2d6
[13090.608273]  [<ffffffff811282e3>] evict+0xa8/0x162
[13090.615846]  [<ffffffff81128588>] iput+0x1b3/0x1bb
[13090.623284]  [<ffffffff8119570d>] ext4_fill_super+0x214b/0x256b
[13090.631785]  [<ffffffff8128f835>] ? vsnprintf+0x1ce/0x421
[13090.639778]  [<ffffffff8113f9b2>] ? set_blocksize+0x36/0x86
[13090.647864]  [<ffffffff811169ae>] mount_bdev+0x14b/0x1ad
[13090.655707]  [<ffffffff811935c2>] ? ext4_calculate_overhead+0x247/0x247
[13090.664876]  [<ffffffff8112ada7>] ? alloc_vfsmnt+0xa6/0x198
[13090.672989]  [<ffffffff81184d5f>] ext4_mount+0x10/0x12
[13090.680564]  [<ffffffff81117390>] mount_fs+0x64/0x150
[13090.687944]  [<ffffffff810e5439>] ? __alloc_percpu+0xb/0xd
[13090.695702]  [<ffffffff8112b156>] vfs_kern_mount+0x64/0xde
[13090.703435]  [<ffffffff8112b544>] do_kern_mount+0x48/0xda
[13090.711023]  [<ffffffff8112ce32>] do_mount+0x6a1/0x704
[13090.718294]  [<ffffffff810e1181>] ? memdup_user+0x38/0x60
[13090.725767]  [<ffffffff810e11df>] ? strndup_user+0x36/0x4c
[13090.733260]  [<ffffffff8114ffaa>] compat_sys_mount+0x208/0x242
[13090.741047]  [<ffffffff814b0b06>] sysenter_dispatch+0x7/0x21
[13090.748612] Code: ff 4c 63 65 b8 4d 6b e4 30 4c 03 65 b0 e9 fd 00 00 00 48 63 55 b8 4c 6b e2 30 4c 03 65 b0 49 83 7c 24 20 00 75 0e 49 8b 44 24 28 <48> 8b 40 28 49 89 44 24 20 49 8b 44 24 18 48 85 c0 75 22 49 8b 
[13090.772841] RIP  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
[13090.781438]  RSP <ffff88012b4c7a18>
[13090.786758] CR2: 0000000000000028
[13090.804674] ---[ end trace 880c73500bb7f09f ]---
[13090.810808] Kernel panic - not syncing: Fatal exception
[13090.817346] panic occurred, switching back to text console
[13090.824556] Rebooting in 20 seconds..
[13110.758740] ACPI MEMORY or I/O RESET_REG.
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: ext4 crash with 3.5.2 in ext4_ext_remove_space
  2012-09-05 16:52               ` ext4 crash with 3.5.2 in ext4_ext_remove_space Marc MERLIN
@ 2012-09-05 17:50                 ` Lukáš Czerner
  2012-09-05 17:53                   ` Marc MERLIN
  2012-09-06  4:24                   ` Marc MERLIN
  0 siblings, 2 replies; 68+ messages in thread
From: Lukáš Czerner @ 2012-09-05 17:50 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-ext4

Hi,

I believe that this has been fixed with v3.6-rc1-5-g89a4e48 and it
was marked for stable release as well.

-Lukas

On Wed, 5 Sep 2012, Marc MERLIN wrote:

> Date: Wed, 5 Sep 2012 09:52:10 -0700
> From: Marc MERLIN <marc@merlins.org>
> To: linux-ext4@vger.kernel.org
> Subject: ext4 crash with 3.5.2 in ext4_ext_remove_space
> 
> I get a crash when mounting a filesystem.
> I'm making an image now with e2image -r before I run e2fsck on it.
> 
> Is there anything else you'd like me to do?
> 
> [13090.175424] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> [13090.184897] IP: [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> [13090.192927] PGD 1120d7067 PUD 1123ac067 PMD 0 
> [13090.198469] Oops: 0000 [#1] PREEMPT SMP 
> [13090.203508] CPU 1 
> [13090.205368] Modules linked in:[13090.209508]  ppdev lp tun autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sata_mv kl5kusb105 ftdi_sio keyspan nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_REJECT xt_state xt_tcpudp xt_LOG iptable_mangle iptable_filter deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_sse2_x86_64 lrw serpent_generic xts gf128mul cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo blowfish_generic blowfish_x86_64 blowfish_common dm_crypt dm_mirror dm_region_hash dm_log aes_x86_64 fuse lm85 hwmon_vid dm_snapshot dm_mod iptable_nat ip_tables nf_conntrack_ftp ipt_MASQUERADE nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 x_tables nf_conntrack s
 g st snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek rc_ati_x10 ati_remote rc_core pl2303 usbserial i915 snd_hda_intel snd_cmipci snd_hda_codec gameport drm_kms_helper dr!
 m !
>  snd_opl3_lib snd_mpu401_uart eeepc_wmi asus_wmi
>  i2c_algo_bit[13090.315877]  sparse_keymap rfkill snd_hwdep snd_seq_midi acpi_cpufreq snd_seq_midi_event mperf kvm_intel kvm processor snd_seq snd_pcm pci_hotplug ehci_hcd wmi parport_pc xhci_hcd microcode button video sata_sil24 parport snd_rawmidi snd_timer usbcore crc32c_intel ghash_clmulni_intel i2c_i801 snd_seq_device cryptd evdev snd lpc_ich mei i2c_core pcspkr snd_page_alloc thermal_sys usb_common soundcore coretemp r8169 mii tpm_tis tpm tpm_bios [last unloaded: kl5kusb105]
> 
> [13090.370146] Pid: 9658, comm: mount Not tainted 3.5.2-amd64-preempt-noide-20120903 #1 System manufacturer System Product Name/P8H67-M PRO
> [13090.387559] RIP: 0010:[<ffffffff8119b92f>]  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> [13090.398254] RSP: 0018:ffff88012b4c7a18  EFLAGS: 00010246
> [13090.406241] RAX: 0000000000000000 RBX: ffff8800a41fb4e8 RCX: ffff8800a41fb450
> [13090.416102] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 000000001b340000
> [13090.426003] RBP: ffff88012b4c7af8 R08: 000000001b340000 R09: ffff88008ae6f6e0
> [13090.435920] R10: ffff880000000000 R11: 0000000000000000 R12: ffff8801156e86f0
> [13090.445782] R13: 0000000000000000 R14: ffff8801156e86c0 R15: 0000000000000000
> [13090.455627] FS:  0000000000000000(0000) GS:ffff88013fa80000(0063) knlGS:00000000f75ae750
> [13090.466451] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> [13090.475010] CR2: 0000000000000028 CR3: 0000000115614000 CR4: 00000000000407e0
> [13090.484921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [13090.494802] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [13090.504649] Process mount (pid: 9658, threadinfo ffff88012b4c6000, task ffff88008d91c4c0)
> [13090.515567] Stack:
> [13090.520209]  ffff88012b4c7a28 ffffffff81175f65 ffff88012b4c7a88 ffffffff81179c44
> [13090.530426]  ffff880100001000 00000000fffffff5 ffffffffa41fb438 ffff8801156e8748
> [13090.540694]  00000000000a77ff ffff8800a41fb438 0000000008007800 ffff880122e67800
> [13090.551124] Call Trace:
> [13090.556238]  [<ffffffff81175f65>] ? brelse+0xe/0x10
> [13090.564018]  [<ffffffff81179c44>] ? ext4_mark_iloc_dirty+0x51c/0x581
> [13090.573175]  [<ffffffff8119d51c>] ext4_ext_truncate+0xcd/0x179
> [13090.581856]  [<ffffffff81132b1c>] ? __inode_wait_for_writeback+0x67/0xa9
> [13090.591316]  [<ffffffff81177e74>] ext4_truncate+0x9c/0xdf
> [13090.599434]  [<ffffffff8117b721>] ext4_evict_inode+0x1e9/0x2d6
> [13090.608273]  [<ffffffff811282e3>] evict+0xa8/0x162
> [13090.615846]  [<ffffffff81128588>] iput+0x1b3/0x1bb
> [13090.623284]  [<ffffffff8119570d>] ext4_fill_super+0x214b/0x256b
> [13090.631785]  [<ffffffff8128f835>] ? vsnprintf+0x1ce/0x421
> [13090.639778]  [<ffffffff8113f9b2>] ? set_blocksize+0x36/0x86
> [13090.647864]  [<ffffffff811169ae>] mount_bdev+0x14b/0x1ad
> [13090.655707]  [<ffffffff811935c2>] ? ext4_calculate_overhead+0x247/0x247
> [13090.664876]  [<ffffffff8112ada7>] ? alloc_vfsmnt+0xa6/0x198
> [13090.672989]  [<ffffffff81184d5f>] ext4_mount+0x10/0x12
> [13090.680564]  [<ffffffff81117390>] mount_fs+0x64/0x150
> [13090.687944]  [<ffffffff810e5439>] ? __alloc_percpu+0xb/0xd
> [13090.695702]  [<ffffffff8112b156>] vfs_kern_mount+0x64/0xde
> [13090.703435]  [<ffffffff8112b544>] do_kern_mount+0x48/0xda
> [13090.711023]  [<ffffffff8112ce32>] do_mount+0x6a1/0x704
> [13090.718294]  [<ffffffff810e1181>] ? memdup_user+0x38/0x60
> [13090.725767]  [<ffffffff810e11df>] ? strndup_user+0x36/0x4c
> [13090.733260]  [<ffffffff8114ffaa>] compat_sys_mount+0x208/0x242
> [13090.741047]  [<ffffffff814b0b06>] sysenter_dispatch+0x7/0x21
> [13090.748612] Code: ff 4c 63 65 b8 4d 6b e4 30 4c 03 65 b0 e9 fd 00 00 00 48 63 55 b8 4c 6b e2 30 4c 03 65 b0 49 83 7c 24 20 00 75 0e 49 8b 44 24 28 <48> 8b 40 28 49 89 44 24 20 49 8b 44 24 18 48 85 c0 75 22 49 8b 
> [13090.772841] RIP  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> [13090.781438]  RSP <ffff88012b4c7a18>
> [13090.786758] CR2: 0000000000000028
> [13090.804674] ---[ end trace 880c73500bb7f09f ]---
> [13090.810808] Kernel panic - not syncing: Fatal exception
> [13090.817346] panic occurred, switching back to text console
> [13090.824556] Rebooting in 20 seconds..
> [13110.758740] ACPI MEMORY or I/O RESET_REG.
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: ext4 crash with 3.5.2 in ext4_ext_remove_space
  2012-09-05 17:50                 ` Lukáš Czerner
@ 2012-09-05 17:53                   ` Marc MERLIN
  2012-09-06  4:24                   ` Marc MERLIN
  1 sibling, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-09-05 17:53 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4

On Wed, Sep 05, 2012 at 01:50:54PM -0400, Lukáš Czerner wrote:
> Hi,
> 
> I believe that this has been fixed with v3.6-rc1-5-g89a4e48 and it
> was marked for stable release as well.

Cool, thanks for letting me know.

I'll keep the e2image on hand just in case, and just fsck the filesystem
before mounting it.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: ext4 crash with 3.5.2 in ext4_ext_remove_space
  2012-09-05 17:50                 ` Lukáš Czerner
  2012-09-05 17:53                   ` Marc MERLIN
@ 2012-09-06  4:24                   ` Marc MERLIN
  2012-09-07 15:19                     ` Marc MERLIN
  1 sibling, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-09-06  4:24 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4

On Wed, Sep 05, 2012 at 01:50:54PM -0400, Lukáš Czerner wrote:
> Hi,
> 
> I believe that this has been fixed with v3.6-rc1-5-g89a4e48 and it
> was marked for stable release as well.

If that helps, it was indeed a single orphaned inode. After clearing it, 
I was able to mount the partition without problems.

gargamel:~# e2fsck -y /dev/mapper/dshelf2-space
e2fsck 1.42.5 (29-Jul-2012)
/dev/mapper/dshelf2-space: recovering journal
Clearing orphaned inode 92668757 (uid=1002, gid=8, mode=0100660, size=6574695342)
/dev/mapper/dshelf2-space: clean, 16987882/114098176 files, 426716110/456392704 blocks

Cheers,
Marc

> -Lukas
> 
> On Wed, 5 Sep 2012, Marc MERLIN wrote:
> 
> > Date: Wed, 5 Sep 2012 09:52:10 -0700
> > From: Marc MERLIN <marc@merlins.org>
> > To: linux-ext4@vger.kernel.org
> > Subject: ext4 crash with 3.5.2 in ext4_ext_remove_space
> > 
> > I get a crash when mounting a filesystem.
> > I'm making an image now with e2image -r before I run e2fsck on it.
> > 
> > Is there anything else you'd like me to do?
> > 
> > [13090.175424] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> > [13090.184897] IP: [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > [13090.192927] PGD 1120d7067 PUD 1123ac067 PMD 0 
> > [13090.198469] Oops: 0000 [#1] PREEMPT SMP 
> > [13090.203508] CPU 1 
> > [13090.205368] Modules linked in:[13090.209508]  ppdev lp tun autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sata_mv kl5kusb105 ftdi_sio keyspan nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_REJECT xt_state xt_tcpudp xt_LOG iptable_mangle iptable_filter deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_sse2_x86_64 lrw serpent_generic xts gf128mul cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo blowfish_generic blowfish_x86_64 blowfish_common dm_crypt dm_mirror dm_region_hash dm_log aes_x86_64 fuse lm85 hwmon_vid dm_snapshot dm_mod iptable_nat ip_tables nf_conntrack_ftp ipt_MASQUERADE nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 x_tables nf_conntrack sg st snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek rc_ati_x10 ati_remote rc_core pl2303 usbserial i915 snd_hda_intel snd_cmipci snd_hda_codec gameport drm_kms_helper dr!
>  m !
> >  snd_opl3_lib snd_mpu401_uart eeepc_wmi asus_wmi
> >  i2c_algo_bit[13090.315877]  sparse_keymap rfkill snd_hwdep snd_seq_midi acpi_cpufreq snd_seq_midi_event mperf kvm_intel kvm processor snd_seq snd_pcm pci_hotplug ehci_hcd wmi parport_pc xhci_hcd microcode button video sata_sil24 parport snd_rawmidi snd_timer usbcore crc32c_intel ghash_clmulni_intel i2c_i801 snd_seq_device cryptd evdev snd lpc_ich mei i2c_core pcspkr snd_page_alloc thermal_sys usb_common soundcore coretemp r8169 mii tpm_tis tpm tpm_bios [last unloaded: kl5kusb105]
> > 
> > [13090.370146] Pid: 9658, comm: mount Not tainted 3.5.2-amd64-preempt-noide-20120903 #1 System manufacturer System Product Name/P8H67-M PRO
> > [13090.387559] RIP: 0010:[<ffffffff8119b92f>]  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > [13090.398254] RSP: 0018:ffff88012b4c7a18  EFLAGS: 00010246
> > [13090.406241] RAX: 0000000000000000 RBX: ffff8800a41fb4e8 RCX: ffff8800a41fb450
> > [13090.416102] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 000000001b340000
> > [13090.426003] RBP: ffff88012b4c7af8 R08: 000000001b340000 R09: ffff88008ae6f6e0
> > [13090.435920] R10: ffff880000000000 R11: 0000000000000000 R12: ffff8801156e86f0
> > [13090.445782] R13: 0000000000000000 R14: ffff8801156e86c0 R15: 0000000000000000
> > [13090.455627] FS:  0000000000000000(0000) GS:ffff88013fa80000(0063) knlGS:00000000f75ae750
> > [13090.466451] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> > [13090.475010] CR2: 0000000000000028 CR3: 0000000115614000 CR4: 00000000000407e0
> > [13090.484921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [13090.494802] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [13090.504649] Process mount (pid: 9658, threadinfo ffff88012b4c6000, task ffff88008d91c4c0)
> > [13090.515567] Stack:
> > [13090.520209]  ffff88012b4c7a28 ffffffff81175f65 ffff88012b4c7a88 ffffffff81179c44
> > [13090.530426]  ffff880100001000 00000000fffffff5 ffffffffa41fb438 ffff8801156e8748
> > [13090.540694]  00000000000a77ff ffff8800a41fb438 0000000008007800 ffff880122e67800
> > [13090.551124] Call Trace:
> > [13090.556238]  [<ffffffff81175f65>] ? brelse+0xe/0x10
> > [13090.564018]  [<ffffffff81179c44>] ? ext4_mark_iloc_dirty+0x51c/0x581
> > [13090.573175]  [<ffffffff8119d51c>] ext4_ext_truncate+0xcd/0x179
> > [13090.581856]  [<ffffffff81132b1c>] ? __inode_wait_for_writeback+0x67/0xa9
> > [13090.591316]  [<ffffffff81177e74>] ext4_truncate+0x9c/0xdf
> > [13090.599434]  [<ffffffff8117b721>] ext4_evict_inode+0x1e9/0x2d6
> > [13090.608273]  [<ffffffff811282e3>] evict+0xa8/0x162
> > [13090.615846]  [<ffffffff81128588>] iput+0x1b3/0x1bb
> > [13090.623284]  [<ffffffff8119570d>] ext4_fill_super+0x214b/0x256b
> > [13090.631785]  [<ffffffff8128f835>] ? vsnprintf+0x1ce/0x421
> > [13090.639778]  [<ffffffff8113f9b2>] ? set_blocksize+0x36/0x86
> > [13090.647864]  [<ffffffff811169ae>] mount_bdev+0x14b/0x1ad
> > [13090.655707]  [<ffffffff811935c2>] ? ext4_calculate_overhead+0x247/0x247
> > [13090.664876]  [<ffffffff8112ada7>] ? alloc_vfsmnt+0xa6/0x198
> > [13090.672989]  [<ffffffff81184d5f>] ext4_mount+0x10/0x12
> > [13090.680564]  [<ffffffff81117390>] mount_fs+0x64/0x150
> > [13090.687944]  [<ffffffff810e5439>] ? __alloc_percpu+0xb/0xd
> > [13090.695702]  [<ffffffff8112b156>] vfs_kern_mount+0x64/0xde
> > [13090.703435]  [<ffffffff8112b544>] do_kern_mount+0x48/0xda
> > [13090.711023]  [<ffffffff8112ce32>] do_mount+0x6a1/0x704
> > [13090.718294]  [<ffffffff810e1181>] ? memdup_user+0x38/0x60
> > [13090.725767]  [<ffffffff810e11df>] ? strndup_user+0x36/0x4c
> > [13090.733260]  [<ffffffff8114ffaa>] compat_sys_mount+0x208/0x242
> > [13090.741047]  [<ffffffff814b0b06>] sysenter_dispatch+0x7/0x21
> > [13090.748612] Code: ff 4c 63 65 b8 4d 6b e4 30 4c 03 65 b0 e9 fd 00 00 00 48 63 55 b8 4c 6b e2 30 4c 03 65 b0 49 83 7c 24 20 00 75 0e 49 8b 44 24 28 <48> 8b 40 28 49 89 44 24 20 49 8b 44 24 18 48 85 c0 75 22 49 8b 
> > [13090.772841] RIP  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > [13090.781438]  RSP <ffff88012b4c7a18>
> > [13090.786758] CR2: 0000000000000028
> > [13090.804674] ---[ end trace 880c73500bb7f09f ]---
> > [13090.810808] Kernel panic - not syncing: Fatal exception
> > [13090.817346] panic occurred, switching back to text console
> > [13090.824556] Rebooting in 20 seconds..
> > [13110.758740] ACPI MEMORY or I/O RESET_REG.
> > 
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: ext4 crash with 3.5.2 in ext4_ext_remove_space
  2012-09-06  4:24                   ` Marc MERLIN
@ 2012-09-07 15:19                     ` Marc MERLIN
  2012-09-07 15:39                       ` Lukáš Czerner
  0 siblings, 1 reply; 68+ messages in thread
From: Marc MERLIN @ 2012-09-07 15:19 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4

On Wed, Sep 05, 2012 at 09:24:26PM -0700, Marc MERLIN wrote:
> On Wed, Sep 05, 2012 at 01:50:54PM -0400, Lukáš Czerner wrote:
> > Hi,
> > 
> > I believe that this has been fixed with v3.6-rc1-5-g89a4e48 and it
> > was marked for stable release as well.
> 
> If that helps, it was indeed a single orphaned inode. After clearing it, 
> I was able to mount the partition without problems.
> 
> gargamel:~# e2fsck -y /dev/mapper/dshelf2-space
> e2fsck 1.42.5 (29-Jul-2012)
> /dev/mapper/dshelf2-space: recovering journal
> Clearing orphaned inode 92668757 (uid=1002, gid=8, mode=0100660, size=6574695342)
> /dev/mapper/dshelf2-space: clean, 16987882/114098176 files, 426716110/456392704 blocks

Would you happen to have a link to the patch, I had this crash again, I'd
like to check if it's in 3.5.3 or apply it myself to stop further crashes.

Thanks,
Marc
 
> Cheers,
> Marc
> 
> > -Lukas
> > 
> > On Wed, 5 Sep 2012, Marc MERLIN wrote:
> > 
> > > Date: Wed, 5 Sep 2012 09:52:10 -0700
> > > From: Marc MERLIN <marc@merlins.org>
> > > To: linux-ext4@vger.kernel.org
> > > Subject: ext4 crash with 3.5.2 in ext4_ext_remove_space
> > > 
> > > I get a crash when mounting a filesystem.
> > > I'm making an image now with e2image -r before I run e2fsck on it.
> > > 
> > > Is there anything else you'd like me to do?
> > > 
> > > [13090.175424] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> > > [13090.184897] IP: [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > > [13090.192927] PGD 1120d7067 PUD 1123ac067 PMD 0 
> > > [13090.198469] Oops: 0000 [#1] PREEMPT SMP 
> > > [13090.203508] CPU 1 
> > > [13090.205368] Modules linked in:[13090.209508]  ppdev lp tun autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sata_mv kl5kusb105 ftdi_sio keyspan nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_REJECT xt_state xt_tcpudp xt_LOG iptable_mangle iptable_filter deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_sse2_x86_64 lrw serpent_generic xts gf128mul cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo blowfish_generic blowfish_x86_64 blowfish_common dm_crypt dm_mirror dm_region_hash dm_log aes_x86_64 fuse lm85 hwmon_vid dm_snapshot dm_mod iptable_nat ip_tables nf_conntrack_ftp ipt_MASQUERADE nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 x_tables nf_conntrack sg st snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek rc_ati_x10 ati_remote rc_core pl2303 usbserial i915 snd_hda_intel snd_cmipci snd_hda_codec gameport drm_kms_helper dr!
> >  m !
> > >  snd_opl3_lib snd_mpu401_uart eeepc_wmi asus_wmi
> > >  i2c_algo_bit[13090.315877]  sparse_keymap rfkill snd_hwdep snd_seq_midi acpi_cpufreq snd_seq_midi_event mperf kvm_intel kvm processor snd_seq snd_pcm pci_hotplug ehci_hcd wmi parport_pc xhci_hcd microcode button video sata_sil24 parport snd_rawmidi snd_timer usbcore crc32c_intel ghash_clmulni_intel i2c_i801 snd_seq_device cryptd evdev snd lpc_ich mei i2c_core pcspkr snd_page_alloc thermal_sys usb_common soundcore coretemp r8169 mii tpm_tis tpm tpm_bios [last unloaded: kl5kusb105]
> > > 
> > > [13090.370146] Pid: 9658, comm: mount Not tainted 3.5.2-amd64-preempt-noide-20120903 #1 System manufacturer System Product Name/P8H67-M PRO
> > > [13090.387559] RIP: 0010:[<ffffffff8119b92f>]  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > > [13090.398254] RSP: 0018:ffff88012b4c7a18  EFLAGS: 00010246
> > > [13090.406241] RAX: 0000000000000000 RBX: ffff8800a41fb4e8 RCX: ffff8800a41fb450
> > > [13090.416102] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 000000001b340000
> > > [13090.426003] RBP: ffff88012b4c7af8 R08: 000000001b340000 R09: ffff88008ae6f6e0
> > > [13090.435920] R10: ffff880000000000 R11: 0000000000000000 R12: ffff8801156e86f0
> > > [13090.445782] R13: 0000000000000000 R14: ffff8801156e86c0 R15: 0000000000000000
> > > [13090.455627] FS:  0000000000000000(0000) GS:ffff88013fa80000(0063) knlGS:00000000f75ae750
> > > [13090.466451] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> > > [13090.475010] CR2: 0000000000000028 CR3: 0000000115614000 CR4: 00000000000407e0
> > > [13090.484921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [13090.494802] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [13090.504649] Process mount (pid: 9658, threadinfo ffff88012b4c6000, task ffff88008d91c4c0)
> > > [13090.515567] Stack:
> > > [13090.520209]  ffff88012b4c7a28 ffffffff81175f65 ffff88012b4c7a88 ffffffff81179c44
> > > [13090.530426]  ffff880100001000 00000000fffffff5 ffffffffa41fb438 ffff8801156e8748
> > > [13090.540694]  00000000000a77ff ffff8800a41fb438 0000000008007800 ffff880122e67800
> > > [13090.551124] Call Trace:
> > > [13090.556238]  [<ffffffff81175f65>] ? brelse+0xe/0x10
> > > [13090.564018]  [<ffffffff81179c44>] ? ext4_mark_iloc_dirty+0x51c/0x581
> > > [13090.573175]  [<ffffffff8119d51c>] ext4_ext_truncate+0xcd/0x179
> > > [13090.581856]  [<ffffffff81132b1c>] ? __inode_wait_for_writeback+0x67/0xa9
> > > [13090.591316]  [<ffffffff81177e74>] ext4_truncate+0x9c/0xdf
> > > [13090.599434]  [<ffffffff8117b721>] ext4_evict_inode+0x1e9/0x2d6
> > > [13090.608273]  [<ffffffff811282e3>] evict+0xa8/0x162
> > > [13090.615846]  [<ffffffff81128588>] iput+0x1b3/0x1bb
> > > [13090.623284]  [<ffffffff8119570d>] ext4_fill_super+0x214b/0x256b
> > > [13090.631785]  [<ffffffff8128f835>] ? vsnprintf+0x1ce/0x421
> > > [13090.639778]  [<ffffffff8113f9b2>] ? set_blocksize+0x36/0x86
> > > [13090.647864]  [<ffffffff811169ae>] mount_bdev+0x14b/0x1ad
> > > [13090.655707]  [<ffffffff811935c2>] ? ext4_calculate_overhead+0x247/0x247
> > > [13090.664876]  [<ffffffff8112ada7>] ? alloc_vfsmnt+0xa6/0x198
> > > [13090.672989]  [<ffffffff81184d5f>] ext4_mount+0x10/0x12
> > > [13090.680564]  [<ffffffff81117390>] mount_fs+0x64/0x150
> > > [13090.687944]  [<ffffffff810e5439>] ? __alloc_percpu+0xb/0xd
> > > [13090.695702]  [<ffffffff8112b156>] vfs_kern_mount+0x64/0xde
> > > [13090.703435]  [<ffffffff8112b544>] do_kern_mount+0x48/0xda
> > > [13090.711023]  [<ffffffff8112ce32>] do_mount+0x6a1/0x704
> > > [13090.718294]  [<ffffffff810e1181>] ? memdup_user+0x38/0x60
> > > [13090.725767]  [<ffffffff810e11df>] ? strndup_user+0x36/0x4c
> > > [13090.733260]  [<ffffffff8114ffaa>] compat_sys_mount+0x208/0x242
> > > [13090.741047]  [<ffffffff814b0b06>] sysenter_dispatch+0x7/0x21
> > > [13090.748612] Code: ff 4c 63 65 b8 4d 6b e4 30 4c 03 65 b0 e9 fd 00 00 00 48 63 55 b8 4c 6b e2 30 4c 03 65 b0 49 83 7c 24 20 00 75 0e 49 8b 44 24 28 <48> 8b 40 28 49 89 44 24 20 49 8b 44 24 18 48 85 c0 75 22 49 8b 
> > > [13090.772841] RIP  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > > [13090.781438]  RSP <ffff88012b4c7a18>
> > > [13090.786758] CR2: 0000000000000028
> > > [13090.804674] ---[ end trace 880c73500bb7f09f ]---
> > > [13090.810808] Kernel panic - not syncing: Fatal exception
> > > [13090.817346] panic occurred, switching back to text console
> > > [13090.824556] Rebooting in 20 seconds..
> > > [13110.758740] ACPI MEMORY or I/O RESET_REG.
> > > 
> > 
> 
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems ....
>                                       .... what McDonalds is to gourmet cooking
> Home page: http://marc.merlins.org/

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: ext4 crash with 3.5.2 in ext4_ext_remove_space
  2012-09-07 15:19                     ` Marc MERLIN
@ 2012-09-07 15:39                       ` Lukáš Czerner
  2012-09-07 15:51                         ` Marc MERLIN
  0 siblings, 1 reply; 68+ messages in thread
From: Lukáš Czerner @ 2012-09-07 15:39 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Lukáš Czerner, linux-ext4

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8003 bytes --]

On Fri, 7 Sep 2012, Marc MERLIN wrote:

> Date: Fri, 7 Sep 2012 08:19:42 -0700
> From: Marc MERLIN <marc@merlins.org>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org
> Subject: Re: ext4 crash with 3.5.2 in ext4_ext_remove_space
> 
> On Wed, Sep 05, 2012 at 09:24:26PM -0700, Marc MERLIN wrote:
> > On Wed, Sep 05, 2012 at 01:50:54PM -0400, Lukáš Czerner wrote:
> > > Hi,
> > > 
> > > I believe that this has been fixed with v3.6-rc1-5-g89a4e48 and it
> > > was marked for stable release as well.
> > 
> > If that helps, it was indeed a single orphaned inode. After clearing it, 
> > I was able to mount the partition without problems.
> > 
> > gargamel:~# e2fsck -y /dev/mapper/dshelf2-space
> > e2fsck 1.42.5 (29-Jul-2012)
> > /dev/mapper/dshelf2-space: recovering journal
> > Clearing orphaned inode 92668757 (uid=1002, gid=8, mode=0100660, size=6574695342)
> > /dev/mapper/dshelf2-space: clean, 16987882/114098176 files, 426716110/456392704 blocks
> 
> Would you happen to have a link to the patch, I had this crash again, I'd
> like to check if it's in 3.5.3 or apply it myself to stop further crashes.
> 
> Thanks,
> Marc

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=89a4e48f8479f8145eca9698f39fe188c982212f

-Lukas


>  
> > Cheers,
> > Marc
> > 
> > > -Lukas
> > > 
> > > On Wed, 5 Sep 2012, Marc MERLIN wrote:
> > > 
> > > > Date: Wed, 5 Sep 2012 09:52:10 -0700
> > > > From: Marc MERLIN <marc@merlins.org>
> > > > To: linux-ext4@vger.kernel.org
> > > > Subject: ext4 crash with 3.5.2 in ext4_ext_remove_space
> > > > 
> > > > I get a crash when mounting a filesystem.
> > > > I'm making an image now with e2image -r before I run e2fsck on it.
> > > > 
> > > > Is there anything else you'd like me to do?
> > > > 
> > > > [13090.175424] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
> > > > [13090.184897] IP: [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > > > [13090.192927] PGD 1120d7067 PUD 1123ac067 PMD 0 
> > > > [13090.198469] Oops: 0000 [#1] PREEMPT SMP 
> > > > [13090.203508] CPU 1 
> > > > [13090.205368] Modules linked in:[13090.209508]  ppdev lp tun autofs4 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sata_mv kl5kusb105 ftdi_sio keyspan nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_REJECT xt_state xt_tcpudp xt_LOG iptable_mangle iptable_filter deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_sse2_x86_64 lrw serpent_generic xts gf128mul cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo blowfish_generic blowfish_x86_64 blowfish_common dm_crypt dm_mirror dm_region_hash dm_log aes_x86_64 fuse lm85 hwmon_vid dm_snapshot dm_mod iptable_nat ip_tables nf_conntrack_ftp ipt_MASQUERADE nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 x_tables nf_connt
 rack sg st snd_pcm_oss snd_mixer_oss snd_hda_codec_hdmi snd_hda_codec_realtek rc_ati_x10 ati_remote rc_core pl2303 usbserial i915 snd_hda_intel snd_cmipci snd_hda_codec gameport drm_kms_hel!
 per dr!
> > >  m !
> > > >  snd_opl3_lib snd_mpu401_uart eeepc_wmi asus_wmi
> > > >  i2c_algo_bit[13090.315877]  sparse_keymap rfkill snd_hwdep snd_seq_midi acpi_cpufreq snd_seq_midi_event mperf kvm_intel kvm processor snd_seq snd_pcm pci_hotplug ehci_hcd wmi parport_pc xhci_hcd microcode button video sata_sil24 parport snd_rawmidi snd_timer usbcore crc32c_intel ghash_clmulni_intel i2c_i801 snd_seq_device cryptd evdev snd lpc_ich mei i2c_core pcspkr snd_page_alloc thermal_sys usb_common soundcore coretemp r8169 mii tpm_tis tpm tpm_bios [last unloaded: kl5kusb105]
> > > > 
> > > > [13090.370146] Pid: 9658, comm: mount Not tainted 3.5.2-amd64-preempt-noide-20120903 #1 System manufacturer System Product Name/P8H67-M PRO
> > > > [13090.387559] RIP: 0010:[<ffffffff8119b92f>]  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > > > [13090.398254] RSP: 0018:ffff88012b4c7a18  EFLAGS: 00010246
> > > > [13090.406241] RAX: 0000000000000000 RBX: ffff8800a41fb4e8 RCX: ffff8800a41fb450
> > > > [13090.416102] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 000000001b340000
> > > > [13090.426003] RBP: ffff88012b4c7af8 R08: 000000001b340000 R09: ffff88008ae6f6e0
> > > > [13090.435920] R10: ffff880000000000 R11: 0000000000000000 R12: ffff8801156e86f0
> > > > [13090.445782] R13: 0000000000000000 R14: ffff8801156e86c0 R15: 0000000000000000
> > > > [13090.455627] FS:  0000000000000000(0000) GS:ffff88013fa80000(0063) knlGS:00000000f75ae750
> > > > [13090.466451] CS:  0010 DS: 002b ES: 002b CR0: 000000008005003b
> > > > [13090.475010] CR2: 0000000000000028 CR3: 0000000115614000 CR4: 00000000000407e0
> > > > [13090.484921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > [13090.494802] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > [13090.504649] Process mount (pid: 9658, threadinfo ffff88012b4c6000, task ffff88008d91c4c0)
> > > > [13090.515567] Stack:
> > > > [13090.520209]  ffff88012b4c7a28 ffffffff81175f65 ffff88012b4c7a88 ffffffff81179c44
> > > > [13090.530426]  ffff880100001000 00000000fffffff5 ffffffffa41fb438 ffff8801156e8748
> > > > [13090.540694]  00000000000a77ff ffff8800a41fb438 0000000008007800 ffff880122e67800
> > > > [13090.551124] Call Trace:
> > > > [13090.556238]  [<ffffffff81175f65>] ? brelse+0xe/0x10
> > > > [13090.564018]  [<ffffffff81179c44>] ? ext4_mark_iloc_dirty+0x51c/0x581
> > > > [13090.573175]  [<ffffffff8119d51c>] ext4_ext_truncate+0xcd/0x179
> > > > [13090.581856]  [<ffffffff81132b1c>] ? __inode_wait_for_writeback+0x67/0xa9
> > > > [13090.591316]  [<ffffffff81177e74>] ext4_truncate+0x9c/0xdf
> > > > [13090.599434]  [<ffffffff8117b721>] ext4_evict_inode+0x1e9/0x2d6
> > > > [13090.608273]  [<ffffffff811282e3>] evict+0xa8/0x162
> > > > [13090.615846]  [<ffffffff81128588>] iput+0x1b3/0x1bb
> > > > [13090.623284]  [<ffffffff8119570d>] ext4_fill_super+0x214b/0x256b
> > > > [13090.631785]  [<ffffffff8128f835>] ? vsnprintf+0x1ce/0x421
> > > > [13090.639778]  [<ffffffff8113f9b2>] ? set_blocksize+0x36/0x86
> > > > [13090.647864]  [<ffffffff811169ae>] mount_bdev+0x14b/0x1ad
> > > > [13090.655707]  [<ffffffff811935c2>] ? ext4_calculate_overhead+0x247/0x247
> > > > [13090.664876]  [<ffffffff8112ada7>] ? alloc_vfsmnt+0xa6/0x198
> > > > [13090.672989]  [<ffffffff81184d5f>] ext4_mount+0x10/0x12
> > > > [13090.680564]  [<ffffffff81117390>] mount_fs+0x64/0x150
> > > > [13090.687944]  [<ffffffff810e5439>] ? __alloc_percpu+0xb/0xd
> > > > [13090.695702]  [<ffffffff8112b156>] vfs_kern_mount+0x64/0xde
> > > > [13090.703435]  [<ffffffff8112b544>] do_kern_mount+0x48/0xda
> > > > [13090.711023]  [<ffffffff8112ce32>] do_mount+0x6a1/0x704
> > > > [13090.718294]  [<ffffffff810e1181>] ? memdup_user+0x38/0x60
> > > > [13090.725767]  [<ffffffff810e11df>] ? strndup_user+0x36/0x4c
> > > > [13090.733260]  [<ffffffff8114ffaa>] compat_sys_mount+0x208/0x242
> > > > [13090.741047]  [<ffffffff814b0b06>] sysenter_dispatch+0x7/0x21
> > > > [13090.748612] Code: ff 4c 63 65 b8 4d 6b e4 30 4c 03 65 b0 e9 fd 00 00 00 48 63 55 b8 4c 6b e2 30 4c 03 65 b0 49 83 7c 24 20 00 75 0e 49 8b 44 24 28 <48> 8b 40 28 49 89 44 24 20 49 8b 44 24 18 48 85 c0 75 22 49 8b 
> > > > [13090.772841] RIP  [<ffffffff8119b92f>] ext4_ext_remove_space+0x83d/0xb51
> > > > [13090.781438]  RSP <ffff88012b4c7a18>
> > > > [13090.786758] CR2: 0000000000000028
> > > > [13090.804674] ---[ end trace 880c73500bb7f09f ]---
> > > > [13090.810808] Kernel panic - not syncing: Fatal exception
> > > > [13090.817346] panic occurred, switching back to text console
> > > > [13090.824556] Rebooting in 20 seconds..
> > > > [13110.758740] ACPI MEMORY or I/O RESET_REG.
> > > > 
> > > 
> > 
> > -- 
> > "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> > Microsoft is to operating systems ....
> >                                       .... what McDonalds is to gourmet cooking
> > Home page: http://marc.merlins.org/
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: ext4 crash with 3.5.2 in ext4_ext_remove_space
  2012-09-07 15:39                       ` Lukáš Czerner
@ 2012-09-07 15:51                         ` Marc MERLIN
  0 siblings, 0 replies; 68+ messages in thread
From: Marc MERLIN @ 2012-09-07 15:51 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: linux-ext4

On Fri, Sep 07, 2012 at 11:39:53AM -0400, Lukáš Czerner wrote:
> > Would you happen to have a link to the patch, I had this crash again, I'd
> > like to check if it's in 3.5.3 or apply it myself to stop further crashes.
> > 
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=89a4e48f8479f8145eca9698f39fe188c982212f

Cool, I just confirmed it's in 3.5.3 already.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2012-09-07 15:51 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-30  0:37 brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Marc MERLIN
2012-02-01 17:56 ` Chris Mason
2012-02-02  3:23   ` Marc MERLIN
2012-02-02 12:42     ` Chris Mason
     [not found]       ` <20120202152722.GI12429@merlins.org>
2012-02-12 22:32         ` Marc MERLIN
2012-02-12 23:47           ` Milan Broz
2012-02-13  0:14             ` Marc MERLIN
2012-02-15 15:42               ` Calvin Walton
2012-02-15 16:55                 ` Marc MERLIN
2012-02-15 16:59                   ` Hugo Mills
2012-02-22 10:28                     ` Justin Ossevoort
2012-02-22 11:07                       ` Hugo Mills
2012-02-16  6:33               ` Chris Samuel
2012-02-18 12:33               ` Martin Steigerwald
2012-02-18 12:39               ` Martin Steigerwald
2012-02-18 12:49                 ` Martin Steigerwald
2012-07-18 18:13               ` brtfs on top of dmcrypt with SSD -> Trim or no Trim Marc MERLIN
2012-07-18 20:04                 ` Fajar A. Nugraha
2012-07-18 20:37                   ` Marc MERLIN
2012-07-18 21:34                   ` Clemens Eisserer
2012-07-18 21:48                     ` Marc MERLIN
2012-07-18 21:49                 ` Martin Steigerwald
2012-07-18 22:04                   ` Marc MERLIN
2012-07-19 10:40                     ` Martin Steigerwald
2012-07-22 18:58                     ` brtfs on top of dmcrypt with SSD -> ssd or nossd + crypt performance? Marc MERLIN
2012-07-22 19:35                       ` Martin Steigerwald
2012-07-22 19:43                         ` Martin Steigerwald
2012-07-22 20:44                         ` Marc MERLIN
2012-07-22 22:41                           ` brtfs on top of dmcrypt with SSD -> file access 5x slower than spinning disk Marc MERLIN
2012-07-23  6:42                             ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
2012-07-24  7:56                               ` Martin Steigerwald
2012-07-27  4:40                                 ` Marc MERLIN
2012-07-27 11:08                               ` Chris Mason
2012-07-27 18:42                                 ` Marc MERLIN
2012-02-18 16:07             ` brtfs on top of dmcrypt with SSD. No corruption iff write cache off? Marc MERLIN
2012-02-19  0:53               ` Clemens Eisserer
2012-07-25 15:45 du -s src is a lot slower on SSD than spinning disk in the same laptop Marc MERLIN
     [not found] ` <alpine.LFD.2.00.1207252023080.4340@(none)>
2012-07-25 23:38   ` Marc MERLIN
2012-07-26  3:32   ` Ted Ts'o
2012-07-26  3:35     ` Marc MERLIN
2012-07-26  6:54     ` Marc MERLIN
2012-08-01  5:30       ` Marc MERLIN
2012-08-01  6:01         ` How can btrfs take 23sec to stat 23K files from an SSD? Marc MERLIN
2012-08-01  6:08           ` Fajar A. Nugraha
2012-08-01  6:21             ` Marc MERLIN
2012-08-01 21:57               ` Martin Steigerwald
2012-08-02  5:07                 ` Marc MERLIN
2012-08-02 11:18                   ` Martin Steigerwald
2012-08-02 17:39                     ` Marc MERLIN
2012-08-02 20:20                       ` Martin Steigerwald
2012-08-02 20:44                         ` Marc MERLIN
2012-08-02 21:21                           ` Martin Steigerwald
2012-08-02 21:49                             ` Marc MERLIN
2012-08-03 18:45                               ` Martin Steigerwald
2012-08-16  7:45                                 ` Marc MERLIN
2012-08-02 11:25                   ` Martin Steigerwald
2012-08-01  6:36           ` Chris Samuel
2012-08-01  6:40             ` Marc MERLIN
2012-08-01  8:18         ` du -s src is a lot slower on SSD than spinning disk in the same laptop Spelic
2012-08-16  7:50         ` Marc MERLIN
     [not found]           ` <502CC2A2.4010506@shiftmail.org>
2012-08-16 17:55             ` Marc MERLIN
2012-09-05 16:52               ` ext4 crash with 3.5.2 in ext4_ext_remove_space Marc MERLIN
2012-09-05 17:50                 ` Lukáš Czerner
2012-09-05 17:53                   ` Marc MERLIN
2012-09-06  4:24                   ` Marc MERLIN
2012-09-07 15:19                     ` Marc MERLIN
2012-09-07 15:39                       ` Lukáš Czerner
2012-09-07 15:51                         ` Marc MERLIN

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.