All of lore.kernel.org
 help / color / mirror / Atom feed
* high throughput storage server?
@ 2011-02-14 23:59 Matt Garman
  2011-02-15  2:06 ` Doug Dumitru
                   ` (4 more replies)
  0 siblings, 5 replies; 116+ messages in thread
From: Matt Garman @ 2011-02-14 23:59 UTC (permalink / raw)
  To: Mdadm

For many years, I have been using Linux software RAID at home for a
simple NAS system.  Now at work, we are looking at buying a massive,
high-throughput storage system (e.g. a SAN).  I have little
familiarity with these kinds of pre-built, vendor-supplied solutions.
I just started talking to a vendor, and the prices are extremely high.

So I got to thinking, perhaps I could build an adequate device for
significantly less cost using Linux.  The problem is, the requirements
for such a system are significantly higher than my home media server,
and put me into unfamiliar territory (in terms of both hardware and
software configuration).

The requirement is basically this: around 40 to 50 compute machines
act as basically an ad-hoc scientific compute/simulation/analysis
cluster.  These machines all need access to a shared 20 TB pool of
storage.  Each compute machine has a gigabit network connection, and
it's possible that nearly every machine could simultaneously try to
access a large (100 to 1000 MB) file in the storage pool.  In other
words, a 20 TB file store with bandwidth upwards of 50 Gbps.

I was wondering if anyone on the list has built something similar to
this using off-the-shelf hardware (and Linux of course)?

My initial thoughts/questions are:

    (1) We need lots of spindles (i.e. many small disks rather than
few big disks).  How do you compute disk throughput when there are
multiple consumers?  Most manufacturers provide specs on their drives
such as sustained linear read throughput.  But how is that number
affected when there are multiple processes simultanesously trying to
access different data?  Is the sustained bulk read throughput value
inversely proportional to the number of consumers?  (E.g. 100 MB/s
drive only does 33 MB/s w/three consumers.)  Or is there are more
specific way to estimate this?

    (2) The big storage server(s) need to connect to the network via
multiple bonded Gigabit ethernet, or something faster like
FibreChannel or 10 GbE.  That seems pretty straightforward.

    (3) This will probably require multiple servers connected together
somehow and presented to the compute machines as one big data store.
This is where I really don't know much of anything.  I did a quick
"back of the envelope" spec for a system with 24 600 GB 15k SAS drives
(based on the observation that 24-bay rackmount enclosures seem to be
fairly common).  Such a system would only provide 7.2 TB of storage
using a scheme like RAID-10.  So how could two or three of these
servers be "chained" together and look like a single large data pool
to the analysis machines?

I know this is a broad question, and not 100% about Linux software
RAID.  But I've been lurking on this list for years now, and I get the
impression there are list members who regularly work with "big iron"
systems such as what I've described.  I'm just looking for any kind of
relevant information here; any and all is appreciated!

Thank you,
Matt

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-14 23:59 high throughput storage server? Matt Garman
@ 2011-02-15  2:06 ` Doug Dumitru
  2011-02-15  4:44   ` Matt Garman
  2011-02-15 12:29 ` Stan Hoeppner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 116+ messages in thread
From: Doug Dumitru @ 2011-02-15  2:06 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

Matt,

You have a whole slew of questions to answer before you can decide on
a design.  This is true if you build it yourself or decide to go with
a vendor and buy a supported server.  If you do go with a vendor, the
odds are actually quite good you will end up with Linux anyway.

You state a need for 20TB of storage split among 40-50 "users".  Your
description implies that this space needs to be shared at the file
level.  This means you are building a NAS (Network Attached Storage),
not a SAN (Storage Area Network).  SANs typically export block devices
over protocols like iSCSI.  These block devices are non-sharable (ie,
only a single client can mount them (at least read/write) at a time.

So, 20TB of NAS.  Not really that hard to build.  Next, you need to
look at the space itself.  Is this all unique data, or is there an
opportunity for "data deduplication".  Some filesystems (ZFS) and some
block solutions can actively spot blocks that are duplicates and only
store a single copy.  With some applications (like virtual servers all
running the same OS), this can result in de-dupe ratios of 20:1.  If
your application is like this, your 20TB might only be 1-2 TB.  I
suspect this is not the case based on your description.

Next, is the space all the same.  Perhaps some of it is "active" and
some of it is archival.  If you need 4TB of "fast" storage and 16TB of
"backup" storage, this can really impact how you build a NAS.  Space
for backup might be configured with large (> 1TB) SATA drives running
RAID-5/6.  These configurations are good at reads and linear writes,
but lousy at random writes.  There cost is wildly lower than "fast"
storage.  You can buy a 12 bay 2U chassis for $300 plus PS and put 12
2TB 7200 RPM SATA drives raid/6 and get ~20TB of usable space.  Random
write performance will be quite bad, but for backups and "near line"
storage, it will do quite well.  You can probably build this for
around $5K (or maybe a bit less) including a 10GigE adapter and server
class components.

If you need IOPS (IO Operations Per Second), you are looking at SSDs.
You can build 20TB of pure SSD space.  If you do it yourself raid-10,
expect to pay around $6/GB or $120K just for drives.  18TB will fit in
a 4U chassis (see the 72 drive SuperMicro double-sided 4U).  72 500GB
drives later and you have 18,000 GB of space.  Not cheap, but if you
quote a system from NetApp or EMC it will seem so.

If you can cut the "fast" size down to 2-4TBs, SSDs become a lot more
realistic with commercial systems from new companies like WhipTail for
way under $100K.

If you go with hard drives, you are trading speed for space.  With
600GB 10K drives would need 66 drives raid-10.  Multi-threaded, this
would read at around 10K IOPS and write at around 7K for "small"
blocks (4-8K).  Linear IO would be wicked fast but random OPs slow you
down.  Conversly, large SSDs arrays can routinely hit > 400K reads and
> 200K writes if built correctly.  Just the 66 hard drives will run
you $30K.  These are SAS drives, not WD Velociraptors which would save
you 30%.

If you opt for "lots of small drives" (ie, 72GB 15K SAS drives) or
worse (short seek small drives), the SSDs are actually faster and
cheaper per GB.  20TB of raid-10 72GB drives is 550 drives or $105K
(just for the drives, not counting jbod enclosures, racks, etc).
Short seeking would be 1000+ drives.  I highly expect you do not want
to do this.

In terms of Linux, pretty much any stock distribution will work.
After all, you are just talking about SMB or NFS exports.  Not exactly
rocket science.

In terms of hardware, buy good disk controllers and good SAS
expanders.  SuperMicro is a good brand for motherboards and white box
chassis.  The LSI 8 channel 6gbit SAS PCIe card is a favorite as a
dumb disk controller.  The SuperMicro backplanes have LSI SAS expander
chips and work well.

The network is the easiest part.  Buy a decent dual-port 10GigE
adapter and two 24-port GigE switches with 10GigE uplink ports.  You
will max out at about 1.2 GBytes/sec on the network but should be able
to keep the GigE channels very busy.

Then you get to test, test, test.

Good Luck

Doug Dumitru
EasyCo LLC





On Mon, Feb 14, 2011 at 3:59 PM, Matt Garman <matthew.garman@gmail.com> wrote:
> For many years, I have been using Linux software RAID at home for a
> simple NAS system.  Now at work, we are looking at buying a massive,
> high-throughput storage system (e.g. a SAN).  I have little
> familiarity with these kinds of pre-built, vendor-supplied solutions.
> I just started talking to a vendor, and the prices are extremely high.
>
> So I got to thinking, perhaps I could build an adequate device for
> significantly less cost using Linux.  The problem is, the requirements
> for such a system are significantly higher than my home media server,
> and put me into unfamiliar territory (in terms of both hardware and
> software configuration).
>
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster.  These machines all need access to a shared 20 TB pool of
> storage.  Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool.  In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?
>
> My initial thoughts/questions are:
>
>    (1) We need lots of spindles (i.e. many small disks rather than
> few big disks).  How do you compute disk throughput when there are
> multiple consumers?  Most manufacturers provide specs on their drives
> such as sustained linear read throughput.  But how is that number
> affected when there are multiple processes simultanesously trying to
> access different data?  Is the sustained bulk read throughput value
> inversely proportional to the number of consumers?  (E.g. 100 MB/s
> drive only does 33 MB/s w/three consumers.)  Or is there are more
> specific way to estimate this?
>
>    (2) The big storage server(s) need to connect to the network via
> multiple bonded Gigabit ethernet, or something faster like
> FibreChannel or 10 GbE.  That seems pretty straightforward.
>
>    (3) This will probably require multiple servers connected together
> somehow and presented to the compute machines as one big data store.
> This is where I really don't know much of anything.  I did a quick
> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
> (based on the observation that 24-bay rackmount enclosures seem to be
> fairly common).  Such a system would only provide 7.2 TB of storage
> using a scheme like RAID-10.  So how could two or three of these
> servers be "chained" together and look like a single large data pool
> to the analysis machines?
>
> I know this is a broad question, and not 100% about Linux software
> RAID.  But I've been lurking on this list for years now, and I get the
> impression there are list members who regularly work with "big iron"
> systems such as what I've described.  I'm just looking for any kind of
> relevant information here; any and all is appreciated!
>
> Thank you,
> Matt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15  2:06 ` Doug Dumitru
@ 2011-02-15  4:44   ` Matt Garman
  2011-02-15  5:49     ` hansbkk
                       ` (3 more replies)
  0 siblings, 4 replies; 116+ messages in thread
From: Matt Garman @ 2011-02-15  4:44 UTC (permalink / raw)
  To: Doug Dumitru; +Cc: Mdadm

On Mon, Feb 14, 2011 at 06:06:43PM -0800, Doug Dumitru wrote:
> You have a whole slew of questions to answer before you can decide
> on a design.  This is true if you build it yourself or decide to
> go with a vendor and buy a supported server.  If you do go with a
> vendor, the odds are actually quite good you will end up with
> Linux anyway.

I kind of assumed/wondered if the vendor-supplied systems didn't run
Linux behind the scenes anyway. 

> You state a need for 20TB of storage split among 40-50 "users".
> Your description implies that this space needs to be shared at the
> file level.  This means you are building a NAS (Network Attached
> Storage), not a SAN (Storage Area Network).  SANs typically export
> block devices over protocols like iSCSI.  These block devices are
> non-sharable (ie, only a single client can mount them (at least
> read/write) at a time.

Is that the only distinction between SAN and NAS?  (Honest
question, not rhetorical.)

> So, 20TB of NAS.  Not really that hard to build.  Next, you need
> to look at the space itself.  Is this all unique data, or is there
> an opportunity for "data deduplication".  Some filesystems (ZFS)
> and some block solutions can actively spot blocks that are
> duplicates and only store a single copy.  With some applications
> (like virtual servers all running the same OS), this can result in
> de-dupe ratios of 20:1.  If your application is like this, your
> 20TB might only be 1-2 TB.  I suspect this is not the case based
> on your description.

Unfortunately, no, there is no duplication.  Basically, we have a
bunch of files that are generated via another big collection of
servers scattered throughout different data centers.  These files
are "harvested" daily (i.e. copied back to the big store in our
office for the analysis I've mentioned).

> Next, is the space all the same.  Perhaps some of it is "active"
> and some of it is archival.  If you need 4TB of "fast" storage and
> ...
> well.  You can probably build this for around $5K (or maybe a bit
> less) including a 10GigE adapter and server class components.

The whole system needs to be "fast".

Actually, to give more detail, we currently have a simple system I
built for backup/slow access.  This is exactly what you described, a
bunch of big, slow disks.  Lots of space, lowsy I/O performance, but
plenty adequate for backup purposes.

As of right now, we actually have about a dozen "users", i.e.
compute servers.  The collection is basically a home-grown compute
farm.  Each server has a gigabit ethernet connection, and 1 TB of
RAID-1 spinning disk storage.  Each storage mounts every other
server via NFS, and the current data is distributed evenly across
all systems.

So, loosely speaking, right now we have roughly 10 TB of
"live"/"fast" data available at 1 to 10 gbps, depending on how you
look at it.

While we only have about a dozen servers now, we have definitely
identified growing this compute farm about 4x (to 40--50 servers)
within the next year.  But the storage capacity requirements
shouldn't change too terribly much.  The 20 TB number was basically
thrown out there as a "it would be nice to have 2x the live
storage".

I'll also add that this NAS needs to be optimized for *read*
throughput.  As I mentioned, the only real write process is the
daily "harvesting" of the data files.  Those are copied across
long-haul leased lines, and the copy process isn't really
performance sensitive.  In other words, in day-to-day use, those
40--50 client machines will do 100% reading from the NAS.

> If you need IOPS (IO Operations Per Second), you are looking at
> SSDs.  You can build 20TB of pure SSD space.  If you do it
> yourself raid-10, expect to pay around $6/GB or $120K just for
> drives.  18TB will fit in a 4U chassis (see the 72 drive
> SuperMicro double-sided 4U).  72 500GB drives later and you have
> 18,000 GB of space.  Not cheap, but if you quote a system from
> NetApp or EMC it will seem so.

Hmm.  That does seem high, but that would be a beast of a system.
And I have to add, I'd love to build something like that!

> If you can cut the "fast" size down to 2-4TBs, SSDs become a lot
> more realistic with commercial systems from new companies like
> WhipTail for way under $100K.
> 
> If you go with hard drives, you are trading speed for space.  With
> 600GB 10K drives would need 66 drives raid-10.  Multi-threaded, this
> would read at around 10K IOPS and write at around 7K for "small"
> blocks (4-8K).  Linear IO would be wicked fast but random OPs slow you
> down.  Conversly, large SSDs arrays can routinely hit > 400K reads and
> > 200K writes if built correctly.  Just the 66 hard drives will run
> you $30K.  These are SAS drives, not WD Velociraptors which would save
> you 30%.
> 
> If you opt for "lots of small drives" (ie, 72GB 15K SAS drives) or
> worse (short seek small drives), the SSDs are actually faster and
> cheaper per GB.  20TB of raid-10 72GB drives is 550 drives or $105K
> (just for the drives, not counting jbod enclosures, racks, etc).
> Short seeking would be 1000+ drives.  I highly expect you do not want
> to do this.

No.  :)  72 SSDs sounds like fun; 550 spinning disks sound dreadful.
I have a feeling I'd probably have to keep a significant number
on-hand as spares, as I predict drive failures would probably be a
weekly occurance.

Thank you for the detailed and thoughtful answers!  Definitely very
helpful.

Take care,
Matt


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15  4:44   ` Matt Garman
@ 2011-02-15  5:49     ` hansbkk
  2011-02-15  9:43     ` David Brown
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 116+ messages in thread
From: hansbkk @ 2011-02-15  5:49 UTC (permalink / raw)
  To: Matt Garman; +Cc: Doug Dumitru, linux-raid

I highly recommend taking a look at Openfiler, pretty simple to set up
and very flexible, really just a stabilized/tested "appliance" built
on Linux/FOSS tools. Then your choices come down to what
top-of-the-line hardware you'd like to buy. . .

With the money you'd save from not going COTS, you could build two of
them and create high-availability mirrored servers with DRBD/heartbeat
for extra redundancy/fault-tolerance. And pre-pay for a full lifetime
of support, if that gives you and the company an extra level of
comfort. And still have a nice chunk of budget left over for UPSs,
backup hardware, network capacity expansion etc.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15  4:44   ` Matt Garman
  2011-02-15  5:49     ` hansbkk
@ 2011-02-15  9:43     ` David Brown
  2011-02-24 20:28       ` Matt Garman
  2011-02-15 15:16     ` Joe Landman
  2011-02-27 21:30     ` high throughput storage server? Ed W
  3 siblings, 1 reply; 116+ messages in thread
From: David Brown @ 2011-02-15  9:43 UTC (permalink / raw)
  To: linux-raid

On 15/02/2011 05:44, Matt Garman wrote:
> On Mon, Feb 14, 2011 at 06:06:43PM -0800, Doug Dumitru wrote:
>
> I'll also add that this NAS needs to be optimized for *read*
> throughput.  As I mentioned, the only real write process is the
> daily "harvesting" of the data files.  Those are copied across
> long-haul leased lines, and the copy process isn't really
> performance sensitive.  In other words, in day-to-day use, those
> 40--50 client machines will do 100% reading from the NAS.
>

If you are not too bothered about write performance, I'd put a fair 
amount of the budget into ram rather than just disk performance.  When 
you've got the ram space to make sure small reads are mostly cached, the 
main bottleneck will be sequential reads - and big hard disks handle 
sequential reads as fast as expensive SSDs.

>
> No.  :)  72 SSDs sounds like fun; 550 spinning disks sound dreadful.
> I have a feeling I'd probably have to keep a significant number
> on-hand as spares, as I predict drive failures would probably be a
> weekly occurance.
>

Don't forget to include running costs in this - 72 SSDs use a lot less 
power than 550 hard disks.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-14 23:59 high throughput storage server? Matt Garman
  2011-02-15  2:06 ` Doug Dumitru
@ 2011-02-15 12:29 ` Stan Hoeppner
  2011-02-15 12:45   ` Roberto Spadim
  2011-02-15 13:39   ` David Brown
  2011-02-15 13:48 ` Zdenek Kaspar
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-15 12:29 UTC (permalink / raw)
  To: Mdadm, Matt Garman

Matt Garman put forth on 2/14/2011 5:59 PM:

> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster.  These machines all need access to a shared 20 TB pool of
> storage.  Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool.  In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.

If your description of the requirement is accurate, then what you need is a
_reliable_ high performance NFS server backed by many large/fast spindles.

> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?

My thoughtful, considered, recommendation would be to stay away from a DIY build
for the requirement you describe, and stay away from mdraid as well, but not
because mdraid isn't up to the task.  I get the feeling you don't fully grasp
some of the consequences of a less than expert level mdraid admin being
responsible for such a system after it's in production.  If multiple drives are
kicked off line simultaneously (posts of such seem to occur multiple times/week
here), downing the array, are you capable of bringing it back online intact,
successfully, without outside assistance, in a short period of time?  If you
lose the entire array due to a typo'd mdadm parm, then what?

You haven't described a hobby level system here, one which you can fix at your
leisure.  You've described a large, expensive, production caliber storage
resource used for scientific discovery.  You need to perform one very serious
gut check, and be damn sure you're prepared to successfully manage such a large,
apparently important, mdraid array when things go to the South pole in a
heartbeat.  Do the NFS server yourself, as mistakes there are more forgiving
than mistakes at the array level.  Thus, I'd recommend the following.  And as
you can tell from the length of it, I put some careful consideration (and time)
into whipping this up.

Get one HP ProLiant DL 385 G7 eight core AMD Magny Cours server for deployment
as your 64 bit Linux NFS server ($2500):
http://www.newegg.com/Product/Product.aspx?Item=N82E16859105806

Eight 2.3GHz cores is actually overkill for this NFS server, but this box has
the right combination of price and other features you need.  The standard box
comes with 2x2GB DIMMs, using only 2 of the 4 channels of the occupied G34
socket, and 4GB is a tad short of what you'll need.  So toss the installed DIMMs
and buy this HP certified 4 channel 16GB kit directly from Kingston ($400):
http://www.ec.kingston.com/ecom/configurator_new/partsinfo.asp?root=us&LinkBack=http://www.kingston.com&ktcpartno=KTH-PL313K4/16G

This box has 4 GbE ports, which will give you max NFS throughput of ~600-800
MB/s bidirectional, roughly 1/3rd to half the storage system bandwidth (see
below).  Link aggregation with the switch will help with efficiency.  Set jumbo
frames across all the systems and switches obviously, MTU of 9000, or the lowest
common denominator, regardless of which NIC solution you end up with.  If that's
not enough b/w...

Add one of these PCI Express 2.0 x8 10 GbE copper Intel NICs ($600) to bump max
NFS throughput to ~1.5-2 GB/s bidirectional (assuming your switch has a copper
10 GbE port):
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106043
Due to using NFS+TCP/UDP as your protocols, this NIC can't outrun the FC back
end, even though the raw signaling rate is slightly higher.  However, if you
fired up 10-12 simultaneous FTP gets you'd come really close.

Two of these for boot drives ($600):
http://www.newegg.com/Product/Product.aspx?Item=N82E16822332060
Mirror them with the onboard 256MB SmartArray BBWC RAID controller

Qlogic PCIe 2.0 x4/x8 8Gbit FC HBA ($1100):
http://www.newegg.com/Product/Product.aspx?Item=N82E16833380014&cm_re=qlogic-_-33-380-014-_-Product

for connecting to the important part ($20-40K USD):
http://www.nexsan.com/satabeast.php

42 drives in a single 4U chassis, one RAID group or many, up to 254 LUNs or just
one, awesome capacity and performance for the price.  To keep costs down yet
performance high, you'll want the 8Gbit FC single controller model with 2GB
cache (standard) and with qty 42* 1TB 7.2K rpm SATA drives.  All drives use a
firmware revision tested and certified by Nexsan for use with their controllers
so you won't have problems with drives being randomly kicked offline, etc.  This
is an enterprise class SAN controller.  (Do some research and look at Nexsan's
customers and what they're using these things for.  Caltech dumps the data from
the Spitzer space telescope to a group of 50-60 of these SATABeasts).

A SATABeast with 42 * 1TB drives should run in the ballpark of $20-40K USD
depending on the reseller and your organization status (EDU, non profit,
government, etc).  Nexsan has resellers covering the entire Americas and Europe.
 If you need to expand in the future, Nexsan offers the NXS-B60E expansion
chassis
(http://www.nexstor.co.uk/prod_pdfs/NXS-B60E_Nexsan%20NXS-B60E%20Datasheet.pdf)
which holds 60 disks and plugs into the SATABeast with redundant multilane SAS
cables, allowing up to 102 drives in 8U of rack space, 204TB total using 2TB
drives, or any combination in between.  The NXS-B60E adds no additional
bandwidth to the system.  Thus, if you need more speed and space, buy a second
SATABeast and another FC card, or replace the single port FC card with a dual
port model (or buy the dual port up front)

With the full 42 drive chassis configured as a 40 disk RAID 10 (2 spares) you'll
get 20TB usable space and you'll easily peak the 8GBit FC interface in both
directions simultaneously.  Aggregate random non-cached IOPS will peak at around
3000, cached at 50,000.  The bandwidth figures may seem low to people used to
"testing" md arrays with hdparm or dd and seeing figures of 500MB/s to 1GB/s
with only a handful of disks, however these are usually sequential _read_
figures only, on RAID 6 arrays, which have write performance often 2-3 times
lower.  In the real world, 1.6GB/s of sustained bidirectional random I/O
throughput while servicing dozens or hundreds of hosts is pretty phenomenal
performance, especially in this price range.  The NFS server will most likely be
the bottleneck though, not this storage, definitely so if 4 bonded GbE
interfaces are used for NFS serving instead of the 10 GbE NIC.

The hardware for this should run you well less than $50K USD for everything.
I'd highly recommend you create a single 40 drive RAID 10 array, as I mentioned
above with 2 spares, if you need performance as much as, if not more than,
capacity--especially write performance.  A 40 drive RAID 10 on this SATABeast
will give you performance almost identical to a 20 disk RAID 0 stripe.  If you
need additional capacity more than speed, configure 40 drives as a RAID 6.  The
read performance will be similar, although the write performance will take a big
dive with 40 drives and dual parity.

Configure 90-95% of the array as one logical drive and save the other 5-10% for
a rainy day--you'll be glad you did.  Export the logical drive as a single LUN.
 Format that LUN as XFS.  Visit the XFS mailing list and ask for instructions on
how best to format and mount it.  Use the most recent Linux kernel available,
2.6.37 or later, depending on when you actually build the NFS server--2.6.38/39
if they're stable.  If you get Linux+XFS+NFS configured and running optimally,
you should be more than impressed and satisfied with the performance and
reliability of this combined system.

I don't work for any of the companies whose products are mentioned above.  I'm
merely a satisfied customer of all of them.  The Nexsan products have the lowest
price/TB of any SAN storage products on the market, and the highest
performance/dollar, and lowest price per watt of power consumption.  They're
easy as cake to setup and manage with a nice GUI web interface over an ethernet
management port.

Hope you find this information useful.  Feel free to contact me directly if I
can be of further assistance.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 12:29 ` Stan Hoeppner
@ 2011-02-15 12:45   ` Roberto Spadim
  2011-02-15 13:03     ` Roberto Spadim
  2011-02-15 13:39   ` David Brown
  1 sibling, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-02-15 12:45 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm, Matt Garman

if you want a hobby server
a old computer with many pci-express slots
many sata2 boards
and a mdadm work
no problem on speed
the common bottle neck:

1)disk speed for sequencial read/write
2)disk speed for non-sequencial read/write
3)disk channel (SATA/SAS/other)
4)pci-express/pci/isa/other channel speed
5)ram memory speed
6)cpu use

check that buffer on disk controllers just help with read speed, if
you want more speed for read put more ram (file system cache) or
controller cache
another solution for big speed is ssd (for read/write it´s near a
fixed speed rate), use raid0 when possible, raid1 just for mirroring
(it´s not a speed improvement for writes, the write is done with the
rate of slowest disk, read can work near raid0 if using harddisk,
better if use raid1)

2011/2/15 Stan Hoeppner <stan@hardwarefreak.com>:
> Matt Garman put forth on 2/14/2011 5:59 PM:
>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster.  These machines all need access to a shared 20 TB pool of
>> storage.  Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool.  In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> If your description of the requirement is accurate, then what you need is a
> _reliable_ high performance NFS server backed by many large/fast spindles.
>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> My thoughtful, considered, recommendation would be to stay away from a DIY build
> for the requirement you describe, and stay away from mdraid as well, but not
> because mdraid isn't up to the task.  I get the feeling you don't fully grasp
> some of the consequences of a less than expert level mdraid admin being
> responsible for such a system after it's in production.  If multiple drives are
> kicked off line simultaneously (posts of such seem to occur multiple times/week
> here), downing the array, are you capable of bringing it back online intact,
> successfully, without outside assistance, in a short period of time?  If you
> lose the entire array due to a typo'd mdadm parm, then what?
>
> You haven't described a hobby level system here, one which you can fix at your
> leisure.  You've described a large, expensive, production caliber storage
> resource used for scientific discovery.  You need to perform one very serious
> gut check, and be damn sure you're prepared to successfully manage such a large,
> apparently important, mdraid array when things go to the South pole in a
> heartbeat.  Do the NFS server yourself, as mistakes there are more forgiving
> than mistakes at the array level.  Thus, I'd recommend the following.  And as
> you can tell from the length of it, I put some careful consideration (and time)
> into whipping this up.
>
> Get one HP ProLiant DL 385 G7 eight core AMD Magny Cours server for deployment
> as your 64 bit Linux NFS server ($2500):
> http://www.newegg.com/Product/Product.aspx?Item=N82E16859105806
>
> Eight 2.3GHz cores is actually overkill for this NFS server, but this box has
> the right combination of price and other features you need.  The standard box
> comes with 2x2GB DIMMs, using only 2 of the 4 channels of the occupied G34
> socket, and 4GB is a tad short of what you'll need.  So toss the installed DIMMs
> and buy this HP certified 4 channel 16GB kit directly from Kingston ($400):
> http://www.ec.kingston.com/ecom/configurator_new/partsinfo.asp?root=us&LinkBack=http://www.kingston.com&ktcpartno=KTH-PL313K4/16G
>
> This box has 4 GbE ports, which will give you max NFS throughput of ~600-800
> MB/s bidirectional, roughly 1/3rd to half the storage system bandwidth (see
> below).  Link aggregation with the switch will help with efficiency.  Set jumbo
> frames across all the systems and switches obviously, MTU of 9000, or the lowest
> common denominator, regardless of which NIC solution you end up with.  If that's
> not enough b/w...
>
> Add one of these PCI Express 2.0 x8 10 GbE copper Intel NICs ($600) to bump max
> NFS throughput to ~1.5-2 GB/s bidirectional (assuming your switch has a copper
> 10 GbE port):
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106043
> Due to using NFS+TCP/UDP as your protocols, this NIC can't outrun the FC back
> end, even though the raw signaling rate is slightly higher.  However, if you
> fired up 10-12 simultaneous FTP gets you'd come really close.
>
> Two of these for boot drives ($600):
> http://www.newegg.com/Product/Product.aspx?Item=N82E16822332060
> Mirror them with the onboard 256MB SmartArray BBWC RAID controller
>
> Qlogic PCIe 2.0 x4/x8 8Gbit FC HBA ($1100):
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833380014&cm_re=qlogic-_-33-380-014-_-Product
>
> for connecting to the important part ($20-40K USD):
> http://www.nexsan.com/satabeast.php
>
> 42 drives in a single 4U chassis, one RAID group or many, up to 254 LUNs or just
> one, awesome capacity and performance for the price.  To keep costs down yet
> performance high, you'll want the 8Gbit FC single controller model with 2GB
> cache (standard) and with qty 42* 1TB 7.2K rpm SATA drives.  All drives use a
> firmware revision tested and certified by Nexsan for use with their controllers
> so you won't have problems with drives being randomly kicked offline, etc.  This
> is an enterprise class SAN controller.  (Do some research and look at Nexsan's
> customers and what they're using these things for.  Caltech dumps the data from
> the Spitzer space telescope to a group of 50-60 of these SATABeasts).
>
> A SATABeast with 42 * 1TB drives should run in the ballpark of $20-40K USD
> depending on the reseller and your organization status (EDU, non profit,
> government, etc).  Nexsan has resellers covering the entire Americas and Europe.
>  If you need to expand in the future, Nexsan offers the NXS-B60E expansion
> chassis
> (http://www.nexstor.co.uk/prod_pdfs/NXS-B60E_Nexsan%20NXS-B60E%20Datasheet.pdf)
> which holds 60 disks and plugs into the SATABeast with redundant multilane SAS
> cables, allowing up to 102 drives in 8U of rack space, 204TB total using 2TB
> drives, or any combination in between.  The NXS-B60E adds no additional
> bandwidth to the system.  Thus, if you need more speed and space, buy a second
> SATABeast and another FC card, or replace the single port FC card with a dual
> port model (or buy the dual port up front)
>
> With the full 42 drive chassis configured as a 40 disk RAID 10 (2 spares) you'll
> get 20TB usable space and you'll easily peak the 8GBit FC interface in both
> directions simultaneously.  Aggregate random non-cached IOPS will peak at around
> 3000, cached at 50,000.  The bandwidth figures may seem low to people used to
> "testing" md arrays with hdparm or dd and seeing figures of 500MB/s to 1GB/s
> with only a handful of disks, however these are usually sequential _read_
> figures only, on RAID 6 arrays, which have write performance often 2-3 times
> lower.  In the real world, 1.6GB/s of sustained bidirectional random I/O
> throughput while servicing dozens or hundreds of hosts is pretty phenomenal
> performance, especially in this price range.  The NFS server will most likely be
> the bottleneck though, not this storage, definitely so if 4 bonded GbE
> interfaces are used for NFS serving instead of the 10 GbE NIC.
>
> The hardware for this should run you well less than $50K USD for everything.
> I'd highly recommend you create a single 40 drive RAID 10 array, as I mentioned
> above with 2 spares, if you need performance as much as, if not more than,
> capacity--especially write performance.  A 40 drive RAID 10 on this SATABeast
> will give you performance almost identical to a 20 disk RAID 0 stripe.  If you
> need additional capacity more than speed, configure 40 drives as a RAID 6.  The
> read performance will be similar, although the write performance will take a big
> dive with 40 drives and dual parity.
>
> Configure 90-95% of the array as one logical drive and save the other 5-10% for
> a rainy day--you'll be glad you did.  Export the logical drive as a single LUN.
>  Format that LUN as XFS.  Visit the XFS mailing list and ask for instructions on
> how best to format and mount it.  Use the most recent Linux kernel available,
> 2.6.37 or later, depending on when you actually build the NFS server--2.6.38/39
> if they're stable.  If you get Linux+XFS+NFS configured and running optimally,
> you should be more than impressed and satisfied with the performance and
> reliability of this combined system.
>
> I don't work for any of the companies whose products are mentioned above.  I'm
> merely a satisfied customer of all of them.  The Nexsan products have the lowest
> price/TB of any SAN storage products on the market, and the highest
> performance/dollar, and lowest price per watt of power consumption.  They're
> easy as cake to setup and manage with a nice GUI web interface over an ethernet
> management port.
>
> Hope you find this information useful.  Feel free to contact me directly if I
> can be of further assistance.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 12:45   ` Roberto Spadim
@ 2011-02-15 13:03     ` Roberto Spadim
  2011-02-24 20:43       ` Matt Garman
  0 siblings, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-02-15 13:03 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm, Matt Garman

disks are good for sequencial access
for non-sequencial ssd are better (the sequencial access rate for a
ssd is the same for a non sequencial access rate)

in my tests the best disk i used (15000rpm SAS 6gb 146gb) get a
sequencial read of 160MB/s (for random it´s slower)
a OCZ VERTEX2 SSD SATA2 (near USD 200, for 128GB) get min of 190MB/s
max of 270MB/s for random or sequencial read (maybe a disk isn´t a
good option today... the cost of ssd isn´t a problem today, i´m using
vertez2 on one production server and the speed is really good)

the solution to get more speed today is raid0 (or another stripe raid solution)
why? check example:

reading sector 1 to 10
using raid0, 2, hard disks, striping per sector

what today read do:

considering disks position=0
read sector 1
disk1 read, new position=1 (no access time, since the sector1 = disk 1
position0)
read sector 2
disk2 read, new position=1 (no access time, since the sector2 = disk 2
position0)
read sector 3
disk1 read, new position=2 (no access time, since the sector2 = disk 2
position1)
...


that´s why you get 2x the read speed for harddisks raid0 sequencial
read, the access time is very small for raid0 on a sequencial read, if
you use a random access you will have a bigger access time  since disk
must change head position, with a sequencial read the position isn´t
changed a lot

with raid1 using harddisk you can´t get the same speed as raid0
striping, since sector2 is position 2 in any disk, that´s why the
today raid1 read_balance use near head algorithm and if it can use
only one disk it will use just one disk

if you want try another read balance for raid1 i´m testing (benchmarking) it at:
www.spadim.com.br/raid1/

when i get good benchmarks i will send to Neil to test and try to
adopt it on next md version
if you could help with benchmarks =) you are welcome =)
there´s many scenarios where diferent read_balance are better than near_head
all solutions can use any read_balance
the time based is good for anyone, the problem is the number of
parameters to config it
the round robin is good for ssd since access time is the same for
random or sequencial read
the stripe is a round robin solution but i didn´t see any performace
improvement with it
the near head is good with hard disk


2011/2/15 Roberto Spadim <roberto@spadim.com.br>:
> if you want a hobby server
> a old computer with many pci-express slots
> many sata2 boards
> and a mdadm work
> no problem on speed
> the common bottle neck:
>
> 1)disk speed for sequencial read/write
> 2)disk speed for non-sequencial read/write
> 3)disk channel (SATA/SAS/other)
> 4)pci-express/pci/isa/other channel speed
> 5)ram memory speed
> 6)cpu use
>
> check that buffer on disk controllers just help with read speed, if
> you want more speed for read put more ram (file system cache) or
> controller cache
> another solution for big speed is ssd (for read/write it´s near a
> fixed speed rate), use raid0 when possible, raid1 just for mirroring
> (it´s not a speed improvement for writes, the write is done with the
> rate of slowest disk, read can work near raid0 if using harddisk,
> better if use raid1)
>
> 2011/2/15 Stan Hoeppner <stan@hardwarefreak.com>:
>> Matt Garman put forth on 2/14/2011 5:59 PM:
>>
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster.  These machines all need access to a shared 20 TB pool of
>>> storage.  Each compute machine has a gigabit network connection, and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool.  In other
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> If your description of the requirement is accurate, then what you need is a
>> _reliable_ high performance NFS server backed by many large/fast spindles.
>>
>>> I was wondering if anyone on the list has built something similar to
>>> this using off-the-shelf hardware (and Linux of course)?
>>
>> My thoughtful, considered, recommendation would be to stay away from a DIY build
>> for the requirement you describe, and stay away from mdraid as well, but not
>> because mdraid isn't up to the task.  I get the feeling you don't fully grasp
>> some of the consequences of a less than expert level mdraid admin being
>> responsible for such a system after it's in production.  If multiple drives are
>> kicked off line simultaneously (posts of such seem to occur multiple times/week
>> here), downing the array, are you capable of bringing it back online intact,
>> successfully, without outside assistance, in a short period of time?  If you
>> lose the entire array due to a typo'd mdadm parm, then what?
>>
>> You haven't described a hobby level system here, one which you can fix at your
>> leisure.  You've described a large, expensive, production caliber storage
>> resource used for scientific discovery.  You need to perform one very serious
>> gut check, and be damn sure you're prepared to successfully manage such a large,
>> apparently important, mdraid array when things go to the South pole in a
>> heartbeat.  Do the NFS server yourself, as mistakes there are more forgiving
>> than mistakes at the array level.  Thus, I'd recommend the following.  And as
>> you can tell from the length of it, I put some careful consideration (and time)
>> into whipping this up.
>>
>> Get one HP ProLiant DL 385 G7 eight core AMD Magny Cours server for deployment
>> as your 64 bit Linux NFS server ($2500):
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16859105806
>>
>> Eight 2.3GHz cores is actually overkill for this NFS server, but this box has
>> the right combination of price and other features you need.  The standard box
>> comes with 2x2GB DIMMs, using only 2 of the 4 channels of the occupied G34
>> socket, and 4GB is a tad short of what you'll need.  So toss the installed DIMMs
>> and buy this HP certified 4 channel 16GB kit directly from Kingston ($400):
>> http://www.ec.kingston.com/ecom/configurator_new/partsinfo.asp?root=us&LinkBack=http://www.kingston.com&ktcpartno=KTH-PL313K4/16G
>>
>> This box has 4 GbE ports, which will give you max NFS throughput of ~600-800
>> MB/s bidirectional, roughly 1/3rd to half the storage system bandwidth (see
>> below).  Link aggregation with the switch will help with efficiency.  Set jumbo
>> frames across all the systems and switches obviously, MTU of 9000, or the lowest
>> common denominator, regardless of which NIC solution you end up with.  If that's
>> not enough b/w...
>>
>> Add one of these PCI Express 2.0 x8 10 GbE copper Intel NICs ($600) to bump max
>> NFS throughput to ~1.5-2 GB/s bidirectional (assuming your switch has a copper
>> 10 GbE port):
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106043
>> Due to using NFS+TCP/UDP as your protocols, this NIC can't outrun the FC back
>> end, even though the raw signaling rate is slightly higher.  However, if you
>> fired up 10-12 simultaneous FTP gets you'd come really close.
>>
>> Two of these for boot drives ($600):
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16822332060
>> Mirror them with the onboard 256MB SmartArray BBWC RAID controller
>>
>> Qlogic PCIe 2.0 x4/x8 8Gbit FC HBA ($1100):
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16833380014&cm_re=qlogic-_-33-380-014-_-Product
>>
>> for connecting to the important part ($20-40K USD):
>> http://www.nexsan.com/satabeast.php
>>
>> 42 drives in a single 4U chassis, one RAID group or many, up to 254 LUNs or just
>> one, awesome capacity and performance for the price.  To keep costs down yet
>> performance high, you'll want the 8Gbit FC single controller model with 2GB
>> cache (standard) and with qty 42* 1TB 7.2K rpm SATA drives.  All drives use a
>> firmware revision tested and certified by Nexsan for use with their controllers
>> so you won't have problems with drives being randomly kicked offline, etc.  This
>> is an enterprise class SAN controller.  (Do some research and look at Nexsan's
>> customers and what they're using these things for.  Caltech dumps the data from
>> the Spitzer space telescope to a group of 50-60 of these SATABeasts).
>>
>> A SATABeast with 42 * 1TB drives should run in the ballpark of $20-40K USD
>> depending on the reseller and your organization status (EDU, non profit,
>> government, etc).  Nexsan has resellers covering the entire Americas and Europe.
>>  If you need to expand in the future, Nexsan offers the NXS-B60E expansion
>> chassis
>> (http://www.nexstor.co.uk/prod_pdfs/NXS-B60E_Nexsan%20NXS-B60E%20Datasheet.pdf)
>> which holds 60 disks and plugs into the SATABeast with redundant multilane SAS
>> cables, allowing up to 102 drives in 8U of rack space, 204TB total using 2TB
>> drives, or any combination in between.  The NXS-B60E adds no additional
>> bandwidth to the system.  Thus, if you need more speed and space, buy a second
>> SATABeast and another FC card, or replace the single port FC card with a dual
>> port model (or buy the dual port up front)
>>
>> With the full 42 drive chassis configured as a 40 disk RAID 10 (2 spares) you'll
>> get 20TB usable space and you'll easily peak the 8GBit FC interface in both
>> directions simultaneously.  Aggregate random non-cached IOPS will peak at around
>> 3000, cached at 50,000.  The bandwidth figures may seem low to people used to
>> "testing" md arrays with hdparm or dd and seeing figures of 500MB/s to 1GB/s
>> with only a handful of disks, however these are usually sequential _read_
>> figures only, on RAID 6 arrays, which have write performance often 2-3 times
>> lower.  In the real world, 1.6GB/s of sustained bidirectional random I/O
>> throughput while servicing dozens or hundreds of hosts is pretty phenomenal
>> performance, especially in this price range.  The NFS server will most likely be
>> the bottleneck though, not this storage, definitely so if 4 bonded GbE
>> interfaces are used for NFS serving instead of the 10 GbE NIC.
>>
>> The hardware for this should run you well less than $50K USD for everything.
>> I'd highly recommend you create a single 40 drive RAID 10 array, as I mentioned
>> above with 2 spares, if you need performance as much as, if not more than,
>> capacity--especially write performance.  A 40 drive RAID 10 on this SATABeast
>> will give you performance almost identical to a 20 disk RAID 0 stripe.  If you
>> need additional capacity more than speed, configure 40 drives as a RAID 6.  The
>> read performance will be similar, although the write performance will take a big
>> dive with 40 drives and dual parity.
>>
>> Configure 90-95% of the array as one logical drive and save the other 5-10% for
>> a rainy day--you'll be glad you did.  Export the logical drive as a single LUN.
>>  Format that LUN as XFS.  Visit the XFS mailing list and ask for instructions on
>> how best to format and mount it.  Use the most recent Linux kernel available,
>> 2.6.37 or later, depending on when you actually build the NFS server--2.6.38/39
>> if they're stable.  If you get Linux+XFS+NFS configured and running optimally,
>> you should be more than impressed and satisfied with the performance and
>> reliability of this combined system.
>>
>> I don't work for any of the companies whose products are mentioned above.  I'm
>> merely a satisfied customer of all of them.  The Nexsan products have the lowest
>> price/TB of any SAN storage products on the market, and the highest
>> performance/dollar, and lowest price per watt of power consumption.  They're
>> easy as cake to setup and manage with a nice GUI web interface over an ethernet
>> management port.
>>
>> Hope you find this information useful.  Feel free to contact me directly if I
>> can be of further assistance.
>>
>> --
>> Stan
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 12:29 ` Stan Hoeppner
  2011-02-15 12:45   ` Roberto Spadim
@ 2011-02-15 13:39   ` David Brown
  2011-02-16 23:32     ` Stan Hoeppner
  2011-02-24 20:49     ` Matt Garman
  1 sibling, 2 replies; 116+ messages in thread
From: David Brown @ 2011-02-15 13:39 UTC (permalink / raw)
  To: linux-raid

On 15/02/2011 13:29, Stan Hoeppner wrote:
> Matt Garman put forth on 2/14/2011 5:59 PM:
>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster.  These machines all need access to a shared 20 TB pool of
>> storage.  Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool.  In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> If your description of the requirement is accurate, then what you need is a
> _reliable_ high performance NFS server backed by many large/fast spindles.
>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> My thoughtful, considered, recommendation would be to stay away from a DIY build
> for the requirement you describe, and stay away from mdraid as well, but not
> because mdraid isn't up to the task.  I get the feeling you don't fully grasp
> some of the consequences of a less than expert level mdraid admin being
> responsible for such a system after it's in production.  If multiple drives are
> kicked off line simultaneously (posts of such seem to occur multiple times/week
> here), downing the array, are you capable of bringing it back online intact,
> successfully, without outside assistance, in a short period of time?  If you
> lose the entire array due to a typo'd mdadm parm, then what?
>

This brings up an important point - no matter what sort of system you 
get (home made, mdadm raid, or whatever) you will want to do some tests 
and drills at replacing failed drives.  Also make sure everything is 
well documented, and well labelled.  When mdadm sends you an email 
telling you drive sdx has failed, you want to be /very/ sure you know 
which drive is sdx before you take it out!



You also want to consider your raid setup carefully.  RAID 10 has been 
mentioned here several times - it is often a good choice, but not 
necessarily.  RAID 10 gives you fast recovery, and can at best survive a 
loss of half your disks - but at worst a loss of two disks will bring 
down the whole set.  It is also very inefficient in space.  If you use 
SSDs, it may not be worth double the price to have RAID 10.  If you use 
hard disks, it may not be sufficient safety.

I haven't build a raid of anything like this size, so my comments here 
are only based on my imperfect understanding of the theory - I'm 
learning too.

RAID 10 has the advantage of good speed at reading (close to RAID 0 
speeds), at the cost of poorer write speed and poor space efficiency. 
RAID 5 and RAID 6 are space efficient, and fast for most purposes, but 
slow for rebuilds and slow for small writes.

You are not much bothered about write performance, and most of your 
writes are large anyway.

How about building the array as a two-tier RAID 6+5 setup?  Take 7 x 1TB 
disks as a RAID 6 for 5 TB space.  Five sets of these as RAID 5 gives 
you your 20 TB in 35 drives.  This will survive any four failed disks, 
or more depending on the combinations.  If you are careful how it is 
arranged, it will also survive a failing controller card.

If a disk fails, you could remove that whole set from the outer array 
(which should have a write intent bitmap) - then the rebuild will go at 
maximal speed, while the outer array's speed will not be so badly 
affected.  Once the rebuild is complete, put it back in the outer array. 
  Since you are not doing many writes, it will not take long to catch up.

It is probably worth having a small array of SSDs (RAID1 or RAID10) to 
hold the write intent bitmap, the journal for your main file system, and 
of course your OS.  Maybe one of these absurdly fast PCI Express flash 
disks would be a good choice.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-14 23:59 high throughput storage server? Matt Garman
  2011-02-15  2:06 ` Doug Dumitru
  2011-02-15 12:29 ` Stan Hoeppner
@ 2011-02-15 13:48 ` Zdenek Kaspar
  2011-02-15 14:29   ` Roberto Spadim
  2011-02-17 11:07 ` John Robinson
  2011-02-18 13:49 ` Mattias Wadenstein
  4 siblings, 1 reply; 116+ messages in thread
From: Zdenek Kaspar @ 2011-02-15 13:48 UTC (permalink / raw)
  To: linux-raid

Dne 15.2.2011 0:59, Matt Garman napsal(a):
> For many years, I have been using Linux software RAID at home for a
> simple NAS system.  Now at work, we are looking at buying a massive,
> high-throughput storage system (e.g. a SAN).  I have little
> familiarity with these kinds of pre-built, vendor-supplied solutions.
> I just started talking to a vendor, and the prices are extremely high.
> 
> So I got to thinking, perhaps I could build an adequate device for
> significantly less cost using Linux.  The problem is, the requirements
> for such a system are significantly higher than my home media server,
> and put me into unfamiliar territory (in terms of both hardware and
> software configuration).
> 
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster.  These machines all need access to a shared 20 TB pool of
> storage.  Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool.  In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
> 
> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?
> 
> My initial thoughts/questions are:
> 
>     (1) We need lots of spindles (i.e. many small disks rather than
> few big disks).  How do you compute disk throughput when there are
> multiple consumers?  Most manufacturers provide specs on their drives
> such as sustained linear read throughput.  But how is that number
> affected when there are multiple processes simultanesously trying to
> access different data?  Is the sustained bulk read throughput value
> inversely proportional to the number of consumers?  (E.g. 100 MB/s
> drive only does 33 MB/s w/three consumers.)  Or is there are more
> specific way to estimate this?
> 
>     (2) The big storage server(s) need to connect to the network via
> multiple bonded Gigabit ethernet, or something faster like
> FibreChannel or 10 GbE.  That seems pretty straightforward.
> 
>     (3) This will probably require multiple servers connected together
> somehow and presented to the compute machines as one big data store.
> This is where I really don't know much of anything.  I did a quick
> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
> (based on the observation that 24-bay rackmount enclosures seem to be
> fairly common).  Such a system would only provide 7.2 TB of storage
> using a scheme like RAID-10.  So how could two or three of these
> servers be "chained" together and look like a single large data pool
> to the analysis machines?
> 
> I know this is a broad question, and not 100% about Linux software
> RAID.  But I've been lurking on this list for years now, and I get the
> impression there are list members who regularly work with "big iron"
> systems such as what I've described.  I'm just looking for any kind of
> relevant information here; any and all is appreciated!
> 
> Thank you,
> Matt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

If you really need to handle 50Gbit/s storage traffic, then it's not so
easy for hobby. For good price you probably want multiple machines with
lots hard drives and interconnects..

Might be worth to ask here:
Newsgroups: gmane.comp.clustering.beowulf.general

HTH, Z.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 13:48 ` Zdenek Kaspar
@ 2011-02-15 14:29   ` Roberto Spadim
  2011-02-15 14:51     ` A. Krijgsman
  2011-02-15 14:56     ` Zdenek Kaspar
  0 siblings, 2 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-02-15 14:29 UTC (permalink / raw)
  To: Zdenek Kaspar; +Cc: linux-raid

first, run memtest86 (if you use x86 cpu)
check ram memory speed
my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)

maybe ram is a bottleneck for 50gbits....
you will need a multi computer raid or stripe fileaccess operations
(database on one machine, s.o. on another...)

for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
the today state of art, in 'my world' is: http://www.ramsan.com/products/3


2011/2/15 Zdenek Kaspar <zkaspar82@gmail.com>:
> Dne 15.2.2011 0:59, Matt Garman napsal(a):
>> For many years, I have been using Linux software RAID at home for a
>> simple NAS system.  Now at work, we are looking at buying a massive,
>> high-throughput storage system (e.g. a SAN).  I have little
>> familiarity with these kinds of pre-built, vendor-supplied solutions.
>> I just started talking to a vendor, and the prices are extremely high.
>>
>> So I got to thinking, perhaps I could build an adequate device for
>> significantly less cost using Linux.  The problem is, the requirements
>> for such a system are significantly higher than my home media server,
>> and put me into unfamiliar territory (in terms of both hardware and
>> software configuration).
>>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster.  These machines all need access to a shared 20 TB pool of
>> storage.  Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool.  In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>>
>> My initial thoughts/questions are:
>>
>>     (1) We need lots of spindles (i.e. many small disks rather than
>> few big disks).  How do you compute disk throughput when there are
>> multiple consumers?  Most manufacturers provide specs on their drives
>> such as sustained linear read throughput.  But how is that number
>> affected when there are multiple processes simultanesously trying to
>> access different data?  Is the sustained bulk read throughput value
>> inversely proportional to the number of consumers?  (E.g. 100 MB/s
>> drive only does 33 MB/s w/three consumers.)  Or is there are more
>> specific way to estimate this?
>>
>>     (2) The big storage server(s) need to connect to the network via
>> multiple bonded Gigabit ethernet, or something faster like
>> FibreChannel or 10 GbE.  That seems pretty straightforward.
>>
>>     (3) This will probably require multiple servers connected together
>> somehow and presented to the compute machines as one big data store.
>> This is where I really don't know much of anything.  I did a quick
>> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
>> (based on the observation that 24-bay rackmount enclosures seem to be
>> fairly common).  Such a system would only provide 7.2 TB of storage
>> using a scheme like RAID-10.  So how could two or three of these
>> servers be "chained" together and look like a single large data pool
>> to the analysis machines?
>>
>> I know this is a broad question, and not 100% about Linux software
>> RAID.  But I've been lurking on this list for years now, and I get the
>> impression there are list members who regularly work with "big iron"
>> systems such as what I've described.  I'm just looking for any kind of
>> relevant information here; any and all is appreciated!
>>
>> Thank you,
>> Matt
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> If you really need to handle 50Gbit/s storage traffic, then it's not so
> easy for hobby. For good price you probably want multiple machines with
> lots hard drives and interconnects..
>
> Might be worth to ask here:
> Newsgroups: gmane.comp.clustering.beowulf.general
>
> HTH, Z.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 14:29   ` Roberto Spadim
@ 2011-02-15 14:51     ` A. Krijgsman
  2011-02-15 16:44       ` Roberto Spadim
  2011-02-15 14:56     ` Zdenek Kaspar
  1 sibling, 1 reply; 116+ messages in thread
From: A. Krijgsman @ 2011-02-15 14:51 UTC (permalink / raw)
  To: Roberto Spadim, Zdenek Kaspar; +Cc: linux-raid

Just ran memcheck 2 weeks ago.

If you triple-lane your memory you get 10GByte (!) per second memory.
( This is memory from 2010 ;-) 1333 Mhz ) 


-----Oorspronkelijk bericht----- 
From: Roberto Spadim 
Sent: Tuesday, February 15, 2011 3:29 PM 
To: Zdenek Kaspar 
Cc: linux-raid@vger.kernel.org 
Subject: Re: high throughput storage server? 

first, run memtest86 (if you use x86 cpu)
check ram memory speed
my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)

maybe ram is a bottleneck for 50gbits....
you will need a multi computer raid or stripe fileaccess operations
(database on one machine, s.o. on another...)

for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
the today state of art, in 'my world' is: http://www.ramsan.com/products/3


2011/2/15 Zdenek Kaspar <zkaspar82@gmail.com>:
> Dne 15.2.2011 0:59, Matt Garman napsal(a):
>> For many years, I have been using Linux software RAID at home for a
>> simple NAS system.  Now at work, we are looking at buying a massive,
>> high-throughput storage system (e.g. a SAN).  I have little
>> familiarity with these kinds of pre-built, vendor-supplied solutions.
>> I just started talking to a vendor, and the prices are extremely high.
>>
>> So I got to thinking, perhaps I could build an adequate device for
>> significantly less cost using Linux.  The problem is, the requirements
>> for such a system are significantly higher than my home media server,
>> and put me into unfamiliar territory (in terms of both hardware and
>> software configuration).
>>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster.  These machines all need access to a shared 20 TB pool of
>> storage.  Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool.  In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>>
>> My initial thoughts/questions are:
>>
>>     (1) We need lots of spindles (i.e. many small disks rather than
>> few big disks).  How do you compute disk throughput when there are
>> multiple consumers?  Most manufacturers provide specs on their drives
>> such as sustained linear read throughput.  But how is that number
>> affected when there are multiple processes simultanesously trying to
>> access different data?  Is the sustained bulk read throughput value
>> inversely proportional to the number of consumers?  (E.g. 100 MB/s
>> drive only does 33 MB/s w/three consumers.)  Or is there are more
>> specific way to estimate this?
>>
>>     (2) The big storage server(s) need to connect to the network via
>> multiple bonded Gigabit ethernet, or something faster like
>> FibreChannel or 10 GbE.  That seems pretty straightforward.
>>
>>     (3) This will probably require multiple servers connected together
>> somehow and presented to the compute machines as one big data store.
>> This is where I really don't know much of anything.  I did a quick
>> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
>> (based on the observation that 24-bay rackmount enclosures seem to be
>> fairly common).  Such a system would only provide 7.2 TB of storage
>> using a scheme like RAID-10.  So how could two or three of these
>> servers be "chained" together and look like a single large data pool
>> to the analysis machines?
>>
>> I know this is a broad question, and not 100% about Linux software
>> RAID.  But I've been lurking on this list for years now, and I get the
>> impression there are list members who regularly work with "big iron"
>> systems such as what I've described.  I'm just looking for any kind of
>> relevant information here; any and all is appreciated!
>>
>> Thank you,
>> Matt
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> If you really need to handle 50Gbit/s storage traffic, then it's not so
> easy for hobby. For good price you probably want multiple machines with
> lots hard drives and interconnects..
>
> Might be worth to ask here:
> Newsgroups: gmane.comp.clustering.beowulf.general
>
> HTH, Z.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 14:29   ` Roberto Spadim
  2011-02-15 14:51     ` A. Krijgsman
@ 2011-02-15 14:56     ` Zdenek Kaspar
  2011-02-24 20:36       ` Matt Garman
  1 sibling, 1 reply; 116+ messages in thread
From: Zdenek Kaspar @ 2011-02-15 14:56 UTC (permalink / raw)
  To: linux-raid

Dne 15.2.2011 15:29, Roberto Spadim napsal(a):
> first, run memtest86 (if you use x86 cpu)
> check ram memory speed
> my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)
> 
> maybe ram is a bottleneck for 50gbits....
> you will need a multi computer raid or stripe fileaccess operations
> (database on one machine, s.o. on another...)
> 
> for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
> the today state of art, in 'my world' is: http://www.ramsan.com/products/3

I doubt 20TB SLC which will survive huge abuse (writes) is low-cost
solution what OP wants to build himself..

or 20TB RAM omg..

Z.



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15  4:44   ` Matt Garman
  2011-02-15  5:49     ` hansbkk
  2011-02-15  9:43     ` David Brown
@ 2011-02-15 15:16     ` Joe Landman
  2011-02-15 20:37       ` NeilBrown
  2011-02-24 20:58       ` Matt Garman
  2011-02-27 21:30     ` high throughput storage server? Ed W
  3 siblings, 2 replies; 116+ messages in thread
From: Joe Landman @ 2011-02-15 15:16 UTC (permalink / raw)
  To: Matt Garman; +Cc: Doug Dumitru, Mdadm

[disclosure: vendor posting, ignore if you wish, vendor html link at 
bottom of message]

On 02/14/2011 11:44 PM, Matt Garman wrote:
> On Mon, Feb 14, 2011 at 06:06:43PM -0800, Doug Dumitru wrote:
>> You have a whole slew of questions to answer before you can decide
>> on a design.  This is true if you build it yourself or decide to
>> go with a vendor and buy a supported server.  If you do go with a
>> vendor, the odds are actually quite good you will end up with
>> Linux anyway.
>
> I kind of assumed/wondered if the vendor-supplied systems didn't run
> Linux behind the scenes anyway.

We've been using Linux as the basis for our storage systems. 
Occasionally there are other OSes required by customers, but for the 
most part, Linux is the preferred platform.

[...]

>> Next, is the space all the same.  Perhaps some of it is "active"
>> and some of it is archival.  If you need 4TB of "fast" storage and
>> ...
>> well.  You can probably build this for around $5K (or maybe a bit
>> less) including a 10GigE adapter and server class components.
>
> The whole system needs to be "fast".

Ok ... sounds strange, but ...

Define what you mean by "fast".  Seriously ... we've had people tell us 
about their "huge" storage needs that we can easily fit onto a single 
small unit, no storage cluster needed.  We've had people say "fast" when 
they mean "able to keep 1 GbE port busy".

Fast needs to be articulated really in terms of what you will do with 
it.  As you noted in this and other messages, you are scaling up from 10 
compute nodes to 40 compute nodes.  4x change in demand, and I am 
guessing bandwidth (if these are large files you are streaming) or IOPs 
(if these are many small files you are reading).  Small and large here 
would mean less than 64kB for small, and greater than 4MB for large.


> Actually, to give more detail, we currently have a simple system I
> built for backup/slow access.  This is exactly what you described, a
> bunch of big, slow disks.  Lots of space, lowsy I/O performance, but
> plenty adequate for backup purposes.

Your choice is simple.  Build or buy.  Many folks have made suggestions, 
and some are pretty reasonable, though a pure SSD or Flash based 
machine, while doable (and we sell these), is quite unlikely to be close 
to the realities of your budget.  There are use cases for which this 
does make sense, but the costs are quite prohibitive for all but a few 
users.

> As of right now, we actually have about a dozen "users", i.e.
> compute servers.  The collection is basically a home-grown compute
> farm.  Each server has a gigabit ethernet connection, and 1 TB of
> RAID-1 spinning disk storage.  Each storage mounts every other
> server via NFS, and the current data is distributed evenly across
> all systems.

Ok ... this isn't something thats great to manage.  I might suggest 
looking at GlusterFS for this.  You can aggregate and distribute your 
data.  Even build in some resiliency if you wish/need.  GlusterFS 3.1.2 
is open source, so you can deploy fairly easily.

>
> So, loosely speaking, right now we have roughly 10 TB of
> "live"/"fast" data available at 1 to 10 gbps, depending on how you
> look at it.
>
> While we only have about a dozen servers now, we have definitely
> identified growing this compute farm about 4x (to 40--50 servers)
> within the next year.  But the storage capacity requirements
> shouldn't change too terribly much.  The 20 TB number was basically
> thrown out there as a "it would be nice to have 2x the live
> storage".

Without building a storage unit, you could (in concept) use GlusterFS 
for this.  In practice, this model gets harder and harder to manage as 
you increase the number of nodes.  Adding the N+1 th node means you have 
N+1 nodes to modify and manage storage on.  This does not scale well at all.

>
> I'll also add that this NAS needs to be optimized for *read*
> throughput.  As I mentioned, the only real write process is the
> daily "harvesting" of the data files.  Those are copied across
> long-haul leased lines, and the copy process isn't really
> performance sensitive.  In other words, in day-to-day use, those
> 40--50 client machines will do 100% reading from the NAS.

Ok.

This isn't a commercial.  I'll keep this part short.

We've built systems like this which sustain north of 10GB/s (big B not 
little b) for concurrent read and write access from thousands of cores. 
  20TB (and 40TB) are on the ... small ... side for this, but it is very 
doable.

As a tie in to the Linux RAID list, we use md raid for our OS drives 
(SSD pairs), and other utility functions within the unit, as well as 
striping over our hardware accelerated RAIDs.  We would like to use 
non-power of two chunk sizes, but haven't delved into the code as much 
as we'd like to see if we can make this work.

As a rule, we find mdadm to be an excellent tool, and the whole md RAID 
system to be quite good.  We may spend time at some point on figuring 
out whats wrong with the multi-threaded raid456 bit (allocated 200+ 
kernel threads last I played with it), but apart from bits like that, we 
do find it very good for production use.  It isn't as fast as some 
dedicated accelerated RAID hardware (though we have our md + kernel 
stack very well tuned so some of our software RAIDs are faster than many 
of our competitors hardware RAIDs).

You could build a fairly competent unit using md RAID.

It all gets back to build versus buy.  In either case, I'd recommend 
grabbing a copy of dstat (http://dag.wieers.com/home-made/dstat/) and 
watching your IO/network system throughput.  I am assuming 1 GbE 
switches as the basis for your cluster.  I assume this will not change. 
  The cost of your time/effort and any opportunity cost and productivity 
loss should also be accounted for in the cost-benefit analysis.  That 
is, if it costs you less overall to buy than to build, should you build 
anyway?  Generally no, but some people simply want the experience.

Big issues you need to be aware of with md raid are the hotswap problem. 
  Your SATA link needs to allow you to pull a drive out without crashing 
the machine.  Many of the on-motherboard SATA connections we've used 
over the years don't tolerate unplugs/plugins very well.  I'd recommend 
at least an reasonable HBA for this that understands hot swap and 
handles it correctly (you need hardware and driver level support to 
correctly signal the kernel of these events).

If you decide to buy, have a really clear idea of your performance 
regime, and a realistic eye towards budget.  A 48 TB server with > 2GB/s 
streaming performance for TB sized files is very doable, well under $30k 
USD.  A 48 TB software RAID version would be quite a bit less than that.

Good luck with this, and let us know what you do.

vendor html link:  http://scalableinformatics.com , our storage clusters 
http://scalableinformatics.com/sicluster

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 14:51     ` A. Krijgsman
@ 2011-02-15 16:44       ` Roberto Spadim
  0 siblings, 0 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-02-15 16:44 UTC (permalink / raw)
  To: A. Krijgsman; +Cc: Zdenek Kaspar, linux-raid

10Gbyte ~ 80Gbit, i don´t know if 50Gbit is possible
you have SO cpu time to read and write many things not just memory,
check filesystem cache, etc. etc. etc., maybe you can´t get this speed
with just 80Gbit memory

2011/2/15 A. Krijgsman <a.krijgsman@draftsman.nl>:
> Just ran memcheck 2 weeks ago.
>
> If you triple-lane your memory you get 10GByte (!) per second memory.
> ( This is memory from 2010 ;-) 1333 Mhz )
>
> -----Oorspronkelijk bericht----- From: Roberto Spadim Sent: Tuesday,
> February 15, 2011 3:29 PM To: Zdenek Kaspar Cc: linux-raid@vger.kernel.org
> Subject: Re: high throughput storage server?
> first, run memtest86 (if you use x86 cpu)
> check ram memory speed
> my hp (ml350g5 very old: 2005) get 2500MB/s (~20 Gbits/s)
>
> maybe ram is a bottleneck for 50gbits....
> you will need a multi computer raid or stripe fileaccess operations
> (database on one machine, s.o. on another...)
>
> for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
> the today state of art, in 'my world' is: http://www.ramsan.com/products/3
>
>
> 2011/2/15 Zdenek Kaspar <zkaspar82@gmail.com>:
>>
>> Dne 15.2.2011 0:59, Matt Garman napsal(a):
>>>
>>> For many years, I have been using Linux software RAID at home for a
>>> simple NAS system.  Now at work, we are looking at buying a massive,
>>> high-throughput storage system (e.g. a SAN).  I have little
>>> familiarity with these kinds of pre-built, vendor-supplied solutions.
>>> I just started talking to a vendor, and the prices are extremely high.
>>>
>>> So I got to thinking, perhaps I could build an adequate device for
>>> significantly less cost using Linux.  The problem is, the requirements
>>> for such a system are significantly higher than my home media server,
>>> and put me into unfamiliar territory (in terms of both hardware and
>>> software configuration).
>>>
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster.  These machines all need access to a shared 20 TB pool of
>>> storage.  Each compute machine has a gigabit network connection, and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool.  In other
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>>
>>> I was wondering if anyone on the list has built something similar to
>>> this using off-the-shelf hardware (and Linux of course)?
>>>
>>> My initial thoughts/questions are:
>>>
>>>    (1) We need lots of spindles (i.e. many small disks rather than
>>> few big disks).  How do you compute disk throughput when there are
>>> multiple consumers?  Most manufacturers provide specs on their drives
>>> such as sustained linear read throughput.  But how is that number
>>> affected when there are multiple processes simultanesously trying to
>>> access different data?  Is the sustained bulk read throughput value
>>> inversely proportional to the number of consumers?  (E.g. 100 MB/s
>>> drive only does 33 MB/s w/three consumers.)  Or is there are more
>>> specific way to estimate this?
>>>
>>>    (2) The big storage server(s) need to connect to the network via
>>> multiple bonded Gigabit ethernet, or something faster like
>>> FibreChannel or 10 GbE.  That seems pretty straightforward.
>>>
>>>    (3) This will probably require multiple servers connected together
>>> somehow and presented to the compute machines as one big data store.
>>> This is where I really don't know much of anything.  I did a quick
>>> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
>>> (based on the observation that 24-bay rackmount enclosures seem to be
>>> fairly common).  Such a system would only provide 7.2 TB of storage
>>> using a scheme like RAID-10.  So how could two or three of these
>>> servers be "chained" together and look like a single large data pool
>>> to the analysis machines?
>>>
>>> I know this is a broad question, and not 100% about Linux software
>>> RAID.  But I've been lurking on this list for years now, and I get the
>>> impression there are list members who regularly work with "big iron"
>>> systems such as what I've described.  I'm just looking for any kind of
>>> relevant information here; any and all is appreciated!
>>>
>>> Thank you,
>>> Matt
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> If you really need to handle 50Gbit/s storage traffic, then it's not so
>> easy for hobby. For good price you probably want multiple machines with
>> lots hard drives and interconnects..
>>
>> Might be worth to ask here:
>> Newsgroups: gmane.comp.clustering.beowulf.general
>>
>> HTH, Z.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 15:16     ` Joe Landman
@ 2011-02-15 20:37       ` NeilBrown
  2011-02-15 20:47         ` Joe Landman
  2011-02-24 20:58       ` Matt Garman
  1 sibling, 1 reply; 116+ messages in thread
From: NeilBrown @ 2011-02-15 20:37 UTC (permalink / raw)
  To: Joe Landman; +Cc: Matt Garman, Doug Dumitru, Mdadm

On Tue, 15 Feb 2011 10:16:15 -0500 Joe Landman <joe.landman@gmail.com> wrote:

> As a tie in to the Linux RAID list, we use md raid for our OS drives 
> (SSD pairs), and other utility functions within the unit, as well as 
> striping over our hardware accelerated RAIDs.  We would like to use 
> non-power of two chunk sizes, but haven't delved into the code as much 
> as we'd like to see if we can make this work.
> 

md/raid0 (striping) currently supports non-power-of-two chunk sizes, though
it is a relatively recent addition.
(raid4/5/6 doesn't).

Just FYI.

NeilBrown


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 20:37       ` NeilBrown
@ 2011-02-15 20:47         ` Joe Landman
  2011-02-15 21:41           ` NeilBrown
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-02-15 20:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: Matt Garman, Doug Dumitru, Mdadm

On 02/15/2011 03:37 PM, NeilBrown wrote:
> On Tue, 15 Feb 2011 10:16:15 -0500 Joe Landman<joe.landman@gmail.com>  wrote:
>
>> As a tie in to the Linux RAID list, we use md raid for our OS drives
>> (SSD pairs), and other utility functions within the unit, as well as
>> striping over our hardware accelerated RAIDs.  We would like to use
>> non-power of two chunk sizes, but haven't delved into the code as much
>> as we'd like to see if we can make this work.
>>
>
> md/raid0 (striping) currently supports non-power-of-two chunk sizes, though
> it is a relatively recent addition.
> (raid4/5/6 doesn't).

Cool!  We need to start playing with this ...

Which kernels have the support?

--
Joe

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 20:47         ` Joe Landman
@ 2011-02-15 21:41           ` NeilBrown
  0 siblings, 0 replies; 116+ messages in thread
From: NeilBrown @ 2011-02-15 21:41 UTC (permalink / raw)
  To: Joe Landman; +Cc: Matt Garman, Doug Dumitru, Mdadm

On Tue, 15 Feb 2011 15:47:37 -0500 Joe Landman <joe.landman@gmail.com> wrote:

> On 02/15/2011 03:37 PM, NeilBrown wrote:
> > On Tue, 15 Feb 2011 10:16:15 -0500 Joe Landman<joe.landman@gmail.com>  wrote:
> >
> >> As a tie in to the Linux RAID list, we use md raid for our OS drives
> >> (SSD pairs), and other utility functions within the unit, as well as
> >> striping over our hardware accelerated RAIDs.  We would like to use
> >> non-power of two chunk sizes, but haven't delved into the code as much
> >> as we'd like to see if we can make this work.
> >>
> >
> > md/raid0 (striping) currently supports non-power-of-two chunk sizes, though
> > it is a relatively recent addition.
> > (raid4/5/6 doesn't).
> 
> Cool!  We need to start playing with this ...
> 
> Which kernels have the support?

It was enabled by commit fbb704efb784e2c8418e34dc3013af76bdd58101
so 

$ git name-rev fbb704efb784e2c8418e34dc3013af76bdd58101
fbb704efb784e2c8418e34dc3013af76bdd58101 tags/v2.6.31-rc1~143^2~18


2.6.31 has this support.

However I note that mdadm still checks that the chunk size is a power of two:
       if (chunk < 8 || ((chunk-1)&chunk)) {

I should fix that...

NeilBrown


> 
> --
> Joe
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 13:39   ` David Brown
@ 2011-02-16 23:32     ` Stan Hoeppner
  2011-02-17  0:00       ` Keld Jørn Simonsen
  2011-02-17  0:26       ` David Brown
  2011-02-24 20:49     ` Matt Garman
  1 sibling, 2 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-16 23:32 UTC (permalink / raw)
  To: Linux RAID

David Brown put forth on 2/15/2011 7:39 AM:

> This brings up an important point - no matter what sort of system you get (home
> made, mdadm raid, or whatever) you will want to do some tests and drills at
> replacing failed drives.  Also make sure everything is well documented, and well
> labelled.  When mdadm sends you an email telling you drive sdx has failed, you
> want to be /very/ sure you know which drive is sdx before you take it out!

This is one of the many reasons I recommended an enterprise class vendor
solution.  The Nexsan unit can be configured for SMTP and/or SNMP and/or pager
notification.  When a drive is taken offline the drive slot is identified in the
GUI.  Additionally, the backplane board has power and activity LEDs next to each
drive.  When you slide the chassis out of the rack (while still fully
operating), and pull the cover, you will see a distinct blink pattern of the
LEDs next to the failed drive.  This is fully described in the documentation,
but even without reading such it'll be crystal clear which drive is down.  There
is zero guess work.

The drive replacement testing scenario you describe is unnecessary with the
Nexsan products as well as any enterprise disk array.

> You also want to consider your raid setup carefully.  RAID 10 has been mentioned
> here several times - it is often a good choice, but not necessarily.  RAID 10
> gives you fast recovery, and can at best survive a loss of half your disks - but
> at worst a loss of two disks will bring down the whole set.  It is also very
> inefficient in space.  If you use SSDs, it may not be worth double the price to
> have RAID 10.  If you use hard disks, it may not be sufficient safety.

RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
today due to the low price of mech drives.  Using the SATABeast as an example,
the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
$1200/TB.  Given all the advantages of RAID 10 over RAID 6 the 33% premium is
more than worth it.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-16 23:32     ` Stan Hoeppner
@ 2011-02-17  0:00       ` Keld Jørn Simonsen
  2011-02-17  0:19         ` Stan Hoeppner
  2011-02-17  0:26       ` David Brown
  1 sibling, 1 reply; 116+ messages in thread
From: Keld Jørn Simonsen @ 2011-02-17  0:00 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Linux RAID

On Wed, Feb 16, 2011 at 05:32:58PM -0600, Stan Hoeppner wrote:
> David Brown put forth on 2/15/2011 7:39 AM:
> 
> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
> today due to the low price of mech drives.  Using the SATABeast as an example,
> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
> $1200/TB.  Given all the advantages of RAID 10 over RAID 6 the 33% premium is
> more than worth it.

I assume that you with 20 TB mean the payload space in both places, that
is for the Linux MD RAID10 you actually have 40 TB of raw disk space.
With the Linux MD raid10 solution you furthermore can enjoy an almost
double up of the IO reading speed, involving 20 * 2 TB spindles compared
to 12 * 2 TB spindles.

best regards
keld

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17  0:00       ` Keld Jørn Simonsen
@ 2011-02-17  0:19         ` Stan Hoeppner
  2011-02-17  2:23           ` Roberto Spadim
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-17  0:19 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: Linux RAID

Keld Jørn Simonsen put forth on 2/16/2011 6:00 PM:
> On Wed, Feb 16, 2011 at 05:32:58PM -0600, Stan Hoeppner wrote:
>> David Brown put forth on 2/15/2011 7:39 AM:
>>
>> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
>> today due to the low price of mech drives.  Using the SATABeast as an example,
>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
>> $1200/TB.  Given all the advantages of RAID 10 over RAID 6 the 33% premium is
>> more than worth it.
> 
> I assume that you with 20 TB mean the payload space in both places, that
> is for the Linux MD RAID10 you actually have 40 TB of raw disk space.
> With the Linux MD raid10 solution you furthermore can enjoy an almost
> double up of the IO reading speed, involving 20 * 2 TB spindles compared
> to 12 * 2 TB spindles.

Enterprise solutions don't use Linux mdraid.  The RAID function is built into
the SAN controller.  My TCO figures were based on a single controller SATABeast,
42x1TB drives in the RAID 10, and 24x1TB drives in the RAID 6, each
configuration including two spares.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-16 23:32     ` Stan Hoeppner
  2011-02-17  0:00       ` Keld Jørn Simonsen
@ 2011-02-17  0:26       ` David Brown
  2011-02-17  0:45         ` Stan Hoeppner
  1 sibling, 1 reply; 116+ messages in thread
From: David Brown @ 2011-02-17  0:26 UTC (permalink / raw)
  To: linux-raid

(Sorry for the mixup in sending this by direct email instead of posting 
to the list.)

On 17/02/11 00:32, Stan Hoeppner wrote:
> David Brown put forth on 2/15/2011 7:39 AM:
>
>> This brings up an important point - no matter what sort of system you get (home
>> made, mdadm raid, or whatever) you will want to do some tests and drills at
>> replacing failed drives.  Also make sure everything is well documented, and well
>> labelled.  When mdadm sends you an email telling you drive sdx has failed, you
>> want to be /very/ sure you know which drive is sdx before you take it out!
>
> This is one of the many reasons I recommended an enterprise class vendor
> solution.  The Nexsan unit can be configured for SMTP and/or SNMP and/or pager
> notification.  When a drive is taken offline the drive slot is identified in the
> GUI.  Additionally, the backplane board has power and activity LEDs next to each
> drive.  When you slide the chassis out of the rack (while still fully
> operating), and pull the cover, you will see a distinct blink pattern of the
> LEDs next to the failed drive.  This is fully described in the documentation,
> but even without reading such it'll be crystal clear which drive is down.  There
> is zero guess work.
>
> The drive replacement testing scenario you describe is unnecessary with the
> Nexsan products as well as any enterprise disk array.
>

I'd still like to do a test - you don't want to be surprised at the 
wrong moment.  The test lets you know everything is working fine, and 
gives you a feel of how long it will take, and how easy or difficult it is.

But I agree there is a lot of benefit in the sort of clear indications 
of problems that you get with that sort of hardware rather a home made 
system.


>> You also want to consider your raid setup carefully.  RAID 10 has been mentioned
>> here several times - it is often a good choice, but not necessarily.  RAID 10
>> gives you fast recovery, and can at best survive a loss of half your disks - but
>> at worst a loss of two disks will bring down the whole set.  It is also very
>> inefficient in space.  If you use SSDs, it may not be worth double the price to
>> have RAID 10.  If you use hard disks, it may not be sufficient safety.
>
> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
> today due to the low price of mech drives.  Using the SATABeast as an example,
> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
> $1200/TB.  Given all the advantages of RAID 10 over RAID 6 the 33% premium is
> more than worth it.
>

 >

I don't think it is fair to give general rules like that.  In this 
particular case, that might be how the sums work out.  But in other 
cases, using RAID 10 instead of RAID 6 might mean stepping up in chassis 
or controller size and costs.  Also remember that RAID 10 is not better 
than RAID 6 in every way - a RAID 6 array will survive any two failed 
drives, while with RAID 10 an unlucky pairing of failed drives will 
bring down the whole raid.  Different applications require different 
balances here.



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17  0:26       ` David Brown
@ 2011-02-17  0:45         ` Stan Hoeppner
  2011-02-17 10:39           ` David Brown
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-17  0:45 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

David Brown put forth on 2/16/2011 6:26 PM:

> On 17/02/11 00:32, Stan Hoeppner wrote:

>> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
>> today due to the low price of mech drives.  Using the SATABeast as an example,
>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
>> $1200/TB.  Given all the advantages of RAID 10 over RAID 6 the 33% premium is
>> more than worth it.

> I don't think it is fair to give general rules like that.  In this particular

The IT press does it every day.  CTOs read those articles.  In many cases it's
their primary source of information.  Speak in terms CTOs (i.e. those holding
the purse) understand.

> case, that might be how the sums work out.  But in other cases, using RAID 10
> instead of RAID 6 might mean stepping up in chassis or controller size and
> costs.  Also remember that RAID 10 is not better than RAID 6 in every way - a
> RAID 6 array will survive any two failed drives, while with RAID 10 an unlucky
> pairing of failed drives will bring down the whole raid.  Different applications
> require different balances here.

I'm not sure about being "fair" but it directly relates to the original question
that started this thread.  The OP wanted performance and space with a preference
for performance.  This demonstrates he can get the performance for a ~33% cost
premium.  He didn't mention a budget limit, only that most vendor figures were
too high.

Also, you're repeating points I've made in this (and other) threads back to me.
 Try to keep up David. ;)

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17  0:19         ` Stan Hoeppner
@ 2011-02-17  2:23           ` Roberto Spadim
  2011-02-17  3:05             ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-02-17  2:23 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Keld Jørn Simonsen, Linux RAID

what's 'enterprise' means?

2011/2/16 Stan Hoeppner <stan@hardwarefreak.com>:
> Keld Jørn Simonsen put forth on 2/16/2011 6:00 PM:
>> On Wed, Feb 16, 2011 at 05:32:58PM -0600, Stan Hoeppner wrote:
>>> David Brown put forth on 2/15/2011 7:39 AM:
>>>
>>> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
>>> today due to the low price of mech drives.  Using the SATABeast as an example,
>>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
>>> $1200/TB.  Given all the advantages of RAID 10 over RAID 6 the 33% premium is
>>> more than worth it.
>>
>> I assume that you with 20 TB mean the payload space in both places, that
>> is for the Linux MD RAID10 you actually have 40 TB of raw disk space.
>> With the Linux MD raid10 solution you furthermore can enjoy an almost
>> double up of the IO reading speed, involving 20 * 2 TB spindles compared
>> to 12 * 2 TB spindles.
>
> Enterprise solutions don't use Linux mdraid.  The RAID function is built into
> the SAN controller.  My TCO figures were based on a single controller SATABeast,
> 42x1TB drives in the RAID 10, and 24x1TB drives in the RAID 6, each
> configuration including two spares.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17  2:23           ` Roberto Spadim
@ 2011-02-17  3:05             ` Stan Hoeppner
  0 siblings, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-17  3:05 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Keld Jørn Simonsen, Linux RAID

Roberto Spadim put forth on 2/16/2011 8:23 PM:
> what's 'enterprise' means?

http://lmgtfy.com/?q=enterprise+storage

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17  0:45         ` Stan Hoeppner
@ 2011-02-17 10:39           ` David Brown
  0 siblings, 0 replies; 116+ messages in thread
From: David Brown @ 2011-02-17 10:39 UTC (permalink / raw)
  To: linux-raid

On 17/02/2011 01:45, Stan Hoeppner wrote:
> David Brown put forth on 2/16/2011 6:26 PM:
>
>> On 17/02/11 00:32, Stan Hoeppner wrote:
>
>>> RAID level space/cost efficiency from a TCO standpoint is largely irrelevant
>>> today due to the low price of mech drives.  Using the SATABeast as an example,
>>> the cost per TB of a 20TB RAID 10 is roughly $1600/TB and a 20TB RAID 6 is about
>>> $1200/TB.  Given all the advantages of RAID 10 over RAID 6 the 33% premium is
>>> more than worth it.
>
>> I don't think it is fair to give general rules like that.  In this particular
>
> The IT press does it every day.  CTOs read those articles.  In many cases it's
> their primary source of information.  Speak in terms CTOs (i.e. those holding
> the purse) understand.
>

I work at a small company - I get to read the articles, make the 
recommendations, and build the servers.  So I can put more emphasis on 
what I think is technically the best solution for us, rather than what 
sounds good in the press.  Of course, the other side of the coin is that 
being a small company with modest server needs, I don't get to play with 
20 TB raid systems!

>> case, that might be how the sums work out.  But in other cases, using RAID 10
>> instead of RAID 6 might mean stepping up in chassis or controller size and
>> costs.  Also remember that RAID 10 is not better than RAID 6 in every way - a
>> RAID 6 array will survive any two failed drives, while with RAID 10 an unlucky
>> pairing of failed drives will bring down the whole raid.  Different applications
>> require different balances here.
>
> I'm not sure about being "fair" but it directly relates to the original question
> that started this thread.  The OP wanted performance and space with a preference
> for performance.  This demonstrates he can get the performance for a ~33% cost
> premium.  He didn't mention a budget limit, only that most vendor figures were
> too high.
>

I agree that RAID 10 sounds like a match for the OP.  All I am saying is 
that it is not necessarily the best choice in general, and not just 
because of the initial purchase price.

> Also, you're repeating points I've made in this (and other) threads back to me.
>   Try to keep up David. ;)
>

I'm doing my best!  I believe I've got a fair understanding of various 
sorts of RAID systems, but I am totally missing real-world experience of 
anything more advanced than a four disk setup.  Bigger raid setups is 
only a hobby interest for me at the moment, so I'm learning as I go 
here.  And you write such a lot here that it's hard for an amateur to 
take it all in :-)


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-14 23:59 high throughput storage server? Matt Garman
                   ` (2 preceding siblings ...)
  2011-02-15 13:48 ` Zdenek Kaspar
@ 2011-02-17 11:07 ` John Robinson
  2011-02-17 13:36   ` Roberto Spadim
  2011-02-17 21:47   ` Stan Hoeppner
  2011-02-18 13:49 ` Mattias Wadenstein
  4 siblings, 2 replies; 116+ messages in thread
From: John Robinson @ 2011-02-17 11:07 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On 14/02/2011 23:59, Matt Garman wrote:
[...]
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster.  These machines all need access to a shared 20 TB pool of
> storage.  Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool.  In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.

I'd recommend you analyse that requirement more closely. Yes, you have 
50 compute machines with GigE connections so it's possible they could 
all demand data from the file store at once, but in actual use, would they?

For example, if these machines were each to demand a 100MB file, how 
long would they spend computing their results from it? If it's only 1 
second, then you would indeed need an aggregate bandwidth of 50Gbps[1]. 
If it's 20 seconds processing, your filer only needs an aggregate 
bandwidth of 2.5Gbps.

So I'd recommend you work out first how much data the compute machines 
can actually chew through and work up from there, rather than what their 
network connections could stream through and work down.

Cheers,

John.

[1] I'm assuming the compute nodes are fetching the data for the next 
compute cycle while they're working on this one; if they're not you're 
likely making unnecessary demands on your filer while leaving your 
compute nodes idle.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17 11:07 ` John Robinson
@ 2011-02-17 13:36   ` Roberto Spadim
  2011-02-17 13:54     ` Roberto Spadim
  2011-02-17 21:47   ` Stan Hoeppner
  1 sibling, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-02-17 13:36 UTC (permalink / raw)
  To: John Robinson; +Cc: Matt Garman, Mdadm

with more network cards = more network gbps
with better (faster) rams = more disks reads
with more raid0/4/5/6 = more speed on disks reads
with more raid1 mirrors = more security
with more sas/sata/raid controllers = more GB/TB on storage
with more anything ~= more money
just know what numbers you want and make it work

2011/2/17 John Robinson <john.robinson@anonymous.org.uk>:
> On 14/02/2011 23:59, Matt Garman wrote:
> [...]
>>
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster.  These machines all need access to a shared 20 TB pool of
>> storage.  Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool.  In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I'd recommend you analyse that requirement more closely. Yes, you have 50
> compute machines with GigE connections so it's possible they could all
> demand data from the file store at once, but in actual use, would they?
>
> For example, if these machines were each to demand a 100MB file, how long
> would they spend computing their results from it? If it's only 1 second,
> then you would indeed need an aggregate bandwidth of 50Gbps[1]. If it's 20
> seconds processing, your filer only needs an aggregate bandwidth of 2.5Gbps.
>
> So I'd recommend you work out first how much data the compute machines can
> actually chew through and work up from there, rather than what their network
> connections could stream through and work down.
>
> Cheers,
>
> John.
>
> [1] I'm assuming the compute nodes are fetching the data for the next
> compute cycle while they're working on this one; if they're not you're
> likely making unnecessary demands on your filer while leaving your compute
> nodes idle.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17 13:36   ` Roberto Spadim
@ 2011-02-17 13:54     ` Roberto Spadim
  0 siblings, 0 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-02-17 13:54 UTC (permalink / raw)
  To: John Robinson; +Cc: Matt Garman, Mdadm

building it on only one machine...
if you want 50gbps, put six (one more) for network access (you need
many pci-express slots with 4x(10gbps) or 8x(20gbps))
i use raid10 for redundancy and speed
you can do raid1 for redundancy and after raid0,4,5,6 over raid1
devices for better speed

sata/sas/raid controllers? sata is very cheap you can use SSD with
sata2 interface, sas have fasters (less accestime) hard disks with
10k/15k rpm

ram? with more ran = more cache/buffers low disks usage, more read speed
cpu? i don't know what to use, but it's a big machine maybe you need
servers motherboards (5 pci-express just for network = big
motherboard, big motherboard = many cpus) try with only one cpu with
6cores hiperthread, etc. if it's not enought put a second cpu

operational system? linux with md =), it's a md list heehhe, maybe a
netbsd or freebsd or windows works too
file server? nfs, samba
filesystem? hummmmmm a cluster fs is good here, but a single ext4,
xfs, reiserfs could work, your energy is good? you want jornaling?
redundancy/cluster? beowolf openmosix, others. heartbeat, placemark, others.
sql database? mysql have ndb for clusters, myisam is fast without some
features, innodb is slower with many features, ariadb = myisam but
slower to write with fail safe feature. oracle is good but mysql is
low resource consuming. postgres is nice too, maybe you app will tell
you what to use
network? many 10gbit with bounding(linux module) on round robin or
another good(working) loadbalance

2011/2/17 Roberto Spadim <roberto@spadim.com.br>:
> with more network cards = more network gbps
> with better (faster) rams = more disks reads
> with more raid0/4/5/6 = more speed on disks reads
> with more raid1 mirrors = more security
> with more sas/sata/raid controllers = more GB/TB on storage
> with more anything ~= more money
> just know what numbers you want and make it work
>
> 2011/2/17 John Robinson <john.robinson@anonymous.org.uk>:
>> On 14/02/2011 23:59, Matt Garman wrote:
>> [...]
>>>
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster.  These machines all need access to a shared 20 TB pool of
>>> storage.  Each compute machine has a gigabit network connection, and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool.  In other
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I'd recommend you analyse that requirement more closely. Yes, you have 50
>> compute machines with GigE connections so it's possible they could all
>> demand data from the file store at once, but in actual use, would they?
>>
>> For example, if these machines were each to demand a 100MB file, how long
>> would they spend computing their results from it? If it's only 1 second,
>> then you would indeed need an aggregate bandwidth of 50Gbps[1]. If it's 20
>> seconds processing, your filer only needs an aggregate bandwidth of 2.5Gbps.
>>
>> So I'd recommend you work out first how much data the compute machines can
>> actually chew through and work up from there, rather than what their network
>> connections could stream through and work down.
>>
>> Cheers,
>>
>> John.
>>
>> [1] I'm assuming the compute nodes are fetching the data for the next
>> compute cycle while they're working on this one; if they're not you're
>> likely making unnecessary demands on your filer while leaving your compute
>> nodes idle.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17 11:07 ` John Robinson
  2011-02-17 13:36   ` Roberto Spadim
@ 2011-02-17 21:47   ` Stan Hoeppner
  2011-02-17 22:13     ` Joe Landman
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-17 21:47 UTC (permalink / raw)
  To: John Robinson; +Cc: Matt Garman, Mdadm

John Robinson put forth on 2/17/2011 5:07 AM:
> On 14/02/2011 23:59, Matt Garman wrote:
> [...]
>> The requirement is basically this: around 40 to 50 compute machines
>> act as basically an ad-hoc scientific compute/simulation/analysis
>> cluster.  These machines all need access to a shared 20 TB pool of
>> storage.  Each compute machine has a gigabit network connection, and
>> it's possible that nearly every machine could simultaneously try to
>> access a large (100 to 1000 MB) file in the storage pool.  In other
>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
> 
> I'd recommend you analyse that requirement more closely. Yes, you have
> 50 compute machines with GigE connections so it's possible they could
> all demand data from the file store at once, but in actual use, would they?

This is a very good point and one which I somewhat ignored in my initial
response, making a silent assumption.  I did so based on personal
experience, and knowledge of what other sites are deploying.

You don't see many deployed filers on the planet with 5 * 10 GbE front
end connections.  In fact, today, you still don't see many deployed
filers with even one 10 GbE front end connection, but usually multiple
(often but not always bonded) GbE connections.

A single 10 GbE front end connection provides a truly enormous amount of
real world bandwidth, over 1 GB/s aggregate sustained.  *This is
equivalent to transferring a full length dual layer DVD in 10 seconds*

Few sites/applications actually need this kind of bandwidth, either
burst or sustained.  But, this is the system I spec'd for the OP
earlier.  Sometimes people get caught up in comparing raw bandwidth
numbers between different platforms and lose sight of the real world
performance they can get from any one of them.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17 21:47   ` Stan Hoeppner
@ 2011-02-17 22:13     ` Joe Landman
  2011-02-17 23:49       ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-02-17 22:13 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: John Robinson, Matt Garman, Mdadm

On 02/17/2011 04:47 PM, Stan Hoeppner wrote:
> John Robinson put forth on 2/17/2011 5:07 AM:
>> On 14/02/2011 23:59, Matt Garman wrote:
>> [...]
>>> The requirement is basically this: around 40 to 50 compute machines
>>> act as basically an ad-hoc scientific compute/simulation/analysis
>>> cluster.  These machines all need access to a shared 20 TB pool of
>>> storage.  Each compute machine has a gigabit network connection, and
>>> it's possible that nearly every machine could simultaneously try to
>>> access a large (100 to 1000 MB) file in the storage pool.  In other
>>> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>>
>> I'd recommend you analyse that requirement more closely. Yes, you have
>> 50 compute machines with GigE connections so it's possible they could
>> all demand data from the file store at once, but in actual use, would they?
>
> This is a very good point and one which I somewhat ignored in my initial
> response, making a silent assumption.  I did so based on personal
> experience, and knowledge of what other sites are deploying.

Well, the application area appears to be high performance cluster 
computing, and the storage behind it.  Its a somewhat more specialized 
version of storage, and not one that a typical IT person runs into 
often.  There are different, some profoundly so, demands placed upon 
such storage.

Full disclosure:  this is our major market, we make/sell products in 
this space, have for a while.  Take what we say with that in your mind 
as a caveat, as it does color our opinions.

The spec's as stated, 50Gb/s ... its rare ... exceptionally rare ... 
that you ever see cluster computing storage requirements stated in such 
terms.  Usually they are stated in the MB/s or GB/s regime.  Using  a 
basic conversion of Gb/s to GB/s, the OP is looking for ~6GB/s support.

Some basic facts about this.

Fibre channel (FC-8 in particular), will give you, at best 1GB/s per 
loop, and that presumes you aren't oversubscribing the loop.  The vast 
majority of designs we see coming from IT shops, do, in fact, badly 
oversubscribe the bandwidth, which causes significant contention on the 
loops.  The Nexsan unit you indicated (they are nominally a competitor 
of ours) is an FC device, though we've heard rumblings that they may 
even allow for SAS direct connections (though that would be quite cost 
ineffective as a SAS JBOD chassis compared to other units, and you still 
have the oversubscription problem).

As I said, high performance storage design is a very ... very ... 
different animal from standard IT storage design.  There are very 
different decision points, and design concepts.

> You don't see many deployed filers on the planet with 5 * 10 GbE front
> end connections.  In fact, today, you still don't see many deployed
> filers with even one 10 GbE front end connection, but usually multiple
> (often but not always bonded) GbE connections.

In this space, high performance cluster storage, this statement is 
incorrect.

Our units (again, not trying to be a commercial here, see .sig if you 
want to converse offline) usually ship with either 2x 10GbE, 2x QDR IB, 
or combinations of these.  QDR IB gets you 3.2 GB/s.  Per port.

In high performance computing storage (again, the focus of the OP's 
questions), this is a reasonable configuration and request.
>
> A single 10 GbE front end connection provides a truly enormous amount of
> real world bandwidth, over 1 GB/s aggregate sustained.  *This is
> equivalent to transferring a full length dual layer DVD in 10 seconds*

Trust me.  This is not *enormous*.  Well, ok ... put another way, we 
architect systems that scale well beyond 10GB/s sustained.  We have nice 
TB sprints and similar sorts of "drag racing" as I call them (c.f. 
http://scalability.org/?p=2912 http://scalability.org/?p=2356 
http://scalability.org/?p=2165  http://scalability.org/?p=1980 
http://scalability.org/?p=1756 )

1 GB/s is nothing magical.  Again, not a commercial, but our DeltaV 
units, running MD raid, achieve 850-900MB/s (0.85-0.9 GB/s) for RAID6.

To get good (great) performance you have to start out with a good 
(great) design.  One that will really optimize the performance on a per 
unit basis.

> Few sites/applications actually need this kind of bandwidth, either
> burst or sustained.  But, this is the system I spec'd for the OP
> earlier.  Sometimes people get caught up in comparing raw bandwidth
> numbers between different platforms and lose sight of the real world
> performance they can get from any one of them.

The sad part is that we often wind up fighting against others "marketing 
numbers".  Our real benchmarks are often comparable to their "strong 
wind a the back" numbers.  Heck, our MD raid numbers often are better 
than others hardware RAID numbers.

Theoretical bandwidth from the marketing docs doesn't matter.  The only 
thing that does matter is having a sound design and implementation at 
all levels.  This is why we do what we do, and why we do use MD raid.

Regards,

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17 22:13     ` Joe Landman
@ 2011-02-17 23:49       ` Stan Hoeppner
  2011-02-18  0:06         ` Joe Landman
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-17 23:49 UTC (permalink / raw)
  To: Joe Landman; +Cc: John Robinson, Matt Garman, Mdadm

Joe Landman put forth on 2/17/2011 4:13 PM:

> Well, the application area appears to be high performance cluster
> computing, and the storage behind it.  Its a somewhat more specialized
> version of storage, and not one that a typical IT person runs into
> often.  There are different, some profoundly so, demands placed upon
> such storage.

The OP's post described an ad hoc collection of 40-50 machines doing
various types of processing on shared data files.  This is not classical
cluster computing.  He didn't describe any kind of _parallel_
processing.  It sounded to me like staged batch processing, the
bandwidth demands of which are typically much lower than a parallel
compute cluster.

> Full disclosure:  this is our major market, we make/sell products in
> this space, have for a while.  Take what we say with that in your mind
> as a caveat, as it does color our opinions.

Thanks for the disclosure Joe.

> The spec's as stated, 50Gb/s ... its rare ... exceptionally rare ...
> that you ever see cluster computing storage requirements stated in such
> terms.  Usually they are stated in the MB/s or GB/s regime.  Using  a
> basic conversion of Gb/s to GB/s, the OP is looking for ~6GB/s support.

Indeed.  You typically don't see this kind of storage b/w need outside
the government labs and supercomputing centers (LLNL, Sandia, NCCS,
SDSC, etc).  Of course those sites' requirements are quite a bit higher
than a "puny" 6 GB/s.

> Some basic facts about this.
> 
> Fibre channel (FC-8 in particular), will give you, at best 1GB/s per
> loop, and that presumes you aren't oversubscribing the loop.  The vast
> majority of designs we see coming from IT shops, do, in fact, badly
> oversubscribe the bandwidth, which causes significant contention on the
> loops.  

Who is still doing loops on the front end?  Front end loops died many
years ago with the introduction of switches from Brocade, Qlogic,
McData, etc.  I've not hard of a front end loop being used in many many
years.  Some storage vendors still use loops on the _back_ end to
connect FC/SAS/SATA expansion chassis to the head controller, IBM and
NetApp come to mind, but it's usually dual loops per chassis, so you're
looking at ~3 GB/s per expansion chassis using 8 Gbit loops.  One would
be hard pressed to over subscribe such a system as most of these are
sold with multiple chassis.  And for systems such as the IBMs and
NetApps, you can get anywhere from 4-32 front end ports of 8 Gbit FC or
10 GbE.  In the IBM case you're limited to block access, whereas the
NetApp will do both block and file.

> The Nexsan unit you indicated (they are nominally a competitor
> of ours) is an FC device, though we've heard rumblings that they may
> even allow for SAS direct connections (though that would be quite cost
> ineffective as a SAS JBOD chassis compared to other units, and you still
> have the oversubscription problem).

Nexsan doesn't offer direct SAS connection on the big 42/102 drive Beast
units, only on the Boy units.  The Beast units all use dual or quad FC
front end ports, with a couple front end GbE iSCSI ports thrown in for
flexibility.  The SAS Boy units beat all competitors on price/TB, as do
all the Nexsan products.

I'd like to note that over subscription isn't intrinsic to a piece of
hardware.  It's indicative of an engineer or storage architect not
knowing what the blank he's doing.

> As I said, high performance storage design is a very ... very ...
> different animal from standard IT storage design.  There are very
> different decision points, and design concepts.

Depends on the segment of the HPC market.  It seems you're competing in
the low end of it.  Configurations get a bit exotic at the very high
end.  It also depends on what HPC storage tier you're looking at, and
the application area.  For pure parallel computing sites such as NCCS,
NCSA, PSSC, etc your storage infrastructure and the manner in which it
is accessed is going to be quite different than some of the NASA
sponsored projects, such as the Spitzer telescope project being handled
by Caltech.  The first will have persistent parallel data writing from
simulation runs across many hundreds or thousands of nodes.  The second
will have massive streaming writes as the telescope streams data in real
time to a ground station.  Then this data will be staged and processed
with massive streaming wrties.

So, again, it really depends on the application(s), as always,
regardless of whether it's HPC or IT, although there are few purely
streaming IT workloads, EDL of decision support databases comes to mind,
but these are usually relatively short duration.  They can still put
some strain on a SAN if not architected correctly.

>> You don't see many deployed filers on the planet with 5 * 10 GbE front
>> end connections.  In fact, today, you still don't see many deployed
>> filers with even one 10 GbE front end connection, but usually multiple
>> (often but not always bonded) GbE connections.
> 
> In this space, high performance cluster storage, this statement is
> incorrect.

The OP doesn't have a high performance cluster.  HPC cluster storage by
accepted definition includes highly parallel workloads.  This is not
what the OP described.  He described ad hoc staged data analysis.

> In high performance computing storage (again, the focus of the OP's
> questions), this is a reasonable configuration and request.

Again, I disagree.  See above.

>> A single 10 GbE front end connection provides a truly enormous amount of
>> real world bandwidth, over 1 GB/s aggregate sustained.  *This is
>> equivalent to transferring a full length dual layer DVD in 10 seconds*
> 
> Trust me.  This is not *enormous*.  Well, ok ... put another way, we

Given that the OP has nothing right now, this is *enormous* bandwidth.
It would surely meet his needs.  For the vast majority of
workloads/environments, 1GB/s sustained is enormous.  Sure, there are
environments that may need more, but those folks aren't typically going
to be asking for architecture assistance on this, or any other mailing
list. ;)

> 1 GB/s is nothing magical.  Again, not a commercial, but our DeltaV
> units, running MD raid, achieve 850-900MB/s (0.85-0.9 GB/s) for RAID6.

1 GB/s sustained random I/O is a bit magical, for many many
sites/applications.  I'm betting the 850-900MB/s RAID6 you quote is a
streaming read, yes?  What does that box peak at with a mixed random I/O
workload from 40-50 clients?

> To get good (great) performance you have to start out with a good
> (great) design.  One that will really optimize the performance on a per
> unit basis.

Blah blah.  You're marketing too much at this point. :)

> The sad part is that we often wind up fighting against others "marketing
> numbers".  Our real benchmarks are often comparable to their "strong
> wind a the back" numbers.  Heck, our MD raid numbers often are better
> than others hardware RAID numbers.

And they're all on paper.  It was great back in the day when vendors
would drop off an eval unit free of charge and let you bang on it for a
month.  Today, there are too many players, and margins are to small, for
most companies to have the motivation to do this.  Today you're invited
to the vendor to watch them run the hardware through a demo, which has
little bearing on your workload.  For a small firm like yours I'm
guessing it would be impossible to deploy eval units in any numbers due
to capitalization issues.

> Theoretical bandwidth from the marketing docs doesn't matter.  The only

This is always the case.  Which is one reason why certain trade mags are
still read--almost decent product reviews.

> thing that does matter is having a sound design and implementation at
> all levels.  This is why we do what we do, and why we do use MD raid.

No argument here.  This is one reason why some quality VARs/integrators
are unsung heroes in some quarters.  There is a plethora of fantastic
gear on the market today, from servers to storage to networking gear.
One could buy the best $$ products available and still get crappy
performance if it's not integrated properly, from the cabling to the
firmware to the application.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-17 23:49       ` Stan Hoeppner
@ 2011-02-18  0:06         ` Joe Landman
  2011-02-18  3:48           ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-02-18  0:06 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: John Robinson, Matt Garman, Mdadm

On 2/17/2011 6:49 PM, Stan Hoeppner wrote:
> Joe Landman put forth on 2/17/2011 4:13 PM:
>
>> Well, the application area appears to be high performance cluster
>> computing, and the storage behind it.  Its a somewhat more specialized
>> version of storage, and not one that a typical IT person runs into
>> often.  There are different, some profoundly so, demands placed upon
>> such storage.
>
> The OP's post described an ad hoc collection of 40-50 machines doing
> various types of processing on shared data files.  This is not classical
> cluster computing.  He didn't describe any kind of _parallel_
> processing.  It sounded to me like staged batch processing, the

Semantics at best.  He is doing significant processing, in parallel, 
doing data analysis, in parallel, across a cluster of machines.  Doing 
MPI-IO?  No.  Does not using MPI make this not a cluster?  No.

> bandwidth demands of which are typically much lower than a parallel
> compute cluster.

See his original post.  He posits his bandwidth demands.

>
>> Full disclosure:  this is our major market, we make/sell products in
>> this space, have for a while.  Take what we say with that in your mind
>> as a caveat, as it does color our opinions.
>
> Thanks for the disclosure Joe.
>
>> The spec's as stated, 50Gb/s ... its rare ... exceptionally rare ...
>> that you ever see cluster computing storage requirements stated in such
>> terms.  Usually they are stated in the MB/s or GB/s regime.  Using  a
>> basic conversion of Gb/s to GB/s, the OP is looking for ~6GB/s support.
>
> Indeed.  You typically don't see this kind of storage b/w need outside
> the government labs and supercomputing centers (LLNL, Sandia, NCCS,
> SDSC, etc).  Of course those sites' requirements are quite a bit higher
> than a "puny" 6 GB/s.

Heh ... we see it all the time in compute cluster, large data analysis 
farms etc.  Not at the big labs.

[...]

> McData, etc.  I've not hard of a front end loop being used in many many
> years.  Some storage vendors still use loops on the _back_ end to
> connect FC/SAS/SATA expansion chassis to the head controller, IBM and

I am talking about the back end.

> NetApp come to mind, but it's usually dual loops per chassis, so you're
> looking at ~3 GB/s per expansion chassis using 8 Gbit loops.  One would

2 GB/s assuming FC-8, and 20 lower speed drives are sufficient to 
completely fill 2 GB/s.  So, as I was saying, the design matters.

[...]

> Nexsan doesn't offer direct SAS connection on the big 42/102 drive Beast
> units, only on the Boy units.  The Beast units all use dual or quad FC
> front end ports, with a couple front end GbE iSCSI ports thrown in for
> flexibility.  The SAS Boy units beat all competitors on price/TB, as do
> all the Nexsan products.

As I joked one time, many many years ago "broad sweeping generalizations 
tend to be incorrect".  Yes, it is a recursive joke, but there is a 
serious aspect to it.  Your proffered pricing per TB, which you claim 
Nexsan beats all ... is much higher than ours, and many others.  No, 
they don't beat all, or even many.


> I'd like to note that over subscription isn't intrinsic to a piece of
> hardware.  It's indicative of an engineer or storage architect not
> knowing what the blank he's doing.

Oversubscription and it corresponding resource contention, not to 
mention poor design of other aspects ... yeah, I agree that this is 
indicative of something.  One must question why people continue to 
deploy architectures which don't scale.

>
>> As I said, high performance storage design is a very ... very ...
>> different animal from standard IT storage design.  There are very
>> different decision points, and design concepts.
>
> Depends on the segment of the HPC market.  It seems you're competing in
> the low end of it.  Configurations get a bit exotic at the very high

I noted this about your previous responses, this particular tone you 
take.  I debated for a while responding, until I saw something I simply 
needed to correct.  I'll try not to take your bait.

[...]

> So, again, it really depends on the application(s), as always,
> regardless of whether it's HPC or IT, although there are few purely
> streaming IT workloads, EDL of decision support databases comes to mind,
> but these are usually relatively short duration.  They can still put
> some strain on a SAN if not architected correctly.
>
>>> You don't see many deployed filers on the planet with 5 * 10 GbE front
>>> end connections.  In fact, today, you still don't see many deployed
>>> filers with even one 10 GbE front end connection, but usually multiple
>>> (often but not always bonded) GbE connections.
>>
>> In this space, high performance cluster storage, this statement is
>> incorrect.
>
> The OP doesn't have a high performance cluster.  HPC cluster storage by

Again, semantics.  They are doing massive data ingestion and processing. 
  The view of this is called "big data" in HPC circles and it is *very 
much* an HPC problem.

> accepted definition includes highly parallel workloads.  This is not
> what the OP described.  He described ad hoc staged data analysis.

See above.  If you want to argue semantics, be my guest, I won't be 
party to such a waste of time.  The OP is doing analysis that requires a 
high performance architecture.  The architecture you suggested is not 
one people in the field would likely recommend.

[rest deleted]


--
joe


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-18  0:06         ` Joe Landman
@ 2011-02-18  3:48           ` Stan Hoeppner
  0 siblings, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-18  3:48 UTC (permalink / raw)
  To: Joe Landman; +Cc: John Robinson, Matt Garman, Mdadm

Joe Landman put forth on 2/17/2011 6:06 PM:

> See above.  If you want to argue semantics, be my guest, I won't be
> party to such a waste of time.  The OP is doing analysis that requires a
> high performance architecture.  The architecture you suggested is not
> one people in the field would likely recommend.

We don't actually know what the OP's needs are at this point.  Any
suggestion is an educated guess.  I clearly stated mine was such.

The OP simply multiplied the quantity of his client hosts' interfaces by
their link speed and posted that as his "requirement", which is where
the 50Gb/s figure came from.  IIRC, he posted that as more of a question
than a statement.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-14 23:59 high throughput storage server? Matt Garman
                   ` (3 preceding siblings ...)
  2011-02-17 11:07 ` John Robinson
@ 2011-02-18 13:49 ` Mattias Wadenstein
  2011-02-18 23:16   ` Stan Hoeppner
  2011-02-19  0:24   ` Joe Landman
  4 siblings, 2 replies; 116+ messages in thread
From: Mattias Wadenstein @ 2011-02-18 13:49 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On Mon, 14 Feb 2011, Matt Garman wrote:

> For many years, I have been using Linux software RAID at home for a
> simple NAS system.  Now at work, we are looking at buying a massive,
> high-throughput storage system (e.g. a SAN).  I have little
> familiarity with these kinds of pre-built, vendor-supplied solutions.
> I just started talking to a vendor, and the prices are extremely high.
>
> So I got to thinking, perhaps I could build an adequate device for
> significantly less cost using Linux.  The problem is, the requirements
> for such a system are significantly higher than my home media server,
> and put me into unfamiliar territory (in terms of both hardware and
> software configuration).
>
> The requirement is basically this: around 40 to 50 compute machines
> act as basically an ad-hoc scientific compute/simulation/analysis
> cluster.  These machines all need access to a shared 20 TB pool of
> storage.  Each compute machine has a gigabit network connection, and
> it's possible that nearly every machine could simultaneously try to
> access a large (100 to 1000 MB) file in the storage pool.  In other
> words, a 20 TB file store with bandwidth upwards of 50 Gbps.
>
> I was wondering if anyone on the list has built something similar to
> this using off-the-shelf hardware (and Linux of course)?

Well, this seems fairly close to the LHC data analysis case, or HPC usage 
in general, both of which I'm rather familiar with.

> My initial thoughts/questions are:
>
>    (1) We need lots of spindles (i.e. many small disks rather than
> few big disks).  How do you compute disk throughput when there are
> multiple consumers?  Most manufacturers provide specs on their drives
> such as sustained linear read throughput.  But how is that number
> affected when there are multiple processes simultanesously trying to
> access different data?  Is the sustained bulk read throughput value
> inversely proportional to the number of consumers?  (E.g. 100 MB/s
> drive only does 33 MB/s w/three consumers.)  Or is there are more
> specific way to estimate this?

This is tricky. In general there isn't a good way of estimating this, 
because so much about this involves the way your load interacts with 
IO-scheduling in both Linux and (if you use them) raid controllers, etc.

The actual IO pattern of your workload is probably the biggest factor 
here, determining both if readahead will give any benefits, as well as how 
much sequential IO can be done as opposed to just seeking.

>    (2) The big storage server(s) need to connect to the network via
> multiple bonded Gigabit ethernet, or something faster like
> FibreChannel or 10 GbE.  That seems pretty straightforward.

I'd also look at the option of many small&cheap servers, especially if the 
load is spread out fairly even over the filesets.

>    (3) This will probably require multiple servers connected together
> somehow and presented to the compute machines as one big data store.
> This is where I really don't know much of anything.  I did a quick
> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
> (based on the observation that 24-bay rackmount enclosures seem to be
> fairly common).  Such a system would only provide 7.2 TB of storage
> using a scheme like RAID-10.  So how could two or three of these
> servers be "chained" together and look like a single large data pool
> to the analysis machines?

Here you would either maintain a large list of nfs mounts for the read 
load, or start looking at a distributed filesystem. Sticking them all into 
one big fileserver is easier on the administration part, but quickly gets 
really expensive when you look to put multiple 10GE interfaces on it.

If the load is almost all read and seldom updated, and you can afford the 
time to manually layout data files over the servers, the nfs mounts option 
might work well for you. If the analysis cluster also creates files here 
and there you might need a parallel filesystem.

2U machines with 12 3.5" or 16-24 2.5" hdd slots can be gotten pretty 
cheaply. Add a quad-gige card if your load can get decent sequential load, 
or look at fast/ssd 2.5" drives if you are mostly short random reads. Then 
add as many as you need to sustain the analysis speed you need. The 
advantage here is that this is really scalable, if you double the number 
of servers you get at least twice the IO capacity.

Oh, yet another setup I've seen is adding a some (2-4) fast disks to each 
of the analysis machines and then running a distributed replicated 
filesystem like hadoop over them.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-18 13:49 ` Mattias Wadenstein
@ 2011-02-18 23:16   ` Stan Hoeppner
  2011-02-21 10:25     ` Mattias Wadenstein
  2011-02-19  0:24   ` Joe Landman
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-18 23:16 UTC (permalink / raw)
  To: Linux RAID

Mattias Wadenstein put forth on 2/18/2011 7:49 AM:

> Here you would either maintain a large list of nfs mounts for the read
> load, or start looking at a distributed filesystem. Sticking them all
> into one big fileserver is easier on the administration part, but
> quickly gets really expensive when you look to put multiple 10GE
> interfaces on it.

This really depends on one's definition of "really expensive".  Taking
the total cost of such a system/infrastructure into account, these two
Intel dual port 10 GbE NICs seem rather cheap at $650-$750 USD:

http://www.newegg.com/Product/Product.aspx?Item=N82E16833106037
http://www.newegg.com/Product/Product.aspx?Item=N82E16833106075

20 Gb/s (40 both ways) raw/peak throughput at this price seems like a
bargain to me (plus the switch module cost obviously, if required,
usually not for RJ-45 or CX4, thus my motivation for mentioning these).

The storage infrastructure on the back end required to keep these pipes
full will be the "really expensive" piece.  With 40-50 NFS clients you
end up with a random read/write workload, as has been mentioned.  To
sustain 2 GB/s throughput (CRC+TCP+NFS+etc overhead limited) under such
random IO conditions is going to require something on the order of 24-30
15k SAS drives in a RAID 0 stripe, or 48-60 such drives in a RAID 10,
assuming something like 80-90% efficiency in your software or hardware
RAID engine.  To get this level of sustained random performance from the
Nexsan arrays you'd have to use 2 units as the controller hardware just
isn't fast enough.  This is also exactly why NetApp does good business
in the midrange segment--one unit does it all, including block and file.

RAID 5/6 need not apply due the abysmal RMW partial stripe write
penalty, unless of course you're doing almost no writes.  But in that
case, how did the data get there in the first place? :)

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-18 13:49 ` Mattias Wadenstein
  2011-02-18 23:16   ` Stan Hoeppner
@ 2011-02-19  0:24   ` Joe Landman
  2011-02-21 10:04     ` Mattias Wadenstein
  1 sibling, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-02-19  0:24 UTC (permalink / raw)
  To: Mattias Wadenstein; +Cc: Matt Garman, Mdadm

On 02/18/2011 08:49 AM, Mattias Wadenstein wrote:
> On Mon, 14 Feb 2011, Matt Garman wrote:

[...]

>> I was wondering if anyone on the list has built something similar to
>> this using off-the-shelf hardware (and Linux of course)?
>
> Well, this seems fairly close to the LHC data analysis case, or HPC
> usage in general, both of which I'm rather familiar with.

Its similar to many HPC workloads dealing with large data sets.  There's 
nothing unusual about this in the HPC world.

>
>> My initial thoughts/questions are:
>>
>> (1) We need lots of spindles (i.e. many small disks rather than
>> few big disks). How do you compute disk throughput when there are
>> multiple consumers? Most manufacturers provide specs on their drives
>> such as sustained linear read throughput. But how is that number
>> affected when there are multiple processes simultanesously trying to
>> access different data? Is the sustained bulk read throughput value
>> inversely proportional to the number of consumers? (E.g. 100 MB/s
>> drive only does 33 MB/s w/three consumers.) Or is there are more
>> specific way to estimate this?
>
> This is tricky. In general there isn't a good way of estimating this,
> because so much about this involves the way your load interacts with
> IO-scheduling in both Linux and (if you use them) raid controllers, etc.
>
> The actual IO pattern of your workload is probably the biggest factor
> here, determining both if readahead will give any benefits, as well as
> how much sequential IO can be done as opposed to just seeking.

Absolutely.

Good real time data can be had from a number of tools.  Collectl, 
iostat, etc (sar ...).  I personally like atop for the "dashboard" like 
view.  Collectl and others can get you even more data that you can analyze.

>
>> (2) The big storage server(s) need to connect to the network via
>> multiple bonded Gigabit ethernet, or something faster like
>> FibreChannel or 10 GbE. That seems pretty straightforward.
>
> I'd also look at the option of many small&cheap servers, especially if
> the load is spread out fairly even over the filesets.

Here is where things like GlusterFS and FhGFS shine.  When Ceph firms up 
you can use this.  Happily all of these do run atop an MD raid device 
(to tie into the list).

>> (3) This will probably require multiple servers connected together
>> somehow and presented to the compute machines as one big data store.
>> This is where I really don't know much of anything. I did a quick
>> "back of the envelope" spec for a system with 24 600 GB 15k SAS drives
>> (based on the observation that 24-bay rackmount enclosures seem to be
>> fairly common). Such a system would only provide 7.2 TB of storage
>> using a scheme like RAID-10. So how could two or three of these
>> servers be "chained" together and look like a single large data pool
>> to the analysis machines?
>
> Here you would either maintain a large list of nfs mounts for the read
> load, or start looking at a distributed filesystem. Sticking them all
> into one big fileserver is easier on the administration part, but
> quickly gets really expensive when you look to put multiple 10GE
> interfaces on it.
>
> If the load is almost all read and seldom updated, and you can afford
> the time to manually layout data files over the servers, the nfs mounts
> option might work well for you. If the analysis cluster also creates
> files here and there you might need a parallel filesystem.

One of the nicer aspects of GlusterFS in this context is that it 
provides an NFS compatible server that NFS clients can connect to.  Some 
things aren't supported right now in the current release, but I 
anticipate they will be soon.

Moreover, with the distribute mode, it will do a reasonable job of 
distributing the files among the nodes.  Sort of like the nfs layout 
model, but with a "random" distribution.  This should be, on average, 
reasonably good.

>
> 2U machines with 12 3.5" or 16-24 2.5" hdd slots can be gotten pretty
> cheaply. Add a quad-gige card if your load can get decent sequential
> load, or look at fast/ssd 2.5" drives if you are mostly short random
> reads. Then add as many as you need to sustain the analysis speed you
> need. The advantage here is that this is really scalable, if you double
> the number of servers you get at least twice the IO capacity.
>
> Oh, yet another setup I've seen is adding a some (2-4) fast disks to
> each of the analysis machines and then running a distributed replicated
> filesystem like hadoop over them.

Ugh ... short-stroking drives or using SSDs?  Quite cost-inefficient for 
this work.  And given the HPC nature of the problem, its probably a good 
idea to aim for more cost-efficient.

This said, I'd recommend at least looking at GlusterFS.  Put it atop an 
MD raid (6 or 10), and you should be in pretty good shape with the right 
network design.  That is, as long as you don't use a bad SATA/SAS HBA.

Joe
-- 
Joe Landman
landman@scalableinformatics.com

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-19  0:24   ` Joe Landman
@ 2011-02-21 10:04     ` Mattias Wadenstein
  0 siblings, 0 replies; 116+ messages in thread
From: Mattias Wadenstein @ 2011-02-21 10:04 UTC (permalink / raw)
  To: Joe Landman; +Cc: Matt Garman, Mdadm

On Fri, 18 Feb 2011, Joe Landman wrote:

> On 02/18/2011 08:49 AM, Mattias Wadenstein wrote:
> [...]
>> 2U machines with 12 3.5" or 16-24 2.5" hdd slots can be gotten pretty
>> cheaply. Add a quad-gige card if your load can get decent sequential
>> load, or look at fast/ssd 2.5" drives if you are mostly short random
>> reads. Then add as many as you need to sustain the analysis speed you
>> need. The advantage here is that this is really scalable, if you double
>> the number of servers you get at least twice the IO capacity.
>> 
>> Oh, yet another setup I've seen is adding a some (2-4) fast disks to
>> each of the analysis machines and then running a distributed replicated
>> filesystem like hadoop over them.
>
> Ugh ... short-stroking drives or using SSDs?  Quite cost-inefficient for this 
> work.  And given the HPC nature of the problem, its probably a good idea to 
> aim for more cost-efficient.

Or just regular fairly slow sata drives. The advantage being that it is 
really cheap to get to 100-200 spindles this way, so you might not need 
very fast disks. It depends on your IO pattern, but for the LHC data 
analysis this has been showed to be surprisingly fast.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-18 23:16   ` Stan Hoeppner
@ 2011-02-21 10:25     ` Mattias Wadenstein
  2011-02-21 21:51       ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Mattias Wadenstein @ 2011-02-21 10:25 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Linux RAID

On Fri, 18 Feb 2011, Stan Hoeppner wrote:

> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>
>> Here you would either maintain a large list of nfs mounts for the read
>> load, or start looking at a distributed filesystem. Sticking them all
>> into one big fileserver is easier on the administration part, but
>> quickly gets really expensive when you look to put multiple 10GE
>> interfaces on it.
>
> This really depends on one's definition of "really expensive".  Taking
> the total cost of such a system/infrastructure into account, these two
> Intel dual port 10 GbE NICs seem rather cheap at $650-$750 USD:
>
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106037
> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106075
>
> 20 Gb/s (40 both ways) raw/peak throughput at this price seems like a
> bargain to me (plus the switch module cost obviously, if required,
> usually not for RJ-45 or CX4, thus my motivation for mentioning these).
>
> The storage infrastructure on the back end required to keep these pipes
> full will be the "really expensive" piece.

Exactly my point, a storage server that can sustain 20-200MB/s is rather 
cheap, but one that can sustain 2GB/s is really expensive. Possibly to the 
point where 10-100 smaller file servers are much cheaper. The worst case 
here is very small random reads, and then you're screwed cost-wise 
whatever you choose, if you want to get the 2GB/s number.

[snip]

> RAID 5/6 need not apply due the abysmal RMW partial stripe write
> penalty, unless of course you're doing almost no writes.  But in that
> case, how did the data get there in the first place? :)

Actually, that's probably the common case for data analysis load. Lots of 
random reads, but only occasional sequential writes when you add a new 
file/fileset. So raid 5/6 performance-wise works out pretty much as a 
stripe of n-[12] disks.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-21 10:25     ` Mattias Wadenstein
@ 2011-02-21 21:51       ` Stan Hoeppner
  2011-02-22  8:57         ` David Brown
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-21 21:51 UTC (permalink / raw)
  To: Mattias Wadenstein; +Cc: Linux RAID

Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
> 
>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>
>>> Here you would either maintain a large list of nfs mounts for the read
>>> load, or start looking at a distributed filesystem. Sticking them all
>>> into one big fileserver is easier on the administration part, but
>>> quickly gets really expensive when you look to put multiple 10GE
>>> interfaces on it.
>>
>> This really depends on one's definition of "really expensive".  Taking
>> the total cost of such a system/infrastructure into account, these two
>> Intel dual port 10 GbE NICs seem rather cheap at $650-$750 USD:
>>
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106037
>> http://www.newegg.com/Product/Product.aspx?Item=N82E16833106075
>>
>> 20 Gb/s (40 both ways) raw/peak throughput at this price seems like a
>> bargain to me (plus the switch module cost obviously, if required,
>> usually not for RJ-45 or CX4, thus my motivation for mentioning these).
>>
>> The storage infrastructure on the back end required to keep these pipes
>> full will be the "really expensive" piece.
> 
> Exactly my point, a storage server that can sustain 20-200MB/s is rather
> cheap, but one that can sustain 2GB/s is really expensive. Possibly to
> the point where 10-100 smaller file servers are much cheaper. The worst
> case here is very small random reads, and then you're screwed cost-wise
> whatever you choose, if you want to get the 2GB/s number.

"Screwed" may be a bit harsh, but I agree that one big fast storage
server will usually cost more than many smaller ones with equal
aggregate performance.  But looking at this from a TCO standpoint, the
administrative burden is higher for the many small case, and file layout
can be problematic, specifically in the case where all analysis nodes
need to share a file or group of files.  This can create bottlenecks at
individual storage servers.  Thus, acquisition cost must be weighed
against operational costs.  If any of the data is persistent, backing up
a single server is straight forward.  Backing up multiple servers, and
restoring them if necessary, is more complicated.

>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>> penalty, unless of course you're doing almost no writes.  But in that
>> case, how did the data get there in the first place? :)

> Actually, that's probably the common case for data analysis load. Lots
> of random reads, but only occasional sequential writes when you add a
> new file/fileset. So raid 5/6 performance-wise works out pretty much as
> a stripe of n-[12] disks.

RAID5/6 have decent single streaming read performance, but sub optimal
random read, less than sub optimal streaming write, and abysmal random
write performance.  They exhibit poor random read performance with high
client counts when compared to RAID0 or RAID10.  Additionally, with an
analysis "cluster" designed for overall high utilization (no idle
nodes), one node will be uploading data sets while others are doing
analysis.  Thus you end up with a mixed simultaneous random read and
streaming write workload on the server.  RAID10 will give many times the
throughput in this case compared to RAID5/6, which will bog down rapidly
under such a workload.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-21 21:51       ` Stan Hoeppner
@ 2011-02-22  8:57         ` David Brown
  2011-02-22  9:30           ` Mattias Wadenstein
  2011-02-22 13:38           ` Stan Hoeppner
  0 siblings, 2 replies; 116+ messages in thread
From: David Brown @ 2011-02-22  8:57 UTC (permalink / raw)
  To: linux-raid

On 21/02/2011 22:51, Stan Hoeppner wrote:
> Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
>> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
>>
>>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>>
>>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>>> penalty, unless of course you're doing almost no writes.  But in that
>>> case, how did the data get there in the first place? :)
>
>> Actually, that's probably the common case for data analysis load. Lots
>> of random reads, but only occasional sequential writes when you add a
>> new file/fileset. So raid 5/6 performance-wise works out pretty much as
>> a stripe of n-[12] disks.
>
> RAID5/6 have decent single streaming read performance, but sub optimal
> random read, less than sub optimal streaming write, and abysmal random
> write performance.  They exhibit poor random read performance with high
> client counts when compared to RAID0 or RAID10.  Additionally, with an
> analysis "cluster" designed for overall high utilization (no idle
> nodes), one node will be uploading data sets while others are doing
> analysis.  Thus you end up with a mixed simultaneous random read and
> streaming write workload on the server.  RAID10 will give many times the
> throughput in this case compared to RAID5/6, which will bog down rapidly
> under such a workload.
>

I'm a little confused here.  It's easy to see why RAID5/6 have very poor 
random write performance - you need at least two reads and two writes 
for a single write access.  It's also easy to see that streaming reads 
will be good, as you can read from most of the disks in parallel.

However, I can't see that streaming writes would be so bad - you have to 
write slightly more than for a RAID0 write, since you have the parity 
data too, but the parity is calculated in advance without the need of 
any reads, and all the writes are in parallel.  So you get the streamed 
write performance of n-[12] disks.  Contrast this with RAID10 where you 
have to write out all data twice - you get the performance of n/2 disks.

I also cannot see why random reads would be bad - I would expect that to 
be of similar speed to a RAID0 setup.  The only exception would be if 
you've got atime enabled, and each random read was also causing a small 
write - then it would be terrible.

Or am I missing something here?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-22  8:57         ` David Brown
@ 2011-02-22  9:30           ` Mattias Wadenstein
  2011-02-22  9:49             ` David Brown
  2011-02-22 13:38           ` Stan Hoeppner
  1 sibling, 1 reply; 116+ messages in thread
From: Mattias Wadenstein @ 2011-02-22  9:30 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Tue, 22 Feb 2011, David Brown wrote:

> On 21/02/2011 22:51, Stan Hoeppner wrote:
>> Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
>>> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
>>> 
>>>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>>> 
>>>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>>>> penalty, unless of course you're doing almost no writes.  But in that
>>>> case, how did the data get there in the first place? :)
>> 
>>> Actually, that's probably the common case for data analysis load. Lots
>>> of random reads, but only occasional sequential writes when you add a
>>> new file/fileset. So raid 5/6 performance-wise works out pretty much as
>>> a stripe of n-[12] disks.
>> 
>> RAID5/6 have decent single streaming read performance, but sub optimal
>> random read, less than sub optimal streaming write, and abysmal random
>> write performance.  They exhibit poor random read performance with high
>> client counts when compared to RAID0 or RAID10.  Additionally, with an
>> analysis "cluster" designed for overall high utilization (no idle
>> nodes), one node will be uploading data sets while others are doing
>> analysis.  Thus you end up with a mixed simultaneous random read and
>> streaming write workload on the server.  RAID10 will give many times the
>> throughput in this case compared to RAID5/6, which will bog down rapidly
>> under such a workload.
>> 
>
> I'm a little confused here.  It's easy to see why RAID5/6 have very poor 
> random write performance - you need at least two reads and two writes for a 
> single write access.  It's also easy to see that streaming reads will be 
> good, as you can read from most of the disks in parallel.
>
> However, I can't see that streaming writes would be so bad - you have to 
> write slightly more than for a RAID0 write, since you have the parity data 
> too, but the parity is calculated in advance without the need of any reads, 
> and all the writes are in parallel.  So you get the streamed write 
> performance of n-[12] disks.  Contrast this with RAID10 where you have to 
> write out all data twice - you get the performance of n/2 disks.

It's fine as long as you have only a few streaming writes, if you go up to 
many streams things might start breaking down.

> I also cannot see why random reads would be bad - I would expect that to be 
> of similar speed to a RAID0 setup.  The only exception would be if you've got 
> atime enabled, and each random read was also causing a small write - then it 
> would be terrible.
>
> Or am I missing something here?

The thing I think you are missing is crappy implementations in several HW 
raid controllers. For linux software raid the situation is quite sanely as 
you describe in my experience.

/Mattias Wadenstein

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-22  9:30           ` Mattias Wadenstein
@ 2011-02-22  9:49             ` David Brown
  0 siblings, 0 replies; 116+ messages in thread
From: David Brown @ 2011-02-22  9:49 UTC (permalink / raw)
  To: linux-raid

On 22/02/2011 10:30, Mattias Wadenstein wrote:
> On Tue, 22 Feb 2011, David Brown wrote:
>
>> On 21/02/2011 22:51, Stan Hoeppner wrote:
>>> Mattias Wadenstein put forth on 2/21/2011 4:25 AM:
>>>> On Fri, 18 Feb 2011, Stan Hoeppner wrote:
>>>>
>>>>> Mattias Wadenstein put forth on 2/18/2011 7:49 AM:
>>>>>
>>>>> RAID 5/6 need not apply due the abysmal RMW partial stripe write
>>>>> penalty, unless of course you're doing almost no writes. But in that
>>>>> case, how did the data get there in the first place? :)
>>>
>>>> Actually, that's probably the common case for data analysis load. Lots
>>>> of random reads, but only occasional sequential writes when you add a
>>>> new file/fileset. So raid 5/6 performance-wise works out pretty much as
>>>> a stripe of n-[12] disks.
>>>
>>> RAID5/6 have decent single streaming read performance, but sub optimal
>>> random read, less than sub optimal streaming write, and abysmal random
>>> write performance. They exhibit poor random read performance with high
>>> client counts when compared to RAID0 or RAID10. Additionally, with an
>>> analysis "cluster" designed for overall high utilization (no idle
>>> nodes), one node will be uploading data sets while others are doing
>>> analysis. Thus you end up with a mixed simultaneous random read and
>>> streaming write workload on the server. RAID10 will give many times the
>>> throughput in this case compared to RAID5/6, which will bog down rapidly
>>> under such a workload.
>>>
>>
>> I'm a little confused here. It's easy to see why RAID5/6 have very
>> poor random write performance - you need at least two reads and two
>> writes for a single write access. It's also easy to see that streaming
>> reads will be good, as you can read from most of the disks in parallel.
>>
>> However, I can't see that streaming writes would be so bad - you have
>> to write slightly more than for a RAID0 write, since you have the
>> parity data too, but the parity is calculated in advance without the
>> need of any reads, and all the writes are in parallel. So you get the
>> streamed write performance of n-[12] disks. Contrast this with RAID10
>> where you have to write out all data twice - you get the performance
>> of n/2 disks.
>
> It's fine as long as you have only a few streaming writes, if you go up
> to many streams things might start breaking down.
>

That's always going to be the case when you have a lot of writes at the 
same time.  Perhaps RAID5/6 makes matters a little worse by requiring a 
certain ordering on the writes to ensure consistency (maybe you have to 
write a whole stripe before starting a new stripe?  I don't know how md 
raid balances performance and consistency here).  I think the choice of 
file system is likely to make a bigger impact in such cases.

>> I also cannot see why random reads would be bad - I would expect that
>> to be of similar speed to a RAID0 setup. The only exception would be
>> if you've got atime enabled, and each random read was also causing a
>> small write - then it would be terrible.
>>
>> Or am I missing something here?
>
> The thing I think you are missing is crappy implementations in several
> HW raid controllers. For linux software raid the situation is quite
> sanely as you describe in my experience.
>

Ah, okay.  Thanks!



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-22  8:57         ` David Brown
  2011-02-22  9:30           ` Mattias Wadenstein
@ 2011-02-22 13:38           ` Stan Hoeppner
  2011-02-22 14:18             ` David Brown
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-22 13:38 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

David Brown put forth on 2/22/2011 2:57 AM:
> On 21/02/2011 22:51, Stan Hoeppner wrote:

>> RAID5/6 have decent single streaming read performance, but sub optimal
>> random read, less than sub optimal streaming write, and abysmal random
>> write performance.  They exhibit poor random read performance with high
>> client counts when compared to RAID0 or RAID10.  Additionally, with an
>> analysis "cluster" designed for overall high utilization (no idle
>> nodes), one node will be uploading data sets while others are doing
>> analysis.  Thus you end up with a mixed simultaneous random read and
>> streaming write workload on the server.  RAID10 will give many times the
>> throughput in this case compared to RAID5/6, which will bog down rapidly
>> under such a workload.
>>
> 
> I'm a little confused here.  It's easy to see why RAID5/6 have very poor
> random write performance - you need at least two reads and two writes
> for a single write access.  It's also easy to see that streaming reads
> will be good, as you can read from most of the disks in parallel.
> 
> However, I can't see that streaming writes would be so bad - you have to
> write slightly more than for a RAID0 write, since you have the parity
> data too, but the parity is calculated in advance without the need of
> any reads, and all the writes are in parallel.  So you get the streamed
> write performance of n-[12] disks.  Contrast this with RAID10 where you
> have to write out all data twice - you get the performance of n/2 disks.
> 
> I also cannot see why random reads would be bad - I would expect that to
> be of similar speed to a RAID0 setup.  The only exception would be if
> you've got atime enabled, and each random read was also causing a small
> write - then it would be terrible.
> 
> Or am I missing something here?

I misspoke.  What I meant to say is RAID5/6 have decent streaming and
random read performance, less than optimal *degraded* streaming and
random read performance.  The reason for this is that with one drive
down, each stripe for which that dead drive contained data and not
parity the stripe must be reconstructed with a parity calculation when read.

This is another huge advantage RAID 10 has over the parity RAIDs:  zero
performance loss while degraded.  The other two big ones are vastly
lower rebuild times and still very good performance during a rebuild
operation as only two drives in the array take an extra hit from the
rebuild: the survivor of the mirror pair and the spare being written.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-22 13:38           ` Stan Hoeppner
@ 2011-02-22 14:18             ` David Brown
  2011-02-23  5:52               ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: David Brown @ 2011-02-22 14:18 UTC (permalink / raw)
  To: linux-raid

On 22/02/2011 14:38, Stan Hoeppner wrote:
> David Brown put forth on 2/22/2011 2:57 AM:
>> On 21/02/2011 22:51, Stan Hoeppner wrote:
>
>>> RAID5/6 have decent single streaming read performance, but sub optimal
>>> random read, less than sub optimal streaming write, and abysmal random
>>> write performance.  They exhibit poor random read performance with high
>>> client counts when compared to RAID0 or RAID10.  Additionally, with an
>>> analysis "cluster" designed for overall high utilization (no idle
>>> nodes), one node will be uploading data sets while others are doing
>>> analysis.  Thus you end up with a mixed simultaneous random read and
>>> streaming write workload on the server.  RAID10 will give many times the
>>> throughput in this case compared to RAID5/6, which will bog down rapidly
>>> under such a workload.
>>>
>>
>> I'm a little confused here.  It's easy to see why RAID5/6 have very poor
>> random write performance - you need at least two reads and two writes
>> for a single write access.  It's also easy to see that streaming reads
>> will be good, as you can read from most of the disks in parallel.
>>
>> However, I can't see that streaming writes would be so bad - you have to
>> write slightly more than for a RAID0 write, since you have the parity
>> data too, but the parity is calculated in advance without the need of
>> any reads, and all the writes are in parallel.  So you get the streamed
>> write performance of n-[12] disks.  Contrast this with RAID10 where you
>> have to write out all data twice - you get the performance of n/2 disks.
>>
>> I also cannot see why random reads would be bad - I would expect that to
>> be of similar speed to a RAID0 setup.  The only exception would be if
>> you've got atime enabled, and each random read was also causing a small
>> write - then it would be terrible.
>>
>> Or am I missing something here?
>
> I misspoke.  What I meant to say is RAID5/6 have decent streaming and
> random read performance, less than optimal *degraded* streaming and
> random read performance.  The reason for this is that with one drive
> down, each stripe for which that dead drive contained data and not
> parity the stripe must be reconstructed with a parity calculation when read.
>

That makes lots of sense - I was missing the missing word "degraded"!

I don't think the degraded streaming reads will be too bad - after all, 
you are reading the full stripe anyway, and the data reconstruction will 
be fast on a modern cpu.  But random reads will be very bad.  For 
example, if you have 4+1 drives in a RAID5, then one in every 5 random 
reads will be on the dead drive, and will require 4 reads.  That means 
random reads will take 180% of the normal time, or almost half the 
performance.

> This is another huge advantage RAID 10 has over the parity RAIDs:  zero
> performance loss while degraded.  The other two big ones are vastly
> lower rebuild times and still very good performance during a rebuild
> operation as only two drives in the array take an extra hit from the
> rebuild: the survivor of the mirror pair and the spare being written.
>

Yes, this is definitely true - RAID10 is less affected by running 
degraded, and recovering is faster and involves less disk wear.  The 
disadvantage compared to RAID6 is, of course, if the other half of a 
disk pair dies during recovery then your raid is gone - with RAID6 you 
have better worst-case redundancy.

Once md raid has support for bad block lists, hot replace, and non-sync 
lists, then the differences will be far less clear.  If a disk in a RAID 
5/6 set has a few failures (rather than dying completely), then it will 
run as normal except when bad blocks are accessed.  This means for all 
but the few bad blocks, the degraded performance will be full speed. 
And if you use "hot replace" to replace the partially failed drive, the 
rebuild will have almost exactly the same characteristics as RAID10 
rebuilds - apart from the bad blocks, which must be recovered by parity 
calculations, you have a straight disk-to-disk copy.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-22 14:18             ` David Brown
@ 2011-02-23  5:52               ` Stan Hoeppner
  2011-02-23 13:56                 ` David Brown
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-23  5:52 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

David Brown put forth on 2/22/2011 8:18 AM:

> Yes, this is definitely true - RAID10 is less affected by running
> degraded, and recovering is faster and involves less disk wear.  The
> disadvantage compared to RAID6 is, of course, if the other half of a
> disk pair dies during recovery then your raid is gone - with RAID6 you
> have better worst-case redundancy.

The odds of the mirror partner dying during rebuild are very very long,
and the odds of suffering a URE are very low.  However, in the case of
RAID5/6, moreso with RAID5, with modern very large drives (1/2/3TB),
there is being quite a bit written these days about unrecoverable read
error rates.  Using a sufficient number of these very large disks will
at some point guarantee a URE during an array rebuild, which may very
likely cost you your entire array.  This is because every block of every
remaining disk (assuming full disk RAID not small partitions on each
disk) must be read during a RAID5/6 rebuild.  I don't have the equation
handy but Google should be able to fetch it for you.  IIRC this is one
of the reasons RAID6 is becoming more popular today.  Not just because
it can survive an additional disk failure, but that it's more resilient
to a URE during a rebuild.

With a RAID10 rebuild, as you're only reading entire contents of a
single disk, the odds of encountering a URE are much lower than with a
RAID5 with the same number of drives, simply due to the total number of
bits read.

> Once md raid has support for bad block lists, hot replace, and non-sync
> lists, then the differences will be far less clear.  If a disk in a RAID
> 5/6 set has a few failures (rather than dying completely), then it will
> run as normal except when bad blocks are accessed.  This means for all
> but the few bad blocks, the degraded performance will be full speed. And

You're muddying the definition of a "degraded RAID".

> if you use "hot replace" to replace the partially failed drive, the
> rebuild will have almost exactly the same characteristics as RAID10
> rebuilds - apart from the bad blocks, which must be recovered by parity
> calculations, you have a straight disk-to-disk copy.

Are you saying you'd take a "partially failing" drive in a RAID5/6 and
simply do a full disk copy onto the spare, except "bad blocks",
rebuilding those in the normal fashion, simply to approximate the
recover speed of RAID10?

I think your logic is a tad flawed here.  If a drive is already failing,
why on earth would you trust it, period?  I think you'd be asking for
trouble doing this.  This is precisely one of the reasons many hardware
RAID controllers have historically kicked drives offline after the first
signs of trouble--if a drive is acting flaky we don't want to trust it,
but replace it as soon as possible.

The assumption is that the data on the array is far more valuable than
the cost of a single drive or the entire hardware for that matter.  In
most environments this is the case.  Everyone seems fond of the WD20EARS
drives (which I disdain).  I hear they're loved because Newegg has them
for less than $100.  What's your 2TB of data on that drive worth?  In
the case of a MythTV box, to the owner, that $100 is worth more than the
content.  In a business setting, I'd dare say the data on that drive is
worth far more than the $100 cost of the drive and the admin $$ time
required to replace/rebuild it.

In the MythTV case what you propose might be a worthwhile risk.  In a
business environment, definitely not.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23  5:52               ` Stan Hoeppner
@ 2011-02-23 13:56                 ` David Brown
  2011-02-23 14:25                   ` John Robinson
  2011-02-23 21:11                   ` Stan Hoeppner
  0 siblings, 2 replies; 116+ messages in thread
From: David Brown @ 2011-02-23 13:56 UTC (permalink / raw)
  To: linux-raid

On 23/02/2011 06:52, Stan Hoeppner wrote:
> David Brown put forth on 2/22/2011 8:18 AM:
>
>> Yes, this is definitely true - RAID10 is less affected by running
>> degraded, and recovering is faster and involves less disk wear.  The
>> disadvantage compared to RAID6 is, of course, if the other half of a
>> disk pair dies during recovery then your raid is gone - with RAID6 you
>> have better worst-case redundancy.
>
> The odds of the mirror partner dying during rebuild are very very long,
> and the odds of suffering a URE are very low.  However, in the case of
> RAID5/6, moreso with RAID5, with modern very large drives (1/2/3TB),
> there is being quite a bit written these days about unrecoverable read
> error rates.  Using a sufficient number of these very large disks will
> at some point guarantee a URE during an array rebuild, which may very
> likely cost you your entire array.  This is because every block of every
> remaining disk (assuming full disk RAID not small partitions on each
> disk) must be read during a RAID5/6 rebuild.  I don't have the equation
> handy but Google should be able to fetch it for you.  IIRC this is one
> of the reasons RAID6 is becoming more popular today.  Not just because
> it can survive an additional disk failure, but that it's more resilient
> to a URE during a rebuild.
>

It is certainly the case that the chance of a second failure when doing 
a RAID5/6 rebuild goes up with the number of disks (since all the disks 
are stressed during the rebuild, and any failures are relevant), while 
with RAID 10 rebuilds the chances of a second failure are restricted to 
the single disk being used.

However, as disks get bigger, the chance of errors on any given disk is 
increasing.  And the fact remains that if you have a failure on a RAID10 
system, you then have a single point of failure during the rebuild 
period - while with RAID6 you still have redundancy (obviously RAID5 is 
far worse here).

> With a RAID10 rebuild, as you're only reading entire contents of a
> single disk, the odds of encountering a URE are much lower than with a
> RAID5 with the same number of drives, simply due to the total number of
> bits read.
>
>> Once md raid has support for bad block lists, hot replace, and non-sync
>> lists, then the differences will be far less clear.  If a disk in a RAID
>> 5/6 set has a few failures (rather than dying completely), then it will
>> run as normal except when bad blocks are accessed.  This means for all
>> but the few bad blocks, the degraded performance will be full speed. And
>
> You're muddying the definition of a "degraded RAID".
>

That could be the case - I'll try to be clearer.  It is certainly 
possible that I'm getting terminology wrong.

>> if you use "hot replace" to replace the partially failed drive, the
>> rebuild will have almost exactly the same characteristics as RAID10
>> rebuilds - apart from the bad blocks, which must be recovered by parity
>> calculations, you have a straight disk-to-disk copy.
>
> Are you saying you'd take a "partially failing" drive in a RAID5/6 and
> simply do a full disk copy onto the spare, except "bad blocks",
> rebuilding those in the normal fashion, simply to approximate the
> recover speed of RAID10?
>
> I think your logic is a tad flawed here.  If a drive is already failing,
> why on earth would you trust it, period?  I think you'd be asking for
> trouble doing this.  This is precisely one of the reasons many hardware
> RAID controllers have historically kicked drives offline after the first
> signs of trouble--if a drive is acting flaky we don't want to trust it,
> but replace it as soon as possible.
>

I don't know if you've followed the recent "md road-map: 2011" thread (I 
can't see any replies from you in the thread), but that is my reference 
point here.

Sometimes disks die suddenly and catastrophically.  When that happens, 
the disk is gone and needs to be kicked offline.

Other times, you have a single-event corruption - for some reason, a 
particular block got corrupted.  And sometimes the disk is wearing out - 
disks have a set of replacement blocks for re-locating known bad blocks, 
and in the end these will run out.  Either you get an URE, or a write 
failure.

(I don't have any idea what the ratio of these sorts of failure modes is.)

If you have a drive with a few failures, then the rest of the data is 
still correct.  You can expect that if the drive returns data 
successfully for a read, then the data is valid - that's what the 
drive's ECC is for.  But you would not want to trust it with new data, 
and you would want to replace it as soon as possible.

The point of md raid's planned "bad block list" is to track which areas 
of the drive should not be used.  And the "hot replace" feature is aimed 
at making a direct copy of a disk - excluding the bad blocks - to make 
replacement of failed drives faster and safer.  Since the failing drive 
is not removed from the array until the hot replace takes over, you 
still have full redundancy for most of the array - just not for stripes 
that contain a bad block.

I can well imagine that hardware RAID controllers don't have this sort 
of flexibility.

> The assumption is that the data on the array is far more valuable than
> the cost of a single drive or the entire hardware for that matter.  In
> most environments this is the case.  Everyone seems fond of the WD20EARS
> drives (which I disdain).  I hear they're loved because Newegg has them
> for less than $100.  What's your 2TB of data on that drive worth?  In
> the case of a MythTV box, to the owner, that $100 is worth more than the
> content.  In a business setting, I'd dare say the data on that drive is
> worth far more than the $100 cost of the drive and the admin $$ time
> required to replace/rebuild it.
>
> In the MythTV case what you propose might be a worthwhile risk.  In a
> business environment, definitely not.
>

I believe it is the value of the data - and the value of keeping as much 
redundancy as you can, and minimising the risky rebuild period, that is 
Neil Brown's motivation behind the bad block list and hot replace.  It 
could well be that I'm not explaining it very well - but this is /not/ 
about saving money by continuing to use a dodgy disk even though you 
know it is failing.  It is about a dodgy disk with most of a data set 
being a lot better than no disk when it comes to rebuild speed and data 
redundancy.


Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where 
you have a RAID5 or RAID6 build from RAID1 pairs?  You get all the 
rebuild benefits of RAID1 or RAID10, such as simple and fast direct 
copies for rebuilds, and little performance degradation.  But you also 
get multiple failure redundancy from the RAID5 or RAID6.  It could be 
that it is excessive - that the extra redundancy is not worth the 
performance cost (you still have poor small write performance).


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 13:56                 ` David Brown
@ 2011-02-23 14:25                   ` John Robinson
  2011-02-23 15:15                     ` David Brown
  2011-02-23 21:59                     ` Stan Hoeppner
  2011-02-23 21:11                   ` Stan Hoeppner
  1 sibling, 2 replies; 116+ messages in thread
From: John Robinson @ 2011-02-23 14:25 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On 23/02/2011 13:56, David Brown wrote:
[...]
> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
> copies for rebuilds, and little performance degradation. But you also
> get multiple failure redundancy from the RAID5 or RAID6. It could be
> that it is excessive - that the extra redundancy is not worth the
> performance cost (you still have poor small write performance).

I'd also be interested to hear what Stan and other experienced 
large-array people think of RAID60. For example, elsewhere in this 
thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0 
stripe over RAID-1 pairs), and I wondered how a 40-drive RAID-60 (i.e. a 
10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform, both in 
normal and degraded situations, and whether it might be preferable since 
it would avoid the single-disk-failure issue that the RAID-1 mirrors 
potentially expose. My guess is that it ought to have similar random 
read performance and about half the random write performance, which 
might be a trade-off worth making.

Cheers,

John.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 14:25                   ` John Robinson
@ 2011-02-23 15:15                     ` David Brown
  2011-02-23 23:14                       ` Stan Hoeppner
  2011-02-23 21:59                     ` Stan Hoeppner
  1 sibling, 1 reply; 116+ messages in thread
From: David Brown @ 2011-02-23 15:15 UTC (permalink / raw)
  To: linux-raid

On 23/02/2011 15:25, John Robinson wrote:
> On 23/02/2011 13:56, David Brown wrote:
> [...]
>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>> copies for rebuilds, and little performance degradation. But you also
>> get multiple failure redundancy from the RAID5 or RAID6. It could be
>> that it is excessive - that the extra redundancy is not worth the
>> performance cost (you still have poor small write performance).
>
> I'd also be interested to hear what Stan and other experienced
> large-array people think of RAID60. For example, elsewhere in this
> thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0
> stripe over RAID-1 pairs), and I wondered how a 40-drive RAID-60 (i.e. a
> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform, both in
> normal and degraded situations, and whether it might be preferable since
> it would avoid the single-disk-failure issue that the RAID-1 mirrors
> potentially expose. My guess is that it ought to have similar random
> read performance and about half the random write performance, which
> might be a trade-off worth making.
>

Basically you are comparing a 4-drive RAID-6 to a 4-drive RAID-10.  I 
think the RAID-10 will be faster for streamed reads, and a lot faster 
for small writes.  You get improved safety in that you still have a 
one-drive redundancy after a drive has failed, but you pay for it in 
longer and more demanding rebuilds.  But certainly RAID60 (or at least 
RAID50) seems to be a choice many raid controllers support, so it must 
be popular.



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 13:56                 ` David Brown
  2011-02-23 14:25                   ` John Robinson
@ 2011-02-23 21:11                   ` Stan Hoeppner
  2011-02-24 11:24                     ` David Brown
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-23 21:11 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

David Brown put forth on 2/23/2011 7:56 AM:

> However, as disks get bigger, the chance of errors on any given disk is
> increasing.  And the fact remains that if you have a failure on a RAID10
> system, you then have a single point of failure during the rebuild
> period - while with RAID6 you still have redundancy (obviously RAID5 is
> far worse here).

The problem isn't a 2nd whole drive failure during the rebuild, but a
URE during rebuild:

http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162

> I don't know if you've followed the recent "md road-map: 2011" thread (I
> can't see any replies from you in the thread), but that is my reference
> point here.

Actually I haven't.  Is Neil's motivation with this RAID5/6 "mirror
rebuild" to avoid the URE problem?

> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
> you have a RAID5 or RAID6 build from RAID1 pairs?  You get all the
> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
> copies for rebuilds, and little performance degradation.  But you also
> get multiple failure redundancy from the RAID5 or RAID6.  It could be
> that it is excessive - that the extra redundancy is not worth the
> performance cost (you still have poor small write performance).

I don't care for and don't use parity RAID levels.  Simple mirroring and
RAID10 have served me well for a very long time.  They have many
advantages over parity RAID and few, if any, disadvantages.  I've
mentioned all of these in previous posts.

-- 
Stan


-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 14:25                   ` John Robinson
  2011-02-23 15:15                     ` David Brown
@ 2011-02-23 21:59                     ` Stan Hoeppner
  2011-02-23 23:43                       ` John Robinson
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-23 21:59 UTC (permalink / raw)
  To: John Robinson; +Cc: David Brown, linux-raid

John Robinson put forth on 2/23/2011 8:25 AM:
> On 23/02/2011 13:56, David Brown wrote:
> [...]
>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>> copies for rebuilds, and little performance degradation. But you also
>> get multiple failure redundancy from the RAID5 or RAID6. It could be
>> that it is excessive - that the extra redundancy is not worth the
>> performance cost (you still have poor small write performance).
> 
> I'd also be interested to hear what Stan and other experienced
> large-array people think of RAID60. For example, elsewhere in this
> thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0
> stripe over RAID-1 pairs), 

Actually, that's not what I mentioned.  What I described was a 48 drive
storage system consisting of qty 6 RAID10 arrays of 8 drives each.
These could be 6 mdraid10 8 drive arrays using LVM to concatenate them
into a single volume, or they could be 6 HBA hardware RAID10 8 drive
arrays stitched together with mdraid linear into a single logical device.

Then you would use XFS as your filesystem, and its allocation group
architecture to achieve your multi user workload parallelism.  This
works well for a lot of workloads.  Coincidentally, because we have 6
arrays of 8 drives each, instead of one large 48 drive RAID10, the
probability of the "dreaded" 2nd drive failure during rebuild drops
dramatically.  Additionally, the the amount of data exposed to loss due
to this architecture decreases to 1/6th of that of a single large RAID10
of 48 drives.  If you were to lose both drives during the rebuild, as
long as this 8 drive array is not the first array in the stitched
logical device, it won't contain XFS metadata, and you can recover.
Thus, it's possible to xfs_repair the filesystem, only losing the data
contents of the 8 disk array that failed, or 1/6th of your data.  This
failure/recovery scenario is a wild edge case so I wouldn't _rely_ on
it, but it's interesting that it works, and is worth mentioning.

> and I wondered how a 40-drive RAID-60 (i.e. a
> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform, both in
> normal and degraded situations, and whether it might be preferable since
> it would avoid the single-disk-failure issue that the RAID-1 mirrors
> potentially expose. My guess is that it ought to have similar random
> read performance and about half the random write performance, which
> might be a trade-off worth making.

First off what you describe here is not a RAID60.  RAID60 is defined as
a stripe across _two_ RAID6 arrays--not 10 arrays.  RAID50 is the same
but with RAID5 arrays.  What you're describing is simply a custom nested
RAID, much like what I mentioned above.  Let's call it RAID J-60.

Anyway, you'd be better off striping 13 three-disk mirror sets with a
spare drive making up the 40.  This covers the double drive failure
during rebuild (a non issue in my book for RAID1/10), and suffers zero
read or write performance, except possibly LVM striping overhead in the
event you have to use LVM to create the stripe.  I'm not familiar enough
with mdadm to know if you can do this nested setup all in mdadm.

The big problem I see is stripe size.  How the !@#$ would you calculate
the proper stripe size for this type of nested RAID and actually get
decent performance from your filesystem sitting on top?

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 15:15                     ` David Brown
@ 2011-02-23 23:14                       ` Stan Hoeppner
  2011-02-24 10:19                         ` David Brown
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-23 23:14 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

David Brown put forth on 2/23/2011 9:15 AM:

> Basically you are comparing a 4-drive RAID-6 to a 4-drive RAID-10.  I
> think the RAID-10 will be faster for streamed reads, and a lot faster

In this 4 drive configuration, RAID6 might be ever so slightly faster in
read performance, but RAID10 will very likely be faster in every other
category, to include degraded performance and rebuild time.  I can't say
definitively as I've not actually tested these setups head to head.

> for small writes.  You get improved safety in that you still have a
> one-drive redundancy after a drive has failed, but you pay for it in
> longer and more demanding rebuilds.

Just to be clear, you're saying the RAID6 rebuilds are longer and more
demanding than RAID10.  To state the opposite would be incorrect.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 21:59                     ` Stan Hoeppner
@ 2011-02-23 23:43                       ` John Robinson
  2011-02-24 15:53                         ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: John Robinson @ 2011-02-23 23:43 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Linux RAID

On 23/02/2011 21:59, Stan Hoeppner wrote:
> John Robinson put forth on 2/23/2011 8:25 AM:
>> On 23/02/2011 13:56, David Brown wrote:
>> [...]
>>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>>> you have a RAID5 or RAID6 build from RAID1 pairs? You get all the
>>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>>> copies for rebuilds, and little performance degradation. But you also
>>> get multiple failure redundancy from the RAID5 or RAID6. It could be
>>> that it is excessive - that the extra redundancy is not worth the
>>> performance cost (you still have poor small write performance).
>>
>> I'd also be interested to hear what Stan and other experienced
>> large-array people think of RAID60. For example, elsewhere in this
>> thread Stan suggested using a 40-drive RAID-10 (i.e. a 20-way RAID-0
>> stripe over RAID-1 pairs),
>
> Actually, that's not what I mentioned.

Yes, it's precisely what you mentioned in this post: 
http://marc.info/?l=linux-raid&m=129777295601681&w=2

[...]
>> and I wondered how a 40-drive RAID-60 (i.e. a
>> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform
[...]
> First off what you describe here is not a RAID60.  RAID60 is defined as
> a stripe across _two_ RAID6 arrays--not 10 arrays.  RAID50 is the same
> but with RAID5 arrays.  What you're describing is simply a custom nested
> RAID, much like what I mentioned above.

In the same way that RAID10 is not specified as a stripe across two 
RAID1 arrays, RAID60 is not specified as a stripe across two arrays. But 
yes, it's a nested RAID, in the same way that you have repeatedly 
insisted that RAID10 is nested RAID0 over RAID1.

> Anyway, you'd be better off striping 13 three-disk mirror sets with a
> spare drive making up the 40.  This covers the double drive failure
> during rebuild (a non issue in my book for RAID1/10), and suffers zero
> read or write performance, except possibly LVM striping overhead in the
> event you have to use LVM to create the stripe.  I'm not familiar enough
> with mdadm to know if you can do this nested setup all in mdadm.

Yes of course you can. (You can use md RAID10 with layout n3 or do it 
the long way round with multiple RAID1s and a RAID0.) But in order to 
get the 20TB of storage you'd need 60 drives. That's why for the sake of 
slightly better storage and energy efficiency I'd be interested in how a 
RAID 6+0 (if you prefer) in the arrangement I suggested would perform 
compared to a RAID 10.

I'm positing this arrangement specifically to cope with the almost 
inevitable URE when trying to recover an array. You dismissed it above 
as a non-issue but in another post you linked to the zdnet article on 
"why RAID5 stops working in 2009", and as far as I'm concerned much the 
same applies to RAID1 pairs. UREs are now a fact of life. When they do 
occur the drives aren't necessarily even operating outside their specs: 
it's 1 in 10^14 or 10^15 bits, so read a lot more than that (as you will 
on a busy drive) and they're going to happen.

Cheers,

John.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 23:14                       ` Stan Hoeppner
@ 2011-02-24 10:19                         ` David Brown
  0 siblings, 0 replies; 116+ messages in thread
From: David Brown @ 2011-02-24 10:19 UTC (permalink / raw)
  To: linux-raid

On 24/02/2011 00:14, Stan Hoeppner wrote:
> David Brown put forth on 2/23/2011 9:15 AM:
>
>> Basically you are comparing a 4-drive RAID-6 to a 4-drive RAID-10.  I
>> think the RAID-10 will be faster for streamed reads, and a lot faster
>
> In this 4 drive configuration, RAID6 might be ever so slightly faster in
> read performance, but RAID10 will very likely be faster in every other
> category, to include degraded performance and rebuild time.  I can't say
> definitively as I've not actually tested these setups head to head.
>
>> for small writes.  You get improved safety in that you still have a
>> one-drive redundancy after a drive has failed, but you pay for it in
>> longer and more demanding rebuilds.
>
> Just to be clear, you're saying the RAID6 rebuilds are longer and more
> demanding than RAID10.  To state the opposite would be incorrect.
>

Yes, that is exactly what I am saying.



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 21:11                   ` Stan Hoeppner
@ 2011-02-24 11:24                     ` David Brown
  2011-02-24 23:30                       ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: David Brown @ 2011-02-24 11:24 UTC (permalink / raw)
  To: linux-raid

On 23/02/2011 22:11, Stan Hoeppner wrote:
> David Brown put forth on 2/23/2011 7:56 AM:
>
>> However, as disks get bigger, the chance of errors on any given disk is
>> increasing.  And the fact remains that if you have a failure on a RAID10
>> system, you then have a single point of failure during the rebuild
>> period - while with RAID6 you still have redundancy (obviously RAID5 is
>> far worse here).
>
> The problem isn't a 2nd whole drive failure during the rebuild, but a
> URE during rebuild:
>
> http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162
>

Yes, I've read that article - it's one of the reasons for always 
preferring RAID6 to RAID5.

My understanding of RAID controllers (software or hardware) is that they 
consider a drive to be either "good" or "bad".  So if you get an URE, 
the controller considers the drive "bad" and ejects it from the array. 
It doesn't matter if it is an URE or a total disk death.

Maybe hardware RAID controllers do something else here - you know far 
more about them than I do.

The idea of the md raid "bad block list" is that there is a medium 
ground - you can have disks that are "mostly good".

Supposing you have a RAID6 array, and one disk has died completely.  It 
gets replaced by a hot spare, and rebuild begins.  As the rebuild 
progresses, disk 1 gets an URE.  Traditional handling would mean disk 1 
is ejected, and now you have a double-degraded RAID6 to rebuilt.  When 
you later get an URE on disk 2, you have lost data for that stripe - and 
the whole raid is gone.

But with bad block lists, the URE on disk 1 leads to a bad block entry 
on disk 1, and the rebuild continues.  When you later get an URE on disk 
2, it's no problem - you use data from disk 1 and the other disks. 
URE's are no longer a killer unless your set has no redundancy.


URE's are also what I worry about with RAID1 (including RAID10) 
rebuilds.  If a disk has failed, you are right in saying that the 
chances of the second disk in the pair failing completely are tiny.  But 
the chances of getting an URE on the second disk during the rebuild are 
not negligible - they are small, but growing with each new jump in disk 
size.

With md raid's future bad block lists and hot replace features, then an 
URE on the second disk during rebuilds is only a problem if the first 
disk has died completely - if it only had a small problem, then the "hot 
replace" rebuild will be able to use both disks to find the data.

>> I don't know if you've followed the recent "md road-map: 2011" thread (I
>> can't see any replies from you in the thread), but that is my reference
>> point here.
>
> Actually I haven't.  Is Neil's motivation with this RAID5/6 "mirror
> rebuild" to avoid the URE problem?
>

I know you are more interested in hardware raid than software raid, but 
I'm sure you'll find some interesting points in Neil's writings.  If you 
don't want to read through the thread, at least read his blog post.

<http://neil.brown.name/blog/20110216044002>

>> Incidentally, what's your opinion on a RAID1+5 or RAID1+6 setup, where
>> you have a RAID5 or RAID6 build from RAID1 pairs?  You get all the
>> rebuild benefits of RAID1 or RAID10, such as simple and fast direct
>> copies for rebuilds, and little performance degradation.  But you also
>> get multiple failure redundancy from the RAID5 or RAID6.  It could be
>> that it is excessive - that the extra redundancy is not worth the
>> performance cost (you still have poor small write performance).
>
> I don't care for and don't use parity RAID levels.  Simple mirroring and
> RAID10 have served me well for a very long time.  They have many
> advantages over parity RAID and few, if any, disadvantages.  I've
> mentioned all of these in previous posts.
>



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-23 23:43                       ` John Robinson
@ 2011-02-24 15:53                         ` Stan Hoeppner
  0 siblings, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-24 15:53 UTC (permalink / raw)
  To: John Robinson; +Cc: Linux RAID

John Robinson put forth on 2/23/2011 5:43 PM:
> On 23/02/2011 21:59, Stan Hoeppner wrote:

>> Actually, that's not what I mentioned.
> 
> Yes, it's precisely what you mentioned in this post:
> http://marc.info/?l=linux-raid&m=129777295601681&w=2

Sorry John.  I thought you were referring to my recent post regarding 48
drives.  I usually don't remember my own posts very long, especially
those over a week old.  Heck, I'm lucky to remember a post I made 2-3
days ago.  ;)

> [...]
>>> and I wondered how a 40-drive RAID-60 (i.e. a
>>> 10-way RAID-0 stripe over 4-way RAID-6 arrays) would perform
> [...]
>> First off what you describe here is not a RAID60.  RAID60 is defined as
>> a stripe across _two_ RAID6 arrays--not 10 arrays.  RAID50 is the same
>> but with RAID5 arrays.  What you're describing is simply a custom nested
>> RAID, much like what I mentioned above.
> 
> In the same way that RAID10 is not specified as a stripe across two
> RAID1 arrays, RAID60 is not specified as a stripe across two arrays. But
> yes, it's a nested RAID, in the same way that you have repeatedly
> insisted that RAID10 is nested RAID0 over RAID1.

"RAID 10" is used to describe striped mirrors regardless of the number
of mirror sets used, simply specifying the number of drives in the
description, i.e. "20 drive RAID 10" or "8 drive RAID 10".  As I just
learned from doing some research, apparently when ones stripes more than
2 RAID6s one would then describe the array and an "n leg RAID 60", or "n
element RAID 60".  In your example this would be a "10 leg RAID 60".
I'd only seen the term "RAID 60" used to describe the 2 leg case.  My
apologies for straying out here and wasting time on a non-issue.

>> Anyway, you'd be better off striping 13 three-disk mirror sets with a
>> spare drive making up the 40.  This covers the double drive failure
>> during rebuild (a non issue in my book for RAID1/10), and suffers zero
>> read or write performance, except possibly LVM striping overhead in the
>> event you have to use LVM to create the stripe.  I'm not familiar enough
>> with mdadm to know if you can do this nested setup all in mdadm.
> 
> Yes of course you can. (You can use md RAID10 with layout n3 or do it
> the long way round with multiple RAID1s and a RAID0.) But in order to
> get the 20TB of storage you'd need 60 drives. That's why for the sake of
> slightly better storage and energy efficiency I'd be interested in how a
> RAID 6+0 (if you prefer) in the arrangement I suggested would perform
> compared to a RAID 10.

For the definitive answer to this you'd have to test each RAID level
with your target workload.  In general, I'd say, other than the problems
with parity performance, the possible gotcha is being able to come up
with a workable stripe block/width with such a setup.  Wide arrays
typically don't work well for general use filesystems as most files are
much smaller than the typical stripe block required to get decent
performance from such a wide stripe.  The situation is even worse with
nested stripes.

Your example uses a top level stripe width of 10 with a nested stripe
width of 2.  Getting any filesystem to work efficiently with such a
nested RAID, from both an overall performance and space efficiency
standpoint, may prove to be very difficult.  If you can't find a magic
formula for this, you could very well end up with worse actual space
efficiency in the FS than if you used a straight RAID10.

If you prefer RAID6 legs, what I'd recommend is simply concatenating the
legs instead of striping them.  Using your 40 drive example, I'd
recommend using 4 RAID6 legs of 10 drives each, so you get an 8 drive
stripe width per array and thus better performance than the 4 drive
case.  Use a stripe block size of 64KB on each array as this should
yield a good mix of space efficiency for average size files/extents and
performance for random IO with such size files.  Concatenating in this
manner will avoid the difficult to solve multiple layered stripe
block/width to filesystem harmony problem.

Using XFS atop this concatenated RAID6 setup with an allocation group
count of 32 (4 arrays x 8 stripe spindles/array) will give you good
parallelism across the 4 arrays with a multiuser workload.  AFAIK,
EXT3/4, ReiserFS, JFS, don't use allocation groups or anything like
them, and thus can't get parallelism from such a concatenated setup.
This is one of the many reasons why XFS is the only suitable Linux FS
for large/complex arrays.  I haven't paid any attention to BTRFS, so I
don't know if it would be suitable for scenarios like this.  It's so far
from production quality at this point it's not really even worth
mentioning, but I did so for the sake of being complete.

As always, all of this is a strictly academic guessing exercise without
testing the specific workload.  That said, for any multiuser workload
this setup should perform relatively well, for a parity based array.

The takeaway here is concatenation instead of layered striping, and
using the appropriate filesystem to take advantage of such.

> I'm positing this arrangement specifically to cope with the almost
> inevitable URE when trying to recover an array. You dismissed it above
> as a non-issue but in another post you linked to the zdnet article on
> "why RAID5 stops working in 2009", and as far as I'm concerned much the
> same applies to RAID1 pairs. UREs are now a fact of life. When they do
> occur the drives aren't necessarily even operating outside their specs:
> it's 1 in 10^14 or 10^15 bits, so read a lot more than that (as you will
> on a busy drive) and they're going to happen.

I didn't mean to discount anything.  The math shows that UREs during
rebuild aren't relevant for mirrored RAID schemes.   This is because
with current drive sizes and URE rates you have to read more than
something like 12 TB before encountering a URE.  The largest drives
available are 3TB, or ~1/4th the "URE rebuild threshold" bit count.
Probabilities inform us about the hypothetical world in general terms.
In the real world, sure, anything can happen.  Real world data of this
type isn't published, do we have to base our calculation and planning on
what the manufacturers provide.

The article makes an interesting point in that as drives continue to
increase in capacity, with their URE rates remaining basically static,
eventually every RAID6 rebuild will see a URE.  I haven't done the math
so I don't know at exactly what drive size/count this will occur.  The
obvious answer to it will be RAID7, or triple parity RAID.  At that
point, parity RAID will have, in practical $$, lost its only advantage
over mirrors, i.e. RAID10.

In the long run, if the current size:URE rate trend continues, we may
see the 3 leg RAID 10 becoming popular.  My personal hope is that the
drive makers can start producing drives with much lower URE rates.  I'd
rather never see the days of anything close to hexa parity RAID9 and
quad leg RAID10 being required simply to survive a rebuild process.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15  9:43     ` David Brown
@ 2011-02-24 20:28       ` Matt Garman
  2011-02-24 20:43         ` David Brown
  0 siblings, 1 reply; 116+ messages in thread
From: Matt Garman @ 2011-02-24 20:28 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

Wow, I can't believe the number of responses I've received to this
question.  I've been trying to digest it all.  I'm going to throw some
follow-up comments as time allows, starting here...

On Tue, Feb 15, 2011 at 3:43 AM, David Brown <david@westcontrol.com> wrote:
> If you are not too bothered about write performance, I'd put a fair amount
> of the budget into ram rather than just disk performance.  When you've got
> the ram space to make sure small reads are mostly cached, the main
> bottleneck will be sequential reads - and big hard disks handle sequential
> reads as fast as expensive SSDs.

I could be wrong, but I'm not so sure RAM would be beneficial for our
case.  Are workload is virtually all reads, however, these are huge
reads.  The analysis programs basically do a full read of data files
that are generally pretty big: roughly 100 MB to 5 GB in the worst
case.  Average file size is maybe 500 MB (rough estimate).  And there
are hundreds of these falls, all of which need "immediate" access.  So
to cache these in RAM, seems like it would take an awful lot of RAM.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 14:56     ` Zdenek Kaspar
@ 2011-02-24 20:36       ` Matt Garman
  0 siblings, 0 replies; 116+ messages in thread
From: Matt Garman @ 2011-02-24 20:36 UTC (permalink / raw)
  To: Zdenek Kaspar; +Cc: linux-raid

On Tue, Feb 15, 2011 at 8:56 AM, Zdenek Kaspar <zkaspar82@gmail.com> wrote:
> Dne 15.2.2011 15:29, Roberto Spadim napsal(a):
>> for hobby = SATA2 disks, 50USD disks of 1TB 50MB/s
>> the today state of art, in 'my world' is: http://www.ramsan.com/products/3
>
> I doubt 20TB SLC which will survive huge abuse (writes) is low-cost
> solution what OP wants to build himself..
>
> or 20TB RAM omg..

Just to be clear, this is *not* a hobby system.  I mentioned hobby
system in my original post just to serve as a reference for my current
knowledge level.  I've built and configured the simple linux md raid6
NAS box at home, and a similar system for backups here at work.

But now I'm looking at something that's obviously a completely
different game, with bigger and stricter requirements.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-24 20:28       ` Matt Garman
@ 2011-02-24 20:43         ` David Brown
  0 siblings, 0 replies; 116+ messages in thread
From: David Brown @ 2011-02-24 20:43 UTC (permalink / raw)
  To: linux-raid

On 24/02/11 21:28, Matt Garman wrote:
> Wow, I can't believe the number of responses I've received to this
> question.  I've been trying to digest it all.  I'm going to throw some
> follow-up comments as time allows, starting here...
>
> On Tue, Feb 15, 2011 at 3:43 AM, David Brown<david@westcontrol.com>  wrote:
>> If you are not too bothered about write performance, I'd put a fair amount
>> of the budget into ram rather than just disk performance.  When you've got
>> the ram space to make sure small reads are mostly cached, the main
>> bottleneck will be sequential reads - and big hard disks handle sequential
>> reads as fast as expensive SSDs.
>
> I could be wrong, but I'm not so sure RAM would be beneficial for our
> case.  Are workload is virtually all reads, however, these are huge
> reads.  The analysis programs basically do a full read of data files
> that are generally pretty big: roughly 100 MB to 5 GB in the worst
> case.  Average file size is maybe 500 MB (rough estimate).  And there
> are hundreds of these falls, all of which need "immediate" access.  So
> to cache these in RAM, seems like it would take an awful lot of RAM.

RAM for cache makes a difference if the same file is read more than 
once.  That applies equally to big files - but only if more than one 
machine is reading the same file.  If they are all reading different 
files, then - as you say - there won't be much to gain as each file is 
only used once.

Still, when you have so much data going from the disks and out to the 
clients, it is good to have plenty of ram for buffering, even if it is 
only used once.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 13:03     ` Roberto Spadim
@ 2011-02-24 20:43       ` Matt Garman
  2011-02-24 20:53         ` Zdenek Kaspar
  0 siblings, 1 reply; 116+ messages in thread
From: Matt Garman @ 2011-02-24 20:43 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Stan Hoeppner, Mdadm

On Tue, Feb 15, 2011 at 7:03 AM, Roberto Spadim <roberto@spadim.com.br> wrote:
> disks are good for sequencial access
> for non-sequencial ssd are better (the sequencial access rate for a
> ssd is the same for a non sequencial access rate)

I have a more general question: say I have an ultra simple NAS system,
with exactly one disk, and an infinitely fast network connection.
Now, with exactly one client, I should be able to do a sequential read
that is exactly the speed of that single drive in the NAS box (assume
network protocol overhead is negligible to keep it simple).

What happens if there are exactly two clients simultaneously
requesting different large files?  From the client's perspective, this
is a sequential read, but from the drive's perspective, it's obviously
not.

And likewise, what if there are three clients, or four clients, ...,
all requesting different but large files simultaneously?

How does one calculate the drive's throughput in these cases?  And,
clearly, there are two throughputs, one from the clients'
perspectives, and one from the drive's perspective.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 13:39   ` David Brown
  2011-02-16 23:32     ` Stan Hoeppner
@ 2011-02-24 20:49     ` Matt Garman
  1 sibling, 0 replies; 116+ messages in thread
From: Matt Garman @ 2011-02-24 20:49 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On Tue, Feb 15, 2011 at 7:39 AM, David Brown <david@westcontrol.com> wrote:
> This brings up an important point - no matter what sort of system you get
> (home made, mdadm raid, or whatever) you will want to do some tests and
> drills at replacing failed drives.  Also make sure everything is well
> documented, and well labelled.  When mdadm sends you an email telling you
> drive sdx has failed, you want to be /very/ sure you know which drive is sdx
> before you take it out!

Agreed!  This will be a learn-as-I-go project.

> You also want to consider your raid setup carefully.  RAID 10 has been
> mentioned here several times - it is often a good choice, but not
> necessarily.  RAID 10 gives you fast recovery, and can at best survive a
> loss of half your disks - but at worst a loss of two disks will bring down
> the whole set.  It is also very inefficient in space.  If you use SSDs, it
> may not be worth double the price to have RAID 10.  If you use hard disks,
> it may not be sufficient safety.

And that's what has me thinking about cluster filesystems.
Ultimately, I'd like a pool of storage "nodes".  These could live on
the same physical machine, or be spread across multiple machines.  To
the clients, this pool of nodes would look like one single collection
of storage.  The benefit of this, in my opinion, is flexibility
(mainly easy to grow/add new nodes), but also a bit more safety.  If
one node dies, it doesn't take down the whole pool, just the files on
that node become unavailable.

Even better would be a "smart" pool, that, when a new node is added,
it automatically re-distributes all the files, so that the new node
has the same kind of space utilization as all the others.

> It is probably worth having a small array of SSDs (RAID1 or RAID10) to hold
> the write intent bitmap, the journal for your main file system, and of
> course your OS.  Maybe one of these absurdly fast PCI Express flash disks
> would be a good choice.

Is that really necessary, though, when writes account for probably >5%
of total IO operations?  And (relatively speaking) write performance
is unimportant?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-24 20:43       ` Matt Garman
@ 2011-02-24 20:53         ` Zdenek Kaspar
  2011-02-24 21:07           ` Joe Landman
  0 siblings, 1 reply; 116+ messages in thread
From: Zdenek Kaspar @ 2011-02-24 20:53 UTC (permalink / raw)
  To: linux-raid

Dne 24.2.2011 21:43, Matt Garman napsal(a):
> On Tue, Feb 15, 2011 at 7:03 AM, Roberto Spadim <roberto@spadim.com.br> wrote:
>> disks are good for sequencial access
>> for non-sequencial ssd are better (the sequencial access rate for a
>> ssd is the same for a non sequencial access rate)
> 
> I have a more general question: say I have an ultra simple NAS system,
> with exactly one disk, and an infinitely fast network connection.
> Now, with exactly one client, I should be able to do a sequential read
> that is exactly the speed of that single drive in the NAS box (assume
> network protocol overhead is negligible to keep it simple).
> 
> What happens if there are exactly two clients simultaneously
> requesting different large files?  From the client's perspective, this
> is a sequential read, but from the drive's perspective, it's obviously
> not.
> 
> And likewise, what if there are three clients, or four clients, ...,
> all requesting different but large files simultaneously?
> 
> How does one calculate the drive's throughput in these cases?  And,
> clearly, there are two throughputs, one from the clients'
> perspectives, and one from the drive's perspective.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

For rough estimate, try to simulate your workload in small scale, ie:
create files on your disk (fs), and run multiple processes (dd) reading
them. To summarize things together watch loads, ie for disk(s): iostat
-mx 1.

HTH, Z.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15 15:16     ` Joe Landman
  2011-02-15 20:37       ` NeilBrown
@ 2011-02-24 20:58       ` Matt Garman
  2011-02-24 21:20         ` Joe Landman
  1 sibling, 1 reply; 116+ messages in thread
From: Matt Garman @ 2011-02-24 20:58 UTC (permalink / raw)
  To: Joe Landman; +Cc: Doug Dumitru, Mdadm

On Tue, Feb 15, 2011 at 9:16 AM, Joe Landman <joe.landman@gmail.com> wrote:
> [disclosure: vendor posting, ignore if you wish, vendor html link at bottom
> of message]
>
>> The whole system needs to be "fast".
>
> Define what you mean by "fast".  Seriously ... we've had people tell us
> about their "huge" storage needs that we can easily fit onto a single small
> unit, no storage cluster needed.  We've had people say "fast" when they mean
> "able to keep 1 GbE port busy".
>
> Fast needs to be articulated really in terms of what you will do with it.
>  As you noted in this and other messages, you are scaling up from 10 compute
> nodes to 40 compute nodes.  4x change in demand, and I am guessing bandwidth
> (if these are large files you are streaming) or IOPs (if these are many
> small files you are reading).  Small and large here would mean less than
> 64kB for small, and greater than 4MB for large.

These are definitely large files; maybe "huge" is a better word.  All
are over 100 MB in size, some are upwards of 5 GB, most are probably a
few hundred megs in size.

The word "streaming" may be accurate, but to me it is misleading. I
associate streaming with media, i.e. it is generally consumed much
more slowly than it can be sent (e.g. even high-def 1080p video won't
saturate a 100 mbps link).  But in our case, these files are basically
read into memory, and then computations are done from there.

So, for an upper bounds on the notion of "fast", I'll illustrate the
worst-case scenario: there are 50 analysis machines, each of which can
run up to 10 processes, making 500 total processes.  Every single
process requests a different file at the exact same time, and every
requested file is over 100 MB in size.  Ideally, each process would be
able to access the file as though it were local, and was the only
process on the machine.  In reality, it's "good enough" if each of the
50 machines' gigabit network connections are saturated.  So from the
network perspective, that's 50 gbps.

From the storage perspective, it's less clear to me.  That's 500 huge
simultaneous read requests, and I'm not clear on what it would take to
satisfy that.

> Your choice is simple.  Build or buy.  Many folks have made suggestions, and
> some are pretty reasonable, though a pure SSD or Flash based machine, while
> doable (and we sell these), is quite unlikely to be close to the realities
> of your budget.  There are use cases for which this does make sense, but the
> costs are quite prohibitive for all but a few users.

Well, I haven't decided on whether or not to build or buy, but the
thought experiment of planning a buy is very instructive.  Thanks to
everyone who has contributed to this thread, I've got more information
than I've been able to digest so far!
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-24 20:53         ` Zdenek Kaspar
@ 2011-02-24 21:07           ` Joe Landman
  0 siblings, 0 replies; 116+ messages in thread
From: Joe Landman @ 2011-02-24 21:07 UTC (permalink / raw)
  To: Zdenek Kaspar; +Cc: linux-raid

On 02/24/2011 03:53 PM, Zdenek Kaspar wrote:

>> And likewise, what if there are three clients, or four clients, ...,
>> all requesting different but large files simultaneously?
>>
>> How does one calculate the drive's throughput in these cases?  And,
>> clearly, there are two throughputs, one from the clients'
>> perspectives, and one from the drive's perspective.

we us Jens Axboe's fio code to model this.

Best case scenario is you get 1/N of the fixed sized resource that you 
share averaged out over time for N requestors of equal size/priority. 
Reality is often different, in that there are multiple stacks to 
traverse, potential seek time issues as well as network contention 
issues, interrupt and general OS "jitter", etc.  That is, all the 
standard HPC issues you get for compute/analysis nodes, you get for this.

Best advise is "go wide".  As many spindles as possible.  If you are 
read bound (large block streaming IO), then RAID6 is good, and many of 
them joined into a parallel file system (ala GlusterFS, FhGFS, MooseFS, 
OrangeFS, ... ) is even better.  Well, as long as the baseline hardware 
is fast to begin with.  We do not recommend a single drive per server, 
turns out to be a terrible way to aggregate bandwidth in practice.  Its 
better to build really fast units, and go "wide" with them.  Which is, 
curiously, what we do with our siCluster boxen.

MD raid should be fine for you.

Regards,

Joe



-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-24 20:58       ` Matt Garman
@ 2011-02-24 21:20         ` Joe Landman
  2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-02-24 21:20 UTC (permalink / raw)
  To: Matt Garman; +Cc: Doug Dumitru, Mdadm

On 02/24/2011 03:58 PM, Matt Garman wrote:

> These are definitely large files; maybe "huge" is a better word.  All
> are over 100 MB in size, some are upwards of 5 GB, most are probably a
> few hundred megs in size.

Heh ... the "huge" storage I alluded to above is also quite ... er ... 
context sensitive.

>
> The word "streaming" may be accurate, but to me it is misleading. I

Actually not at all.  We have quite a few customers that consume files 
by slurping them into ram before processing.  So the file system streams 
(e.g. sends data as fast as the remote process can consume it, modulo 
network and other inefficiencies).

> associate streaming with media, i.e. it is generally consumed much
> more slowly than it can be sent (e.g. even high-def 1080p video won't
> saturate a 100 mbps link).  But in our case, these files are basically
> read into memory, and then computations are done from there.

Same use case.  dd is an example of a "trivial" streaming app, though we 
prefer to generate load with fio.

>
> So, for an upper bounds on the notion of "fast", I'll illustrate the
> worst-case scenario: there are 50 analysis machines, each of which can
> run up to 10 processes, making 500 total processes.  Every single
> process requests a different file at the exact same time, and every
> requested file is over 100 MB in size.  Ideally, each process would be
> able to access the file as though it were local, and was the only
> process on the machine.  In reality, it's "good enough" if each of the
> 50 machines' gigabit network connections are saturated.  So from the
> network perspective, that's 50 gbps.

Ok, so if we divide these 50 Gbps across say ... 10 storage nodes ... 
then we need only sustain, on average, 5 Gbps/storage node.  This makes 
a number of assumptions, some of which are valid (e.g. file distribution 
across nodes is effectively random, and can be accomplished via parallel 
file system). 5 Gbps/storage node sounds like a node with 6x GbE ports, 
or 1x 10GbE port.  Run one of the parallel file systems across it and 
make sure the interior RAID can handle this sort of bandwidth (you'd 
need at least 700 MB/s on the interior RAID, which eliminates many/most 
of the units on the market, and you'd need pretty high efficiencies in 
the stack, which also have a tendency to reduce your choices ... better 
to build the interior RAIDs as fast as possible, deal with the network 
efficiency losses and call it a day)

All this said, its better to express your IO bandwidth needs in MB/s, 
preferably in terms of sustained bandwidth needs, as this is language 
that you'd be talking to vendors in.  So on 50 machines, assume each 
machine can saturate its 1GbE port (these aren't Broadcom NICs, right?), 
that gets you 50x 117 MB/s or about 5.9 GB/s sustained bandwidth for 
your IO.  10 machines running at a sustainable 600 MB/s delivered over 
the network, and a parallel file system atop this, solves this problem.

Single centralized resources (FC heads, filers, etc.) won't scale to 
this.  Then again, this isn't their use case.

Regards,

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-24 11:24                     ` David Brown
@ 2011-02-24 23:30                       ` Stan Hoeppner
  2011-02-25  8:20                         ` David Brown
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-24 23:30 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

David Brown put forth on 2/24/2011 5:24 AM:

> My understanding of RAID controllers (software or hardware) is that they
> consider a drive to be either "good" or "bad".  So if you get an URE,
> the controller considers the drive "bad" and ejects it from the array.
> It doesn't matter if it is an URE or a total disk death.
> 
> Maybe hardware RAID controllers do something else here - you know far
> more about them than I do.

Most HBA and SAN RAID firmware I've dealt with kicks drives offline
pretty quickly at any sign of an unrecoverable error.  I've also seen
drives kicked simply because the RAID firmware didn't like the drive
firmware.  I have a fond (sarcasm) memory of DAC960s kicking ST118202
18GB Cheetahs offline left and right in the late 90s.  The fact I still
recall that Seagate drive# after 10+ years should be informative
regarding the severity of that issue.  :(

> The idea of the md raid "bad block list" is that there is a medium
> ground - you can have disks that are "mostly good".

Everything I've read and seen in the last few years regarding hard disk
technology says that platter manufacturing quality and tolerance are so
high on modern drives that media defects are rarely, if ever, seen by
the customer, as they're mapped out at the factory.  The platters don't
suffer wear effects, but the rest of the moving parts do.  From what
I've read/seen, "media" errors observed in the wild today are actually
caused by mechanical failures due to physical wear on various moving
parts:  VC actuator pivot bearing/race, spindle bearings, etc.
Mechanical failures tend to show mild "media errors" in the beginning
and get worse with time as moving parts go further out of alignment.
Thus, as I see it, any UREs on a modern drive represent a "Don't trust
me--Replace me NOW" flag.  I could be all wrong here, but this is what
I've read, and seen in manufacturer videos from WD and Seagate.

> Supposing you have a RAID6 array, and one disk has died completely.  It
> gets replaced by a hot spare, and rebuild begins.  As the rebuild
> progresses, disk 1 gets an URE.  Traditional handling would mean disk 1
> is ejected, and now you have a double-degraded RAID6 to rebuilt.  When
> you later get an URE on disk 2, you have lost data for that stripe - and
> the whole raid is gone.
> 
> But with bad block lists, the URE on disk 1 leads to a bad block entry
> on disk 1, and the rebuild continues.  When you later get an URE on disk
> 2, it's no problem - you use data from disk 1 and the other disks. URE's
> are no longer a killer unless your set has no redundancy.

They're not a killer with RAID 6 anyway, are they?.  You can be
rebuilding one failed drive and suffer UREs left and right, as long as
you don't get two of them on two drives simultaneously in the same
stripe block read.  I think that's right.  Please correct me if not.

> URE's are also what I worry about with RAID1 (including RAID10)
> rebuilds.  If a disk has failed, you are right in saying that the
> chances of the second disk in the pair failing completely are tiny.  But
> the chances of getting an URE on the second disk during the rebuild are
> not negligible - they are small, but growing with each new jump in disk
> size.

I touched on this in my other reply, somewhat tongue-in-cheek mentioning
3 leg and 4 leg RAID10.  At current capacities and URE ratings I'm not
worried about it with mirror pairs.  If URE ratings haven't increased
substantially by the time our avg drive capacity hits 10GB I'll start to
worry.

Somewhat related to this, does any else here build their arrays from the
smallest cap drives they can get away with, preferably single platter
models when possible?  I adopted this strategy quite some time ago,
mostly to keep rebuild times to a minimum, keep rotational mass low to
consume the least energy since using more drives, but also with the URE
issue in the back of my mind.  Anecdotal evidence tends to point to the
trend of OPs going with fewer gargantuan drives instead of many smaller
ones.  Maybe that's just members of this list, whose criteria may be
quite different from the typical enterprise data center.

> With md raid's future bad block lists and hot replace features, then an
> URE on the second disk during rebuilds is only a problem if the first
> disk has died completely - if it only had a small problem, then the "hot
> replace" rebuild will be able to use both disks to find the data.

What happens when you have multiple drives at the same or similar bad
block count?

> I know you are more interested in hardware raid than software raid, but
> I'm sure you'll find some interesting points in Neil's writings.  If you
> don't want to read through the thread, at least read his blog post.
> 
> <http://neil.brown.name/blog/20110216044002>

Will catch up.  Thanks for the blog link.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-24 23:30                       ` Stan Hoeppner
@ 2011-02-25  8:20                         ` David Brown
  0 siblings, 0 replies; 116+ messages in thread
From: David Brown @ 2011-02-25  8:20 UTC (permalink / raw)
  To: linux-raid

On 25/02/2011 00:30, Stan Hoeppner wrote:
> David Brown put forth on 2/24/2011 5:24 AM:
>
>> My understanding of RAID controllers (software or hardware) is that they
>> consider a drive to be either "good" or "bad".  So if you get an URE,
>> the controller considers the drive "bad" and ejects it from the array.
>> It doesn't matter if it is an URE or a total disk death.
>>
>> Maybe hardware RAID controllers do something else here - you know far
>> more about them than I do.
>
> Most HBA and SAN RAID firmware I've dealt with kicks drives offline
> pretty quickly at any sign of an unrecoverable error.  I've also seen
> drives kicked simply because the RAID firmware didn't like the drive
> firmware.  I have a fond (sarcasm) memory of DAC960s kicking ST118202
> 18GB Cheetahs offline left and right in the late 90s.  The fact I still
> recall that Seagate drive# after 10+ years should be informative
> regarding the severity of that issue.  :(
>
>> The idea of the md raid "bad block list" is that there is a medium
>> ground - you can have disks that are "mostly good".
>
> Everything I've read and seen in the last few years regarding hard disk
> technology says that platter manufacturing quality and tolerance are so
> high on modern drives that media defects are rarely, if ever, seen by
> the customer, as they're mapped out at the factory.  The platters don't
> suffer wear effects, but the rest of the moving parts do.  From what
> I've read/seen, "media" errors observed in the wild today are actually
> caused by mechanical failures due to physical wear on various moving
> parts:  VC actuator pivot bearing/race, spindle bearings, etc.
> Mechanical failures tend to show mild "media errors" in the beginning
> and get worse with time as moving parts go further out of alignment.
> Thus, as I see it, any UREs on a modern drive represent a "Don't trust
> me--Replace me NOW" flag.  I could be all wrong here, but this is what
> I've read, and seen in manufacturer videos from WD and Seagate.
>

That's very useful information to know - I don't go through nearly 
enough disks myself to be able to judge these things (and while I read 
lots of stuff on the web, I don't see /everything/ !).  Thanks.

However, this still sounds to me like a drive with UREs is dying but not 
dead yet.  Assuming you are correct here (and I've no reason to doubt 
that - unless someone else disagrees), it means that a disk with UREs 
will be dying quickly rather than dying slowly.  But if the non-URE data 
on the disk can be used to make a rebuild faster and safer, then surely 
that is worth doing?

It may be that when a disk has had an URE and therefore an entry in the 
bad block list, then it should be marked read-only and only used for 
data recovery and "hot replace" rebuilds.  But until it completely 
croaks, it is still better than no disk at all while the rebuild is in 
progress.


>> Supposing you have a RAID6 array, and one disk has died completely.  It
>> gets replaced by a hot spare, and rebuild begins.  As the rebuild
>> progresses, disk 1 gets an URE.  Traditional handling would mean disk 1
>> is ejected, and now you have a double-degraded RAID6 to rebuilt.  When
>> you later get an URE on disk 2, you have lost data for that stripe - and
>> the whole raid is gone.
>>
>> But with bad block lists, the URE on disk 1 leads to a bad block entry
>> on disk 1, and the rebuild continues.  When you later get an URE on disk
>> 2, it's no problem - you use data from disk 1 and the other disks. URE's
>> are no longer a killer unless your set has no redundancy.
>
> They're not a killer with RAID 6 anyway, are they?.  You can be
> rebuilding one failed drive and suffer UREs left and right, as long as
> you don't get two of them on two drives simultaneously in the same
> stripe block read.  I think that's right.  Please correct me if not.
>

That's true as long as UREs do not cause that disk to be kicked out of 
the array.  With bad block support in md raid, a disk suffering an URE 
will /not/ be kicked out.  But my understanding (from what you wrote 
above) was that with hardware raid controllers, an URE /would/ cause a 
disk to be kicked out.  Or am I mixing something up again?

>> URE's are also what I worry about with RAID1 (including RAID10)
>> rebuilds.  If a disk has failed, you are right in saying that the
>> chances of the second disk in the pair failing completely are tiny.  But
>> the chances of getting an URE on the second disk during the rebuild are
>> not negligible - they are small, but growing with each new jump in disk
>> size.
>
> I touched on this in my other reply, somewhat tongue-in-cheek mentioning
> 3 leg and 4 leg RAID10.  At current capacities and URE ratings I'm not
> worried about it with mirror pairs.  If URE ratings haven't increased
> substantially by the time our avg drive capacity hits 10GB I'll start to
> worry.
>
> Somewhat related to this, does any else here build their arrays from the
> smallest cap drives they can get away with, preferably single platter
> models when possible?  I adopted this strategy quite some time ago,
> mostly to keep rebuild times to a minimum, keep rotational mass low to
> consume the least energy since using more drives, but also with the URE
> issue in the back of my mind.  Anecdotal evidence tends to point to the
> trend of OPs going with fewer gargantuan drives instead of many smaller
> ones.  Maybe that's just members of this list, whose criteria may be
> quite different from the typical enterprise data center.
>
>> With md raid's future bad block lists and hot replace features, then an
>> URE on the second disk during rebuilds is only a problem if the first
>> disk has died completely - if it only had a small problem, then the "hot
>> replace" rebuild will be able to use both disks to find the data.
>
> What happens when you have multiple drives at the same or similar bad
> block count?
>

You replace them all.  Once a drive reaches a certain number of bad 
blocks (and that threshold may be just 1, or it may be more), you should 
replace it.  There isn't any reason not to do hot replace builds on 
multiple drives simultaneously, if you've got the drives and drive bays 
on hand - apart from at the bad blocks, they replacement is just a 
straight disk to disk copy.

>> I know you are more interested in hardware raid than software raid, but
>> I'm sure you'll find some interesting points in Neil's writings.  If you
>> don't want to read through the thread, at least read his blog post.
>>
>> <http://neil.brown.name/blog/20110216044002>
>
> Will catch up.  Thanks for the blog link.
>



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?  GPFS w/ 10GB/s throughput to the rescue
  2011-02-24 21:20         ` Joe Landman
@ 2011-02-26 23:54           ` Stan Hoeppner
  2011-02-27  0:56             ` Joe Landman
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-26 23:54 UTC (permalink / raw)
  To: Joe Landman; +Cc: Matt Garman, Doug Dumitru, Mdadm

Joe Landman put forth on 2/24/2011 3:20 PM:

> All this said, its better to express your IO bandwidth needs in MB/s,
> preferably in terms of sustained bandwidth needs, as this is language
> that you'd be talking to vendors in.  

Heartily agree.

> that gets you 50x 117 MB/s or about 5.9 GB/s sustained bandwidth for
> your IO.  10 machines running at a sustainable 600 MB/s delivered over
> the network, and a parallel file system atop this, solves this problem.

That's 1 file server for each 5 compute nodes Joe.  That is excessive.
Your business is selling these storage servers, so I can understand this
recommendation.  What cost is Matt looking at for these 10 storage
servers?  $8-15k apiece?  $80-150K total, not including installation,
maintenance, service contract, or administration training?  And these
require a cluster file system.  I'm guessing that's in the territory of
quotes he's already received from NetApp et al.

In that case it makes more sense to simply use direct attached storage
in each compute node at marginal additional cost, and a truly scalable
parallel filesystem across the compute nodes, IBM's GPFS.  This will
give better aggregate performance at substantially lower cost, and
likely with much easier filesystem administration.

Matt, if a parallel cluster file system is in your cards, and it very
well may be, the very best way to achieve your storage bandwidth goal
would be leveraging direct attached disks in each compute node, your
existing GbE network, and using IBM GPFS as your parallel cluster
filesystem.  I'd recommend using IBM 1U servers with 4 disk bays of
146GB 10k SAS drives in hardware RAID 10 (it's built in--free).  With 50
compute nodes, this will give you over 10GB/s aggregate disk bandwidth,
over 200MB/s per node.  Using these 146GB 2.5" drives you'd have ~14TB
of GPFS storage and can push/pull over 5GB/s of GPFS throughput over
TCP/IP.  Throughput will be likely be limited by the network, not the disks.

Each 1U server has dual GbE ports, allowing each node's application to
read 100MB/s from the GPFS while the node is simultaneously serving
100MB/s to all the other nodes, with full network redundancy in the
event a single NIC or switch should fail in one of your redundant
ethernet segments.  Or, you could bond the NICs, without fail over, for
over 200MB/s full duplex, giving you aggregate GPFS throughput of
between 6-10GB/s depending on actual workload access patterns.

Your only additional cost here over the base compute node is 4 drives at
~$1000, the GPFS licensing, and consulting fees to IBM Global Services
for setup and training, and maybe another GbE switch or two.  This
system is completely scalable.  Each time you add a compute node you add
another 100-200MB/s+ of GPFS bandwidth to the cluster, at minimal cost.
 I have no idea what IBM GPFS licensing costs are.  My wild ass guess
would be a couple hundred dollars per node, which is pretty reasonable
considering the capability it gives you, and the cost savings over other
solutions.

You should make an appointment with IBM Global Services to visit your
site, go over your needs and budget, and make a recommendation or two.
Request they send a GPFS educated engineer along on the call.  Express
that you're looking at the architecture I've described.  They may have a
better solution given your workload and cost criteria.  The key thing is
that you need to get as much information as possible at this point so
have the best options going forward.

Here's an appropriate IBM compute cluster node:
http://www-304.ibm.com/shop/americas/webapp/wcs/stores/servlet/default/ProductDisplay?productId=4611686018425930325&storeId=1&langId=-1&categoryId=4611686018425272306&dualCurrId=73&catalogId=-840

1U rack chassis
Xeon X3430 - 2.4 GHz, 4 core, 8MB cache
8GB DDR3
dual 10/100/1000 Ethernet
4 x 146GB 10k rpm SAS hot swap, RADI10

IBM web price per single unit:  ~$3,100
If buying volume in one PO:     ~$2,500 or less through a wholesaler

Hope this information is helpful Matt.

-- 
Stan


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?  GPFS w/ 10GB/s throughput to the rescue
  2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
@ 2011-02-27  0:56             ` Joe Landman
  2011-02-27 14:55               ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-02-27  0:56 UTC (permalink / raw)
  To: Mdadm

On 02/26/2011 06:54 PM, Stan Hoeppner wrote:
> Joe Landman put forth on 2/24/2011 3:20 PM:

[...]

>> that gets you 50x 117 MB/s or about 5.9 GB/s sustained bandwidth for
>> your IO.  10 machines running at a sustainable 600 MB/s delivered over
>> the network, and a parallel file system atop this, solves this problem.
>
> That's 1 file server for each 5 compute nodes Joe.  That is excessive.

No Stan, it isn't.  As I said, this is our market, we know it pretty 
well.  Matt stated his needs pretty clearly.

He needs 5.9GB/s sustained bandwidth.  Local drives (as you suggested 
later on) will deliver 75-100 MB/s of bandwidth, and he'd need 2 for 
RAID1, as well as a RAID0 (e.g. RAID10) for local bandwidth (150+ MB/s). 
  4 drives per unit, 50 units.  200 drives.

Any admin want to admin 200+ drives in 50 chassis?  Admin 50 different 
file systems?

Oh, and what is the impact if some of those nodes went away?  Would they 
take down the file system?  In the cloud of microdisk model Stan 
suggested, yes they would.  Which is why you might not want to give that 
advice serious consideration.  Unless you built in replication.  Now we 
are at 400 disks in 50 chassis.

Again, this design keeps getting worse.

> Your business is selling these storage servers, so I can understand this
> recommendation.  What cost is Matt looking at for these 10 storage

Now this is sad, very sad.

Stan started out selling the Nexsan version of things (and why was he 
doing it on the MD RAID list I wonder?), which would have run into the 
same costs Stan noted later.  Now Stan is selling (actually mis-selling) 
GPFS (again, on an MD RAID list, seemingly having picked it off of a 
website), without having a clue as to the pricing, implementation, 
issues, etc.

> servers?  $8-15k apiece?  $80-150K total, not including installation,
> maintenance, service contract, or administration training?  And these
> require a cluster file system.  I'm guessing that's in the territory of
> quotes he's already received from NetApp et al.

I did suggest using GlusterFS as it will help with a number of aspects, 
has an open source version.  I did also suggest (since he seems to wish 
to build it himself) that he pursue a reasonable design to start with, 
and avoid the filer based designs Stan suggested (two Nexsan's and some 
sort of filer head to handle them), or a SAN switch of some sort. 
Neither design works well in his scenario, or for that matter, in the 
vast majority of HPC situations.

I did make a full disclosure of my interests up front, and people are 
free to take my words with a grain of salt.  Insinuating based upon my 
disclosure?  Sad.


> In that case it makes more sense to simply use direct attached storage
> in each compute node at marginal additional cost, and a truly scalable
> parallel filesystem across the compute nodes, IBM's GPFS.  This will
> give better aggregate performance at substantially lower cost, and
> likely with much easier filesystem administration.

See GlusterFS.  Open source at zero cost.  However, and this is a large 
however, this design, using local storage for a pooled "cloud" of disks, 
has some often problematic issues (resiliency, performance, hotspots). 
A truly hobby design would use this.  Local disk is fine for scratch 
space, for a few other things.  Managing the disk spread out among 50 
nodes?  Yeah, its harder.

I'm gonna go out on a limb here and suggest Matt speak with HPC cluster 
and storage people.  He can implement things ranging from effectively 
zero cost through things which can be quite expensive.  If you are 
talking to Netapp about HPC storage, well, probably move onto a real HPC 
storage shop.  His problem is squarely in the HPC arena.

However, I would strongly advise against designs such as a single 
centralized unit, or a cloud of micro disks.  The first design is 
decidedly non-scalable, which is in part why the HPC community abandoned 
it years ago.  The second design is very hard to manage and guarantee 
any sort of resiliency.  You get all the benefits of a RAID0 in what 
Stan proposed.

Start out talking with and working with experts, and its pretty likely 
you'll come out with a good solution.   The inverse is also true.

MD RAID, which Stan dismissed as a "hobby RAID" at first can work well 
for Matt.  GlusterFS can help with the parallel file system atop this. 
Starting with a realistic design, an MD RAID based system (self built or 
otherwise) could easily provide everything Matt needs, at the data rates 
he needs it, using entirely open source technologies.  And good designs.

You really won't get good performance out of a bad design.  The folks 
doing HPC work who've responded have largely helped frame good design 
patterns.  The folks who aren't sure what HPC really is, haven't.

Regards,

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?  GPFS w/ 10GB/s throughput to the rescue
  2011-02-27  0:56             ` Joe Landman
@ 2011-02-27 14:55               ` Stan Hoeppner
  2011-03-12 22:49                 ` Matt Garman
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-27 14:55 UTC (permalink / raw)
  To: Joe Landman; +Cc: Mdadm

Joe Landman put forth on 2/26/2011 6:56 PM:

> Local drives (as you suggested
> later on) will deliver 75-100 MB/s of bandwidth, and he'd need 2 for
> RAID1, as well as a RAID0 (e.g. RAID10) for local bandwidth (150+ MB/s).
>  4 drives per unit, 50 units.  200 drives.

Yes, this is pretty much exactly what I mentioned.  ~5GB/s aggregate.
But we've still not received an accurate detailed description from Matt
regarding his actual performance needs.  He's not posted iostat numbers
from his current filer, or any similar metrics.

> Any admin want to admin 200+ drives in 50 chassis?  Admin 50 different
> file systems?

GPFS has single point administration for all storage in all nodes.

> Oh, and what is the impact if some of those nodes went away?  Would they
> take down the file system?  In the cloud of microdisk model Stan
> suggested, yes they would.  

No, they would not.  GPFS has multiple redundancy mechanisms and can
sustain multiple node failures.  I think you should read the GPFS
introductory documentation:

http://www.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=SA&subtype=WH&appname=STGE_XB_XB_USEN&htmlfid=XBW03010USEN&attachment=XBW03010USEN.PDF

> Which is why you might not want to give that
> advice serious consideration.  Unless you built in replication.  Now we
> are at 400 disks in 50 chassis.

Your numbers are wrong, by a factor of 2.  He should research GPFS and
give it serious consideration.  It may be exactly what he needs.

> Again, this design keeps getting worse.

Actually it's getting better, which you'll see after reading the docs.

> Now this is sad, very sad.
> 
> Stan started out selling the Nexsan version of things (and why was he

For the record, I'm not selling anything.  I don't have a $$ horse in
this race.  I'm simply trying to show Matt some good options.  I don't
work for any company selling anything.  I'm just an SA, giving free
advice to another SA with regard to his request for information.  I just
happen to know a lot more about high performance storage than the
average SA.  I recommend Nexsan products because I've used them, they
work very well, and are very competitive WRT price/performance/capacity.

> doing it on the MD RAID list I wonder?), 

The OP asked for possible solutions to solve for his need.  This need
may not necessarily be best met by mdraid, regardless of the fact he
asked on the Linux RAID list.  LED identification of a failed drive is
enough reason for me to not recommend mdraid in this solution, given the
fact he'll only have 4 disks per chassis w/an inbuilt hardware RAID
chip.  I'm guessing fault LED is one of the reasons why you use a
combination of PCIe RAID cards and mdraid in your JackRabbit and Delta-V
systems instead of strictly mdraid.  I'm not knocking it.  That's the
only way to do it properly on such systems.  Likewise, please don't
knock me for recommending the obvious better solution in this case.
mdraid would have not materially positive impact, but would introduce
maintenance problems.

> which would have run into the
> same costs Stan noted later.  Now Stan is selling (actually mis-selling)
> GPFS (again, on an MD RAID list, seemingly having picked it off of a
> website), without having a clue as to the pricing, implementation,
> issues, etc.

I first learned of GPFS in 2001 when it was deployed on the 256 node IBM
Netfinity dual P3 933 Myrinet cluster at Maui High Performance Computing
Center.  GPFS was deployed in this cluster using what is currently
called the Network Shared Disk protocol, spanning the 512 local disks.
GPFS has grown and matured significantly in the 10 years since.  Today
it is most commonly deployed with a dedicated file server node farm
architecture, but it still works just as well using NSD.  In the
configuration I suggested, each node will be an NSD client and NSD
server.  GPFS is renowned for its reliability and performance in the
world of HPC cluster computing due to its excellent 10+ year track
record in the field.  It is years ahead of any other cluster filesystem
in capability, performance, manageability, and reliability.

> I did suggest using GlusterFS as it will help with a number of aspects,
> has an open source version.  I did also suggest (since he seems to wish
> to build it himself) that he pursue a reasonable design to start with,

I don't believe his desire is to actually DIY the compute and/or storage
nodes.  If it is, for a production system of this size/caliber, *I*
wouldn't DIY in this case, and I'm the king of DIY hardware.  Actually,
I'm TheHardwareFreak.  ;)  I guess you've missed the RHS of my email
addy. :)  I was given that nickname, flattering or not, about 15 years
ago.  Obviously it stuck.  It's been my vanity domain for quite a few years.

> and avoid the filer based designs Stan suggested (two Nexsan's and some
> sort of filer head to handle them), or a SAN switch of some sort.

There's nothing wrong with a single filer, just because it's a single
filer.  I'm sure you've sold some singles.  They can be very performant.
 I could build a single DIY 10 GbE filer today from white box parts
using JBOD enclosures that could push highly parallel NFS client reads
at ~4GB/s all day long, about double the performance of your JackRabbit
5U.  It would take me some time to tune PCIe interrupt routing, TCP, NFS
server threading, etc, but it can be done.  Basic parts list would be
something like:

1 x SuperMicro H8DG6 w/dual 8 core 2GHz Optys, 8x4GB DDR3 ECC RDIMMs
3 x LSI MegaRAID SAS 9280-4i4e PCIe x8 512MB cache
1 x NIAGARA 32714L Quad Port Fiber 10 Gigabit Ethernet NIC
1 x SUPERMICRO CSE-825TQ-R700LPB Black 2U Rackmount 700W redundant PSU
3 x NORCO DS-24E External 4U 24 Bay 6G SAS w/LSI 4x6 SAS expander
74 x Seagate ST3300657SS 15K 300GB 6Gb/s SAS, 2 boot, 72 in JBOD chassis
Configure 24 drive HW RAID6 on each LSI HBA, mdraid linear over them
Format the mdraid device with mkfs.xfs with "-d agcount=66"

With this setup the disks will saturate the 12 SAS host channels at
7.2GB/s aggregate with concurrent parallel streaming reads, as each 22
drive RAID6 will be able to push over 3GB/s with 15k drives.  This
excess of disk bandwidth, and high random IOPS of the 15k drives,
ensures that highly random read loads from many concurrent NFS clients
will still hit in the 4GB/s range, again, after the system has been
properly tuned.

> Neither design works well in his scenario, or for that matter, in the
> vast majority of HPC situations.

Why don't you ask Matt, as I have, for an actual, accurate description
of his workload.  What we've been given isn't an accurate description.
If it was, his current production systems would be so overwhelmed he'd
already be writing checks for new gear.  I've seen no iostat or other
metrics, which are standard fair when asking for this kind of advice.

> I did make a full disclosure of my interests up front, and people are
> free to take my words with a grain of salt.  Insinuating based upon my
> disclosure?  Sad.

It just seems to me you're too willing to oversell him.  He apparently
doesn't have that kind of budget anyway.  If we, you, me, anyone, really
wants to give Matt good advice, regardless of how much you might profit,
or mere satisfaction I may gain because one of my suggestions was
implemented, why don't we both agree to get as much information as
possible from Matt before making any more recommendations?

I think we've both forgotten once or twice in this thread that it's not
about us, but about Matt's requirement.

> See GlusterFS.  Open source at zero cost.  However, and this is a large
> however, this design, using local storage for a pooled "cloud" of disks,
> has some often problematic issues (resiliency, performance, hotspots). A
> truly hobby design would use this.  Local disk is fine for scratch
> space, for a few other things.  Managing the disk spread out among 50
> nodes?  Yeah, its harder.

Gluster isn't designed as a high performance parallel filesystem.  It
was never meant to be such.  There are guys on the dovecot list who have
tried it as a maildir store and it just falls over.  It simply cannot
handle random IO workloads, period.  And yes, it is difficult to design
a high performance parallel network based filesystem.  Much so.  IBM has
a massive lead on the other cluster filesystems as IBM started work back
in the mid/late 90s for their Power clusters.

> I'm gonna go out on a limb here and suggest Matt speak with HPC cluster
> and storage people.  He can implement things ranging from effectively
> zero cost through things which can be quite expensive.  If you are
> talking to Netapp about HPC storage, well, probably move onto a real HPC
> storage shop.  His problem is squarely in the HPC arena.

I'm still not convinced of that.  Simply stating "I have 50 compute
nodes each w/one GbE port, so I need 6GB/s of bandwidth" isn't actual
application workload data.  From what Matt did describe of how the
application behaves, simply time shifting the data access will likely
solve all of his problems, cheaply.  He might even be able to get by
with his current filer.  We simply need more information.  I do anyway.
 I'd hope you would as well.

> However, I would strongly advise against designs such as a single
> centralized unit, or a cloud of micro disks.  The first design is
> decidedly non-scalable, which is in part why the HPC community abandoned
> it years ago.  The second design is very hard to manage and guarantee
> any sort of resiliency.  You get all the benefits of a RAID0 in what
> Stan proposed.

A single system filer is scalable up to the point you run out of PCIe
slots.  The system I mentioned using the Nexsan array can scale 3x
before running out of slots.

I think some folks at IBM would tend to vehemently disagree with your
assertions here about GPFS. :)  It's the only filesystem used on IBM's
pSeries clusters and supercomputers.  I'd wager that IBM has shipped
more GPFS nodes into the HPC marketplace than Joe's company has shipped
nodes, total, ever, into any market, or ever will, by a factor of at
least 100.

This isn't really a fair comparison, as IBM has shipped single GPFS
supercomputers with more nodes than Joe's company will sell in its
entire lifespan.  Case in point:  ASCI Purple has 1640 GPFS client
nodes, and 134 GPFS server nodes.  This machine ships GPFS traffic over
the IBM HPS network at 4GB/s per node link, each node having two links
for 8GB/s per client node--a tad faster than GbE. ;).

For this environment, and most HPC "centers", using a few fat GPFS
storage servers with hundreds of terabytes of direct attached fiber
channel storage makes more sense than deploying every compute node as a
GPFS client *and* server using local disk.  In Matt's case it makes more
sense to do the latter, called NSD.

For the curious, here are the details of the $140 million ASCI Purple
system including the GPFS setup:
https://computing.llnl.gov/tutorials/purple/

> Start out talking with and working with experts, and its pretty likely
> you'll come out with a good solution.   The inverse is also true.

If by experts you mean those working in the HPC field, not vendors,
that's a great idea.  Matt, fire off a short polite email to Jack
Dongarra and one to Bill Camp.  Dr. Dongarra is the primary author of
the Linpack benchmark, which is used to rate the 500 fastest
supercomputers in the world twice yearly, among other things.  His name
is probably the most well known in the field of supercomputing.

Bill Camp designed the Red Storm supercomputer, which is now the
architectural basis for Cray's large MPP supercomputers.  He works for
Sandia National Laboratory, which is one of the 4 US nuclear weapons
laboratories.

If neither of these two men has an answer for you, nor can point you to
folks who do, the answer simply doesn't exist.  Out of consideration I'm
not going to post their email addresses.  You can find them at the
following locations.  While you're at it, read the Red Storm document.
It's very interesting.

http://www.netlib.org/utk/people/JackDongarra/

http://www.google.com/url?sa=t&source=web&cd=3&ved=0CCEQFjAC&url=http%3A%2F%2Fwww.lanl.gov%2Forgs%2Fhpc%2Fsalishan%2Fsalishan2003%2Fcamp.pdf&rct=j&q=bill%20camp%20asci%20red&ei=VxRqTdTuEYOClAf4xKH_AQ&usg=AFQjCNFl420n6HAwBkDs5AFBU2TKpsiHvA&cad=rja

I've not corresponded with Professor Dongarra for many years, but back
then he always answered my emails rather promptly, within a day or two.
 The key is to keep it short and sweet, as the man is pretty busy I'd
guess.  I've never corresponded with Dr. Camp, but I'm sure he'd respond
to you, one way or another.  My experience is that technical people
enjoy talking tech shop, at least to a degree.

> MD RAID, which Stan dismissed as a "hobby RAID" at first can work well

That's a mis-characterization of the statement I made.

> for Matt.  GlusterFS can help with the parallel file system atop this.
> Starting with a realistic design, an MD RAID based system (self built or
> otherwise) could easily provide everything Matt needs, at the data rates
> he needs it, using entirely open source technologies.  And good designs.

I don't recall Matt saying he needed a solution based entirely on FOSS.
 If he did I missed it.  If he can accomplish his goals with all FOSS
that's always a plus in my book.  However, I'm not averse to closed
source when it's a better fit for a requirement.

> You really won't get good performance out of a bad design.  The folks

That's brilliant insight. ;)

> doing HPC work who've responded have largely helped frame good design
> patterns.  The folks who aren't sure what HPC really is, haven't.

The folks who use the term HPC as a catch all, speaking as if there is
one workload pattern, or only one file access pattern which comprises
HPC, as Joe continues to do, and who attempt to tell others they don't
know what they're talking about, when they most certainly do, should be
viewed with some skepticism.

Just as in the business sector, there are many widely varied workloads
in the HPC space.  At opposite ends of the disk access spectrum,
analysis applications tend to read a lot and write very little.
Simulation applications, on the other hand, tend to read very little,
and generate a tremendous amount of output.  For each of these, some
benefit greatly from highly parallel communication and disk throughput,
some don't.  Some benefit from extreme parallelism, and benefit from
using message passing and Lustre file access over infiniband, some with
lots of serialization don't.  Some may benefit from openmp parallelism
but only mild amounts of disk parallelism.  In summary, there are many
shades of HPC.

For maximum performance and ROI, just as in the business or any other
computing world, one needs to optimize his compute and storage system to
meet his particular workload.  There isn't one size that fits all.
Thus, contrary to what Joe may have anyone here believe, NFS filers are
a perfect fit for some HPC workloads.  For Joe to say that any workload
that works fine with an NFS filer isn't an HPC workload is simply
rubbish.  One need look no further than a little ways back in this
thread to see this.  In one hand, Joe says Matt's workload is absolutely
an HPC workload.  Matt currently uses an NFS filer for this workload.
Thus, Joe would say this isn't an HPC workload because it's working fine
with an NFS filer.  Just a bit of self contradiction there.

Instead of arguing what is and is not HPC, and arguing that Matt's
workload is "an HPC workload", I think, again, that nailing down his
exact data access profile and making a recommendation on that, is what
he needs.  I'm betting he could care less if his workload is "an HPC
workload" or not.  I'm starting to tire of this thread.  Matt has plenty
of conflicting information to sort out.  I'll be glad to answer any
questions he may have of me.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-15  4:44   ` Matt Garman
                       ` (2 preceding siblings ...)
  2011-02-15 15:16     ` Joe Landman
@ 2011-02-27 21:30     ` Ed W
  2011-02-28 15:46       ` Joe Landman
                         ` (2 more replies)
  3 siblings, 3 replies; 116+ messages in thread
From: Ed W @ 2011-02-27 21:30 UTC (permalink / raw)
  To: Matt Garman, Mdadm

Your application appears to be an implementation of a queue processing 
system?  ie each machine: pulls a file down, processes it, gets the next 
file, etc?

Can you share some information on
- the size of files you pull down (I saw something in another post)
- how long each machine takes to process each file
- whether there is any dependency between the processing machines? eg 
can each machine operate completely independently of the others and 
start it's job when it wishes (or does it need to sync?)

Given the tentative assumption that
- processing each file takes many multiples of the time needed to 
download the file, and
- files are processed independently

It would appear that you can use a much lower powered system to 
basically push jobs out to the processing machines in advance, this way 
your bandwidth basically only needs to be:
     size_of_job * num_machines / time_to_process_jobs

So if the time to process jobs is significant then you have quite some 
time to push out the next job to local storage ready?

Firstly is this architecture workable?  If so then you have some new 
performance parameters to target for the storage architecture?

Good luck

Ed W

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-27 21:30     ` high throughput storage server? Ed W
@ 2011-02-28 15:46       ` Joe Landman
  2011-02-28 23:14         ` Stan Hoeppner
  2011-02-28 22:22       ` Stan Hoeppner
  2011-03-02  3:44       ` Matt Garman
  2 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-02-28 15:46 UTC (permalink / raw)
  To: Ed W; +Cc: Matt Garman, Mdadm

On 02/27/2011 04:30 PM, Ed W wrote:

[...]

> It would appear that you can use a much lower powered system to
> basically push jobs out to the processing machines in advance, this way
> your bandwidth basically only needs to be:
> size_of_job * num_machines / time_to_process_jobs

This would be good.  Matt's original argument suggested he needed this 
as his sustained bandwidth given the way the analysis proceeded.

If we assume that the processing time is T_p, and the communication time 
is T_c, ignoring other factors, the total time for 1 job is T_j = T_p + 
T_c.  If T_c << T_p, then you can effectively ignore bandwidth related 
issues (and use a much smaller bandwidth system).  For T_c << T_p, lets 
(for laughs) say T_c = 0.1 x T_p (e.g. communication time is 1/10th the 
processing time).  Then even if you halved your bandwidth, and doubled 
T_c, you are making only an about 10% increase in your total execution 
time for a job.

With Nmachines each with Ncores, you have Nmachines x Ncores jobs going 
on all at once. If T_c << T_p (as in the above example), then most of 
the time, on average, the machines will not be communicating.  In fact, 
if we do a very rough first pass approximation to an answer (there are 
more accurate statistical models) for this, one would expect the network 
to be used T_c/T_p fraction of the time by each process.  Then the total 
consumption of data for a run (assuming all runs are *approximately* of 
equal duration)

	D = B x T_c

D being the amount of data in MB or GB, and B being the bandwidth 
expressed in MB/s or GB/s.  Your effective bandwidth per run, Beff will be

	D = Beff x T = Beff x (T_c + T_p)

For Nmachines x Ncores jobs, Dtotal is the total data transfered

	Dtotal	= Nmachines x Ncores * D = Nmachines x Ncores x Beff
   		x (T_c + T_p)


You know Dtotal (aggregate data needed for run).  You know Nmachines and 
Ncores.  You know T_c and T_p (approximately).  From this, solve for 
Beff.  Thats what you have to sustain (approximately).

> So if the time to process jobs is significant then you have quite some
> time to push out the next job to local storage ready?
>
> Firstly is this architecture workable? If so then you have some new
> performance parameters to target for the storage architecture?
>
> Good luck
>
> Ed W

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-27 21:30     ` high throughput storage server? Ed W
  2011-02-28 15:46       ` Joe Landman
@ 2011-02-28 22:22       ` Stan Hoeppner
  2011-03-02  3:44       ` Matt Garman
  2 siblings, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-28 22:22 UTC (permalink / raw)
  To: Ed W; +Cc: Matt Garman, Mdadm

Ed W put forth on 2/27/2011 3:30 PM:
> Your application appears to be an implementation of a queue processing
> system?  ie each machine: pulls a file down, processes it, gets the next
> file, etc?
> 
> Can you share some information on
> - the size of files you pull down (I saw something in another post)
> - how long each machine takes to process each file
> - whether there is any dependency between the processing machines? eg
> can each machine operate completely independently of the others and
> start it's job when it wishes (or does it need to sync?)
> 
> Given the tentative assumption that
> - processing each file takes many multiples of the time needed to
> download the file, and
> - files are processed independently
> 
> It would appear that you can use a much lower powered system to
> basically push jobs out to the processing machines in advance, this way
> your bandwidth basically only needs to be:
>     size_of_job * num_machines / time_to_process_jobs
> 
> So if the time to process jobs is significant then you have quite some
> time to push out the next job to local storage ready?
> 
> Firstly is this architecture workable?  If so then you have some new
> performance parameters to target for the storage architecture?
> 
> Good luck

Ed, you stated this thought much more thoroughly and eloquently than I
did in my last rambling post.  Thank you.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-28 15:46       ` Joe Landman
@ 2011-02-28 23:14         ` Stan Hoeppner
  0 siblings, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-02-28 23:14 UTC (permalink / raw)
  To: Joe Landman; +Cc: Ed W, Matt Garman, Mdadm

Joe Landman put forth on 2/28/2011 9:46 AM:
> On 02/27/2011 04:30 PM, Ed W wrote:
> 
> [...]
> 
>> It would appear that you can use a much lower powered system to
>> basically push jobs out to the processing machines in advance, this way
>> your bandwidth basically only needs to be:
>> size_of_job * num_machines / time_to_process_jobs
> 
> This would be good.  Matt's original argument suggested he needed this
> as his sustained bandwidth given the way the analysis proceeded.

And Joe has provided a nice mathematical model for quantifying it.

> If we assume that the processing time is T_p, and the communication time
> is T_c, ignoring other factors, the total time for 1 job is T_j = T_p +
> T_c.  If T_c << T_p, then you can effectively ignore bandwidth related
> issues (and use a much smaller bandwidth system).  For T_c << T_p, lets
> (for laughs) say T_c = 0.1 x T_p (e.g. communication time is 1/10th the
> processing time).  Then even if you halved your bandwidth, and doubled
> T_c, you are making only an about 10% increase in your total execution
> time for a job.
> 
> With Nmachines each with Ncores, you have Nmachines x Ncores jobs going
> on all at once. If T_c << T_p (as in the above example), then most of
> the time, on average, the machines will not be communicating.  In fact,
> if we do a very rough first pass approximation to an answer (there are
> more accurate statistical models) for this, one would expect the network
> to be used T_c/T_p fraction of the time by each process.  Then the total
> consumption of data for a run (assuming all runs are *approximately* of
> equal duration)
> 
>     D = B x T_c
> 
> D being the amount of data in MB or GB, and B being the bandwidth
> expressed in MB/s or GB/s.  Your effective bandwidth per run, Beff will be
> 
>     D = Beff x T = Beff x (T_c + T_p)
> 
> For Nmachines x Ncores jobs, Dtotal is the total data transfered
> 
>     Dtotal    = Nmachines x Ncores * D = Nmachines x Ncores x Beff
>           x (T_c + T_p)
> 
> 
> You know Dtotal (aggregate data needed for run).  You know Nmachines and
> Ncores.  You know T_c and T_p (approximately).  From this, solve for
> Beff.  Thats what you have to sustain (approximately).

This assumes his application is threaded and scales linearly across
multiple cores.  If not, running Ncores processes on each node should
achieve a similar result to the threaded case, assuming the application
is written such that multiple process instances don't trip over each
other by say, all using the same scratch file path/name, etc, etc.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-02-27 21:30     ` high throughput storage server? Ed W
  2011-02-28 15:46       ` Joe Landman
  2011-02-28 22:22       ` Stan Hoeppner
@ 2011-03-02  3:44       ` Matt Garman
  2011-03-02  4:20         ` Joe Landman
  2 siblings, 1 reply; 116+ messages in thread
From: Matt Garman @ 2011-03-02  3:44 UTC (permalink / raw)
  To: Ed W; +Cc: Mdadm

On Sun, Feb 27, 2011 at 3:30 PM, Ed W <lists@wildgooses.com> wrote:
> Your application appears to be an implementation of a queue processing
> system?  ie each machine: pulls a file down, processes it, gets the next
> file, etc?

Sort of.  It's not so much "each machine" as it is "each job".  A
machine can have multiple jobs.

At this point I'm not exactly sure what the jobs' specifics are; that
is, not sure if a job reads a bunch of files at once, then processes;
or, reads one file, then processes (as you described).

> Can you share some information on
> - the size of files you pull down (I saw something in another post)

They vary; they can be anywhere from about 100 MB to a few TB.
Average is probably on the order of a few hundred MB.

> - how long each machine takes to process each file

I'm not sure how long a job takes to process a file; I'm trying to get
these answers from the people who design and run the jobs.

> - whether there is any dependency between the processing machines? eg can
> each machine operate completely independently of the others and start it's
> job when it wishes (or does it need to sync?)

I'm fairly sure the jobs are independent.

> Given the tentative assumption that
> - processing each file takes many multiples of the time needed to download
> the file, and
> - files are processed independently
>
> It would appear that you can use a much lower powered system to basically
> push jobs out to the processing machines in advance, this way your bandwidth
> basically only needs to be:
>    size_of_job * num_machines / time_to_process_jobs
>
> So if the time to process jobs is significant then you have quite some time
> to push out the next job to local storage ready?
>
> Firstly is this architecture workable?  If so then you have some new
> performance parameters to target for the storage architecture?

That might be workable, but it would require me (or someone) to
develop and deploy the job dispatching system.  Which is certainly
doable, but it might meet some "political" resistance.  My boss
basically said, "find a system to buy or spec out a system to build
that meets [the requirements I've mentioned in this and other
emails]."
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-02  3:44       ` Matt Garman
@ 2011-03-02  4:20         ` Joe Landman
  2011-03-02  7:10           ` Roberto Spadim
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-03-02  4:20 UTC (permalink / raw)
  To: Matt Garman; +Cc: Ed W, Mdadm

On 03/01/2011 10:44 PM, Matt Garman wrote:

[...]


> That might be workable, but it would require me (or someone) to
> develop and deploy the job dispatching system.  Which is certainly

Happily, the "develop" part of this is already done.  Have a look at 
GridEngine, Torque, slurm, and a number of others (commercial versions 
include the excellent LSF from Platform, PBSpro by Altair, and others).

> doable, but it might meet some "political" resistance.  My boss
> basically said, "find a system to buy or spec out a system to build
> that meets [the requirements I've mentioned in this and other
> emails]."

This is wandering outside of the MD list focus.  You might want to speak 
with other folks on the Beowulf list, among others.

I should note that nothing you've brought up isn't a solvable problem. 
You simply have some additional data to gather on the apps, some costs 
to compare against the benefits they bring, and make decisions from 
there.  Build vs buy is one of the critical ones, but as Ed, myself and 
others have noted, you do need more detail to make sure you don't 
under(over) spec the design for the near/mid/far term.

Regards,

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-02  4:20         ` Joe Landman
@ 2011-03-02  7:10           ` Roberto Spadim
  2011-03-02 19:03             ` Drew
  0 siblings, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-03-02  7:10 UTC (permalink / raw)
  To: Joe Landman; +Cc: Matt Garman, Ed W, Mdadm

why not supercapacitors to power safe and ram memories as ramdisks?
a backup solution could help on umount or backup

2011/3/2 Joe Landman <joe.landman@gmail.com>:
> On 03/01/2011 10:44 PM, Matt Garman wrote:
>
> [...]
>
>
>> That might be workable, but it would require me (or someone) to
>> develop and deploy the job dispatching system.  Which is certainly
>
> Happily, the "develop" part of this is already done.  Have a look at
> GridEngine, Torque, slurm, and a number of others (commercial versions
> include the excellent LSF from Platform, PBSpro by Altair, and others).
>
>> doable, but it might meet some "political" resistance.  My boss
>> basically said, "find a system to buy or spec out a system to build
>> that meets [the requirements I've mentioned in this and other
>> emails]."
>
> This is wandering outside of the MD list focus.  You might want to speak
> with other folks on the Beowulf list, among others.
>
> I should note that nothing you've brought up isn't a solvable problem. You
> simply have some additional data to gather on the apps, some costs to
> compare against the benefits they bring, and make decisions from there.
>  Build vs buy is one of the critical ones, but as Ed, myself and others have
> noted, you do need more detail to make sure you don't under(over) spec the
> design for the near/mid/far term.
>
> Regards,
>
> Joe
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics, Inc.
> email: landman@scalableinformatics.com
> web  : http://scalableinformatics.com
>       http://scalableinformatics.com/sicluster
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-02  7:10           ` Roberto Spadim
@ 2011-03-02 19:03             ` Drew
  2011-03-02 19:20               ` Roberto Spadim
  0 siblings, 1 reply; 116+ messages in thread
From: Drew @ 2011-03-02 19:03 UTC (permalink / raw)
  To: Mdadm

> why not supercapacitors to power safe and ram memories as ramdisks?
> a backup solution could help on umount or backup

Huh?

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-02 19:03             ` Drew
@ 2011-03-02 19:20               ` Roberto Spadim
  2011-03-13 20:10                 ` Christoph Hellwig
  0 siblings, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-03-02 19:20 UTC (permalink / raw)
  To: Drew; +Cc: Mdadm

=) high throughput
no ssd, no harddisk for main data, only ram, and a good ups system
with supercapacitor (not for cpu, just ram disks), could use 2,5v
2500F capacitors
ddr3 memory have >=10000gb/s, use SAS 6gbit channel for each ram disk

and with time, get ram disk and save to harddisks (backup only, not
online data, some filesystem have snapshots, could use it)
ram is more expensive than ssd and harddisk, but is faster, with a
good ups it´s less volatille (some hours without computer power
supply)


2011/3/2 Drew <drew.kay@gmail.com>:
>> why not supercapacitors to power safe and ram memories as ramdisks?
>> a backup solution could help on umount or backup
>
> Huh?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server? GPFS w/ 10GB/s throughput to the rescue
  2011-02-27 14:55               ` Stan Hoeppner
@ 2011-03-12 22:49                 ` Matt Garman
  0 siblings, 0 replies; 116+ messages in thread
From: Matt Garman @ 2011-03-12 22:49 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Joe Landman, Mdadm

Sorry again for the delayed response... it takes me a while to read
through all these and process them.  :)  I do appreciate all the
feedback though!

On Sun, Feb 27, 2011 at 8:55 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> Yes, this is pretty much exactly what I mentioned.  ~5GB/s aggregate.
> But we've still not received an accurate detailed description from Matt
> regarding his actual performance needs.  He's not posted iostat numbers
> from his current filer, or any similar metrics.

Accurate metrics are hard to determine.  I did run iostat for 24 hours
on a few servers, but I don't think the results give an accurate
picture of what we really need.  Here's the details on what we have
now:

We currently have 10 servers, each with an NFS share.  Each server
mounts every other NFS share; mountpoints are consistently named on
every server (and a server's local storage is a symlink named like its
mountpoint on other machines).  One server has a huge directory of
symbolic links that acts as the "database" or "index" to all the files
spread across all 10 servers.

We spent some time a while ago creating a semi-smart distribution of
the files.  In short, we basically round-robin'ed files in such a way
as to parallelize bulk reads across many servers.

The current system works, but is (as others have suggested), not
particularly scalable.  When we add new servers, I have to
re-distribute those files across the new servers.

That, and these storage servers are dual-purposed; they are also used
as analysis servers---basically batch computation jobs that use this
data.  The folks who run the analysis programs look at the machine
load to determine how many analysis jobs to run.  So when all machines
are running analysis jobs, the machine load is a combination of both
the CPU load from these analysis programs AND the I/O load from
serving files.  In other words, if these machines were strictly
compute servers, they would in general show a lower load, and thus
would run even more programs.

Having said all that, I picked a few of the 10 NFS/compute servers and
ran iostat for 24 hours, reporting stats every 1 minute (FYI, this is
actually what Dell asks you to do if you inquire about their storage
solutions).  The results from all machines were (as expected)
virtually the same.  They average constant, continuous reads at about
3--4 MB/s.  You might take that info and say, 4 MB/s times 10
machines, that's only 40 MB/s... that's nothing, not even the full
bandwidth of a single gigabit ethernet connection.  But there are
several problems (1) the number of analysis jobs is currently
artificially limited; (2) the file distribution is smart enough that
NFS load is balanced across all 10 machines; and (3) there are
currently about 15 machines doing analysis jobs (10 are dual-purposed
as I already mentioned), but this number is expected to grow to 40 or
50 within the year.

Given all that, I have simplified the requirements as follows: I want
"something" that is capable of keeping the gigabit connections of
those 50 analysis machines saturated at all times.  There have been
several suggestions along the lines of smart job scheduling and the
like.  However, the thing is, these analysis jobs are custom---they
are constantly being modified, new ones created, and old ones retired.
 Meaning, the access patterns are somewhat dynamic, and will certainly
change over time.  Our current "smart" file distribution is just based
on the general case of maybe 50% of the analysis programs' access
patterns.  But next week someone could come up with a new analysis
program that makes our current file distribution "stupid".  The point
is, current access patterns are somewhat meaningless, because they are
all but guaranteed to change.  So what do we do?  For business
reasons, any surplus manpower needs to be focused on these analysis
jobs; we don't have the resources to constantly adjust job scheduling
and file distribution.

So I think we are truly trying to solve the most general case here,
which is that all 50 gigabit-connected servers will be continuously
requesting data in an arbitrary fashion.

This is definitely a solvable problem; and there are multiple options;
I'm in the learning stage right now, so hopefully I can make a good
decision about which solution is best for our particular case.  I
solicited the list because I had the impression that there were at
least a few people who have built and/or administer systems like this.
 And clearly there are people with exactly this experience, given the
feedback I've received!  So I've learned a lot, which is exactly what
I wanted in the first place.

> http://www.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=SA&subtype=WH&appname=STGE_XB_XB_USEN&htmlfid=XBW03010USEN&attachment=XBW03010USEN.PDF
>
> Your numbers are wrong, by a factor of 2.  He should research GPFS and
> give it serious consideration.  It may be exactly what he needs.

I'll definitely look over that.

> I don't believe his desire is to actually DIY the compute and/or storage
> nodes.  If it is, for a production system of this size/caliber, *I*
> wouldn't DIY in this case, and I'm the king of DIY hardware.  Actually,
> I'm TheHardwareFreak.  ;)  I guess you've missed the RHS of my email
> addy. :)  I was given that nickname, flattering or not, about 15 years
> ago.  Obviously it stuck.  It's been my vanity domain for quite a few years.

I'm now leaning towards a purchased solution, mainly due to the fact
that it seems like a DIY solution would cost a lot more in terms of my
time.  Expensive though they are, one of the nicer things about the
vendor solutions is that they seem to provide somewhat of a "set it
and forget it" experience.  Of course, a system like this needs
routine maintenance and such, but the the vendors claim their
solutions simplify that.  But maybe that's just marketspeak!  :)
Although I think there's some truth to it---I've been a Linux/DIY
enthusiast/hobbyist for years now, and my experience is that the
DIY/FOSS stuff always takes more individual effort.  It's fun to do at
home, but can be costly from a business perspective...

> Why don't you ask Matt, as I have, for an actual, accurate description
> of his workload.  What we've been given isn't an accurate description.
> If it was, his current production systems would be so overwhelmed he'd
> already be writing checks for new gear.  I've seen no iostat or other
> metrics, which are standard fair when asking for this kind of advice.

Hopefully my description above sheds a little more light on what we
need.  Ignoring smarter job scheduling and such, I want to solve the
worst-case scenario, which is 50 servers all requesting enough data to
saturate their gigabit network connections.

> I'm still not convinced of that.  Simply stating "I have 50 compute
> nodes each w/one GbE port, so I need 6GB/s of bandwidth" isn't actual
> application workload data.  From what Matt did describe of how the
> application behaves, simply time shifting the data access will likely
> solve all of his problems, cheaply.  He might even be able to get by
> with his current filer.  We simply need more information.  I do anyway.
>  I'd hope you would as well.

Hopefully I described well enough why our current application workload
data metrics aren't sufficient.  We haven't time-shifted data access,
but have somewhat space-shifted it, given the round-robin "smart" file
distribution I described above.  But it's only "smart" for today's
usage---tomorrow's usage will almost certainly be different.  50 gbps
/ 6 GB/s is the requirement.

> I don't recall Matt saying he needed a solution based entirely on FOSS.
>  If he did I missed it.  If he can accomplish his goals with all FOSS
> that's always a plus in my book.  However, I'm not averse to closed
> source when it's a better fit for a requirement.

Nope, doesn't have to be entirely FOSS.

-Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-02 19:20               ` Roberto Spadim
@ 2011-03-13 20:10                 ` Christoph Hellwig
  2011-03-14 12:27                   ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Christoph Hellwig @ 2011-03-13 20:10 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: Drew, Mdadm

Btw, XFS has been used for >10GB/s throughput systems for about the last
5 years.  The big issues is getting hardware that can reliably sustain
it - if you have that using it with Linux and XFS is not an problem at
all.  Note that with NUMA system you also have to thing about your
intereconnect bandwith as a limiting factor for buffered I/O, not just
the storage subsystem.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-13 20:10                 ` Christoph Hellwig
@ 2011-03-14 12:27                   ` Stan Hoeppner
  2011-03-14 12:47                     ` Christoph Hellwig
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-14 12:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Roberto Spadim, Drew, Mdadm

Christoph Hellwig put forth on 3/13/2011 3:10 PM:
> Btw, XFS has been used for >10GB/s throughput systems for about the last
> 5 years.  The big issues is getting hardware that can reliably sustain
> it - if you have that using it with Linux and XFS is not an problem at

I already noted this far back in the thread Christoph, but it is worth
repeating.  And it carries more weight when you, a Linux Kernel dev,
state this, than when I do.  So thanks for adding your input. :)

> all.  Note that with NUMA system you also have to thing about your
> intereconnect bandwith as a limiting factor for buffered I/O, not just
> the storage subsystem.

Is this only an issue with multi-chassis cabled NUMA systems such as
Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
with their relatively low direct node-node bandwidth, or is this also of
concern with single chassis systems with relatively much higher
node-node bandwidth, such as the AMD Opteron systems, specifically the
newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-14 12:27                   ` Stan Hoeppner
@ 2011-03-14 12:47                     ` Christoph Hellwig
  2011-03-18 13:16                       ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Christoph Hellwig @ 2011-03-14 12:47 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, Roberto Spadim, Drew, Mdadm

On Mon, Mar 14, 2011 at 07:27:00AM -0500, Stan Hoeppner wrote:
> Is this only an issue with multi-chassis cabled NUMA systems such as
> Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
> with their relatively low direct node-node bandwidth, or is this also of
> concern with single chassis systems with relatively much higher
> node-node bandwidth, such as the AMD Opteron systems, specifically the
> newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?

Just do your math.  Buffered I/O will do two memory copies - a
copy_to_user into the pagecache and DMA from the pagecache to the device
(yes, that's also a copy as far as the memory subsystem is concerned,
even if it is access from the device).

So to get 10GB/s throughput you spends 20GB/s on memcpys for the actual
data alone.  Add to that other system activity and metadata.  Wether you
hit the interconnect or not depends on your memory configuration, I/O
attachment, and process locality.  If you have all memory that the
process uses and all I/O on one node you won't hit the interconnect at
all, but depending on memory placement and storage attachment you might
hit it twice:

 - userspace memory on node A to pagecache on node B to device on node
   C (or A again for that matter).

In short you need to review your configuration pretty carefully.  With
direct I/O it's a lot easier as you save a copy.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-14 12:47                     ` Christoph Hellwig
@ 2011-03-18 13:16                       ` Stan Hoeppner
  2011-03-18 14:05                         ` Christoph Hellwig
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-18 13:16 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Roberto Spadim, Drew, Mdadm

Christoph Hellwig put forth on 3/14/2011 7:47 AM:
> On Mon, Mar 14, 2011 at 07:27:00AM -0500, Stan Hoeppner wrote:
>> Is this only an issue with multi-chassis cabled NUMA systems such as
>> Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
>> with their relatively low direct node-node bandwidth, or is this also of
>> concern with single chassis systems with relatively much higher
>> node-node bandwidth, such as the AMD Opteron systems, specifically the
>> newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?
> 
> Just do your math.  Buffered I/O will do two memory copies - a
> copy_to_user into the pagecache and DMA from the pagecache to the device
> (yes, that's also a copy as far as the memory subsystem is concerned,
> even if it is access from the device).

The context of this thread was high throughput NFS serving.  If we
wanted to do 10 GB/s kernel NFS serving, would we still only have two
memory copies, since the NFS server runs in kernel, not user, space?
I.e. in addition to the block device DMA read into the page cache, would
we also have a memcopy into application buffers from the page cache, or
does the kernel NFS server simply work with the data directly from the
page cache without an extra memory copy being needed?  If the latter,
adding in the DMA copy to the NIC would yield two total memory copies.
Is this correct?  Or would we have 3 memcopies?

> So to get 10GB/s throughput you spends 20GB/s on memcpys for the actual
> data alone.  Add to that other system activity and metadata.  Wether you
> hit the interconnect or not depends on your memory configuration, I/O
> attachment, and process locality.  If you have all memory that the
> process uses and all I/O on one node you won't hit the interconnect at
> all, but depending on memory placement and storage attachment you might
> hit it twice:
> 
>  - userspace memory on node A to pagecache on node B to device on node
>    C (or A again for that matter).

Not to mention hardware interrupt processing load, which, in addition to
eating some interconnect bandwidth, will also take a toll on CPU cycles
given the number of RAID HBAs and NIC required to read and push 10GB/s
NFS to clients.

Will achieving 10GB/s NFS likely require intricate manual process
placement, along with spreading interrupt processing across only node
cores which are directly connected to the IO bridge chips, preventing
interrupt packets from consuming interconnect bandwidth?

> In short you need to review your configuration pretty carefully.  With
> direct I/O it's a lot easier as you save a copy.

Is O_DIRECT necessary in this scenario, or does the kernel NFS server
negate the need for direct IO since the worker threads execute in kernel
space not user space?  If not, is it possible to force to kernel NFS
server to always do O_DIRECT reads and writes, or is that the
responsibility of the application on the NFS client?

I was under the impression that the memory manager in recent 2.6
kernels, similar to IRIX on Origin, is sufficiently NUMA aware in the
default configuration to automatically take care of memory placement,
keeping all of a given process/thread's memory on the local node, and in
cases where thread memory ends up on another node for some reason, block
copying that memory to the local node and invalidating the remote CPU
caches, or in certain cases, simply moving the thread execution pointer
to a core in the remote node where the memory resides.

WRT the page cache, if the kernel doesn't automatically place page cache
data associated with a given thread in that thread's local node memory,
is it possible to force this?  It's been a while since I read the
cpumemsets and other related documentation, and I don't recall if page
cache memory is manually locatable.  That doesn't ring a bell.
Obviously it would be a big win from an interconnect utilization and
overall performance standpoint if the thread's working memory and page
cache memory were both on the local node.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-18 13:16                       ` Stan Hoeppner
@ 2011-03-18 14:05                         ` Christoph Hellwig
  2011-03-18 15:43                           ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Christoph Hellwig @ 2011-03-18 14:05 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, Roberto Spadim, Drew, Mdadm

On Fri, Mar 18, 2011 at 08:16:26AM -0500, Stan Hoeppner wrote:
> The context of this thread was high throughput NFS serving.  If we
> wanted to do 10 GB/s kernel NFS serving, would we still only have two
> memory copies, since the NFS server runs in kernel, not user, space?
> I.e. in addition to the block device DMA read into the page cache, would
> we also have a memcopy into application buffers from the page cache, or
> does the kernel NFS server simply work with the data directly from the
> page cache without an extra memory copy being needed?  If the latter,
> adding in the DMA copy to the NIC would yield two total memory copies.
> Is this correct?  Or would we have 3 memcopies?

When reading from the NFS server you get away with two memory "copies":

 1) DMA from the storage controller into the page cache
 2) DMA from the page cache into the network card

but when writing to the NFS server you usually need three:

 1) DMA from the network card into the socket buffer
 2) copy from the socket buffer into the page cache
 3) DMA from the page cache to the storage controller

That's because we can't do proper zero copy receive.  It's possible in
theory with hardware than can align payload headers on page boundaries,
and while such hardware exists on the highend I don't think we support
it yet, nor do typical setups have the network card firmware smarts for
it.

> Not to mention hardware interrupt processing load, which, in addition to
> eating some interconnect bandwidth, will also take a toll on CPU cycles
> given the number of RAID HBAs and NIC required to read and push 10GB/s
> NFS to clients.

> Will achieving 10GB/s NFS likely require intricate manual process
> placement, along with spreading interrupt processing across only node
> cores which are directly connected to the IO bridge chips, preventing
> interrupt packets from consuming interconnect bandwidth?

Note that we do have a lot of infrastructure for high end NFS serving in
the kernel, e.g. the per-node NFSD thread that Greg Banks wrote for SGI
a couple of years ago.  All this was for big SGI NAS servers running
XFS.  But as you mentioned it's not quite trivial to setup.

> > In short you need to review your configuration pretty carefully.  With
> > direct I/O it's a lot easier as you save a copy.
> 
> Is O_DIRECT necessary in this scenario, or does the kernel NFS server
> negate the need for direct IO since the worker threads execute in kernel
> space not user space?  If not, is it possible to force to kernel NFS
> server to always do O_DIRECT reads and writes, or is that the
> responsibility of the application on the NFS client?

The kernel NFS server doesn't use O_DIRECT - in fact the current
O_DIRECT code can't be used on kernel pages at all.  For some NFS
workloads it would certainly be interesting to make use of it, though.
E.g. large stable writes.

> I was under the impression that the memory manager in recent 2.6
> kernels, similar to IRIX on Origin, is sufficiently NUMA aware in the
> default configuration to automatically take care of memory placement,
> keeping all of a given process/thread's memory on the local node, and in
> cases where thread memory ends up on another node for some reason, block
> copying that memory to the local node and invalidating the remote CPU
> caches, or in certain cases, simply moving the thread execution pointer
> to a core in the remote node where the memory resides.
> 
> WRT the page cache, if the kernel doesn't automatically place page cache
> data associated with a given thread in that thread's local node memory,
> is it possible to force this?  It's been a while since I read the
> cpumemsets and other related documentation, and I don't recall if page
> cache memory is manually locatable.  That doesn't ring a bell.
> Obviously it would be a big win from an interconnect utilization and
> overall performance standpoint if the thread's working memory and page
> cache memory were both on the local node.

The kernel is pretty smart in placement of user and page cache data, but
it can't really second guess your intention.  With the numactl tool you
can help it doing the proper placement for you workload.  Note that the
choice isn't always trivial - a numa system tends to have memory on
multiple nodes, so you'll either have to find a good partitioning of
your workload or live with off-node references.  I don't think
partitioning NFS workloads is trivial, but then again I'm not a
networking expert.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-18 14:05                         ` Christoph Hellwig
@ 2011-03-18 15:43                           ` Stan Hoeppner
  2011-03-18 16:21                             ` Roberto Spadim
  2011-03-18 22:01                             ` NeilBrown
  0 siblings, 2 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-18 15:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Roberto Spadim, Drew, Mdadm

Christoph Hellwig put forth on 3/18/2011 9:05 AM:

Thanks for the confirmations and explanations.

> The kernel is pretty smart in placement of user and page cache data, but
> it can't really second guess your intention.  With the numactl tool you
> can help it doing the proper placement for you workload.  Note that the
> choice isn't always trivial - a numa system tends to have memory on
> multiple nodes, so you'll either have to find a good partitioning of
> your workload or live with off-node references.  I don't think
> partitioning NFS workloads is trivial, but then again I'm not a
> networking expert.

Bringing mdraid back into the fold, I'm wondering what kinda of load the
mdraid threads would place on a system of the caliber needed to push
10GB/s NFS.

Neil, I spent quite a bit of time yesterday spec'ing out what I believe
is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
This includes:

  4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
  3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter

This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.

I made the assumption that RAID 10 would be the only suitable RAID level
due to a few reasons:

1.  The workload being 50+ NFS large file reads of aggregate 10GB/s,
yielding a massive random IO workload at the disk head level.

2.  We'll need 384 15k SAS drives to service a 10GB/s random IO load

3.  We'll need multiple "small" arrays enabling multiple mdraid threads,
assuming a single 2.4GHz core isn't enough to handle something like 48
or 96 mdraid disks.

4.  Rebuild times for parity raid schemes would be unacceptably high and
would eat all of the CPU the rebuild thread would run on

To get the bandwidth we need and making sure we don't run out of
controller chip IOPS, my calculations show we'd need 16 x 24 drive
mdraid 10 arrays.  Thus, ignoring all other considerations momentarily,
a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with one
mdraid thread per core, each managing a 24 drive RAID 10.  Would we then
want to layer a --linear array across the 16 RAID 10 arrays?  If we did
this, would the linear thread bottleneck instantly as it runs on only
one core?  How many additional memory copies (interconnect transfers)
are we going to be performing per mdraid thread for each block read
before the data is picked up by the nfsd kernel threads?

How much of each core's cycles will we consume with normal random read
operations assuming 10GB/s of continuous aggregate throughput?  Would
the mdraid threads consume sufficient cycles that when combined with
network stack processing and interrupt processing, that 16 cores at
2.4GHz would be insufficient?  If so, would bumping the two sockets up
to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
need to move to a 4 socket system with 32 or 48 cores?

Is this possibly a situation where mdraid just isn't suitable due to the
CPU, memory, and interconnect bandwidth demands, making hardware RAID
the only real option?  And if it does requires hardware RAID, would it
be possible to stick 16 block devices together in a --linear mdraid
array and maintain the 10GB/s performance?  Or, would the single
--linear array be processed by a single thread?  If so, would a single
2.4GHz core be able to handle an mdraid --leaner thread managing 8
devices at 10GB/s aggregate?

Unfortunately I don't currently work in a position allowing me to test
such a system, and I certainly don't have the personal financial
resources to build it.  My rough estimate on the hardware cost is
$150-200K USD.  The 384 Hitachi 15k SAS 146GB drives at $250 each
wholesale are a little over $90k.

It would be really neat to have a job that allowed me to setup and test
such things. :)

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-18 15:43                           ` Stan Hoeppner
@ 2011-03-18 16:21                             ` Roberto Spadim
  2011-03-18 22:01                             ` NeilBrown
  1 sibling, 0 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-03-18 16:21 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, Drew, Mdadm

did you contacted texas ssd solutions? i don't know how much $$$
should you pay for this setup, but it's a nice solution...

2011/3/18 Stan Hoeppner <stan@hardwarefreak.com>:
> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>
> Thanks for the confirmations and explanations.
>
>> The kernel is pretty smart in placement of user and page cache data, but
>> it can't really second guess your intention.  With the numactl tool you
>> can help it doing the proper placement for you workload.  Note that the
>> choice isn't always trivial - a numa system tends to have memory on
>> multiple nodes, so you'll either have to find a good partitioning of
>> your workload or live with off-node references.  I don't think
>> partitioning NFS workloads is trivial, but then again I'm not a
>> networking expert.
>
> Bringing mdraid back into the fold, I'm wondering what kinda of load the
> mdraid threads would place on a system of the caliber needed to push
> 10GB/s NFS.
>
> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
> is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
> This includes:
>
>  4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
>  3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
>
> This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
> hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.
>
> I made the assumption that RAID 10 would be the only suitable RAID level
> due to a few reasons:
>
> 1.  The workload being 50+ NFS large file reads of aggregate 10GB/s,
> yielding a massive random IO workload at the disk head level.
>
> 2.  We'll need 384 15k SAS drives to service a 10GB/s random IO load
>
> 3.  We'll need multiple "small" arrays enabling multiple mdraid threads,
> assuming a single 2.4GHz core isn't enough to handle something like 48
> or 96 mdraid disks.
>
> 4.  Rebuild times for parity raid schemes would be unacceptably high and
> would eat all of the CPU the rebuild thread would run on
>
> To get the bandwidth we need and making sure we don't run out of
> controller chip IOPS, my calculations show we'd need 16 x 24 drive
> mdraid 10 arrays.  Thus, ignoring all other considerations momentarily,
> a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with one
> mdraid thread per core, each managing a 24 drive RAID 10.  Would we then
> want to layer a --linear array across the 16 RAID 10 arrays?  If we did
> this, would the linear thread bottleneck instantly as it runs on only
> one core?  How many additional memory copies (interconnect transfers)
> are we going to be performing per mdraid thread for each block read
> before the data is picked up by the nfsd kernel threads?
>
> How much of each core's cycles will we consume with normal random read
> operations assuming 10GB/s of continuous aggregate throughput?  Would
> the mdraid threads consume sufficient cycles that when combined with
> network stack processing and interrupt processing, that 16 cores at
> 2.4GHz would be insufficient?  If so, would bumping the two sockets up
> to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
> need to move to a 4 socket system with 32 or 48 cores?
>
> Is this possibly a situation where mdraid just isn't suitable due to the
> CPU, memory, and interconnect bandwidth demands, making hardware RAID
> the only real option?  And if it does requires hardware RAID, would it
> be possible to stick 16 block devices together in a --linear mdraid
> array and maintain the 10GB/s performance?  Or, would the single
> --linear array be processed by a single thread?  If so, would a single
> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
> devices at 10GB/s aggregate?
>
> Unfortunately I don't currently work in a position allowing me to test
> such a system, and I certainly don't have the personal financial
> resources to build it.  My rough estimate on the hardware cost is
> $150-200K USD.  The 384 Hitachi 15k SAS 146GB drives at $250 each
> wholesale are a little over $90k.
>
> It would be really neat to have a job that allowed me to setup and test
> such things. :)
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-18 15:43                           ` Stan Hoeppner
  2011-03-18 16:21                             ` Roberto Spadim
@ 2011-03-18 22:01                             ` NeilBrown
  2011-03-18 22:23                               ` Roberto Spadim
  2011-03-20  1:34                               ` Stan Hoeppner
  1 sibling, 2 replies; 116+ messages in thread
From: NeilBrown @ 2011-03-18 22:01 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, Roberto Spadim, Drew, Mdadm

On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
> 
> Thanks for the confirmations and explanations.
> 
> > The kernel is pretty smart in placement of user and page cache data, but
> > it can't really second guess your intention.  With the numactl tool you
> > can help it doing the proper placement for you workload.  Note that the
> > choice isn't always trivial - a numa system tends to have memory on
> > multiple nodes, so you'll either have to find a good partitioning of
> > your workload or live with off-node references.  I don't think
> > partitioning NFS workloads is trivial, but then again I'm not a
> > networking expert.
> 
> Bringing mdraid back into the fold, I'm wondering what kinda of load the
> mdraid threads would place on a system of the caliber needed to push
> 10GB/s NFS.
> 
> Neil, I spent quite a bit of time yesterday spec'ing out what I believe

Addressing me directly in an email that wasn't addressed to me directly seem
a bit ... odd.  Maybe that is just me.

> is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
> This includes:
> 
>   4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
>   3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
> 
> This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
> hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.
> 
> I made the assumption that RAID 10 would be the only suitable RAID level
> due to a few reasons:
> 
> 1.  The workload being 50+ NFS large file reads of aggregate 10GB/s,
> yielding a massive random IO workload at the disk head level.
> 
> 2.  We'll need 384 15k SAS drives to service a 10GB/s random IO load
> 
> 3.  We'll need multiple "small" arrays enabling multiple mdraid threads,
> assuming a single 2.4GHz core isn't enough to handle something like 48
> or 96 mdraid disks.
> 
> 4.  Rebuild times for parity raid schemes would be unacceptably high and
> would eat all of the CPU the rebuild thread would run on
> 
> To get the bandwidth we need and making sure we don't run out of
> controller chip IOPS, my calculations show we'd need 16 x 24 drive
> mdraid 10 arrays.  Thus, ignoring all other considerations momentarily,
> a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with one
> mdraid thread per core, each managing a 24 drive RAID 10.  Would we then
> want to layer a --linear array across the 16 RAID 10 arrays?  If we did
> this, would the linear thread bottleneck instantly as it runs on only
> one core?  How many additional memory copies (interconnect transfers)
> are we going to be performing per mdraid thread for each block read
> before the data is picked up by the nfsd kernel threads?
> 
> How much of each core's cycles will we consume with normal random read

For RAID10, the md thread plays no part in reads.  Which ever thread
submitted the read submits it all the way down to the relevant member device.
If the read fails the thread will come in to play.

For writes, the thread is used primarily to make sure the writes are properly
orders w.r.t. bitmap updates.  I could probably remove that requirement if a
bitmap was not in use...

> operations assuming 10GB/s of continuous aggregate throughput?  Would
> the mdraid threads consume sufficient cycles that when combined with
> network stack processing and interrupt processing, that 16 cores at
> 2.4GHz would be insufficient?  If so, would bumping the two sockets up
> to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
> need to move to a 4 socket system with 32 or 48 cores?
> 
> Is this possibly a situation where mdraid just isn't suitable due to the
> CPU, memory, and interconnect bandwidth demands, making hardware RAID
> the only real option?

I'm sorry, but I don't do resource usage estimates or comparisons with
hardware raid.  I just do software design and coding.


>     And if it does requires hardware RAID, would it
> be possible to stick 16 block devices together in a --linear mdraid
> array and maintain the 10GB/s performance?  Or, would the single
> --linear array be processed by a single thread?  If so, would a single
> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
> devices at 10GB/s aggregate?

There is no thread for linear or RAID0.

If you want to share load over a number of devices, you would normally use
RAID0.  However if the load had a high thread count and the filesystem
distributed IO evenly across the whole device space, then linear might work
for you.

NeilBrown


> 
> Unfortunately I don't currently work in a position allowing me to test
> such a system, and I certainly don't have the personal financial
> resources to build it.  My rough estimate on the hardware cost is
> $150-200K USD.  The 384 Hitachi 15k SAS 146GB drives at $250 each
> wholesale are a little over $90k.
> 
> It would be really neat to have a job that allowed me to setup and test
> such things. :)
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-18 22:01                             ` NeilBrown
@ 2011-03-18 22:23                               ` Roberto Spadim
  2011-03-20  1:34                               ` Stan Hoeppner
  1 sibling, 0 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-03-18 22:23 UTC (permalink / raw)
  To: NeilBrown; +Cc: Stan Hoeppner, Christoph Hellwig, Drew, Mdadm

i think linux can do this job without problems, md code is very
mature. the problem here is: what size/speed of cpu/ram/network/disk
should we use?
slow disk use raid0
mirror use raid1

raid 4,5,6 are cpu intensive, maybe a problem on very high speed (if
you have money buy more cpu and no problems)

2011/3/18 NeilBrown <neilb@suse.de>:
> On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
>> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>>
>> Thanks for the confirmations and explanations.
>>
>> > The kernel is pretty smart in placement of user and page cache data, but
>> > it can't really second guess your intention.  With the numactl tool you
>> > can help it doing the proper placement for you workload.  Note that the
>> > choice isn't always trivial - a numa system tends to have memory on
>> > multiple nodes, so you'll either have to find a good partitioning of
>> > your workload or live with off-node references.  I don't think
>> > partitioning NFS workloads is trivial, but then again I'm not a
>> > networking expert.
>>
>> Bringing mdraid back into the fold, I'm wondering what kinda of load the
>> mdraid threads would place on a system of the caliber needed to push
>> 10GB/s NFS.
>>
>> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
>
> Addressing me directly in an email that wasn't addressed to me directly seem
> a bit ... odd.  Maybe that is just me.
>
>> is the bare minimum AMD64 based hardware needed to push 10GB/s NFS.
>> This includes:
>>
>>   4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs
>>   3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
>>
>> This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw
>> hardware bandwidth of 20GB/s SAS and 15GB/s ethernet.
>>
>> I made the assumption that RAID 10 would be the only suitable RAID level
>> due to a few reasons:
>>
>> 1.  The workload being 50+ NFS large file reads of aggregate 10GB/s,
>> yielding a massive random IO workload at the disk head level.
>>
>> 2.  We'll need 384 15k SAS drives to service a 10GB/s random IO load
>>
>> 3.  We'll need multiple "small" arrays enabling multiple mdraid threads,
>> assuming a single 2.4GHz core isn't enough to handle something like 48
>> or 96 mdraid disks.
>>
>> 4.  Rebuild times for parity raid schemes would be unacceptably high and
>> would eat all of the CPU the rebuild thread would run on
>>
>> To get the bandwidth we need and making sure we don't run out of
>> controller chip IOPS, my calculations show we'd need 16 x 24 drive
>> mdraid 10 arrays.  Thus, ignoring all other considerations momentarily,
>> a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with one
>> mdraid thread per core, each managing a 24 drive RAID 10.  Would we then
>> want to layer a --linear array across the 16 RAID 10 arrays?  If we did
>> this, would the linear thread bottleneck instantly as it runs on only
>> one core?  How many additional memory copies (interconnect transfers)
>> are we going to be performing per mdraid thread for each block read
>> before the data is picked up by the nfsd kernel threads?
>>
>> How much of each core's cycles will we consume with normal random read
>
> For RAID10, the md thread plays no part in reads.  Which ever thread
> submitted the read submits it all the way down to the relevant member device.
> If the read fails the thread will come in to play.
>
> For writes, the thread is used primarily to make sure the writes are properly
> orders w.r.t. bitmap updates.  I could probably remove that requirement if a
> bitmap was not in use...
>
>> operations assuming 10GB/s of continuous aggregate throughput?  Would
>> the mdraid threads consume sufficient cycles that when combined with
>> network stack processing and interrupt processing, that 16 cores at
>> 2.4GHz would be insufficient?  If so, would bumping the two sockets up
>> to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
>> need to move to a 4 socket system with 32 or 48 cores?
>>
>> Is this possibly a situation where mdraid just isn't suitable due to the
>> CPU, memory, and interconnect bandwidth demands, making hardware RAID
>> the only real option?
>
> I'm sorry, but I don't do resource usage estimates or comparisons with
> hardware raid.  I just do software design and coding.
>
>
>>     And if it does requires hardware RAID, would it
>> be possible to stick 16 block devices together in a --linear mdraid
>> array and maintain the 10GB/s performance?  Or, would the single
>> --linear array be processed by a single thread?  If so, would a single
>> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
>> devices at 10GB/s aggregate?
>
> There is no thread for linear or RAID0.
>
> If you want to share load over a number of devices, you would normally use
> RAID0.  However if the load had a high thread count and the filesystem
> distributed IO evenly across the whole device space, then linear might work
> for you.
>
> NeilBrown
>
>
>>
>> Unfortunately I don't currently work in a position allowing me to test
>> such a system, and I certainly don't have the personal financial
>> resources to build it.  My rough estimate on the hardware cost is
>> $150-200K USD.  The 384 Hitachi 15k SAS 146GB drives at $250 each
>> wholesale are a little over $90k.
>>
>> It would be really neat to have a job that allowed me to setup and test
>> such things. :)
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-18 22:01                             ` NeilBrown
  2011-03-18 22:23                               ` Roberto Spadim
@ 2011-03-20  1:34                               ` Stan Hoeppner
  2011-03-20  3:41                                 ` NeilBrown
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-20  1:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christoph Hellwig, Roberto Spadim, Drew, Mdadm

NeilBrown put forth on 3/18/2011 5:01 PM:
> On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>>
>> Thanks for the confirmations and explanations.
>>
>>> The kernel is pretty smart in placement of user and page cache data, but
>>> it can't really second guess your intention.  With the numactl tool you
>>> can help it doing the proper placement for you workload.  Note that the
>>> choice isn't always trivial - a numa system tends to have memory on
>>> multiple nodes, so you'll either have to find a good partitioning of
>>> your workload or live with off-node references.  I don't think
>>> partitioning NFS workloads is trivial, but then again I'm not a
>>> networking expert.
>>
>> Bringing mdraid back into the fold, I'm wondering what kinda of load the
>> mdraid threads would place on a system of the caliber needed to push
>> 10GB/s NFS.
>>
>> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
> 
> Addressing me directly in an email that wasn't addressed to me directly seem
> a bit ... odd.  Maybe that is just me.

I guess that depends on one's perspective.  Is it the content of email
To: and Cc: headers that matters, or the substance of the list
discussion thread?  You are the lead developer and maintainer of Linux
mdraid AFAIK.  Thus I would have assumed that directly addressing a
question to you within any given list thread was acceptable, regardless
of whose address was where in the email headers.

>> How much of each core's cycles will we consume with normal random read
> 
> For RAID10, the md thread plays no part in reads.  Which ever thread
> submitted the read submits it all the way down to the relevant member device.
> If the read fails the thread will come in to play.

So with RIAD10 read scalability is in essence limited to the execution
rate of the block device layer code and the interconnect b/w required.

> For writes, the thread is used primarily to make sure the writes are properly
> orders w.r.t. bitmap updates.  I could probably remove that requirement if a
> bitmap was not in use...

How compute intensive is this thread during writes, if at all, at
extreme IO bandwidth rates?

>> operations assuming 10GB/s of continuous aggregate throughput?  Would
>> the mdraid threads consume sufficient cycles that when combined with
>> network stack processing and interrupt processing, that 16 cores at
>> 2.4GHz would be insufficient?  If so, would bumping the two sockets up
>> to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
>> need to move to a 4 socket system with 32 or 48 cores?
>>
>> Is this possibly a situation where mdraid just isn't suitable due to the
>> CPU, memory, and interconnect bandwidth demands, making hardware RAID
>> the only real option?
> 
> I'm sorry, but I don't do resource usage estimates or comparisons with
> hardware raid.  I just do software design and coding.

I probably worded this question very poorly and have possibly made
unfair assumptions about mdraid performance.

>>     And if it does requires hardware RAID, would it
>> be possible to stick 16 block devices together in a --linear mdraid
>> array and maintain the 10GB/s performance?  Or, would the single
>> --linear array be processed by a single thread?  If so, would a single
>> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
>> devices at 10GB/s aggregate?
> 
> There is no thread for linear or RAID0.

What kernel code is responsible for the concatenation and striping
operations of mdraid linear and RAID0 if not an mdraid thread?

> If you want to share load over a number of devices, you would normally use
> RAID0.  However if the load had a high thread count and the filesystem
> distributed IO evenly across the whole device space, then linear might work
> for you.

In my scenario I'm thinking I'd want to stay away RAID0 because of the
multi-level stripe width issues of double nested RAID (RAID0 over
RAID10).  I assumed linear would be the way to go, as my scenario calls
for using XFS.  Using 32 allocation groups should evenly spread the
load, which is ~50 NFS clients.

What I'm trying to figure out is how much CPU time I am going to need for:

1.  Aggregate 10GB/s IO rate
2.  mdraid managing 384 drives
    A.  16 mdraid10 arrays of 24 drives each
    B.  mdraid linear concatenating the 16 arrays

Thanks for your input Neil.

-- 
Stan


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-20  1:34                               ` Stan Hoeppner
@ 2011-03-20  3:41                                 ` NeilBrown
  2011-03-20  5:32                                   ` Roberto Spadim
  0 siblings, 1 reply; 116+ messages in thread
From: NeilBrown @ 2011-03-20  3:41 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Christoph Hellwig, Roberto Spadim, Drew, Mdadm

On Sat, 19 Mar 2011 20:34:26 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> NeilBrown put forth on 3/18/2011 5:01 PM:
> > On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> > wrote:
> > 
> >> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
> >>
> >> Thanks for the confirmations and explanations.
> >>
> >>> The kernel is pretty smart in placement of user and page cache data, but
> >>> it can't really second guess your intention.  With the numactl tool you
> >>> can help it doing the proper placement for you workload.  Note that the
> >>> choice isn't always trivial - a numa system tends to have memory on
> >>> multiple nodes, so you'll either have to find a good partitioning of
> >>> your workload or live with off-node references.  I don't think
> >>> partitioning NFS workloads is trivial, but then again I'm not a
> >>> networking expert.
> >>
> >> Bringing mdraid back into the fold, I'm wondering what kinda of load the
> >> mdraid threads would place on a system of the caliber needed to push
> >> 10GB/s NFS.
> >>
> >> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
> > 
> > Addressing me directly in an email that wasn't addressed to me directly seem
> > a bit ... odd.  Maybe that is just me.
> 
> I guess that depends on one's perspective.  Is it the content of email
> To: and Cc: headers that matters, or the substance of the list
> discussion thread?  You are the lead developer and maintainer of Linux
> mdraid AFAIK.  Thus I would have assumed that directly addressing a
> question to you within any given list thread was acceptable, regardless
> of whose address was where in the email headers.

This assumes that I read every email on this list.  I certainly do read a lot,
but I tend to tune out of threads that don't seem particularly interesting -
and configuring hardware is only vaguely interesting to me - and I am sure
there are people on the list with more experience.

But whatever... there is certainly more chance of me missing something that
isn't directly addressed to me (such messages get filed differently).


> 
> >> How much of each core's cycles will we consume with normal random read
> > 
> > For RAID10, the md thread plays no part in reads.  Which ever thread
> > submitted the read submits it all the way down to the relevant member device.
> > If the read fails the thread will come in to play.
> 
> So with RIAD10 read scalability is in essence limited to the execution
> rate of the block device layer code and the interconnect b/w required.

Correct.

> 
> > For writes, the thread is used primarily to make sure the writes are properly
> > orders w.r.t. bitmap updates.  I could probably remove that requirement if a
> > bitmap was not in use...
> 
> How compute intensive is this thread during writes, if at all, at
> extreme IO bandwidth rates?

Not compute intensive at all - just single threaded.  So it will only
dispatch a single request at a time.  Whether single threading the writes is
good or bad is not something that I'm completely clear on.  It seems bad in
the sense that modern machines have lots of CPUs and we are fore-going any
possible benefits of parallelism.  However the current VM seems to do all
(or most) writeout from a single thread per device - the 'bdi' threads.
So maybe keeping it single threaded in the md level is perfectly natural and
avoids cache bouncing...


> 
> >> operations assuming 10GB/s of continuous aggregate throughput?  Would
> >> the mdraid threads consume sufficient cycles that when combined with
> >> network stack processing and interrupt processing, that 16 cores at
> >> 2.4GHz would be insufficient?  If so, would bumping the two sockets up
> >> to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
> >> need to move to a 4 socket system with 32 or 48 cores?
> >>
> >> Is this possibly a situation where mdraid just isn't suitable due to the
> >> CPU, memory, and interconnect bandwidth demands, making hardware RAID
> >> the only real option?
> > 
> > I'm sorry, but I don't do resource usage estimates or comparisons with
> > hardware raid.  I just do software design and coding.
> 
> I probably worded this question very poorly and have possibly made
> unfair assumptions about mdraid performance.
> 
> >>     And if it does requires hardware RAID, would it
> >> be possible to stick 16 block devices together in a --linear mdraid
> >> array and maintain the 10GB/s performance?  Or, would the single
> >> --linear array be processed by a single thread?  If so, would a single
> >> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
> >> devices at 10GB/s aggregate?
> > 
> > There is no thread for linear or RAID0.
> 
> What kernel code is responsible for the concatenation and striping
> operations of mdraid linear and RAID0 if not an mdraid thread?
> 

When the VM or filesystem or whatever wants to start an IO request, it calls
into the md code to find out how big it is allowed to make that request.  The
md code returns a number which ensures that the request will end up being
mapped onto just one drive (at least in the majority of cases).
The VM or filesystem builds up the request (a struct bio) to at most that
size and hands it to md.  md simply assigns a different target device and
offset in that device to the request, and hands it over the the target device.

So whatever thread it was that started the request carries it all the way
down to the device which is a member of the RAID array (for RAID0/linear).
Typically it then gets placed on a queue, and an interrupt handler takes it
off the queue and acts upon it.

So - no separate md thread.

RAID1 and RAID10 make only limited use of their thread, doing as much of the
work as possible in the original calling thread.
RAID4/5/6 do lots of work in the md thread.  The calling thread just finds a
place in the stripe cache to attach the request, attaches it, and signals the
thread.
(Though reads on a non-degraded array can by-pass the cache and are handled
much like reads on RAID0).

> > If you want to share load over a number of devices, you would normally use
> > RAID0.  However if the load had a high thread count and the filesystem
> > distributed IO evenly across the whole device space, then linear might work
> > for you.
> 
> In my scenario I'm thinking I'd want to stay away RAID0 because of the
> multi-level stripe width issues of double nested RAID (RAID0 over
> RAID10).  I assumed linear would be the way to go, as my scenario calls
> for using XFS.  Using 32 allocation groups should evenly spread the
> load, which is ~50 NFS clients.

You may well be right.

> 
> What I'm trying to figure out is how much CPU time I am going to need for:
> 
> 1.  Aggregate 10GB/s IO rate
> 2.  mdraid managing 384 drives
>     A.  16 mdraid10 arrays of 24 drives each
>     B.  mdraid linear concatenating the 16 arrays

I very much doubt that CPU is going to be an issue.  Memory bandwidth might -
but I'm only really guessing here, so it is probably time to stop.


> 
> Thanks for your input Neil.
> 
Pleasure.

NeilBrown

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-20  3:41                                 ` NeilBrown
@ 2011-03-20  5:32                                   ` Roberto Spadim
  2011-03-20 23:22                                     ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-03-20  5:32 UTC (permalink / raw)
  To: NeilBrown; +Cc: Stan Hoeppner, Christoph Hellwig, Drew, Mdadm

with 2 disks md raid0 i get 400MB/s SAS 10krpm 6gb/s channel
you will need at last 10000/400*2=25*2=50 disks to get a start number
memory/cpu/network speed?
memory must allow more than 10gb/s (ddr3 can do this, i don't know if
enabled ecc will be a problem or not, check with memtest86+)
cpu? hummm i don't know very well how to help here, since it's just
read and write memory/interfaces (network/disks), maybe a 'magic'
number like: 3ghz * 64bits/8bits=24.000 (maybe 24gbits/s) i don't know
how to estimate... but i think you will need a multicore cpu... maybe
one for network one for disks one for mdadm one for nfs and one for
linux, >=5 cores at least with 3ghz 64bits each (maybe starting with
xeon 6cores with hyper thread)
it's just a idea how to estimate, it's not correct/true/real
i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
about the problem, post results here, this is a nice hardware question
:)
don't tell about software raid, just the hardware to allow this
bandwidth (10gb/s) and share files

2011/3/20 NeilBrown <neilb@suse.de>:
> On Sat, 19 Mar 2011 20:34:26 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
>> NeilBrown put forth on 3/18/2011 5:01 PM:
>> > On Fri, 18 Mar 2011 10:43:43 -0500 Stan Hoeppner <stan@hardwarefreak.com>
>> > wrote:
>> >
>> >> Christoph Hellwig put forth on 3/18/2011 9:05 AM:
>> >>
>> >> Thanks for the confirmations and explanations.
>> >>
>> >>> The kernel is pretty smart in placement of user and page cache data, but
>> >>> it can't really second guess your intention.  With the numactl tool you
>> >>> can help it doing the proper placement for you workload.  Note that the
>> >>> choice isn't always trivial - a numa system tends to have memory on
>> >>> multiple nodes, so you'll either have to find a good partitioning of
>> >>> your workload or live with off-node references.  I don't think
>> >>> partitioning NFS workloads is trivial, but then again I'm not a
>> >>> networking expert.
>> >>
>> >> Bringing mdraid back into the fold, I'm wondering what kinda of load the
>> >> mdraid threads would place on a system of the caliber needed to push
>> >> 10GB/s NFS.
>> >>
>> >> Neil, I spent quite a bit of time yesterday spec'ing out what I believe
>> >
>> > Addressing me directly in an email that wasn't addressed to me directly seem
>> > a bit ... odd.  Maybe that is just me.
>>
>> I guess that depends on one's perspective.  Is it the content of email
>> To: and Cc: headers that matters, or the substance of the list
>> discussion thread?  You are the lead developer and maintainer of Linux
>> mdraid AFAIK.  Thus I would have assumed that directly addressing a
>> question to you within any given list thread was acceptable, regardless
>> of whose address was where in the email headers.
>
> This assumes that I read every email on this list.  I certainly do read a lot,
> but I tend to tune out of threads that don't seem particularly interesting -
> and configuring hardware is only vaguely interesting to me - and I am sure
> there are people on the list with more experience.
>
> But whatever... there is certainly more chance of me missing something that
> isn't directly addressed to me (such messages get filed differently).
>
>
>>
>> >> How much of each core's cycles will we consume with normal random read
>> >
>> > For RAID10, the md thread plays no part in reads.  Which ever thread
>> > submitted the read submits it all the way down to the relevant member device.
>> > If the read fails the thread will come in to play.
>>
>> So with RIAD10 read scalability is in essence limited to the execution
>> rate of the block device layer code and the interconnect b/w required.
>
> Correct.
>
>>
>> > For writes, the thread is used primarily to make sure the writes are properly
>> > orders w.r.t. bitmap updates.  I could probably remove that requirement if a
>> > bitmap was not in use...
>>
>> How compute intensive is this thread during writes, if at all, at
>> extreme IO bandwidth rates?
>
> Not compute intensive at all - just single threaded.  So it will only
> dispatch a single request at a time.  Whether single threading the writes is
> good or bad is not something that I'm completely clear on.  It seems bad in
> the sense that modern machines have lots of CPUs and we are fore-going any
> possible benefits of parallelism.  However the current VM seems to do all
> (or most) writeout from a single thread per device - the 'bdi' threads.
> So maybe keeping it single threaded in the md level is perfectly natural and
> avoids cache bouncing...
>
>
>>
>> >> operations assuming 10GB/s of continuous aggregate throughput?  Would
>> >> the mdraid threads consume sufficient cycles that when combined with
>> >> network stack processing and interrupt processing, that 16 cores at
>> >> 2.4GHz would be insufficient?  If so, would bumping the two sockets up
>> >> to 24 cores at 2.1GHz be enough for the total workload?  Or, would we
>> >> need to move to a 4 socket system with 32 or 48 cores?
>> >>
>> >> Is this possibly a situation where mdraid just isn't suitable due to the
>> >> CPU, memory, and interconnect bandwidth demands, making hardware RAID
>> >> the only real option?
>> >
>> > I'm sorry, but I don't do resource usage estimates or comparisons with
>> > hardware raid.  I just do software design and coding.
>>
>> I probably worded this question very poorly and have possibly made
>> unfair assumptions about mdraid performance.
>>
>> >>     And if it does requires hardware RAID, would it
>> >> be possible to stick 16 block devices together in a --linear mdraid
>> >> array and maintain the 10GB/s performance?  Or, would the single
>> >> --linear array be processed by a single thread?  If so, would a single
>> >> 2.4GHz core be able to handle an mdraid --leaner thread managing 8
>> >> devices at 10GB/s aggregate?
>> >
>> > There is no thread for linear or RAID0.
>>
>> What kernel code is responsible for the concatenation and striping
>> operations of mdraid linear and RAID0 if not an mdraid thread?
>>
>
> When the VM or filesystem or whatever wants to start an IO request, it calls
> into the md code to find out how big it is allowed to make that request.  The
> md code returns a number which ensures that the request will end up being
> mapped onto just one drive (at least in the majority of cases).
> The VM or filesystem builds up the request (a struct bio) to at most that
> size and hands it to md.  md simply assigns a different target device and
> offset in that device to the request, and hands it over the the target device.
>
> So whatever thread it was that started the request carries it all the way
> down to the device which is a member of the RAID array (for RAID0/linear).
> Typically it then gets placed on a queue, and an interrupt handler takes it
> off the queue and acts upon it.
>
> So - no separate md thread.
>
> RAID1 and RAID10 make only limited use of their thread, doing as much of the
> work as possible in the original calling thread.
> RAID4/5/6 do lots of work in the md thread.  The calling thread just finds a
> place in the stripe cache to attach the request, attaches it, and signals the
> thread.
> (Though reads on a non-degraded array can by-pass the cache and are handled
> much like reads on RAID0).
>
>> > If you want to share load over a number of devices, you would normally use
>> > RAID0.  However if the load had a high thread count and the filesystem
>> > distributed IO evenly across the whole device space, then linear might work
>> > for you.
>>
>> In my scenario I'm thinking I'd want to stay away RAID0 because of the
>> multi-level stripe width issues of double nested RAID (RAID0 over
>> RAID10).  I assumed linear would be the way to go, as my scenario calls
>> for using XFS.  Using 32 allocation groups should evenly spread the
>> load, which is ~50 NFS clients.
>
> You may well be right.
>
>>
>> What I'm trying to figure out is how much CPU time I am going to need for:
>>
>> 1.  Aggregate 10GB/s IO rate
>> 2.  mdraid managing 384 drives
>>     A.  16 mdraid10 arrays of 24 drives each
>>     B.  mdraid linear concatenating the 16 arrays
>
> I very much doubt that CPU is going to be an issue.  Memory bandwidth might -
> but I'm only really guessing here, so it is probably time to stop.
>
>
>>
>> Thanks for your input Neil.
>>
> Pleasure.
>
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-20  5:32                                   ` Roberto Spadim
@ 2011-03-20 23:22                                     ` Stan Hoeppner
  2011-03-21  0:52                                       ` Roberto Spadim
  2011-03-21  2:44                                       ` Keld Jørn Simonsen
  0 siblings, 2 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-20 23:22 UTC (permalink / raw)
  To: Mdadm; +Cc: Roberto Spadim, NeilBrown, Christoph Hellwig, Drew

Roberto Spadim put forth on 3/20/2011 12:32 AM:

> i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
> about the problem, post results here, this is a nice hardware question
> :)

I don't need vendor assistance to design a hardware system capable of
the 10GB/s NFS throughput target.  That's relatively easy.  I've already
specified one possible hardware combination capable of this level of
performance (see below).  The configuration will handle 10GB/s using the
RAID function of the LSI SAS HBAs.  The only question is if it has
enough individual and aggregate CPU horsepower, memory, and HT
interconnect bandwidth to do the same using mdraid.  This is the reason
for my questions directed at Neil.

> don't tell about software raid, just the hardware to allow this
> bandwidth (10gb/s) and share files

I already posted some of the minimum hardware specs earlier in this
thread for the given workload I described.  Following is a description
of the workload and a complete hardware specification.

Target workload:

10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
application performs large streaming reads.  At the storage array level
the 50+ parallel streaming reads become a random IO pattern workload
requiring a huge number of spindles due to the high seek rates.

Minimum hardware requirements, based on performance and cost.  Ballpark
guess on total cost of the hardware below is $150-250k USD.  We can't
get the data to the clients without a network, so the specification
starts with the switching hardware needed.

Ethernet switches:
   One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
      488 Gb/s backplane switching capacity
   Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
      208 Gb/s backplane switching capacity
   Maximum common MTU enabled (jumbo frame) globally
   Connect 12 server 10 GbE ports to A5820X
   Uplink 2 10 GbE ports from each A5800 to A5820X
       2 open 10 GbE ports left on A5820X for cluster expansion
       or off cluster data transfers to the main network
   Link aggregate 12 server 10 GbE ports to A5820X
   Link aggregate each client's 2 GbE ports to A5800s
   Aggregate client->switch bandwidth = 12.5 GB/s
   Aggregate server->switch bandwidth = 15.0 GB/s
   The excess server b/w of 2.5GB/s is a result of the following:
       Allowing headroom for an additional 10 clients or out of cluster
          data transfers
       Balancing the packet load over the 3 quad port 10 GbE server NICs
          regardless of how many clients are active to prevent hot spots
          in the server memory and interconnect subsystems

Server chassis
   HP Proliant DL585 G7 with the following specifications
   Dual AMD Opteron 6136, 16 cores @2.4GHz
   20GB/s node-node HT b/w, 160GB/s aggregate
   128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
   20GB/s/node memory bandwidth, 80GB/s aggregate
   7 PCIe x8 slots and 4 PCIe x16
   8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth

IO controllers
   4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
   3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter

JBOD enclosures
   16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
   2 SFF 8088 host and 1 expansion port per enclosure
   384 total SAS 6GB/s 2.5" drive bays
   Two units are daisy chained with one in each pair
      connecting to one of 8 HBA SFF8088 ports, for a total of
      32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w

Disks drives
   384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
      6Gb/s Internal Enterprise Hard Drive


Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
respectively, by approximately 20%.  Also note that each drive can
stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
capacity for the 384 disks.  This is almost 4 times the aggregate one
way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
to parallel client data rate of 10GB/s.  There are a few reasons why
this excess of capacity is built into the system:

1.  RAID10 is the only suitable RAID level for this type of system with
this many disks, for many reasons that have been discussed before.
RAID10 instantly cuts the number of stripe spindles in two, dropping the
data rate by a factor of 2, giving us 30.5GB/s potential aggregate
throughput.  Now we're only at 3 times out target data rate.

2.  As a single disk drive's seek rate increases, its transfer rate
decreases in relation to its single streaming read performance.
Parallel streaming reads will increase seek rates as the disk head must
move between different regions of the disk platter.

3.  In relation to 2, if we assume we'll lose no more than 66% of our
single streaming performance with a multi stream workload, we're down to
10.1GB/s throughput, right at our target.

By using relatively small arrays of 24 drives each (12 stripe spindles),
concatenating (--linear) the 16 resulting arrays, and using a filesystem
such as XFS across the entire array with its intelligent load balancing
of streams using allocation groups, we minimize disk head seeking.
Doing this can in essence divide our 50 client streams across 16 arrays,
with each array seeing approximately 3 of the streaming client reads.
Each disk should be able to easily maintain 33% of its max read rate
while servicing 3 streaming reads.

I hope you found this informative or interesting.  I enjoyed the
exercise.  I'd been working on this system specification for quite a few
days now but have been hesitant to post it due to its length, and the
fact that AFAIK hardware discussion is a bit OT on this list.

I hope it may be valuable to someone Google'ing for this type of
information in the future.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-20 23:22                                     ` Stan Hoeppner
@ 2011-03-21  0:52                                       ` Roberto Spadim
  2011-03-21  2:44                                       ` Keld Jørn Simonsen
  1 sibling, 0 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-03-21  0:52 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm, NeilBrown, Christoph Hellwig, Drew

> I don't need vendor assistance to design a hardware system capable of
> the 10GB/s NFS throughput target.  That's relatively easy.  I've already
it works? tested?

-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-20 23:22                                     ` Stan Hoeppner
  2011-03-21  0:52                                       ` Roberto Spadim
@ 2011-03-21  2:44                                       ` Keld Jørn Simonsen
  2011-03-21  3:13                                         ` Roberto Spadim
  2011-03-21 14:18                                         ` Stan Hoeppner
  1 sibling, 2 replies; 116+ messages in thread
From: Keld Jørn Simonsen @ 2011-03-21  2:44 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm, Roberto Spadim, NeilBrown, Christoph Hellwig, Drew

On Sun, Mar 20, 2011 at 06:22:30PM -0500, Stan Hoeppner wrote:
> Roberto Spadim put forth on 3/20/2011 12:32 AM:
> 
> > i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
> > about the problem, post results here, this is a nice hardware question
> > :)
> 
> I don't need vendor assistance to design a hardware system capable of
> the 10GB/s NFS throughput target.  That's relatively easy.  I've already
> specified one possible hardware combination capable of this level of
> performance (see below).  The configuration will handle 10GB/s using the
> RAID function of the LSI SAS HBAs.  The only question is if it has
> enough individual and aggregate CPU horsepower, memory, and HT
> interconnect bandwidth to do the same using mdraid.  This is the reason
> for my questions directed at Neil.
> 
> > don't tell about software raid, just the hardware to allow this
> > bandwidth (10gb/s) and share files
> 
> I already posted some of the minimum hardware specs earlier in this
> thread for the given workload I described.  Following is a description
> of the workload and a complete hardware specification.
> 
> Target workload:
> 
> 10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
> application performs large streaming reads.  At the storage array level
> the 50+ parallel streaming reads become a random IO pattern workload
> requiring a huge number of spindles due to the high seek rates.
> 
> Minimum hardware requirements, based on performance and cost.  Ballpark
> guess on total cost of the hardware below is $150-250k USD.  We can't
> get the data to the clients without a network, so the specification
> starts with the switching hardware needed.
> 
> Ethernet switches:
>    One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
>       488 Gb/s backplane switching capacity
>    Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
>       208 Gb/s backplane switching capacity
>    Maximum common MTU enabled (jumbo frame) globally
>    Connect 12 server 10 GbE ports to A5820X
>    Uplink 2 10 GbE ports from each A5800 to A5820X
>        2 open 10 GbE ports left on A5820X for cluster expansion
>        or off cluster data transfers to the main network
>    Link aggregate 12 server 10 GbE ports to A5820X
>    Link aggregate each client's 2 GbE ports to A5800s
>    Aggregate client->switch bandwidth = 12.5 GB/s
>    Aggregate server->switch bandwidth = 15.0 GB/s
>    The excess server b/w of 2.5GB/s is a result of the following:
>        Allowing headroom for an additional 10 clients or out of cluster
>           data transfers
>        Balancing the packet load over the 3 quad port 10 GbE server NICs
>           regardless of how many clients are active to prevent hot spots
>           in the server memory and interconnect subsystems
> 
> Server chassis
>    HP Proliant DL585 G7 with the following specifications
>    Dual AMD Opteron 6136, 16 cores @2.4GHz
>    20GB/s node-node HT b/w, 160GB/s aggregate
>    128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
>    20GB/s/node memory bandwidth, 80GB/s aggregate
>    7 PCIe x8 slots and 4 PCIe x16
>    8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth
> 
> IO controllers
>    4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
>    3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
> 
> JBOD enclosures
>    16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
>    2 SFF 8088 host and 1 expansion port per enclosure
>    384 total SAS 6GB/s 2.5" drive bays
>    Two units are daisy chained with one in each pair
>       connecting to one of 8 HBA SFF8088 ports, for a total of
>       32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w
> 
> Disks drives
>    384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
>       6Gb/s Internal Enterprise Hard Drive
> 
> 
> Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
> full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
> respectively, by approximately 20%.  Also note that each drive can
> stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
> capacity for the 384 disks.  This is almost 4 times the aggregate one
> way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
> to parallel client data rate of 10GB/s.  There are a few reasons why
> this excess of capacity is built into the system:
> 
> 1.  RAID10 is the only suitable RAID level for this type of system with
> this many disks, for many reasons that have been discussed before.
> RAID10 instantly cuts the number of stripe spindles in two, dropping the
> data rate by a factor of 2, giving us 30.5GB/s potential aggregate
> throughput.  Now we're only at 3 times out target data rate.
> 
> 2.  As a single disk drive's seek rate increases, its transfer rate
> decreases in relation to its single streaming read performance.
> Parallel streaming reads will increase seek rates as the disk head must
> move between different regions of the disk platter.
> 
> 3.  In relation to 2, if we assume we'll lose no more than 66% of our
> single streaming performance with a multi stream workload, we're down to
> 10.1GB/s throughput, right at our target.
> 
> By using relatively small arrays of 24 drives each (12 stripe spindles),
> concatenating (--linear) the 16 resulting arrays, and using a filesystem
> such as XFS across the entire array with its intelligent load balancing
> of streams using allocation groups, we minimize disk head seeking.
> Doing this can in essence divide our 50 client streams across 16 arrays,
> with each array seeing approximately 3 of the streaming client reads.
> Each disk should be able to easily maintain 33% of its max read rate
> while servicing 3 streaming reads.
> 
> I hope you found this informative or interesting.  I enjoyed the
> exercise.  I'd been working on this system specification for quite a few
> days now but have been hesitant to post it due to its length, and the
> fact that AFAIK hardware discussion is a bit OT on this list.
> 
> I hope it may be valuable to someone Google'ing for this type of
> information in the future.
> 
> -- 
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Are you then building the system yourself, and running Linux MD RAID?

Anyway, with 384 spindles and only 50 users, each user will have in
average 7 spindles for himself. I think much of the time this would mean 
no random IO, as most users are doing large sequential reading. 
Thus on average you can expect quite close to striping speed if you
are running RAID capable of striping. 

I am puzzled about the --linear concatenating. I think this may cause
the disks in the --linear array to be considered as one spindle, and thus
no concurrent IO will be made. I may be wrong there.

best regards
Keld

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21  2:44                                       ` Keld Jørn Simonsen
@ 2011-03-21  3:13                                         ` Roberto Spadim
  2011-03-21  3:14                                           ` Roberto Spadim
  2011-03-21 14:18                                         ` Stan Hoeppner
  1 sibling, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-03-21  3:13 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: Stan Hoeppner, Mdadm, NeilBrown, Christoph Hellwig, Drew

hum, maybe with linear will have less cpu use instead stripe?
i never tested a array with more than 8 disks with linear, and with
stripe hehehe
anyone could help here?

2011/3/20 Keld Jørn Simonsen <keld@keldix.com>:
> On Sun, Mar 20, 2011 at 06:22:30PM -0500, Stan Hoeppner wrote:
>> Roberto Spadim put forth on 3/20/2011 12:32 AM:
>>
>> > i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
>> > about the problem, post results here, this is a nice hardware question
>> > :)
>>
>> I don't need vendor assistance to design a hardware system capable of
>> the 10GB/s NFS throughput target.  That's relatively easy.  I've already
>> specified one possible hardware combination capable of this level of
>> performance (see below).  The configuration will handle 10GB/s using the
>> RAID function of the LSI SAS HBAs.  The only question is if it has
>> enough individual and aggregate CPU horsepower, memory, and HT
>> interconnect bandwidth to do the same using mdraid.  This is the reason
>> for my questions directed at Neil.
>>
>> > don't tell about software raid, just the hardware to allow this
>> > bandwidth (10gb/s) and share files
>>
>> I already posted some of the minimum hardware specs earlier in this
>> thread for the given workload I described.  Following is a description
>> of the workload and a complete hardware specification.
>>
>> Target workload:
>>
>> 10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
>> application performs large streaming reads.  At the storage array level
>> the 50+ parallel streaming reads become a random IO pattern workload
>> requiring a huge number of spindles due to the high seek rates.
>>
>> Minimum hardware requirements, based on performance and cost.  Ballpark
>> guess on total cost of the hardware below is $150-250k USD.  We can't
>> get the data to the clients without a network, so the specification
>> starts with the switching hardware needed.
>>
>> Ethernet switches:
>>    One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
>>       488 Gb/s backplane switching capacity
>>    Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
>>       208 Gb/s backplane switching capacity
>>    Maximum common MTU enabled (jumbo frame) globally
>>    Connect 12 server 10 GbE ports to A5820X
>>    Uplink 2 10 GbE ports from each A5800 to A5820X
>>        2 open 10 GbE ports left on A5820X for cluster expansion
>>        or off cluster data transfers to the main network
>>    Link aggregate 12 server 10 GbE ports to A5820X
>>    Link aggregate each client's 2 GbE ports to A5800s
>>    Aggregate client->switch bandwidth = 12.5 GB/s
>>    Aggregate server->switch bandwidth = 15.0 GB/s
>>    The excess server b/w of 2.5GB/s is a result of the following:
>>        Allowing headroom for an additional 10 clients or out of cluster
>>           data transfers
>>        Balancing the packet load over the 3 quad port 10 GbE server NICs
>>           regardless of how many clients are active to prevent hot spots
>>           in the server memory and interconnect subsystems
>>
>> Server chassis
>>    HP Proliant DL585 G7 with the following specifications
>>    Dual AMD Opteron 6136, 16 cores @2.4GHz
>>    20GB/s node-node HT b/w, 160GB/s aggregate
>>    128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
>>    20GB/s/node memory bandwidth, 80GB/s aggregate
>>    7 PCIe x8 slots and 4 PCIe x16
>>    8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth
>>
>> IO controllers
>>    4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
>>    3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
>>
>> JBOD enclosures
>>    16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
>>    2 SFF 8088 host and 1 expansion port per enclosure
>>    384 total SAS 6GB/s 2.5" drive bays
>>    Two units are daisy chained with one in each pair
>>       connecting to one of 8 HBA SFF8088 ports, for a total of
>>       32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w
>>
>> Disks drives
>>    384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
>>       6Gb/s Internal Enterprise Hard Drive
>>
>>
>> Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
>> full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
>> respectively, by approximately 20%.  Also note that each drive can
>> stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
>> capacity for the 384 disks.  This is almost 4 times the aggregate one
>> way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
>> to parallel client data rate of 10GB/s.  There are a few reasons why
>> this excess of capacity is built into the system:
>>
>> 1.  RAID10 is the only suitable RAID level for this type of system with
>> this many disks, for many reasons that have been discussed before.
>> RAID10 instantly cuts the number of stripe spindles in two, dropping the
>> data rate by a factor of 2, giving us 30.5GB/s potential aggregate
>> throughput.  Now we're only at 3 times out target data rate.
>>
>> 2.  As a single disk drive's seek rate increases, its transfer rate
>> decreases in relation to its single streaming read performance.
>> Parallel streaming reads will increase seek rates as the disk head must
>> move between different regions of the disk platter.
>>
>> 3.  In relation to 2, if we assume we'll lose no more than 66% of our
>> single streaming performance with a multi stream workload, we're down to
>> 10.1GB/s throughput, right at our target.
>>
>> By using relatively small arrays of 24 drives each (12 stripe spindles),
>> concatenating (--linear) the 16 resulting arrays, and using a filesystem
>> such as XFS across the entire array with its intelligent load balancing
>> of streams using allocation groups, we minimize disk head seeking.
>> Doing this can in essence divide our 50 client streams across 16 arrays,
>> with each array seeing approximately 3 of the streaming client reads.
>> Each disk should be able to easily maintain 33% of its max read rate
>> while servicing 3 streaming reads.
>>
>> I hope you found this informative or interesting.  I enjoyed the
>> exercise.  I'd been working on this system specification for quite a few
>> days now but have been hesitant to post it due to its length, and the
>> fact that AFAIK hardware discussion is a bit OT on this list.
>>
>> I hope it may be valuable to someone Google'ing for this type of
>> information in the future.
>>
>> --
>> Stan
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Are you then building the system yourself, and running Linux MD RAID?
>
> Anyway, with 384 spindles and only 50 users, each user will have in
> average 7 spindles for himself. I think much of the time this would mean
> no random IO, as most users are doing large sequential reading.
> Thus on average you can expect quite close to striping speed if you
> are running RAID capable of striping.
>
> I am puzzled about the --linear concatenating. I think this may cause
> the disks in the --linear array to be considered as one spindle, and thus
> no concurrent IO will be made. I may be wrong there.
>
> best regards
> Keld
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21  3:13                                         ` Roberto Spadim
@ 2011-03-21  3:14                                           ` Roberto Spadim
  2011-03-21 17:07                                             ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Roberto Spadim @ 2011-03-21  3:14 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: Stan Hoeppner, Mdadm, NeilBrown, Christoph Hellwig, Drew

with hardware raid we don't think about this problem, but with
software we should consider since we run others app with software raid
running too

2011/3/21 Roberto Spadim <roberto@spadim.com.br>:
> hum, maybe with linear will have less cpu use instead stripe?
> i never tested a array with more than 8 disks with linear, and with
> stripe hehehe
> anyone could help here?
>
> 2011/3/20 Keld Jørn Simonsen <keld@keldix.com>:
>> On Sun, Mar 20, 2011 at 06:22:30PM -0500, Stan Hoeppner wrote:
>>> Roberto Spadim put forth on 3/20/2011 12:32 AM:
>>>
>>> > i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
>>> > about the problem, post results here, this is a nice hardware question
>>> > :)
>>>
>>> I don't need vendor assistance to design a hardware system capable of
>>> the 10GB/s NFS throughput target.  That's relatively easy.  I've already
>>> specified one possible hardware combination capable of this level of
>>> performance (see below).  The configuration will handle 10GB/s using the
>>> RAID function of the LSI SAS HBAs.  The only question is if it has
>>> enough individual and aggregate CPU horsepower, memory, and HT
>>> interconnect bandwidth to do the same using mdraid.  This is the reason
>>> for my questions directed at Neil.
>>>
>>> > don't tell about software raid, just the hardware to allow this
>>> > bandwidth (10gb/s) and share files
>>>
>>> I already posted some of the minimum hardware specs earlier in this
>>> thread for the given workload I described.  Following is a description
>>> of the workload and a complete hardware specification.
>>>
>>> Target workload:
>>>
>>> 10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
>>> application performs large streaming reads.  At the storage array level
>>> the 50+ parallel streaming reads become a random IO pattern workload
>>> requiring a huge number of spindles due to the high seek rates.
>>>
>>> Minimum hardware requirements, based on performance and cost.  Ballpark
>>> guess on total cost of the hardware below is $150-250k USD.  We can't
>>> get the data to the clients without a network, so the specification
>>> starts with the switching hardware needed.
>>>
>>> Ethernet switches:
>>>    One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
>>>       488 Gb/s backplane switching capacity
>>>    Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
>>>       208 Gb/s backplane switching capacity
>>>    Maximum common MTU enabled (jumbo frame) globally
>>>    Connect 12 server 10 GbE ports to A5820X
>>>    Uplink 2 10 GbE ports from each A5800 to A5820X
>>>        2 open 10 GbE ports left on A5820X for cluster expansion
>>>        or off cluster data transfers to the main network
>>>    Link aggregate 12 server 10 GbE ports to A5820X
>>>    Link aggregate each client's 2 GbE ports to A5800s
>>>    Aggregate client->switch bandwidth = 12.5 GB/s
>>>    Aggregate server->switch bandwidth = 15.0 GB/s
>>>    The excess server b/w of 2.5GB/s is a result of the following:
>>>        Allowing headroom for an additional 10 clients or out of cluster
>>>           data transfers
>>>        Balancing the packet load over the 3 quad port 10 GbE server NICs
>>>           regardless of how many clients are active to prevent hot spots
>>>           in the server memory and interconnect subsystems
>>>
>>> Server chassis
>>>    HP Proliant DL585 G7 with the following specifications
>>>    Dual AMD Opteron 6136, 16 cores @2.4GHz
>>>    20GB/s node-node HT b/w, 160GB/s aggregate
>>>    128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
>>>    20GB/s/node memory bandwidth, 80GB/s aggregate
>>>    7 PCIe x8 slots and 4 PCIe x16
>>>    8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth
>>>
>>> IO controllers
>>>    4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
>>>    3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter
>>>
>>> JBOD enclosures
>>>    16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
>>>    2 SFF 8088 host and 1 expansion port per enclosure
>>>    384 total SAS 6GB/s 2.5" drive bays
>>>    Two units are daisy chained with one in each pair
>>>       connecting to one of 8 HBA SFF8088 ports, for a total of
>>>       32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w
>>>
>>> Disks drives
>>>    384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
>>>       6Gb/s Internal Enterprise Hard Drive
>>>
>>>
>>> Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
>>> full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
>>> respectively, by approximately 20%.  Also note that each drive can
>>> stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
>>> capacity for the 384 disks.  This is almost 4 times the aggregate one
>>> way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
>>> to parallel client data rate of 10GB/s.  There are a few reasons why
>>> this excess of capacity is built into the system:
>>>
>>> 1.  RAID10 is the only suitable RAID level for this type of system with
>>> this many disks, for many reasons that have been discussed before.
>>> RAID10 instantly cuts the number of stripe spindles in two, dropping the
>>> data rate by a factor of 2, giving us 30.5GB/s potential aggregate
>>> throughput.  Now we're only at 3 times out target data rate.
>>>
>>> 2.  As a single disk drive's seek rate increases, its transfer rate
>>> decreases in relation to its single streaming read performance.
>>> Parallel streaming reads will increase seek rates as the disk head must
>>> move between different regions of the disk platter.
>>>
>>> 3.  In relation to 2, if we assume we'll lose no more than 66% of our
>>> single streaming performance with a multi stream workload, we're down to
>>> 10.1GB/s throughput, right at our target.
>>>
>>> By using relatively small arrays of 24 drives each (12 stripe spindles),
>>> concatenating (--linear) the 16 resulting arrays, and using a filesystem
>>> such as XFS across the entire array with its intelligent load balancing
>>> of streams using allocation groups, we minimize disk head seeking.
>>> Doing this can in essence divide our 50 client streams across 16 arrays,
>>> with each array seeing approximately 3 of the streaming client reads.
>>> Each disk should be able to easily maintain 33% of its max read rate
>>> while servicing 3 streaming reads.
>>>
>>> I hope you found this informative or interesting.  I enjoyed the
>>> exercise.  I'd been working on this system specification for quite a few
>>> days now but have been hesitant to post it due to its length, and the
>>> fact that AFAIK hardware discussion is a bit OT on this list.
>>>
>>> I hope it may be valuable to someone Google'ing for this type of
>>> information in the future.
>>>
>>> --
>>> Stan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> Are you then building the system yourself, and running Linux MD RAID?
>>
>> Anyway, with 384 spindles and only 50 users, each user will have in
>> average 7 spindles for himself. I think much of the time this would mean
>> no random IO, as most users are doing large sequential reading.
>> Thus on average you can expect quite close to striping speed if you
>> are running RAID capable of striping.
>>
>> I am puzzled about the --linear concatenating. I think this may cause
>> the disks in the --linear array to be considered as one spindle, and thus
>> no concurrent IO will be made. I may be wrong there.
>>
>> best regards
>> Keld
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Roberto Spadim
> Spadim Technology / SPAEmpresarial
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21  2:44                                       ` Keld Jørn Simonsen
  2011-03-21  3:13                                         ` Roberto Spadim
@ 2011-03-21 14:18                                         ` Stan Hoeppner
  2011-03-21 17:08                                           ` Roberto Spadim
  2011-03-21 22:13                                           ` Keld Jørn Simonsen
  1 sibling, 2 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-21 14:18 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: Mdadm, Roberto Spadim, NeilBrown, Christoph Hellwig, Drew

Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:

> Are you then building the system yourself, and running Linux MD RAID?

No.  These specifications meet the needs of Matt Garman's analysis
cluster, and extend that performance from 6GB/s to 10GB/s.  Christoph's
comments about 10GB/s throughput with XFS on large CPU count Altix 4000
series machines from a few years ago prompted me to specify a single
chassis multicore AMD Opteron based system that can achieve the same
throughput at substantially lower cost.

> Anyway, with 384 spindles and only 50 users, each user will have in
> average 7 spindles for himself. I think much of the time this would mean 
> no random IO, as most users are doing large sequential reading. 
> Thus on average you can expect quite close to striping speed if you
> are running RAID capable of striping. 

This is not how large scale shared RAID storage works under a
multi-stream workload.  I thought I explained this in sufficient detail.
 Maybe not.

> I am puzzled about the --linear concatenating. I think this may cause
> the disks in the --linear array to be considered as one spindle, and thus
> no concurrent IO will be made. I may be wrong there.

You are puzzled because you are not familiar with the large scale
performance features built into the XFS filesystem.  XFS allocation
groups automatically enable large scale parallelism on a single logical
device comprised of multiple arrays or single disks, when configured
correctly.  See:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html

The storage pool in my proposed 10GB/s NFS server system consists of 16
RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
spindles per array, 1.752TB per array, 28TB total raw.  Concatenating
the 16 array devices with mdadm --linear creates a 28TB logical device.
 We format it with this simple command, not having to worry about stripe
block size, stripe spindle width, stripe alignment, etc:

~# mkfs.xfs -d agcount=64

Using this method to achieve parallel scalability is simpler and less
prone to configuration errors when compared to multi-level striping,
which often leads to poor performance and poor space utilization.  With
64 XFS allocation groups the kernel can read/write 4 concurrent streams
from/to each array of 12 spindles, which should be able to handle this
load with plenty of headroom.  This system has 32 SAS 6G channels, each
able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
more than our 10GB/s target.  I was going to state that we're limited to
10.4GB/s due to the PCIe/HT bridge to the processor.  However, I just
realized I made an error when specifying the DL585 G7 with only 2
processors.  See [1] below for details.

Using XFS in this manner allows us to avoid nested striped arrays and
the inherent problems associated with them.  For example, in absence of
using XFS allocation groups to get our parallelism, we could do the
following:

1.  Width 16 RAID0 stripe over width 12 RAID10 stripe
2.  Width 16 LVM   stripe over width 12 RAID10 stripe

In either case, what is the correct/optimum stripe block size for each
level when nesting the two?  The answer is that there really aren't
correct or optimum stripe sizes in this scenario.  Writes to the top
level stripe will be broken into 16 chunks.  Each of these 16 chunks
will then be broken into 12 more chunks.  You may be thinking, "Why
don't we just create one 384 disk RAID10?  It would SCREAM with 192
spindles!!"  There are many reasons why nobody does this, one being the
same stripe block size issue as with nested stripes.  Extremely wide
arrays have a plethora of problems associated with them.

In summary, concatenating many relatively low stripe spindle count
arrays, and using XFS allocation groups to achieve parallel scalability,
gives us the performance we want without the problems associated with
other configurations.


[1]  In order to get all 11 PCIe slots in the DL585 G7 one must use the
4 socket model, as the additional PCIe slots of the mezzanine card
connect to two additional SR5690 chips, each one connecting to an HT
port on each of the two additional CPUs.  Thus, I'm re-specifying the
DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
total.  The 128GB in 16 RDIMMs will be spread across all 16 memory
channels.  Memory bandwidth thus doubles to 160GB/s and interconnect b/w
doubles to 320GB/s.  Thus, we now have up to 19.2 GB/s of available one
way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
link.  Adding the two required CPUs may have just made this system
capable of 15GB/s NFS throughput for less than $5000 additional cost,
not due to the processors, but the extra IO bandwidth enabled as a
consequence of their inclusion.  Adding another quad port 10 GbE NIC
will take it close to 20GB/s NFS throughput.  Shame on me for not
digging far deeper into the DL585 G7 docs.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21  3:14                                           ` Roberto Spadim
@ 2011-03-21 17:07                                             ` Stan Hoeppner
  0 siblings, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-21 17:07 UTC (permalink / raw)
  To: Roberto Spadim
  Cc: Keld Jørn Simonsen, Mdadm, NeilBrown, Christoph Hellwig, Drew

Roberto Spadim put forth on 3/20/2011 10:14 PM:
> with hardware raid we don't think about this problem, but with
> software we should consider since we run others app with software raid
> running too

Which is precisely why I asked Neil about this.  If you recall Neil
stated that CPU burn shouldn't be an issue when using mdraid linear over
16 mdraid 10 arrays in the proposed system.  As long as the kernel
somewhat evenly distributes IO steams amongst multiple cores I'm
inclined to agree with Neil.

Note that the application in this case, the NFS server, is threaded
kernel code, and thus very fast and scalable across all CPUs.  By
design, all of the performance critical code in this system runs in
kernel space.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21 14:18                                         ` Stan Hoeppner
@ 2011-03-21 17:08                                           ` Roberto Spadim
  2011-03-21 22:13                                           ` Keld Jørn Simonsen
  1 sibling, 0 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-03-21 17:08 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Keld Jørn Simonsen, Mdadm, NeilBrown, Christoph Hellwig, Drew

hum, i think you have all to work with mdraid and hardware,right?
xfs allocation groups is nice, i don´t know what workload it could
accept maybe with raid0 linear this work better than stripe (i must
test)

i think you know what you do =)
any more doubt?


2011/3/21 Stan Hoeppner <stan@hardwarefreak.com>:
> Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:
>
>> Are you then building the system yourself, and running Linux MD RAID?
>
> No.  These specifications meet the needs of Matt Garman's analysis
> cluster, and extend that performance from 6GB/s to 10GB/s.  Christoph's
> comments about 10GB/s throughput with XFS on large CPU count Altix 4000
> series machines from a few years ago prompted me to specify a single
> chassis multicore AMD Opteron based system that can achieve the same
> throughput at substantially lower cost.
>
>> Anyway, with 384 spindles and only 50 users, each user will have in
>> average 7 spindles for himself. I think much of the time this would mean
>> no random IO, as most users are doing large sequential reading.
>> Thus on average you can expect quite close to striping speed if you
>> are running RAID capable of striping.
>
> This is not how large scale shared RAID storage works under a
> multi-stream workload.  I thought I explained this in sufficient detail.
>  Maybe not.
>
>> I am puzzled about the --linear concatenating. I think this may cause
>> the disks in the --linear array to be considered as one spindle, and thus
>> no concurrent IO will be made. I may be wrong there.
>
> You are puzzled because you are not familiar with the large scale
> performance features built into the XFS filesystem.  XFS allocation
> groups automatically enable large scale parallelism on a single logical
> device comprised of multiple arrays or single disks, when configured
> correctly.  See:
>
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html
>
> The storage pool in my proposed 10GB/s NFS server system consists of 16
> RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
> spindles per array, 1.752TB per array, 28TB total raw.  Concatenating
> the 16 array devices with mdadm --linear creates a 28TB logical device.
>  We format it with this simple command, not having to worry about stripe
> block size, stripe spindle width, stripe alignment, etc:
>
> ~# mkfs.xfs -d agcount=64
>
> Using this method to achieve parallel scalability is simpler and less
> prone to configuration errors when compared to multi-level striping,
> which often leads to poor performance and poor space utilization.  With
> 64 XFS allocation groups the kernel can read/write 4 concurrent streams
> from/to each array of 12 spindles, which should be able to handle this
> load with plenty of headroom.  This system has 32 SAS 6G channels, each
> able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
> more than our 10GB/s target.  I was going to state that we're limited to
> 10.4GB/s due to the PCIe/HT bridge to the processor.  However, I just
> realized I made an error when specifying the DL585 G7 with only 2
> processors.  See [1] below for details.
>
> Using XFS in this manner allows us to avoid nested striped arrays and
> the inherent problems associated with them.  For example, in absence of
> using XFS allocation groups to get our parallelism, we could do the
> following:
>
> 1.  Width 16 RAID0 stripe over width 12 RAID10 stripe
> 2.  Width 16 LVM   stripe over width 12 RAID10 stripe
>
> In either case, what is the correct/optimum stripe block size for each
> level when nesting the two?  The answer is that there really aren't
> correct or optimum stripe sizes in this scenario.  Writes to the top
> level stripe will be broken into 16 chunks.  Each of these 16 chunks
> will then be broken into 12 more chunks.  You may be thinking, "Why
> don't we just create one 384 disk RAID10?  It would SCREAM with 192
> spindles!!"  There are many reasons why nobody does this, one being the
> same stripe block size issue as with nested stripes.  Extremely wide
> arrays have a plethora of problems associated with them.
>
> In summary, concatenating many relatively low stripe spindle count
> arrays, and using XFS allocation groups to achieve parallel scalability,
> gives us the performance we want without the problems associated with
> other configurations.
>
>
> [1]  In order to get all 11 PCIe slots in the DL585 G7 one must use the
> 4 socket model, as the additional PCIe slots of the mezzanine card
> connect to two additional SR5690 chips, each one connecting to an HT
> port on each of the two additional CPUs.  Thus, I'm re-specifying the
> DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
> total.  The 128GB in 16 RDIMMs will be spread across all 16 memory
> channels.  Memory bandwidth thus doubles to 160GB/s and interconnect b/w
> doubles to 320GB/s.  Thus, we now have up to 19.2 GB/s of available one
> way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
> link.  Adding the two required CPUs may have just made this system
> capable of 15GB/s NFS throughput for less than $5000 additional cost,
> not due to the processors, but the extra IO bandwidth enabled as a
> consequence of their inclusion.  Adding another quad port 10 GbE NIC
> will take it close to 20GB/s NFS throughput.  Shame on me for not
> digging far deeper into the DL585 G7 docs.
>
> --
> Stan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21 14:18                                         ` Stan Hoeppner
  2011-03-21 17:08                                           ` Roberto Spadim
@ 2011-03-21 22:13                                           ` Keld Jørn Simonsen
  2011-03-22  9:46                                             ` Robin Hill
  2011-03-22 10:00                                             ` Stan Hoeppner
  1 sibling, 2 replies; 116+ messages in thread
From: Keld Jørn Simonsen @ 2011-03-21 22:13 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Keld Jørn Simonsen, Mdadm, Roberto Spadim, NeilBrown,
	Christoph Hellwig, Drew

On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:
> 
> > Are you then building the system yourself, and running Linux MD RAID?
> 
> No.  These specifications meet the needs of Matt Garman's analysis
> cluster, and extend that performance from 6GB/s to 10GB/s.  Christoph's
> comments about 10GB/s throughput with XFS on large CPU count Altix 4000
> series machines from a few years ago prompted me to specify a single
> chassis multicore AMD Opteron based system that can achieve the same
> throughput at substantially lower cost.

OK, But I understand that this is running Linux MD RAID, and not some
hardware RAID. True?

Or at least Linux MD RAID is used to build a --linear FS.
Then why not use Linux MD to make the underlying RAID1+0 arrays?

> 
> > Anyway, with 384 spindles and only 50 users, each user will have in
> > average 7 spindles for himself. I think much of the time this would mean 
> > no random IO, as most users are doing large sequential reading. 
> > Thus on average you can expect quite close to striping speed if you
> > are running RAID capable of striping. 
> 
> This is not how large scale shared RAID storage works under a
> multi-stream workload.  I thought I explained this in sufficient detail.
>  Maybe not.

Given that the whole array system is only lightly loaded, this is how I
expect it to function. Maybe you can explain why it would not be so, if
you think otherwise.

> > I am puzzled about the --linear concatenating. I think this may cause
> > the disks in the --linear array to be considered as one spindle, and thus
> > no concurrent IO will be made. I may be wrong there.
> 
> You are puzzled because you are not familiar with the large scale
> performance features built into the XFS filesystem.  XFS allocation
> groups automatically enable large scale parallelism on a single logical
> device comprised of multiple arrays or single disks, when configured
> correctly.  See:
> 
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html
> 
> The storage pool in my proposed 10GB/s NFS server system consists of 16
> RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
> spindles per array, 1.752TB per array, 28TB total raw.  Concatenating
> the 16 array devices with mdadm --linear creates a 28TB logical device.
>  We format it with this simple command, not having to worry about stripe
> block size, stripe spindle width, stripe alignment, etc:
> 
> ~# mkfs.xfs -d agcount=64
> 
> Using this method to achieve parallel scalability is simpler and less
> prone to configuration errors when compared to multi-level striping,
> which often leads to poor performance and poor space utilization.  With
> 64 XFS allocation groups the kernel can read/write 4 concurrent streams
> from/to each array of 12 spindles, which should be able to handle this
> load with plenty of headroom.  This system has 32 SAS 6G channels, each
> able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
> more than our 10GB/s target.  I was going to state that we're limited to
> 10.4GB/s due to the PCIe/HT bridge to the processor.  However, I just
> realized I made an error when specifying the DL585 G7 with only 2
> processors.  See [1] below for details.
> 
> Using XFS in this manner allows us to avoid nested striped arrays and
> the inherent problems associated with them.  For example, in absence of
> using XFS allocation groups to get our parallelism, we could do the
> following:
> 
> 1.  Width 16 RAID0 stripe over width 12 RAID10 stripe
> 2.  Width 16 LVM   stripe over width 12 RAID10 stripe
> 
> In either case, what is the correct/optimum stripe block size for each
> level when nesting the two?  The answer is that there really aren't
> correct or optimum stripe sizes in this scenario.  Writes to the top
> level stripe will be broken into 16 chunks.  Each of these 16 chunks
> will then be broken into 12 more chunks.  You may be thinking, "Why
> don't we just create one 384 disk RAID10?  It would SCREAM with 192
> spindles!!"  There are many reasons why nobody does this, one being the
> same stripe block size issue as with nested stripes.  Extremely wide
> arrays have a plethora of problems associated with them.
> 
> In summary, concatenating many relatively low stripe spindle count
> arrays, and using XFS allocation groups to achieve parallel scalability,
> gives us the performance we want without the problems associated with
> other configurations.
> 
> 
> [1]  In order to get all 11 PCIe slots in the DL585 G7 one must use the
> 4 socket model, as the additional PCIe slots of the mezzanine card
> connect to two additional SR5690 chips, each one connecting to an HT
> port on each of the two additional CPUs.  Thus, I'm re-specifying the
> DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
> total.  The 128GB in 16 RDIMMs will be spread across all 16 memory
> channels.  Memory bandwidth thus doubles to 160GB/s and interconnect b/w
> doubles to 320GB/s.  Thus, we now have up to 19.2 GB/s of available one
> way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
> link.  Adding the two required CPUs may have just made this system
> capable of 15GB/s NFS throughput for less than $5000 additional cost,
> not due to the processors, but the extra IO bandwidth enabled as a
> consequence of their inclusion.  Adding another quad port 10 GbE NIC
> will take it close to 20GB/s NFS throughput.  Shame on me for not
> digging far deeper into the DL585 G7 docs.

it is probably not the concurrency of XFS that makes the parallelism of
the IO. It is more likely the IO system, and that would also work for
other file system types, like ext4. I do not see anything in the XFS allocation
blocks with any knowledge of the underlying disk structure. 
What the file system does is only to administer the scheduling of the
IO, in combination with the rest of the kernel.

Anyway, thanks for the energy and expertise that you are supplying to
this thread.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21 22:13                                           ` Keld Jørn Simonsen
@ 2011-03-22  9:46                                             ` Robin Hill
  2011-03-22 10:14                                               ` Keld Jørn Simonsen
  2011-03-22 10:00                                             ` Stan Hoeppner
  1 sibling, 1 reply; 116+ messages in thread
From: Robin Hill @ 2011-03-22  9:46 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: Stan Hoeppner, Mdadm, Roberto Spadim, NeilBrown, Christoph Hellwig, Drew

[-- Attachment #1: Type: text/plain, Size: 2673 bytes --]

On Mon Mar 21, 2011 at 11:13:04 +0100, Keld Jørn Simonsen wrote:

> On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> > 
> > > Anyway, with 384 spindles and only 50 users, each user will have in
> > > average 7 spindles for himself. I think much of the time this would mean 
> > > no random IO, as most users are doing large sequential reading. 
> > > Thus on average you can expect quite close to striping speed if you
> > > are running RAID capable of striping. 
> > 
> > This is not how large scale shared RAID storage works under a
> > multi-stream workload.  I thought I explained this in sufficient detail.
> >  Maybe not.
> 
> Given that the whole array system is only lightly loaded, this is how I
> expect it to function. Maybe you can explain why it would not be so, if
> you think otherwise.
> 
If you have more than one system accessing the array simultaneously then
your sequential IO immediately becomes random (as it'll interleave the
requests from the multiple systems). The more systems accessing
simultaneously, the more random the IO becomes. Of course, there will
still be an opportunity for some readahead, so it's not entirely random
IO.

> it is probably not the concurrency of XFS that makes the parallelism of
> the IO. It is more likely the IO system, and that would also work for
> other file system types, like ext4. I do not see anything in the XFS allocation
> blocks with any knowledge of the underlying disk structure. 
> What the file system does is only to administer the scheduling of the
> IO, in combination with the rest of the kernel.
> 
XFS allows for splitting the single filesystem into multiple allocation
groups. It can then allocate blocks from each group simultaneously
without worrying about collisions. If the allocation groups are on
separate physical spindles then (apart from the initial mapping of a
request to an allocation group, which should be a very quick operation),
the entire write process is parallelised.  Most filesystems have only a
single allocation group, so the block allocation is single threaded and
can easily become a bottleneck. It's only once the blocks are allocated
(assuming the filesystem knows about the physical layout) that the
writes can be parallelised. I've not looked into the details of ext4
though, so I don't know whether it makes any moves towards parallelising
block allocation.

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-21 22:13                                           ` Keld Jørn Simonsen
  2011-03-22  9:46                                             ` Robin Hill
@ 2011-03-22 10:00                                             ` Stan Hoeppner
  2011-03-22 11:01                                               ` Keld Jørn Simonsen
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-22 10:00 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: Mdadm, Roberto Spadim, NeilBrown, Christoph Hellwig, Drew

Keld Jørn Simonsen put forth on 3/21/2011 5:13 PM:
> On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
>> Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:
>>
>>> Are you then building the system yourself, and running Linux MD RAID?
>>
>> No.  These specifications meet the needs of Matt Garman's analysis
>> cluster, and extend that performance from 6GB/s to 10GB/s.  Christoph's
>> comments about 10GB/s throughput with XFS on large CPU count Altix 4000
>> series machines from a few years ago prompted me to specify a single
>> chassis multicore AMD Opteron based system that can achieve the same
>> throughput at substantially lower cost.
> 
> OK, But I understand that this is running Linux MD RAID, and not some
> hardware RAID. True?
> 
> Or at least Linux MD RAID is used to build a --linear FS.
> Then why not use Linux MD to make the underlying RAID1+0 arrays?

Using mdadm --linear is a requirement of this system specification.  The
underlying RAID10 arrays can be either HBA RAID or mdraid.  Note my
recent questions to Neil regarding mdraid CPU consumption across 16
cores with 16 x 24 drive mdraid 10 arrays.

>>> Anyway, with 384 spindles and only 50 users, each user will have in
>>> average 7 spindles for himself. I think much of the time this would mean 
>>> no random IO, as most users are doing large sequential reading. 
>>> Thus on average you can expect quite close to striping speed if you
>>> are running RAID capable of striping. 
>>
>> This is not how large scale shared RAID storage works under a
>> multi-stream workload.  I thought I explained this in sufficient detail.
>>  Maybe not.
> 
> Given that the whole array system is only lightly loaded, this is how I
> expect it to function. Maybe you can explain why it would not be so, if
> you think otherwise.

Using the term "lightly loaded" to describe any system sustaining
concurrent 10GB/s block IO and NFS throughput doesn't seem to be an
accurate statement.  I think you're confusing theoretical maximum
hardware performance with real world IO performance.  The former is
always significantly higher than the latter.  With this in mind, as with
any well designed system, I specified this system to have some headroom,
as I previously stated.  Everything we've discussed so far WRT this
system has been strictly parallel reads.

Now, if 10 cluster nodes are added with an application that performs
streaming writes, occurring concurrently with the 50 streaming reads,
we've just significantly increased the amount of head seeking on our
disks.  The combined IO workload is now a mixed heavy random read/write
workload.  This is the most difficult type of workload for any RAID
subsystem.  It would bring most parity RAID arrays to their knees.  This
is one of the reasons why RAID10 is the only suitable RAID level for
this type of system.

>> In summary, concatenating many relatively low stripe spindle count
>> arrays, and using XFS allocation groups to achieve parallel scalability,
>> gives us the performance we want without the problems associated with
>> other configurations.

> it is probably not the concurrency of XFS that makes the parallelism of
> the IO. 

It most certainly is the parallelism of XFS.  There are some caveats to
the amount of XFS IO parallelism that are workload dependent.  But
generally, with multiple processes/threads reading/writing multiple
files in multiple directories, the device parallelism is very high.  For
example:

If you have 50 NFS clients all reading the same large 20GB file
concurrently, IO parallelism will be limited to the 12 stripe spindles
on the single underlying RAID array upon which the AG holding this file
resides.  If no other files in the AG are being accessed at the time,
you'll get something like 1.8GB/s throughput for this 20GB file.  Since
the bulk, if not all, of this file will get cached during the read, all
50 NFS clients will likely be served from cache at their line rate of
200MB/s, or 10GB/s aggregate.  There's that magic 10GB/s number again.
;)  As you can see I put some serious thought into this system
specification.

If you have all 50 NFS clients accessing 50 different files in 50
different directories you have no cache benefit.  But we will have files
residing in all allocations groups on all 16 arrays.  Since XFS evenly
distributes new directories across AGs when the directories are created,
we can probably assume we'll have parallel IO across all 16 arrays with
this workload.  Since each array can stream reads at 1.8GB/s, that's
potential parallel throughput of 28GB/s, saturating our PCIe bus
bandwidth of 16GB/s.

Now change this to 50 clients each doing 10,000 4KB file reads in a
directory along with 10,000 4KB file writes.  The throughput of each 12
disk array may now drop by over a factor of approximately 128, as each
disk can only sustain about 300 head seeks/second, dropping its
throughput to 300 * 4096 bytes = 1.17MB/s.  Kernel readahead may help
some, but it'll still suck.

It is the occasional workload such as that above that dictates
overbuilding the disk subsystem.  Imagine adding a high IOPS NFS client
workload to this server after it went into production to "only" serve
large streaming reads.  The random workload above would drop the
performance of this 384 disk array with 15k spindles from a peak
streaming rate of 28.4GB/s to 18MB/s--yes, that's megabytes.

With one workload the disks can saturate the PCIe bus by almost a factor
of two.  With an opposite workload the disks can only transfer one
14,000th of the PCIe bandwidth.  This is why Fortune 500 companies and
others with extremely high random IO workloads such as databases, and
plenty of cash, have farms with multiple thousands of disks attached to
database and other servers.

> It is more likely the IO system, and that would also work for
> other file system types, like ext4. 

No.  Upper kernel layers doesn't provide this parallelism.  This is
strictly an XFS feature, although JFS had something similar (and JFS is
now all but dead), though not as performant.  BTRFS might have something
similar but I've read nothing about BTRFS internals.  Because XFS has
simply been the king of scalable filesystems for 15 years, and added
great new capability along the way, all of the other filesystem
developers have started to steal ideas from XFS.   IIRC Ted T'so stole
some things from XFS for use in EXT4, but allocation groups wasn't one
of them.

> I do not see anything in the XFS allocation
> blocks with any knowledge of the underlying disk structure. 

The primary structure that allows for XFS parallelism is
xfs_agnumber_t    sb_agcount

Making the filesystem with
mkfs.xfs -d agcount=16

creates 16 allocations groups of 1.752TB each in our case, 1 per 12
spindle array.  XFS will read/write to all 16 AGs in parallel, under the
right circumstances, with 1 or multiple  IO streams to/from each 12
spindle array.  XFS is the only Linux filesystem with this type of
scalability, again, unless BTRFS has something similar.

> What the file system does is only to administer the scheduling of the
> IO, in combination with the rest of the kernel.

Given that XFS has 64xxx lines of code, BTRFS has 46xxx, and EXT4 has
29xxx, I think there's a bit more to it than that Keld. ;)  Note that
XFS has over twice the code size of EXT4.  That's not bloat but
features, one them being allocation groups.  If your simplistic view of
this was correct we'd have only one Linux filesystem.  Filesystem code
does much much more than you realize.

> Anyway, thanks for the energy and expertise that you are supplying to
> this thread.

High performance systems are one of my passions.  I'm glad to
participate and share.  Speaking of sharing, after further reading on
how the parallelism of AGs is done and some other related things, I'm
changing my recommendation to using only 16 allocation groups of 1.752TB
with this system, one AG per array, instead of 64 AGs of 438GB.  Using
64 AGs could potentially hinder parallelism in some cases.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-22  9:46                                             ` Robin Hill
@ 2011-03-22 10:14                                               ` Keld Jørn Simonsen
  2011-03-23  8:53                                                 ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Keld Jørn Simonsen @ 2011-03-22 10:14 UTC (permalink / raw)
  To: Keld Jørn Simonsen, Stan Hoeppner, Mdadm, Roberto Spadim

On Tue, Mar 22, 2011 at 09:46:58AM +0000, Robin Hill wrote:
> On Mon Mar 21, 2011 at 11:13:04 +0100, Keld Jørn Simonsen wrote:
> 
> > On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> > > 
> > > > Anyway, with 384 spindles and only 50 users, each user will have in
> > > > average 7 spindles for himself. I think much of the time this would mean 
> > > > no random IO, as most users are doing large sequential reading. 
> > > > Thus on average you can expect quite close to striping speed if you
> > > > are running RAID capable of striping. 
> > > 
> > > This is not how large scale shared RAID storage works under a
> > > multi-stream workload.  I thought I explained this in sufficient detail.
> > >  Maybe not.
> > 
> > Given that the whole array system is only lightly loaded, this is how I
> > expect it to function. Maybe you can explain why it would not be so, if
> > you think otherwise.
> > 
> If you have more than one system accessing the array simultaneously then
> your sequential IO immediately becomes random (as it'll interleave the
> requests from the multiple systems). The more systems accessing
> simultaneously, the more random the IO becomes. Of course, there will
> still be an opportunity for some readahead, so it's not entirely random
> IO.

Of course the IO will be randomized, if there is more users, but the
read IO will tend to be quite sequential, if the reading of each process
is sequential. So if a user reads a big file sequentially, and the
system is lightly loaded, IO schedulers will tend to order all IO
for the process so that it is served in one series of operations,
given that the big file is laid out consequently on the file system.

> > it is probably not the concurrency of XFS that makes the parallelism of
> > the IO. It is more likely the IO system, and that would also work for
> > other file system types, like ext4. I do not see anything in the XFS allocation
> > blocks with any knowledge of the underlying disk structure. 
> > What the file system does is only to administer the scheduling of the
> > IO, in combination with the rest of the kernel.

> XFS allows for splitting the single filesystem into multiple allocation
> groups. It can then allocate blocks from each group simultaneously
> without worrying about collisions. If the allocation groups are on
> separate physical spindles then (apart from the initial mapping of a
> request to an allocation group, which should be a very quick operation),
> the entire write process is parallelised.  Most filesystems have only a
> single allocation group, so the block allocation is single threaded and
> can easily become a bottleneck. It's only once the blocks are allocated
> (assuming the filesystem knows about the physical layout) that the
> writes can be parallelised. I've not looked into the details of ext4
> though, so I don't know whether it makes any moves towards parallelising
> block allocation.

The block allocation is only done when writing. The system at hand was
specified as a mostly reading system, where such a bottleneck of block
allocating is not so dominant.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-22 10:00                                             ` Stan Hoeppner
@ 2011-03-22 11:01                                               ` Keld Jørn Simonsen
  0 siblings, 0 replies; 116+ messages in thread
From: Keld Jørn Simonsen @ 2011-03-22 11:01 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Keld Jørn Simonsen, Mdadm, Roberto Spadim, NeilBrown,
	Christoph Hellwig, Drew

On Tue, Mar 22, 2011 at 05:00:40AM -0500, Stan Hoeppner wrote:
> Keld Jørn Simonsen put forth on 3/21/2011 5:13 PM:
> > On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> >> Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:
> >>
> >>> Anyway, with 384 spindles and only 50 users, each user will have in
> >>> average 7 spindles for himself. I think much of the time this would mean 
> >>> no random IO, as most users are doing large sequential reading. 
> >>> Thus on average you can expect quite close to striping speed if you
> >>> are running RAID capable of striping. 
> >>
> >> This is not how large scale shared RAID storage works under a
> >> multi-stream workload.  I thought I explained this in sufficient detail.
> >>  Maybe not.
> > 
> > Given that the whole array system is only lightly loaded, this is how I
> > expect it to function. Maybe you can explain why it would not be so, if
> > you think otherwise.
> 
> Using the term "lightly loaded" to describe any system sustaining
> concurrent 10GB/s block IO and NFS throughput doesn't seem to be an
> accurate statement.  I think you're confusing theoretical maximum
> hardware performance with real world IO performance.  The former is
> always significantly higher than the latter.  With this in mind, as with
> any well designed system, I specified this system to have some headroom,
> as I previously stated.  Everything we've discussed so far WRT this
> system has been strictly parallel reads.

The disks themselves should be cabable of doing about 60 GB/s so 10 GB/s
is only a 15 % use of the disks. And most of the IO is concurrent
sequential reading of big files.

> Now, if 10 cluster nodes are added with an application that performs
> streaming writes, occurring concurrently with the 50 streaming reads,
> we've just significantly increased the amount of head seeking on our
> disks.  The combined IO workload is now a mixed heavy random read/write
> workload.  This is the most difficult type of workload for any RAID
> subsystem.  It would bring most parity RAID arrays to their knees.  This
> is one of the reasons why RAID10 is the only suitable RAID level for
> this type of system.

Yes, I agree. And that is why I also suggest you use a mirrored raid in
the form of Linux MD RAID 10, F2, for better striping performance and disk
access performance than traditional RAID1+0.

Anyway the system was not specified to have additional 10 heavy writing processes.

> >> In summary, concatenating many relatively low stripe spindle count
> >> arrays, and using XFS allocation groups to achieve parallel scalability,
> >> gives us the performance we want without the problems associated with
> >> other configurations.
> 
> > it is probably not the concurrency of XFS that makes the parallelism of
> > the IO. 
> 
> It most certainly is the parallelism of XFS.  There are some caveats to
> the amount of XFS IO parallelism that are workload dependent.  But
> generally, with multiple processes/threads reading/writing multiple
> files in multiple directories, the device parallelism is very high.  For
> example:
> 
> If you have 50 NFS clients all reading the same large 20GB file
> concurrently, IO parallelism will be limited to the 12 stripe spindles
> on the single underlying RAID array upon which the AG holding this file
> resides.  If no other files in the AG are being accessed at the time,
> you'll get something like 1.8GB/s throughput for this 20GB file.  Since
> the bulk, if not all, of this file will get cached during the read, all
> 50 NFS clients will likely be served from cache at their line rate of
> 200MB/s, or 10GB/s aggregate.  There's that magic 10GB/s number again.
> ;)  As you can see I put some serious thought into this system
> specification.
> 
> If you have all 50 NFS clients accessing 50 different files in 50
> different directories you have no cache benefit.  But we will have files
> residing in all allocations groups on all 16 arrays.  Since XFS evenly
> distributes new directories across AGs when the directories are created,
> we can probably assume we'll have parallel IO across all 16 arrays with
> this workload.  Since each array can stream reads at 1.8GB/s, that's
> potential parallel throughput of 28GB/s, saturating our PCIe bus
> bandwidth of 16GB/s.

Hmm, yes RAID1+0 can probably only stream read at 1.8 GB/s. Linux MD
RAID10,F2 can stream read at around 3.6 GB/s, on an array of 24
spindles 15000 rpm, given that each spindle is capable of stream
reading at about 150 MB/s.

> Now change this to 50 clients each doing 10,000 4KB file reads in a
> directory along with 10,000 4KB file writes.  The throughput of each 12
> disk array may now drop by over a factor of approximately 128, as each
> disk can only sustain about 300 head seeks/second, dropping its
> throughput to 300 * 4096 bytes = 1.17MB/s.  Kernel readahead may help
> some, but it'll still suck.
> 
> It is the occasional workload such as that above that dictates
> overbuilding the disk subsystem.  Imagine adding a high IOPS NFS client
> workload to this server after it went into production to "only" serve
> large streaming reads.  The random workload above would drop the
> performance of this 384 disk array with 15k spindles from a peak
> streaming rate of 28.4GB/s to 18MB/s--yes, that's megabytes.

Yes, random reading can diminish performance a lot.
If the mix of random/sequential reading is still with a good sequential
part, then I tink the system should still perform well. I think we lack
measurements for things like that, for maybe incremental sequential
reading speed on a non-saturated file system. I am not sure on how to
define such measures, tho.

> With one workload the disks can saturate the PCIe bus by almost a factor
> of two.  With an opposite workload the disks can only transfer one
> 14,000th of the PCIe bandwidth.  This is why Fortune 500 companies and
> others with extremely high random IO workloads such as databases, and
> plenty of cash, have farms with multiple thousands of disks attached to
> database and other servers.

Or use SSD.

> > It is more likely the IO system, and that would also work for
> > other file system types, like ext4. 
> 
> No.  Upper kernel layers doesn't provide this parallelism.  This is
> strictly an XFS feature, although JFS had something similar (and JFS is
> now all but dead), though not as performant.  BTRFS might have something
> similar but I've read nothing about BTRFS internals.  Because XFS has
> simply been the king of scalable filesystems for 15 years, and added
> great new capability along the way, all of the other filesystem
> developers have started to steal ideas from XFS.   IIRC Ted T'so stole
> some things from XFS for use in EXT4, but allocation groups wasn't one
> of them.
> 
> > I do not see anything in the XFS allocation
> > blocks with any knowledge of the underlying disk structure. 
> 
> The primary structure that allows for XFS parallelism is
> xfs_agnumber_t    sb_agcount
> 
> Making the filesystem with
> mkfs.xfs -d agcount=16
> 
> creates 16 allocations groups of 1.752TB each in our case, 1 per 12
> spindle array.  XFS will read/write to all 16 AGs in parallel, under the
> right circumstances, with 1 or multiple  IO streams to/from each 12
> spindle array.  XFS is the only Linux filesystem with this type of
> scalability, again, unless BTRFS has something similar.
> 
> > What the file system does is only to administer the scheduling of the
> > IO, in combination with the rest of the kernel.
> 
> Given that XFS has 64xxx lines of code, BTRFS has 46xxx, and EXT4 has
> 29xxx, I think there's a bit more to it than that Keld. ;)  Note that
> XFS has over twice the code size of EXT4.  That's not bloat but
> features, one them being allocation groups.  If your simplistic view of
> this was correct we'd have only one Linux filesystem.  Filesystem code
> does much much more than you realize.

Oh, well, of cause the file system does a lot of things. And I have done
a number of designs and patches to a number of file systems during the years.
But I was talking about the overall picture. The CPU power should not be the
bottleneck, the bottleneck is the IO. So we use the kernel code to
administer the IO in the best possible way.  I am also using XFS for
many file systems, but I am also using EXT3, and I think I get
about the same results for the systems I do, which are also a mostly
sequential reading of many big files concurrently (a ftp server).

> > Anyway, thanks for the energy and expertise that you are supplying to
> > this thread.
> 
> High performance systems are one of my passions.  I'm glad to
> participate and share.  Speaking of sharing, after further reading on
> how the parallelism of AGs is done and some other related things, I'm
> changing my recommendation to using only 16 allocation groups of 1.752TB
> with this system, one AG per array, instead of 64 AGs of 438GB.  Using
> 64 AGs could potentially hinder parallelism in some cases.

Thank you again for your insights
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-22 10:14                                               ` Keld Jørn Simonsen
@ 2011-03-23  8:53                                                 ` Stan Hoeppner
  2011-03-23 15:57                                                   ` Roberto Spadim
  0 siblings, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-23  8:53 UTC (permalink / raw)
  To: Keld Jørn Simonsen
  Cc: Mdadm, Roberto Spadim, NeilBrown, Christoph Hellwig, Drew

Keld Jørn Simonsen put forth on 3/22/2011 5:14 AM:

> Of course the IO will be randomized, if there is more users, but the
> read IO will tend to be quite sequential, if the reading of each process
> is sequential. So if a user reads a big file sequentially, and the
> system is lightly loaded, IO schedulers will tend to order all IO
> for the process so that it is served in one series of operations,
> given that the big file is laid out consequently on the file system.

With the way I've architected this hypothetical system, the read load on
each allocation group (each 12 spindle array) should be relatively low,
about 3 streams on 14 AGs, 4 streams on the remaining two AGs,
_assuming_ the files being read are spread out evenly across at least 16
directories.  As you all read in the docs for which I provided links,
XFS AG parallelism functions at the directory and file level.  For
example, if we create 32 directories on a virgin XFS filesystem of 16
allocation groups, the following layout would result:

AG1:  /general requirements	AG1:  /alabama
AG2:  /site construction	AG2:  /alaska
AG3:  /concrete			AG3:  /arizona
..
..
AG14: /conveying systems	AG14: /indiana
AG15: /mechanical		AG15: /iowa
AG16: /electrical		AG16: /kansas

AIUI, the first 16 directories get created in consecutive AGs until we
hit the last AG.  The 17th directory is then created in the first AG and
we start the cycle over.  This is how XFS allocation group parallelism
works.  It doesn't provide linear IO scaling for all workloads, and it's
not magic, but it works especially well for multiuser fileservers, and
typically better than multi nested stripe levels or extremely wide arrays.

Imagine you have a 5000 seat company.  You'd mount this XFS filesytem in
/home.  Each user home directory created would fall in a consecutive AG,
resulting in about 312 user dirs per AG.  In this type of environment
XFS AG parallelism will work marvelously as you'll achieve fairly
balanced IO across all AGs and thus all 16 arrays.

In the case where you have many clients reading files from only one
directory, hence the same AG, IO parallelism is limited to the 12
spindles of that one array.  When this happens, we end up with a highly
random workload at the disk head, resulting in high seek rates and low
throughput.  This is one of the reasons I built some "excess" capacity
into the disk subsystem.  Using XFS AGs for parallelism doesn't
guarantee even distribution of IO across all the 192 spindles of the 16
arrays.  It gives good parallelism if clients are accessing different
files in different directories concurrently, but not in the opposite case.

> The block allocation is only done when writing. The system at hand was
> specified as a mostly reading system, where such a bottleneck of block
> allocating is not so dominant.

This system would excel at massive parallel writes as well, again, as
long as we have many writers into multiple directories concurrently,
which spreads the write load across all AGs, and thus all arrays.

XFS is legendary for multiple large file parallel write throughput,
thanks to delayed allocation, and some other tricks.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-23  8:53                                                 ` Stan Hoeppner
@ 2011-03-23 15:57                                                   ` Roberto Spadim
  2011-03-23 16:19                                                     ` Joe Landman
  2011-03-24  5:52                                                     ` Stan Hoeppner
  0 siblings, 2 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-03-23 15:57 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Keld Jørn Simonsen, Mdadm, NeilBrown, Christoph Hellwig, Drew

it's something like 'partitioning'? i don't know xfs very well, but ...
if you use 99% ag16 and 1% ag1-15
you should use a raid0 with stripe (for better write/read rate),
linear wouldn't help like stripe, i'm right?

a question... this example was with directories, how files (metadata)
are saved? and how file content are saved? and jornaling?

i see a filesystem something like: read/write
jornaling(metadata/files), read/write metadata, read/write file
content, check/repair filesystem, features (backup, snapshot, garbage
collection, raid1, increase/decrease fs size, others)

speed of write and read will be a function of how you designed it to
use device layer (it's something like a virtual memory utilization, a
big memory, and many programs trying to use small parts and when need
use a big part)


2011/3/23 Stan Hoeppner <stan@hardwarefreak.com>:
> Keld Jørn Simonsen put forth on 3/22/2011 5:14 AM:
>
>> Of course the IO will be randomized, if there is more users, but the
>> read IO will tend to be quite sequential, if the reading of each process
>> is sequential. So if a user reads a big file sequentially, and the
>> system is lightly loaded, IO schedulers will tend to order all IO
>> for the process so that it is served in one series of operations,
>> given that the big file is laid out consequently on the file system.
>
> With the way I've architected this hypothetical system, the read load on
> each allocation group (each 12 spindle array) should be relatively low,
> about 3 streams on 14 AGs, 4 streams on the remaining two AGs,
> _assuming_ the files being read are spread out evenly across at least 16
> directories.  As you all read in the docs for which I provided links,
> XFS AG parallelism functions at the directory and file level.  For
> example, if we create 32 directories on a virgin XFS filesystem of 16
> allocation groups, the following layout would result:
>
> AG1:  /general requirements     AG1:  /alabama
> AG2:  /site construction        AG2:  /alaska
> AG3:  /concrete                 AG3:  /arizona
> ..
> ..
> AG14: /conveying systems        AG14: /indiana
> AG15: /mechanical               AG15: /iowa
> AG16: /electrical               AG16: /kansas
>
> AIUI, the first 16 directories get created in consecutive AGs until we
> hit the last AG.  The 17th directory is then created in the first AG and
> we start the cycle over.  This is how XFS allocation group parallelism
> works.  It doesn't provide linear IO scaling for all workloads, and it's
> not magic, but it works especially well for multiuser fileservers, and
> typically better than multi nested stripe levels or extremely wide arrays.
>
> Imagine you have a 5000 seat company.  You'd mount this XFS filesytem in
> /home.  Each user home directory created would fall in a consecutive AG,
> resulting in about 312 user dirs per AG.  In this type of environment
> XFS AG parallelism will work marvelously as you'll achieve fairly
> balanced IO across all AGs and thus all 16 arrays.
>
> In the case where you have many clients reading files from only one
> directory, hence the same AG, IO parallelism is limited to the 12
> spindles of that one array.  When this happens, we end up with a highly
> random workload at the disk head, resulting in high seek rates and low
> throughput.  This is one of the reasons I built some "excess" capacity
> into the disk subsystem.  Using XFS AGs for parallelism doesn't
> guarantee even distribution of IO across all the 192 spindles of the 16
> arrays.  It gives good parallelism if clients are accessing different
> files in different directories concurrently, but not in the opposite case.
>
>> The block allocation is only done when writing. The system at hand was
>> specified as a mostly reading system, where such a bottleneck of block
>> allocating is not so dominant.
>
> This system would excel at massive parallel writes as well, again, as
> long as we have many writers into multiple directories concurrently,
> which spreads the write load across all AGs, and thus all arrays.
>
> XFS is legendary for multiple large file parallel write throughput,
> thanks to delayed allocation, and some other tricks.
>
> --
> Stan
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-23 15:57                                                   ` Roberto Spadim
@ 2011-03-23 16:19                                                     ` Joe Landman
  2011-03-24  8:05                                                       ` Stan Hoeppner
  2011-03-24 17:07                                                       ` Christoph Hellwig
  2011-03-24  5:52                                                     ` Stan Hoeppner
  1 sibling, 2 replies; 116+ messages in thread
From: Joe Landman @ 2011-03-23 16:19 UTC (permalink / raw)
  To: Mdadm

On 03/23/2011 11:57 AM, Roberto Spadim wrote:
> it's something like 'partitioning'? i don't know xfs very well, but ...
> if you use 99% ag16 and 1% ag1-15
> you should use a raid0 with stripe (for better write/read rate),
> linear wouldn't help like stripe, i'm right?
>
> a question... this example was with directories, how files (metadata)
> are saved? and how file content are saved? and jornaling?

I won't comment on the hardware design or choices aspects.  Will briefly 
touch on the file system and MD raid.

MD RAID0 or RAID10 would be the sanest approach, and xfs happily does 
talk nicely to the MD raid system, gathering the stripe information from it.

The issue though is that xfs stores journals internally by default.  You 
can change this, and in specific use cases, an external journal is 
strongly advised.  This would be one such use case.

Though, the OP wants a very read heavy machine, and not a write heavy 
machine.  So it makes more sense to have massive amounts of RAM for the 
OP, and lots of high speed fabric (Infiniband HCA, 10-40 GbE NICs, ...). 
  However, a single system design for the OP's requirements makes very 
little economic or practical sense.  Would be very expensive to build.

And to keep this on target, MD raid could handle it.

> i see a filesystem something like: read/write
> jornaling(metadata/files), read/write metadata, read/write file
> content, check/repair filesystem, features (backup, snapshot, garbage
> collection, raid1, increase/decrease fs size, others)

Unfortunately, xfs snapshots have to be done via LVM2 right now.  My 
memory isn't clear on this, there may be an xfs_freeze requirement for 
the snapshot to be really valid.  e.g.

	xfs_freeze -f /mount/point
	# insert your lvm snapshot command
	xfs_freeze -u /mount/point

I am not sure if this is still required.
	
> speed of write and read will be a function of how you designed it to
> use device layer (it's something like a virtual memory utilization, a
> big memory, and many programs trying to use small parts and when need
> use a big part)

At the end of the day, it will be *far* more economical to build a 
distributed storage cluster with a parallel file system atop it, than 
build a single large storage unit.  We've achieved well north of 10GB/s 
sustained reads and writes from thousands of simultaneous processes 
across thousands of cores (yes, with MD backed RAIDs being part of 
this), for hundreds of GB reads/writes (well into the TB range)

Hardware design is very important here, as are many other features.  The 
BOM posted here notwithstanding, very good performance starts with good 
selection of underlying components, and a rational design.  Not all 
designs you might see are worth the electrons used to transport them to 
your reader.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-23 15:57                                                   ` Roberto Spadim
  2011-03-23 16:19                                                     ` Joe Landman
@ 2011-03-24  5:52                                                     ` Stan Hoeppner
  2011-03-24  6:33                                                       ` NeilBrown
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-24  5:52 UTC (permalink / raw)
  To: Roberto Spadim
  Cc: Keld Jørn Simonsen, Mdadm, NeilBrown, Christoph Hellwig, Drew

Roberto Spadim put forth on 3/23/2011 10:57 AM:
> it's something like 'partitioning'? i don't know xfs very well, but ...
> if you use 99% ag16 and 1% ag1-15
> you should use a raid0 with stripe (for better write/read rate),
> linear wouldn't help like stripe, i'm right?

You should really read up on XFS internals to understand exactly how
allocation groups work.

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

I've explained the basics.  What I didn't mention is that an individual
file can be written concurrently to more than one allocation group,
yielding some of the benefit of striping but without the baggage of
RAID0 over 16 RAID10 or a wide stripe RAID10.  However, I've not been
able to find documentation stating exactly how this is done and under
what circumstances, and I would really like to know.  XFS has some good
documentation, but none of it goes into this kind of low level detail
with lay person digestible descriptions.  I'm not a dev so I'm unable to
understand how this works by reading the codde.

Note that once such a large file is written, reading that file later
puts multiple AGs into play so you have read parallelism approaching the
performance of straight disk striping.

The problems with nested RAID0 over RAID10, or simply a very wide array
(384 disks in this case) are two fold:

1.  Lower performance with files smaller than the stripe width
2.  Poor space utilization for the same reason

Let's analyze the wide RAID10 case.  With 384 disks you get a stripe
width of 192 spindles.  A common stripe block size is 64KB, or 16
filesystem blocks, 128 disk sectors.  Taking that 64KB and multiplying
by 192 stripe spindles we get a stripe size of exactly 12MB.

If you write a file much smaller than the stripe size, say a 1MB file,
to the filesystem atop this wide RAID10, the file will only be striped
across 16 of the 192 spindles, with 64KB going to each stripe member, 16
filesystem blocks, 128 sectors.  I don't know about mdraid, but with
many hardware RAID striping implementations the remaining 176 disks in
the stripe will have zeros or nulls written for their portion of the
stripe for this file that is a tiny fraction of the stripe size.  Also,
all modern disk drives are much more efficient when doing larger
multi-sector transfers of anywhere from 512KB to 1MB or more than with
small transfers of 64KB.

By using XFS allocation groups for parallelism instead of a wide stripe
array, you don't suffer from this massive waste of disk space, and,
since each file is striped across fewer disks (12 in the case of my
example system), we end up with slightly better throughput as each
transfer is larger, 170 sectors in this case.  The extremely wide array,
or nested stripe over striped array setup, is only useful in situations
where all files being written are close to or larger than the stripe
size.  There are many application areas where this is not only plausible
but preferred.  Most HPC applications work with data sets far larger
than the 12MB in this example, usually hundreds of megs if not multiple
gigs.  In this case extremely wide arrays are the way to go, whether
using a single large file store, a cluster of fileservers, or a cluster
filesystem on SAN storage such as CXFS.

Most other environments are going to have a mix of small and large
files, and all sizes in between.  This is the case where leveraging XFS
allocation group parallelism makes far more sense than a very wide
array, and why I chose this configuration for my example system.

Do note that XFS will also outperform any other filesytem when used
directly atop this same 192 spindle wide RAID10 array.  You'll still
have 16 allocation groups, but the performance characteristics of the
AGs change when the underlying storage is a wide stripe.  In this case
the AGs become cylinder groups from the outer to inner edge of the
disks, instead of each AG occupying an entire 12 spindle disk array.

In this case the AGs do more to prevent fragmentation than increase
parallel throughput at the hardware level.  AGs do always allow more
filesystem concurrency though, regardless of the underlying hardware
storage structure, because inodes can be allocated or read in parallel.
 This is sue to the fact each XFS AG has its own set of B+ trees and
inodes.  Each AG is a "filesystem within a filesystem".

If we pretend for a moment that an EXT4 filesystem can be larger than
16TB, in this case 28TB, and we tested this 192 spindle RAID10 array
with a high parallel workload with both EXT4 and XFS, you'd find that
EXT4 throughput is a small fraction of XFS due to the fact that so much
of EXT4 IO is serialized, precisely because it lacks XFS' allocation
group architecture.

> a question... this example was with directories, how files (metadata)
> are saved? and how file content are saved? and jornaling?

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

> speed of write and read will be a function of how you designed it to
> use device layer (it's something like a virtual memory utilization, a
> big memory, and many programs trying to use small parts and when need
> use a big part)

Not only that, but how efficiently you can walk the directory tree to
locate inodes.  XFS can walk many directory trees in parallel, partly
due to allocation groups.  This is one huge advantage it has over
EXT2/3/4, ReiserFS, JFS, etc.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-24  5:52                                                     ` Stan Hoeppner
@ 2011-03-24  6:33                                                       ` NeilBrown
  2011-03-24  8:07                                                         ` Roberto Spadim
  2011-03-24  8:31                                                         ` Stan Hoeppner
  0 siblings, 2 replies; 116+ messages in thread
From: NeilBrown @ 2011-03-24  6:33 UTC (permalink / raw)
  To: Stan Hoeppner
  Cc: Roberto Spadim, Keld Jørn Simonsen, Mdadm, Christoph Hellwig, Drew

On Thu, 24 Mar 2011 00:52:00 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> If you write a file much smaller than the stripe size, say a 1MB file,
> to the filesystem atop this wide RAID10, the file will only be striped
> across 16 of the 192 spindles, with 64KB going to each stripe member, 16
> filesystem blocks, 128 sectors.  I don't know about mdraid, but with
> many hardware RAID striping implementations the remaining 176 disks in
> the stripe will have zeros or nulls written for their portion of the
> stripe for this file that is a tiny fraction of the stripe size. 

This doesn't make any sense at all.  No RAID - hardware or otherwise - is
going to write zeros to most of the stripe like this.  The RAID doesn't even
know about the concept of a file, so it couldn't.
The filesystem places files in the virtual device that is the array, and the
RAID just spreads those blocks out across the various devices.

There will be no space wastage.

If you have a 1MB file, then there is no way you can ever get useful 192-way
parallelism across that file.  Bit if you have 192 1MB files, then they will
be spread even across your spindles some how (depending on FS and RAID level)
and if you have multiple concurrent accessors, they could well get close to
192-way parallelism.

NeilBrown


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-23 16:19                                                     ` Joe Landman
@ 2011-03-24  8:05                                                       ` Stan Hoeppner
  2011-03-24 13:12                                                         ` Joe Landman
  2011-03-24 17:07                                                       ` Christoph Hellwig
  1 sibling, 1 reply; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-24  8:05 UTC (permalink / raw)
  To: Joe Landman; +Cc: Mdadm

Joe Landman put forth on 3/23/2011 11:19 AM:

> MD RAID0 or RAID10 would be the sanest approach, and xfs happily does
> talk nicely to the MD raid system, gathering the stripe information from
> it.

Surely you don't mean a straight mdraid0 over the 384 drives, I assume.
 You referring to the nested case I mentioned, yes?

Yes, XFS does read the mdraid parameters and sets block, stripe size,
etc, accordingly.

> The issue though is that xfs stores journals internally by default.  You
> can change this, and in specific use cases, an external journal is
> strongly advised.  This would be one such use case.

The target workload is read heavy, very few writes.  Even if we added a
write heavy workload to the system, with journal residing on an array
that's seeing heavy utilization from the primary workload, with delayed
logging enabled, this is a non issue.

Thus, this is not a case where an external log device is needed.  In
fact, now that we have the delayed logging feature, cases where an
external log device might be needed are very few and far between.

> Though, the OP wants a very read heavy machine, and not a write heavy
> machine.  So it makes more sense to have massive amounts of RAM for the

Assuming the same files aren't being re-read, how does massive RAM
quantity for buffer cache help?

> OP, and lots of high speed fabric (Infiniband HCA, 10-40 GbE NICs, ...).
>  However, a single system design for the OP's requirements makes very
> little economic or practical sense.  Would be very expensive to build.

I estimated the cost of my proposed 10GB/s NFS server at $150-250k
including all required 10GbE switches, the works.  Did you read that
post?  What is your definition of "very expensive"?  Compared to?

> And to keep this on target, MD raid could handle it.

mdraid was mentioned in my system as well.  And yes, Neil seems to think
mdraid would be fine, not a CPU hog.

> Unfortunately, xfs snapshots have to be done via LVM2 right now.  My
> memory isn't clear on this, there may be an xfs_freeze requirement for
> the snapshot to be really valid.  e.g.

Why do you say "unfortunately"?  *ALL* Linux filesystem snapshots are
performed with a filesystem freeze implemented in the VFS layer.  The
freeze 'was' specific to XFS.  It is such a valuable, *needed* feature
that it was bumped into the VFS so all filesystems could take advantage
of it.  Are you saying freezing writes to a filesystem before taking a
snapshot is a bad thing? (/incredulous)

http://en.wikipedia.org/wiki/XFS#Snapshots

>     xfs_freeze -f /mount/point
>     # insert your lvm snapshot command
>     xfs_freeze -u /mount/point
> 
> I am not sure if this is still required.

It's been fully automatic since 2.6.29, for all Linux filesystems.
Invoking an LVM snapshot automatically freezes the filesystem.

> At the end of the day, it will be *far* more economical to build a
> distributed storage cluster with a parallel file system atop it, than
> build a single large storage unit.  

I must call BS on the "far more economical" comment.  At the end of the
day, to use your phrase, the cost of any large scale high performance
storage system comes down to the quantity and price of the disk drives
needed to achieve the required spindle throughput.  Whether you use a
$20K server chassis to host the NICs, disk controllers and all the
drives, or you used six $3000 server chassis, the costs come out roughly
the same.  The big advantages a single chassis server has are simplicity
of design, maintenance, and use.  The only downside is single point of
failure, not higher cost, compared to a storage cluster.  Failures of
complete server chassis are very rare, BTW, especially quad socket HP
servers.

If it takes 8 of your JackRabbit boxen, 384 drives, to sustain 10+GB/s
using RAID10, maintaining that rate during a rebuild, with a load of 50+
concurrent 200MB/s clients, we're looking at about $200K USD, correct,
$25K per box? Your site doesn't show any pricing that I can find so I
making an educated guess.  That cost figure is not substantially
different than my hypothetical configuration, but mine includes $40K of
HP 10GbE switches to connect the clients and the server at full bandwidth.

> We've achieved well north of 10GB/s
> sustained reads and writes from thousands of simultaneous processes
> across thousands of cores (yes, with MD backed RAIDs being part of
> this), for hundreds of GB reads/writes (well into the TB range)

That's great.  Also, be honest with the fine folks on the list.  You use
mdraid0 or linear for stitching hardware RAID arrays together, similar
to what I mentioned.  You're not using mdraid across all 48 drives in
your chassis.  If you are, the information on your website is incorrect
at best, misleading at worst, as it mentions "RAID Controllers" and
quantity per system model, 1-4 in the case of the JackRabbit.

> Hardware design is very important here, as are many other features.  The
> BOM posted here notwithstanding, very good performance starts with good
> selection of underlying components, and a rational design.  Not all
> designs you might see are worth the electrons used to transport them to
> your reader.

Fortunately for the readers here, such unworthy designs you mention
aren't posted on this list.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-24  6:33                                                       ` NeilBrown
@ 2011-03-24  8:07                                                         ` Roberto Spadim
  2011-03-24  8:31                                                         ` Stan Hoeppner
  1 sibling, 0 replies; 116+ messages in thread
From: Roberto Spadim @ 2011-03-24  8:07 UTC (permalink / raw)
  To: NeilBrown
  Cc: Stan Hoeppner, Keld Jørn Simonsen, Mdadm, Christoph Hellwig, Drew

i will read xfs again, but check, if i'm thinking wrong or right...

i see two ideas about raid0 (i/o rate vs many users)

first lets think that raid0 is somethink like a harddisk firmware....
the problem: we have many plastes/heads and just one arm.

a hard disk = many plates+many heads+ only one arm to move heads(maybe
in future we can use many arms in only one harddisk!)
plates=many sectors=many bits (harddisk works like NOR memories, only
with bits, not with bytes or pages like NAND memories, for bytes it
must head based (stripe) or many reads (time consuming) )

firmware will use:
raid0 stripe => make group of bits from diferent plates/heads
(1,2,3,4,5) a 'block/byte/character' unit (if you have 8heads you can
read a byte with only one 'read all heads bits' command, and merge
bits from head1,2,3,4,5,6,7,8 and get a byte, it can be done in
parallel like raid0 stripe do on linux software raid, with only 1
cycle of read)
raid0 linear => read many bits from a plate to create a 'sector' of
bits (a 'block unit' too) this can only be done in sequential read
(many cycles of read); wait read of bit1 to read bit2,3,4,5,6,7,8,9...
different from stripe where you send many reads after all reads will
merge bits to get a byte)

-----
it's like a 3Ghz cpu with 1 core vs 1Ghz cpu with 3 cores, what's fast?
if you need just 1 cycle of cpu, 3ghz is faster
the problem with harddisk is just one: random reads.
think about a mix of ssd and harddisks (there's some disks that have
it! did you tried? they are nice! there's a bcache and one facebook
linux kernel module to emulate this at o.s.) you will not have the
random read problem, since ssd is very good for random read
-----
the only magics i think a filesystem can do is:
1)online compression - think about 32MB blocks, and if read 12MB
compressed information you can have 32MB of uncompressed information,
if you want more information you will need to jump to sector of next
32MB block, you could use stripe at raid0 here to allow second disk to
be used and don't wait access time of first disk
2)group of similar file access (i think it's what xfs call about
allocation groups). could be done by statistics about: acesstime, read
rate, write rate, filesize, create/delete file rate, file type
(symbolick links, directory, files, devices, pipes, etc), metadata,
journaling
3)how device works: good for write, good for read, good for sequencial
read (few arms-stripe), good for random read(ssd), good for multi task
(many arms-linear)
----------------

reading about harddisks informations at database forums/blogs
(intensive disk users)...
harddisks work better with big blocks since it will get a small
acesstime to read more information...
read rate = bytes read / total time.
total time = accesstime+read time.
accesstime=arm positioning+disk positioning,
read time=disk speed (7200rpm, 10krpm, 15krpm...) and sector nits per
disk revolution for harddisks.

thinking about this... sequencial reads are fast, random reads are slow

how to optimise random reads? read ahead, raid0 (a arm for each group of sector)
how filesystem can optimize random reads? try to not fragment most
access file, put they close, convert random reads use to cache
sequencial information, use of statistic of most read, most write,
file size, create/delete rate, etc to select betters candidates of
futures use (preditive idea)

i think it's all filesystem and raid0 could do

2011/3/24 NeilBrown <neilb@suse.de>:
> On Thu, 24 Mar 2011 00:52:00 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
>> If you write a file much smaller than the stripe size, say a 1MB file,
>> to the filesystem atop this wide RAID10, the file will only be striped
>> across 16 of the 192 spindles, with 64KB going to each stripe member, 16
>> filesystem blocks, 128 sectors.  I don't know about mdraid, but with
>> many hardware RAID striping implementations the remaining 176 disks in
>> the stripe will have zeros or nulls written for their portion of the
>> stripe for this file that is a tiny fraction of the stripe size.
>
> This doesn't make any sense at all.  No RAID - hardware or otherwise - is
> going to write zeros to most of the stripe like this.  The RAID doesn't even
> know about the concept of a file, so it couldn't.
> The filesystem places files in the virtual device that is the array, and the
> RAID just spreads those blocks out across the various devices.
>
> There will be no space wastage.
>
> If you have a 1MB file, then there is no way you can ever get useful 192-way
> parallelism across that file.  Bit if you have 192 1MB files, then they will
> be spread even across your spindles some how (depending on FS and RAID level)
> and if you have multiple concurrent accessors, they could well get close to
> 192-way parallelism.
>
> NeilBrown
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-24  6:33                                                       ` NeilBrown
  2011-03-24  8:07                                                         ` Roberto Spadim
@ 2011-03-24  8:31                                                         ` Stan Hoeppner
  1 sibling, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-24  8:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Roberto Spadim, Keld Jørn Simonsen, Mdadm, Christoph Hellwig, Drew

NeilBrown put forth on 3/24/2011 1:33 AM:
> On Thu, 24 Mar 2011 00:52:00 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> If you write a file much smaller than the stripe size, say a 1MB file,
>> to the filesystem atop this wide RAID10, the file will only be striped
>> across 16 of the 192 spindles, with 64KB going to each stripe member, 16
>> filesystem blocks, 128 sectors.  I don't know about mdraid, but with
>> many hardware RAID striping implementations the remaining 176 disks in
>> the stripe will have zeros or nulls written for their portion of the
>> stripe for this file that is a tiny fraction of the stripe size. 
> 
> This doesn't make any sense at all.  No RAID - hardware or otherwise - is
> going to write zeros to most of the stripe like this.  The RAID doesn't even
> know about the concept of a file, so it couldn't.
> The filesystem places files in the virtual device that is the array, and the
> RAID just spreads those blocks out across the various devices.
> 
> There will be no space wastage.

Well that's good to know then.  Apparently I was confusing partial block
writes with partial stripe writes.  Thanks for clarifying this Neil.

> If you have a 1MB file, then there is no way you can ever get useful 192-way
> parallelism across that file.  

That was exactly my point.  Hence my recommendation against very wide
stripe arrays for general purpose fileservers.

> Bit if you have 192 1MB files, then they will
> be spread even across your spindles some how (depending on FS and RAID level)
> and if you have multiple concurrent accessors, they could well get close to
> 192-way parallelism.

The key here being parallelism, to a great extent.  All 192 files would
need to be in the queue simultaneously.  This would have to be a
relatively busy file or DB server.

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-24  8:05                                                       ` Stan Hoeppner
@ 2011-03-24 13:12                                                         ` Joe Landman
  2011-03-25  7:06                                                           ` Stan Hoeppner
  0 siblings, 1 reply; 116+ messages in thread
From: Joe Landman @ 2011-03-24 13:12 UTC (permalink / raw)
  To: Stan Hoeppner; +Cc: Mdadm

On 03/24/2011 04:05 AM, Stan Hoeppner wrote:

>> At the end of the day, it will be *far* more economical to build a
>> distributed storage cluster with a parallel file system atop it, than
>> build a single large storage unit.
>
> I must call BS on the "far more economical" comment.  At the end of the

I find it funny ... really, that the person whom hasn't designed and 
built the thing that we have, is calling BS on us.

This is the reason why email filters were developed.

In another email, Neil corrected some of Stan's other fundamental 
misconceptions on RAID writing.  Christoph corrected others.  Free 
advice here ... proceed with caution if you are considering using *any* 
of his advice, and get it sanity checked beforehand..

[...]

>> We've achieved well north of 10GB/s

It is important to note this.  We have.  He hasn't.

One thing we deal with on a fairly regular basis are people slapping 
components together that they think will work, and having expectations 
set really high on the performance side.  Expectations get moderated by 
experience.  Those who've done these things know what troubles await, 
those who don't look at spec's, say I need X of these, Y of those and my 
performance troubles will be gone.  It doesn't work that way.  Watching 
such processes unfold is akin to watching a slow motion train wreck on a 
movie ... you don't want it to occur, but it will, and it won't end well.

>> sustained reads and writes from thousands of simultaneous processes
>> across thousands of cores (yes, with MD backed RAIDs being part of
>> this), for hundreds of GB reads/writes (well into the TB range)
>
> That's great.  Also, be honest with the fine folks on the list.  You use
> mdraid0 or linear for stitching hardware RAID arrays together, similar
> to what I mentioned.  You're not using mdraid across all 48 drives in

Again, since we didn't talk about how we use MD RAID, he doesn't know. 
Then constructs a strawman and proceeds to knock it down.

I won't fisk the rest of this, just make sure that, before you take his 
advice, you check with someone that's done it.  He doesn't grok why one 
might need lots of ram in a read heavy scenario, or how RAID writes 
work, or ...

Yeah, you need to be pretty careful taking advice on building RAID or 
high performance scalable file server systems like this from people whom 
haven't, are guessing, and getting their answers corrected at a deep 
fundamental level by others.

[...]

> Fortunately for the readers here, such unworthy designs you mention
> aren't posted on this list.

... says the person whom hasn't designed/built/tested configurations 
that the other group they are criticizing has successfully deployed ...

As a reminder of thread history, he started with singing the praises of 
the Nexsan FC targets, indicated MD raid wasn't up to the task, that it 
wasn't "a professionally used solution" or similar statement.  Then he 
attacked anyone who disagreed, and pointed out flaws in his 
statement/argument.  When people like me (and others) suggested cluster 
file systems, he went on his single system design way, and again, using 
FC/SAS, decided that a linear stripe was the right approach.

Heh!

Nothing to see here folks, adjust your filters accordingly.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman@scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/sicluster
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-23 16:19                                                     ` Joe Landman
  2011-03-24  8:05                                                       ` Stan Hoeppner
@ 2011-03-24 17:07                                                       ` Christoph Hellwig
  1 sibling, 0 replies; 116+ messages in thread
From: Christoph Hellwig @ 2011-03-24 17:07 UTC (permalink / raw)
  To: Joe Landman; +Cc: Mdadm

On Wed, Mar 23, 2011 at 12:19:39PM -0400, Joe Landman wrote:
> The issue though is that xfs stores journals internally by default.
> You can change this, and in specific use cases, an external journal
> is strongly advised.  This would be one such use case.

In general if you have enough spindles, or an SSD for the log for
an otherwise disk based setup the external log will always be
a win.  For many workloads log will be the only backwards seeks.

This is slightly offtopic here through, because as Joe already
sais corretly it won't matter too much for a read heavy workload.

> Unfortunately, xfs snapshots have to be done via LVM2 right now.  My
> memory isn't clear on this, there may be an xfs_freeze requirement
> for the snapshot to be really valid.  e.g.

That's not needed anymore for a long time now - device mapper
now calls the freeze_fs method to invoke exactly the same code
to freeze the filesystem.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: high throughput storage server?
  2011-03-24 13:12                                                         ` Joe Landman
@ 2011-03-25  7:06                                                           ` Stan Hoeppner
  0 siblings, 0 replies; 116+ messages in thread
From: Stan Hoeppner @ 2011-03-25  7:06 UTC (permalink / raw)
  To: Joe Landman; +Cc: Mdadm

Joe Landman put forth on 3/24/2011 8:12 AM:
> On 03/24/2011 04:05 AM, Stan Hoeppner wrote:

>> I must call BS on the "far more economical" comment.  At the end of the
> 
> I find it funny ... really, that the person whom hasn't designed and
> built the thing that we have, is calling BS on us.

Demonstrate your systems are "far more economical" than the estimate I
gave for the system I specified.  You made the claim, I challenged it.
Back your claim.

[lots of smoke/mirrors personal attacks deleted]

> Again, since we didn't talk about how we use MD RAID, he doesn't know.
> Then constructs a strawman and proceeds to knock it down.

Answer the question:  do you use/offer/sell hardware RAID controllers.

[lots more smoke/mirrors personal attacks deleted]

-- 
Stan

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2011-03-25  7:06 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-14 23:59 high throughput storage server? Matt Garman
2011-02-15  2:06 ` Doug Dumitru
2011-02-15  4:44   ` Matt Garman
2011-02-15  5:49     ` hansbkk
2011-02-15  9:43     ` David Brown
2011-02-24 20:28       ` Matt Garman
2011-02-24 20:43         ` David Brown
2011-02-15 15:16     ` Joe Landman
2011-02-15 20:37       ` NeilBrown
2011-02-15 20:47         ` Joe Landman
2011-02-15 21:41           ` NeilBrown
2011-02-24 20:58       ` Matt Garman
2011-02-24 21:20         ` Joe Landman
2011-02-26 23:54           ` high throughput storage server? GPFS w/ 10GB/s throughput to the rescue Stan Hoeppner
2011-02-27  0:56             ` Joe Landman
2011-02-27 14:55               ` Stan Hoeppner
2011-03-12 22:49                 ` Matt Garman
2011-02-27 21:30     ` high throughput storage server? Ed W
2011-02-28 15:46       ` Joe Landman
2011-02-28 23:14         ` Stan Hoeppner
2011-02-28 22:22       ` Stan Hoeppner
2011-03-02  3:44       ` Matt Garman
2011-03-02  4:20         ` Joe Landman
2011-03-02  7:10           ` Roberto Spadim
2011-03-02 19:03             ` Drew
2011-03-02 19:20               ` Roberto Spadim
2011-03-13 20:10                 ` Christoph Hellwig
2011-03-14 12:27                   ` Stan Hoeppner
2011-03-14 12:47                     ` Christoph Hellwig
2011-03-18 13:16                       ` Stan Hoeppner
2011-03-18 14:05                         ` Christoph Hellwig
2011-03-18 15:43                           ` Stan Hoeppner
2011-03-18 16:21                             ` Roberto Spadim
2011-03-18 22:01                             ` NeilBrown
2011-03-18 22:23                               ` Roberto Spadim
2011-03-20  1:34                               ` Stan Hoeppner
2011-03-20  3:41                                 ` NeilBrown
2011-03-20  5:32                                   ` Roberto Spadim
2011-03-20 23:22                                     ` Stan Hoeppner
2011-03-21  0:52                                       ` Roberto Spadim
2011-03-21  2:44                                       ` Keld Jørn Simonsen
2011-03-21  3:13                                         ` Roberto Spadim
2011-03-21  3:14                                           ` Roberto Spadim
2011-03-21 17:07                                             ` Stan Hoeppner
2011-03-21 14:18                                         ` Stan Hoeppner
2011-03-21 17:08                                           ` Roberto Spadim
2011-03-21 22:13                                           ` Keld Jørn Simonsen
2011-03-22  9:46                                             ` Robin Hill
2011-03-22 10:14                                               ` Keld Jørn Simonsen
2011-03-23  8:53                                                 ` Stan Hoeppner
2011-03-23 15:57                                                   ` Roberto Spadim
2011-03-23 16:19                                                     ` Joe Landman
2011-03-24  8:05                                                       ` Stan Hoeppner
2011-03-24 13:12                                                         ` Joe Landman
2011-03-25  7:06                                                           ` Stan Hoeppner
2011-03-24 17:07                                                       ` Christoph Hellwig
2011-03-24  5:52                                                     ` Stan Hoeppner
2011-03-24  6:33                                                       ` NeilBrown
2011-03-24  8:07                                                         ` Roberto Spadim
2011-03-24  8:31                                                         ` Stan Hoeppner
2011-03-22 10:00                                             ` Stan Hoeppner
2011-03-22 11:01                                               ` Keld Jørn Simonsen
2011-02-15 12:29 ` Stan Hoeppner
2011-02-15 12:45   ` Roberto Spadim
2011-02-15 13:03     ` Roberto Spadim
2011-02-24 20:43       ` Matt Garman
2011-02-24 20:53         ` Zdenek Kaspar
2011-02-24 21:07           ` Joe Landman
2011-02-15 13:39   ` David Brown
2011-02-16 23:32     ` Stan Hoeppner
2011-02-17  0:00       ` Keld Jørn Simonsen
2011-02-17  0:19         ` Stan Hoeppner
2011-02-17  2:23           ` Roberto Spadim
2011-02-17  3:05             ` Stan Hoeppner
2011-02-17  0:26       ` David Brown
2011-02-17  0:45         ` Stan Hoeppner
2011-02-17 10:39           ` David Brown
2011-02-24 20:49     ` Matt Garman
2011-02-15 13:48 ` Zdenek Kaspar
2011-02-15 14:29   ` Roberto Spadim
2011-02-15 14:51     ` A. Krijgsman
2011-02-15 16:44       ` Roberto Spadim
2011-02-15 14:56     ` Zdenek Kaspar
2011-02-24 20:36       ` Matt Garman
2011-02-17 11:07 ` John Robinson
2011-02-17 13:36   ` Roberto Spadim
2011-02-17 13:54     ` Roberto Spadim
2011-02-17 21:47   ` Stan Hoeppner
2011-02-17 22:13     ` Joe Landman
2011-02-17 23:49       ` Stan Hoeppner
2011-02-18  0:06         ` Joe Landman
2011-02-18  3:48           ` Stan Hoeppner
2011-02-18 13:49 ` Mattias Wadenstein
2011-02-18 23:16   ` Stan Hoeppner
2011-02-21 10:25     ` Mattias Wadenstein
2011-02-21 21:51       ` Stan Hoeppner
2011-02-22  8:57         ` David Brown
2011-02-22  9:30           ` Mattias Wadenstein
2011-02-22  9:49             ` David Brown
2011-02-22 13:38           ` Stan Hoeppner
2011-02-22 14:18             ` David Brown
2011-02-23  5:52               ` Stan Hoeppner
2011-02-23 13:56                 ` David Brown
2011-02-23 14:25                   ` John Robinson
2011-02-23 15:15                     ` David Brown
2011-02-23 23:14                       ` Stan Hoeppner
2011-02-24 10:19                         ` David Brown
2011-02-23 21:59                     ` Stan Hoeppner
2011-02-23 23:43                       ` John Robinson
2011-02-24 15:53                         ` Stan Hoeppner
2011-02-23 21:11                   ` Stan Hoeppner
2011-02-24 11:24                     ` David Brown
2011-02-24 23:30                       ` Stan Hoeppner
2011-02-25  8:20                         ` David Brown
2011-02-19  0:24   ` Joe Landman
2011-02-21 10:04     ` Mattias Wadenstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.