From mboxrd@z Thu Jan 1 00:00:00 1970 From: Roberto Spadim Subject: Re: high throughput storage server? Date: Mon, 21 Mar 2011 14:08:20 -0300 Message-ID: References: <4D7E0994.3020303@hardwarefreak.com> <20110314124733.GA31377@infradead.org> <4D835B2A.1000805@hardwarefreak.com> <20110318140509.GA26226@infradead.org> <4D837DAF.6060107@hardwarefreak.com> <20110319090101.1786cc2a@notabene.brown> <4D8559A2.6080209@hardwarefreak.com> <20110320144147.29141f04@notabene.brown> <4D868C36.5050304@hardwarefreak.com> <20110321024452.GA23100@www2.open-std.org> <4D875E51.50807@hardwarefreak.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <4D875E51.50807@hardwarefreak.com> Sender: linux-raid-owner@vger.kernel.org To: Stan Hoeppner Cc: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= , Mdadm , NeilBrown , Christoph Hellwig , Drew List-Id: linux-raid.ids hum, i think you have all to work with mdraid and hardware,right? xfs allocation groups is nice, i don=B4t know what workload it could accept maybe with raid0 linear this work better than stripe (i must test) i think you know what you do =3D) any more doubt? 2011/3/21 Stan Hoeppner : > Keld J=F8rn Simonsen put forth on 3/20/2011 9:44 PM: > >> Are you then building the system yourself, and running Linux MD RAID= ? > > No. =A0These specifications meet the needs of Matt Garman's analysis > cluster, and extend that performance from 6GB/s to 10GB/s. =A0Christo= ph's > comments about 10GB/s throughput with XFS on large CPU count Altix 40= 00 > series machines from a few years ago prompted me to specify a single > chassis multicore AMD Opteron based system that can achieve the same > throughput at substantially lower cost. > >> Anyway, with 384 spindles and only 50 users, each user will have in >> average 7 spindles for himself. I think much of the time this would = mean >> no random IO, as most users are doing large sequential reading. >> Thus on average you can expect quite close to striping speed if you >> are running RAID capable of striping. > > This is not how large scale shared RAID storage works under a > multi-stream workload. =A0I thought I explained this in sufficient de= tail. > =A0Maybe not. > >> I am puzzled about the --linear concatenating. I think this may caus= e >> the disks in the --linear array to be considered as one spindle, and= thus >> no concurrent IO will be made. I may be wrong there. > > You are puzzled because you are not familiar with the large scale > performance features built into the XFS filesystem. =A0XFS allocation > groups automatically enable large scale parallelism on a single logic= al > device comprised of multiple arrays or single disks, when configured > correctly. =A0See: > > http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-= US/html/Allocation_Groups.html > > The storage pool in my proposed 10GB/s NFS server system consists of = 16 > RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe > spindles per array, 1.752TB per array, 28TB total raw. =A0Concatenati= ng > the 16 array devices with mdadm --linear creates a 28TB logical devic= e. > =A0We format it with this simple command, not having to worry about s= tripe > block size, stripe spindle width, stripe alignment, etc: > > ~# mkfs.xfs -d agcount=3D64 > > Using this method to achieve parallel scalability is simpler and less > prone to configuration errors when compared to multi-level striping, > which often leads to poor performance and poor space utilization. =A0= With > 64 XFS allocation groups the kernel can read/write 4 concurrent strea= ms > from/to each array of 12 spindles, which should be able to handle thi= s > load with plenty of headroom. =A0This system has 32 SAS 6G channels, = each > able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially > more than our 10GB/s target. =A0I was going to state that we're limit= ed to > 10.4GB/s due to the PCIe/HT bridge to the processor. =A0However, I ju= st > realized I made an error when specifying the DL585 G7 with only 2 > processors. =A0See [1] below for details. > > Using XFS in this manner allows us to avoid nested striped arrays and > the inherent problems associated with them. =A0For example, in absenc= e of > using XFS allocation groups to get our parallelism, we could do the > following: > > 1. =A0Width 16 RAID0 stripe over width 12 RAID10 stripe > 2. =A0Width 16 LVM =A0 stripe over width 12 RAID10 stripe > > In either case, what is the correct/optimum stripe block size for eac= h > level when nesting the two? =A0The answer is that there really aren't > correct or optimum stripe sizes in this scenario. =A0Writes to the to= p > level stripe will be broken into 16 chunks. =A0Each of these 16 chunk= s > will then be broken into 12 more chunks. =A0You may be thinking, "Why > don't we just create one 384 disk RAID10? =A0It would SCREAM with 192 > spindles!!" =A0There are many reasons why nobody does this, one being= the > same stripe block size issue as with nested stripes. =A0Extremely wid= e > arrays have a plethora of problems associated with them. > > In summary, concatenating many relatively low stripe spindle count > arrays, and using XFS allocation groups to achieve parallel scalabili= ty, > gives us the performance we want without the problems associated with > other configurations. > > > [1] =A0In order to get all 11 PCIe slots in the DL585 G7 one must use= the > 4 socket model, as the additional PCIe slots of the mezzanine card > connect to two additional SR5690 chips, each one connecting to an HT > port on each of the two additional CPUs. =A0Thus, I'm re-specifying t= he > DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores > total. =A0The 128GB in 16 RDIMMs will be spread across all 16 memory > channels. =A0Memory bandwidth thus doubles to 160GB/s and interconnec= t b/w > doubles to 320GB/s. =A0Thus, we now have up to 19.2 GB/s of available= one > way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe > link. =A0Adding the two required CPUs may have just made this system > capable of 15GB/s NFS throughput for less than $5000 additional cost, > not due to the processors, but the extra IO bandwidth enabled as a > consequence of their inclusion. =A0Adding another quad port 10 GbE NI= C > will take it close to 20GB/s NFS throughput. =A0Shame on me for not > digging far deeper into the DL585 G7 docs. > > -- > Stan > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at =A0http://vger.kernel.org/majordomo-info.html > --=20 Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html