From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:36608 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750955AbdIAXho (ORCPT ); Fri, 1 Sep 2017 19:37:44 -0400 From: Richard Wareing Subject: Re: [PATCH 1/3] xfs: Add rtdefault mount option Date: Fri, 1 Sep 2017 23:37:37 +0000 Message-ID: <67F62657-D116-4B85-9452-5BAB52EC7041@fb.com> References: <25856B28-A65C-4C5B-890D-159F8822393D@fb.com> <20170901043151.GZ10621@dastard> <20170901193237.GF29225@bfoster.bfoster> <20170901225539.GC10621@dastard> In-Reply-To: <20170901225539.GC10621@dastard> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-ID: <8200194695A25441A6AD4B7F1232FFA6@namprd15.prod.outlook.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Dave Chinner Cc: Brian Foster , "linux-xfs@vger.kernel.org" > On Sep 1, 2017, at 3:55 PM, Dave Chinner wrote: >=20 > [satuday morning here, so just a quick comment] >=20 > On Fri, Sep 01, 2017 at 08:36:53PM +0000, Richard Wareing wrote: >>> On Sep 1, 2017, at 12:32 PM, Brian Foster wrote: >>>=20 >>> On Fri, Sep 01, 2017 at 06:39:09PM +0000, Richard Wareing wrote: >>>> Thanks for the quick feedback Dave! My comments are in-line below. >>>>=20 >>>>=20 >>>>> On Aug 31, 2017, at 9:31 PM, Dave Chinner wrote= : >>>>>=20 >>>>> Hi Richard, >>>>>=20 >>>>> On Thu, Aug 31, 2017 at 06:00:21PM -0700, Richard Wareing wrote: >>> ... >>>>>> add >>>>>> support for the more sophisticated AG based block allocator to RT >>>>>> (bitmapped version works well for us, but multi-threaded use-cases >>>>>> might not do as well). >>>>>=20 >>>>> That's a great big can of worms - not sure we want to open it. The >>>>> simplicity of the rt allocator is one of it's major benefits to >>>>> workloads that require deterministic allocation behaviour... >>>>=20 >>>> Agreed, I took a quick look at what it might take and came to a simila= r conclusion, but I can dream :). >>>>=20 >>>=20 >>> Just a side point based on the discussion so far... I kind of get the >>> impression that the primary reason for using realtime support here is >>> for the simple fact that it's a separate physical device. That provides >>> a basic mechanism to split files across fast and slow physical storage >>> based on some up-front heuristic. The fact that the realtime feature >>> uses a separate allocation algorithm is actually irrelevant (and >>> possibly a problem in the future). >>>=20 >>> Is that an accurate assessment? If so, it makes me wonder whether it's >>> worth thinking about if there are ways to get the same behavior using >>> traditional functionality. This ignores Dave's question about how much >>> of the performance actually comes from simply separating out the log, >>> but for example suppose we had a JBOD block device made up of a >>> combination of spinning and solid state disks via device-mapper with th= e >>> requirement that a boundary from fast -> slow and vice versa was always >>> at something like a 100GB alignment. Then if you formatted that device >>> with XFS using 100GB AGs (or whatever to make them line up), and could >>> somehow tag each AG as "fast" or "slow" based on the known underlying >>> device mapping, >=20 > Not a new idea. :) >=20 > I've got old xfs_spaceman patches sitting around somewhere for > ioctls to add such information to individual AGs. I think I called > them "concat groups" to allow multiple AGs to sit inside a single > concatenation, and they added a policy layer over the top of AGs > to control things like metadata placement.... >=20 >>> could you potentially get the same results by using the >>> same heuristics to direct files to particular sets of AGs rather than >>> between two physical devices? >=20 > That's pretty much what I was working on back at SGI in 2007. i.e. > providing a method for configuring AGs with difference > characteristics and a userspace policy interface to configure and > make use of it.... >=20 > https://urldefense.proofpoint.com/v2/url?u=3Dhttp-3A__oss.sgi.com_archive= s_xfs_2009-2D02_msg00250.html&d=3DDwIBAg&c=3D5VD0RTtNlTh3ycd41b3MUw&r=3DqJ8= Lp7ySfpQklq3QZr44Iw&m=3D2aGIpGJVnKOtPKDQQRfM52Rv5NTAwoK15WHcQIodIG4&s=3DbAO= VWOrDuWm92j4tTCxZnZOQxhUP1EVlj-JSHpC1yoA&e=3D=20 >=20 >=20 >>> Obviously there are some differences like >>> metadata being spread across the fast/slow devices (though I think we >>> had such a thing as metadata only AGs), etc. >=20 > We have "metadata preferred" AGs, and that is what the inode32 > policy uses to place all the inodes and directory/atribute metadata > in the 32bit inode address space. It doesn't get used for data > unless the rest of the filesystem is ENOSPC. >=20 >>> I'm just handwaving here to >>> try and better understand the goal. >=20 > We've been down these paths many times - the problem has always been > that the people who want complex, configurable allocation policies > for their workload have never provided the resources needed to > implement past "here's a mount option hack that works for us"..... >=20 >> Sorry I forgot to clarify the origins of the performance wins >> here. This is obviously very workload dependent (e.g. >> write/flush/inode updatey workloads benefit the most) but for our >> use case about ~65% of the IOP savings (~1/3 journal + slightly >> less than 1/3 sync of metadata from journal, slightly less as some >> journal entries get canceled), the remainder 1/3 of the win comes >> from reading small files from the SSD vs. HDDs (about 25-30% of >> our file population is <=3D256k; depending on the cluster). To be >> clear, we don't split files, we store all data blocks of the files >> either entirely on the SSD (e.g. small files <=3D256k) and the rest >> on the real-time HDD device. The basic principal here being that, >> larger files MIGHT have small IOPs to them (in our use-case this >> happens to be rare, but not impossible), but small files always >> do, and when 25-30% of your population is small...that's a big >> chunk of your IOPs. >=20 > So here's a test for you. Make a device with a SSD as the first 1TB, > and you HDD as the rest (use dm to do this). Then use the inode32 > allocator (mount option) to split metadata from data. The filesysetm > will keep inodes/directories on the SSD and file data on the HDD > automatically. >=20 > Better yet: have data allocations smaller than stripe units target > metadata prefferred AGs (i.e. the SSD region) and allocations larger > than stripe unit target the data-preferred AGs. Set the stripe unit > to match your SSD/HDD threshold.... >=20 > [snip] >=20 >> The AG based could work, though it's going to be a very hard sell >> to use dm mapper, this isn't code we have ever used in our storage >> stack. At our scale, there are important operational reasons we >> need to keep the storage stack simple (less bugs to hit), so >> keeping the solution contained within XFS is a necessary >> requirement for us. >=20 > Modifying the filesysetm on-disk format is far more complex than > adding dm to your stack. Filesystem modifications are difficult and > time consuming because if we screw up, users lose all their data. >=20 > If you can solve the problem with DM and a little bit of additional > in-memory kernel code to categorise and select which AG to use for > what (i.e. policy stuff that can be held in userspace), then that is > the pretty much the only answer that makes sense from a filesystem > developer's point of view.... >=20 > Start by thinking about exposing AG behaviour controls through sysfs > objects and configuring them at mount time through udev event > notifications. >=20 Very cool idea. A detail which I left out which might complicate this, is = we only use 17GB of SSD for each ~8-10TB HDD (we share just a small 256G SS= D for about 15 drives), and even then we don't even use 50% of the SSD for = these partitions. We also want to be very selective about what data we let= touch the SSD, we don't want folks who write large files by doing small IO= to touch the SSD, only IO to small files (which are immutable in our use-c= ase). On an unrelated note, after talking to Omar Sandoval & Chris Mason over her= e, I'm reworking rtdefault to change it to "rtdisable" which gives the same= operational outcome vs. rtdefault w/o setting inheritance bits (see prior = e-mail). This way folks have a kill switch of sorts, yet otherwise maintai= ns the existing "persistent" behavior. > Cheers, >=20 > Dave. > --=20 > Dave Chinner > david@fromorbit.com