* Deprecating ext4 support @ 2016-04-11 21:39 Sage Weil [not found] ` <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> ` (3 more replies) 0 siblings, 4 replies; 36+ messages in thread From: Sage Weil @ 2016-04-11 21:39 UTC (permalink / raw) To: ceph-devel, ceph-users, ceph-maintainers, ceph-announce Hi, ext4 has never been recommended, but we did test it. After Jewel is out, we would like explicitly recommend *against* ext4 and stop testing it. Why: Recently we discovered an issue with the long object name handling that is not fixable without rewriting a significant chunk of FileStores filename handling. (There is a limit in the amount of xattr data ext4 can store in the inode, which causes problems in LFNIndex.) We *could* invest a ton of time rewriting this to fix, but it only affects ext4, which we never recommended, and we plan to deprecate FileStore once BlueStore is stable anyway, so it seems like a waste of time that would be better spent elsewhere. Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly improve time/coverage for FileStore on XFS and on BlueStore. The long file name handling is problematic anytime someone is storing rados objects with long names. The primary user that does this is RGW, which means any RGW cluster using ext4 should recreate their OSDs to use XFS. Other librados users could be affected too, though, like users with very long rbd image names (e.g., > 100 characters), or custom librados users. How: To make this change as visible as possible, the plan is to make ceph-osd refuse to start if the backend is unable to support the configured max object name (osd_max_object_name_len). The OSD will complain that ext4 cannot store such an object and refuse to start. A user who is only using RBD might decide they don't need long file names to work and can adjust the osd_max_object_name_len setting to something small (say, 64) and run successfully. They would be taking a risk, though, because we would like to stop testing on ext4. Is this reasonable? If there significant ext4 users that are unwilling to recreate their OSDs, now would be the time to speak up. Thanks! sage ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>]
* Re: Deprecating ext4 support [not found] ` <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> @ 2016-04-11 21:42 ` Allen Samuels 2016-04-11 21:47 ` [ceph-users] " Jan Schermer 2016-04-11 23:39 ` Christian Balzer 2016-04-12 7:00 ` Michael Metz-Martini | SpeedPartner GmbH 2 siblings, 1 reply; 36+ messages in thread From: Allen Samuels @ 2016-04-11 21:42 UTC (permalink / raw) To: Sage Weil, ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ, ceph-announce-Qp0mS5GaXlQ RIP ext4. Allen Samuels Software Architect, Fellow, Systems and Software Solutions 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels-BhnZeUuC+ExBDgjK7y7TUQ@public.gmane.org > -----Original Message----- > From: ceph-devel-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:ceph-devel- > owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Sage Weil > Sent: Monday, April 11, 2016 2:40 PM > To: ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; ceph-users-Qp0mS5GaXlQ@public.gmane.org; ceph- > maintainers-Qp0mS5GaXlQ@public.gmane.org; ceph-announce-Qp0mS5GaXlQ@public.gmane.org > Subject: Deprecating ext4 support > > Hi, > > ext4 has never been recommended, but we did test it. After Jewel is out, > we would like explicitly recommend *against* ext4 and stop testing it. > > Why: > > Recently we discovered an issue with the long object name handling that is > not fixable without rewriting a significant chunk of FileStores filename > handling. (There is a limit in the amount of xattr data ext4 can store in the > inode, which causes problems in LFNIndex.) > > We *could* invest a ton of time rewriting this to fix, but it only affects ext4, > which we never recommended, and we plan to deprecate FileStore once > BlueStore is stable anyway, so it seems like a waste of time that would be > better spent elsewhere. > > Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly > improve time/coverage for FileStore on XFS and on BlueStore. > > The long file name handling is problematic anytime someone is storing rados > objects with long names. The primary user that does this is RGW, which > means any RGW cluster using ext4 should recreate their OSDs to use XFS. > Other librados users could be affected too, though, like users with very long > rbd image names (e.g., > 100 characters), or custom librados users. > > How: > > To make this change as visible as possible, the plan is to make ceph-osd > refuse to start if the backend is unable to support the configured max > object name (osd_max_object_name_len). The OSD will complain that ext4 > cannot store such an object and refuse to start. A user who is only using > RBD might decide they don't need long file names to work and can adjust > the osd_max_object_name_len setting to something small (say, 64) and run > successfully. They would be taking a risk, though, because we would like > to stop testing on ext4. > > Is this reasonable? If there significant ext4 users that are unwilling to > recreate their OSDs, now would be the time to speak up. > > Thanks! > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-11 21:42 ` Allen Samuels @ 2016-04-11 21:47 ` Jan Schermer 0 siblings, 0 replies; 36+ messages in thread From: Jan Schermer @ 2016-04-11 21:47 UTC (permalink / raw) To: Allen Samuels Cc: Sage Weil, ceph-devel, ceph-users, ceph-maintainers, ceph-announce RIP Ceph. > On 11 Apr 2016, at 23:42, Allen Samuels <Allen.Samuels@sandisk.com> wrote: > > RIP ext4. > > > Allen Samuels > Software Architect, Fellow, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@SanDisk.com > > >> -----Original Message----- >> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel- >> owner@vger.kernel.org] On Behalf Of Sage Weil >> Sent: Monday, April 11, 2016 2:40 PM >> To: ceph-devel@vger.kernel.org; ceph-users@ceph.com; ceph- >> maintainers@ceph.com; ceph-announce@ceph.com >> Subject: Deprecating ext4 support >> >> Hi, >> >> ext4 has never been recommended, but we did test it. After Jewel is out, >> we would like explicitly recommend *against* ext4 and stop testing it. >> >> Why: >> >> Recently we discovered an issue with the long object name handling that is >> not fixable without rewriting a significant chunk of FileStores filename >> handling. (There is a limit in the amount of xattr data ext4 can store in the >> inode, which causes problems in LFNIndex.) >> >> We *could* invest a ton of time rewriting this to fix, but it only affects ext4, >> which we never recommended, and we plan to deprecate FileStore once >> BlueStore is stable anyway, so it seems like a waste of time that would be >> better spent elsewhere. >> >> Also, by dropping ext4 test coverage in ceph-qa-suite, we can significantly >> improve time/coverage for FileStore on XFS and on BlueStore. >> >> The long file name handling is problematic anytime someone is storing rados >> objects with long names. The primary user that does this is RGW, which >> means any RGW cluster using ext4 should recreate their OSDs to use XFS. >> Other librados users could be affected too, though, like users with very long >> rbd image names (e.g., > 100 characters), or custom librados users. >> >> How: >> >> To make this change as visible as possible, the plan is to make ceph-osd >> refuse to start if the backend is unable to support the configured max >> object name (osd_max_object_name_len). The OSD will complain that ext4 >> cannot store such an object and refuse to start. A user who is only using >> RBD might decide they don't need long file names to work and can adjust >> the osd_max_object_name_len setting to something small (say, 64) and run >> successfully. They would be taking a risk, though, because we would like >> to stop testing on ext4. >> >> Is this reasonable? If there significant ext4 users that are unwilling to >> recreate their OSDs, now would be the time to speak up. >> >> Thanks! >> sage >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Deprecating ext4 support [not found] ` <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-11 21:42 ` Allen Samuels @ 2016-04-11 23:39 ` Christian Balzer 2016-04-12 1:12 ` [ceph-users] " Sage Weil [not found] ` <20160412083925.5106311d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org> 2016-04-12 7:00 ` Michael Metz-Martini | SpeedPartner GmbH 2 siblings, 2 replies; 36+ messages in thread From: Christian Balzer @ 2016-04-11 23:39 UTC (permalink / raw) To: Sage Weil Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ, ceph-announce-Qp0mS5GaXlQ Hello, What a lovely missive to start off my working day... On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote: > Hi, > > ext4 has never been recommended, but we did test it. Patently wrong, as Shinobu just pointed. Ext4 never was (especially recently) flogged as much as XFS, but it always was a recommended, supported filestorage filesystem, unlike the experimental BTRFS of ZFS. And for various reasons people, including me, deployed it instead of XFS. > After Jewel is > out, we would like explicitly recommend *against* ext4 and stop testing > it. > Changing your recommendations is fine, stopping testing/supporting it isn't. People deployed Ext4 in good faith and can be expected to use it at least until their HW is up for replacement (4-5 years). > Why: > > Recently we discovered an issue with the long object name handling that > is not fixable without rewriting a significant chunk of FileStores > filename handling. (There is a limit in the amount of xattr data ext4 > can store in the inode, which causes problems in LFNIndex.) > Is that also true if the Ext4 inode size is larger than default? > We *could* invest a ton of time rewriting this to fix, but it only > affects ext4, which we never recommended, and we plan to deprecate > FileStore once BlueStore is stable anyway, so it seems like a waste of > time that would be better spent elsewhere. > If you (that is RH) is going to declare bluestore stable this year, I would be very surprised. Either way, dropping support before the successor is truly ready doesn't sit well with me. Which brings me to the reasons why people would want to migrate (NOT talking about starting freshly) to bluestore. 1. Will it be faster (IOPS) than filestore with SSD journals? Don't think so, but feel free to prove me wrong. 2. Will it be bit-rot proof? Note the deafening silence from the devs in this thread: http://www.spinics.net/lists/ceph-users/msg26510.html > Also, by dropping ext4 test coverage in ceph-qa-suite, we can > significantly improve time/coverage for FileStore on XFS and on > BlueStore. > Really, isn't that fully automated? > The long file name handling is problematic anytime someone is storing > rados objects with long names. The primary user that does this is RGW, > which means any RGW cluster using ext4 should recreate their OSDs to use > XFS. Other librados users could be affected too, though, like users > with very long rbd image names (e.g., > 100 characters), or custom > librados users. > > How: > > To make this change as visible as possible, the plan is to make ceph-osd > refuse to start if the backend is unable to support the configured max > object name (osd_max_object_name_len). The OSD will complain that ext4 > cannot store such an object and refuse to start. A user who is only > using RBD might decide they don't need long file names to work and can > adjust the osd_max_object_name_len setting to something small (say, 64) > and run successfully. They would be taking a risk, though, because we > would like to stop testing on ext4. > > Is this reasonable? About as reasonable as dropping format 1 support, that is not at all. https://www.mail-archive.com/ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org/msg28070.html I'm officially only allowed to do (preventative) maintenance during weekend nights on our main production cluster. That would mean 13 ruined weekends at the realistic rate of 1 OSD per night, so you can see where my lack of enthusiasm for OSD recreation comes from. > If there significant ext4 users that are unwilling > to recreate their OSDs, now would be the time to speak up. > Consider that done. Christian -- Christian Balzer Network/Systems Engineer chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications http://www.gol.com/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-11 23:39 ` Christian Balzer @ 2016-04-12 1:12 ` Sage Weil [not found] ` <alpine.DEB.2.11.1604112046570.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-12 2:43 ` [ceph-users] " Christian Balzer [not found] ` <20160412083925.5106311d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org> 1 sibling, 2 replies; 36+ messages in thread From: Sage Weil @ 2016-04-12 1:12 UTC (permalink / raw) To: Christian Balzer; +Cc: ceph-devel, ceph-users, ceph-maintainers [-- Attachment #1: Type: TEXT/PLAIN, Size: 7310 bytes --] On Tue, 12 Apr 2016, Christian Balzer wrote: > > Hello, > > What a lovely missive to start off my working day... > > On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote: > > > Hi, > > > > ext4 has never been recommended, but we did test it. > Patently wrong, as Shinobu just pointed. > > Ext4 never was (especially recently) flogged as much as XFS, but it always > was a recommended, supported filestorage filesystem, unlike the > experimental BTRFS of ZFS. > And for various reasons people, including me, deployed it instead of XFS. Greg definitely wins the prize for raising this as a major issue, then (and for naming you as one of the major ext4 users). I was not aware that we were recommending ext4 anywhere. FWIW, here's what the docs currently say: Ceph OSD Daemons rely heavily upon the stability and performance of the underlying filesystem. Note: We currently recommend XFS for production deployments. We recommend btrfs for testing, development, and any non-critical deployments. We believe that btrfs has the correct feature set and roadmap to serve Ceph in the long-term, but XFS and ext4 provide the necessary stability for today’s deployments. btrfs development is proceeding rapidly: users should be comfortable installing the latest released upstream kernels and be able to track development activity for critical bug fixes. Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the underlying file system for various forms of internal object state and metadata. The underlying filesystem must provide sufficient capacity for XATTRs. btrfs does not bound the total xattr metadata stored with a file. XFS has a relatively large limit (64 KB) that most deployments won’t encounter, but the ext4 is too small to be usable. (http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4) Unfortunately that second paragraph, second sentence indirectly says ext4 is stable. :( :( I'll prepare a PR tomorrow to revise this whole section based on the new information. If anyone knows of other docs that recommend ext4, please let me know! They need to be updated. > > After Jewel is out, we would like explicitly recommend *against* ext4 > > and stop testing it. > > > Changing your recommendations is fine, stopping testing/supporting it > isn't. > People deployed Ext4 in good faith and can be expected to use it at least > until their HW is up for replacement (4-5 years). I agree, which is why I asked. And part of it depends on what it's being used for. If there are major users using ext4 for RGW then their deployments are at risk and they should swap it out for data safety reasons alone. (Or, we need to figure out how to fix long object name support on ext4.) On the other hand, if the only ext4 users are using RBD only, then they can safely continue with lower max object names, and upstream testing is important to let those OSDs age out naturally. Does your cluster support RBD, RGW, or something else? > > Why: > > > > Recently we discovered an issue with the long object name handling that > > is not fixable without rewriting a significant chunk of FileStores > > filename handling. (There is a limit in the amount of xattr data ext4 > > can store in the inode, which causes problems in LFNIndex.) > > > Is that also true if the Ext4 inode size is larger than default? I'm not sure... Sam, do you know? (It's somewhat academic, though, since we can't change the inode size on existing file systems.) > > We *could* invest a ton of time rewriting this to fix, but it only > > affects ext4, which we never recommended, and we plan to deprecate > > FileStore once BlueStore is stable anyway, so it seems like a waste of > > time that would be better spent elsewhere. > > > If you (that is RH) is going to declare bluestore stable this year, I > would be very surprised. My hope is that it can be the *default* for L (next spring). But we'll see. > Either way, dropping support before the successor is truly ready doesn't > sit well with me. Yeah, I misspoke. Once BlueStore is supported and the default, support for FileStore won't be dropped immediately. But we'll want to communicate that eventually it will lose support. How strongly that is messaged probably depends on how confident we are in BlueStore at that point. And I confess I haven't thought much about how long "long enough" is yet. > Which brings me to the reasons why people would want to migrate (NOT > talking about starting freshly) to bluestore. > > 1. Will it be faster (IOPS) than filestore with SSD journals? > Don't think so, but feel free to prove me wrong. It will absolutely faster on the same hardware. Whether BlueStore on HDD only is faster than FileStore HDD + SSD journal will depend on the workload. > 2. Will it be bit-rot proof? Note the deafening silence from the devs in > this thread: > http://www.spinics.net/lists/ceph-users/msg26510.html I missed that thread, sorry. We (Mirantis, SanDisk, Red Hat) are currently working on checksum support in BlueStore. Part of the reason why BlueStore is the preferred path is because we will probably never see full checksumming in ext4 or XFS. > > Also, by dropping ext4 test coverage in ceph-qa-suite, we can > > significantly improve time/coverage for FileStore on XFS and on > > BlueStore. > > > Really, isn't that fully automated? It is, but hardware and time are finite. Fewer tests on FileStore+ext4 means more tests on FileStore+XFS or BlueStore. But this is a minor point. > > The long file name handling is problematic anytime someone is storing > > rados objects with long names. The primary user that does this is RGW, > > which means any RGW cluster using ext4 should recreate their OSDs to use > > XFS. Other librados users could be affected too, though, like users > > with very long rbd image names (e.g., > 100 characters), or custom > > librados users. > > > > How: > > > > To make this change as visible as possible, the plan is to make ceph-osd > > refuse to start if the backend is unable to support the configured max > > object name (osd_max_object_name_len). The OSD will complain that ext4 > > cannot store such an object and refuse to start. A user who is only > > using RBD might decide they don't need long file names to work and can > > adjust the osd_max_object_name_len setting to something small (say, 64) > > and run successfully. They would be taking a risk, though, because we > > would like to stop testing on ext4. > > > > Is this reasonable? > About as reasonable as dropping format 1 support, that is not at all. > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28070.html Fortunately nobody (to my knowledge) has suggested dropping format 1 support. :) > I'm officially only allowed to do (preventative) maintenance during weekend > nights on our main production cluster. > That would mean 13 ruined weekends at the realistic rate of 1 OSD per > night, so you can see where my lack of enthusiasm for OSD recreation comes > from. Yeah. :( > > If there significant ext4 users that are unwilling > > to recreate their OSDs, now would be the time to speak up. > > > Consider that done. Thank you for the feedback! sage ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <alpine.DEB.2.11.1604112046570.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>]
* Re: Deprecating ext4 support [not found] ` <alpine.DEB.2.11.1604112046570.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> @ 2016-04-12 1:32 ` Shinobu Kinjo 2016-04-12 2:05 ` [Ceph-maintainers] " hp cre 1 sibling, 0 replies; 36+ messages in thread From: Shinobu Kinjo @ 2016-04-12 1:32 UTC (permalink / raw) To: Sage Weil Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-maintainers-Qp0mS5GaXlQ, ceph-users-Qp0mS5GaXlQ Hi Sage, Probably it may be better to mention that we only update master document otherwise someone gets confused again [1]. [1] https://en.wikipedia.org/wiki/Ceph_%28software%29 Cheers, Shinobu ----- Original Message ----- From: "Sage Weil" <sweil@redhat.com> To: "Christian Balzer" <chibi@gol.com> Cc: ceph-devel@vger.kernel.org, ceph-users@ceph.com, ceph-maintainers@ceph.com Sent: Tuesday, April 12, 2016 10:12:14 AM Subject: Re: [ceph-users] Deprecating ext4 support On Tue, 12 Apr 2016, Christian Balzer wrote: > > Hello, > > What a lovely missive to start off my working day... > > On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote: > > > Hi, > > > > ext4 has never been recommended, but we did test it. > Patently wrong, as Shinobu just pointed. > > Ext4 never was (especially recently) flogged as much as XFS, but it always > was a recommended, supported filestorage filesystem, unlike the > experimental BTRFS of ZFS. > And for various reasons people, including me, deployed it instead of XFS. Greg definitely wins the prize for raising this as a major issue, then (and for naming you as one of the major ext4 users). I was not aware that we were recommending ext4 anywhere. FWIW, here's what the docs currently say: Ceph OSD Daemons rely heavily upon the stability and performance of the underlying filesystem. Note: We currently recommend XFS for production deployments. We recommend btrfs for testing, development, and any non-critical deployments. We believe that btrfs has the correct feature set and roadmap to serve Ceph in the long-term, but XFS and ext4 provide the necessary stability for today’s deployments. btrfs development is proceeding rapidly: users should be comfortable installing the latest released upstream kernels and be able to track development activity for critical bug fixes. Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the underlying file system for various forms of internal object state and metadata. The underlying filesystem must provide sufficient capacity for XATTRs. btrfs does not bound the total xattr metadata stored with a file. XFS has a relatively large limit (64 KB) that most deployments won’t encounter, but the ext4 is too small to be usable. (http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4) Unfortunately that second paragraph, second sentence indirectly says ext4 is stable. :( :( I'll prepare a PR tomorrow to revise this whole section based on the new information. If anyone knows of other docs that recommend ext4, please let me know! They need to be updated. > > After Jewel is out, we would like explicitly recommend *against* ext4 > > and stop testing it. > > > Changing your recommendations is fine, stopping testing/supporting it > isn't. > People deployed Ext4 in good faith and can be expected to use it at least > until their HW is up for replacement (4-5 years). I agree, which is why I asked. And part of it depends on what it's being used for. If there are major users using ext4 for RGW then their deployments are at risk and they should swap it out for data safety reasons alone. (Or, we need to figure out how to fix long object name support on ext4.) On the other hand, if the only ext4 users are using RBD only, then they can safely continue with lower max object names, and upstream testing is important to let those OSDs age out naturally. Does your cluster support RBD, RGW, or something else? > > Why: > > > > Recently we discovered an issue with the long object name handling that > > is not fixable without rewriting a significant chunk of FileStores > > filename handling. (There is a limit in the amount of xattr data ext4 > > can store in the inode, which causes problems in LFNIndex.) > > > Is that also true if the Ext4 inode size is larger than default? I'm not sure... Sam, do you know? (It's somewhat academic, though, since we can't change the inode size on existing file systems.) > > We *could* invest a ton of time rewriting this to fix, but it only > > affects ext4, which we never recommended, and we plan to deprecate > > FileStore once BlueStore is stable anyway, so it seems like a waste of > > time that would be better spent elsewhere. > > > If you (that is RH) is going to declare bluestore stable this year, I > would be very surprised. My hope is that it can be the *default* for L (next spring). But we'll see. > Either way, dropping support before the successor is truly ready doesn't > sit well with me. Yeah, I misspoke. Once BlueStore is supported and the default, support for FileStore won't be dropped immediately. But we'll want to communicate that eventually it will lose support. How strongly that is messaged probably depends on how confident we are in BlueStore at that point. And I confess I haven't thought much about how long "long enough" is yet. > Which brings me to the reasons why people would want to migrate (NOT > talking about starting freshly) to bluestore. > > 1. Will it be faster (IOPS) than filestore with SSD journals? > Don't think so, but feel free to prove me wrong. It will absolutely faster on the same hardware. Whether BlueStore on HDD only is faster than FileStore HDD + SSD journal will depend on the workload. > 2. Will it be bit-rot proof? Note the deafening silence from the devs in > this thread: > http://www.spinics.net/lists/ceph-users/msg26510.html I missed that thread, sorry. We (Mirantis, SanDisk, Red Hat) are currently working on checksum support in BlueStore. Part of the reason why BlueStore is the preferred path is because we will probably never see full checksumming in ext4 or XFS. > > Also, by dropping ext4 test coverage in ceph-qa-suite, we can > > significantly improve time/coverage for FileStore on XFS and on > > BlueStore. > > > Really, isn't that fully automated? It is, but hardware and time are finite. Fewer tests on FileStore+ext4 means more tests on FileStore+XFS or BlueStore. But this is a minor point. > > The long file name handling is problematic anytime someone is storing > > rados objects with long names. The primary user that does this is RGW, > > which means any RGW cluster using ext4 should recreate their OSDs to use > > XFS. Other librados users could be affected too, though, like users > > with very long rbd image names (e.g., > 100 characters), or custom > > librados users. > > > > How: > > > > To make this change as visible as possible, the plan is to make ceph-osd > > refuse to start if the backend is unable to support the configured max > > object name (osd_max_object_name_len). The OSD will complain that ext4 > > cannot store such an object and refuse to start. A user who is only > > using RBD might decide they don't need long file names to work and can > > adjust the osd_max_object_name_len setting to something small (say, 64) > > and run successfully. They would be taking a risk, though, because we > > would like to stop testing on ext4. > > > > Is this reasonable? > About as reasonable as dropping format 1 support, that is not at all. > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28070.html Fortunately nobody (to my knowledge) has suggested dropping format 1 support. :) > I'm officially only allowed to do (preventative) maintenance during weekend > nights on our main production cluster. > That would mean 13 ruined weekends at the realistic rate of 1 OSD per > night, so you can see where my lack of enthusiasm for OSD recreation comes > from. Yeah. :( > > If there significant ext4 users that are unwilling > > to recreate their OSDs, now would be the time to speak up. > > > Consider that done. Thank you for the feedback! sage _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Ceph-maintainers] Deprecating ext4 support [not found] ` <alpine.DEB.2.11.1604112046570.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-12 1:32 ` Shinobu Kinjo @ 2016-04-12 2:05 ` hp cre 1 sibling, 0 replies; 36+ messages in thread From: hp cre @ 2016-04-12 2:05 UTC (permalink / raw) To: Sage Weil Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ [-- Attachment #1.1: Type: text/plain, Size: 8492 bytes --] As far as i remember, the documentation did say that either filesystems (ext4 or xfs) are OK, except for xattr which was better represented on xfs. I would think the best move would be to make xfs the default osd creation method and put in a warning about ext4 being deprecated in future releases. But leave support for it till all users are weaned off of it in favour of xfs and later, btrfs. On 12 Apr 2016 03:12, "Sage Weil" <sweil-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, 12 Apr 2016, Christian Balzer wrote: > > > > Hello, > > > > What a lovely missive to start off my working day... > > > > On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote: > > > > > Hi, > > > > > > ext4 has never been recommended, but we did test it. > > Patently wrong, as Shinobu just pointed. > > > > Ext4 never was (especially recently) flogged as much as XFS, but it > always > > was a recommended, supported filestorage filesystem, unlike the > > experimental BTRFS of ZFS. > > And for various reasons people, including me, deployed it instead of XFS. > > Greg definitely wins the prize for raising this as a major issue, then > (and for naming you as one of the major ext4 users). > > I was not aware that we were recommending ext4 anywhere. FWIW, here's > what the docs currently say: > > Ceph OSD Daemons rely heavily upon the stability and performance of the > underlying filesystem. > > Note: We currently recommend XFS for production deployments. We recommend > btrfs for testing, development, and any non-critical deployments. We > believe that btrfs has the correct feature set and roadmap to serve Ceph > in the long-term, but XFS and ext4 provide the necessary stability for > today’s deployments. btrfs development is proceeding rapidly: users should > be comfortable installing the latest released upstream kernels and be able > to track development activity for critical bug fixes. > > Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the > underlying file system for various forms of internal object state and > metadata. The underlying filesystem must provide sufficient capacity for > XATTRs. btrfs does not bound the total xattr metadata stored with a file. > XFS has a relatively large limit (64 KB) that most deployments won’t > encounter, but the ext4 is too small to be usable. > > ( > http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4 > ) > > Unfortunately that second paragraph, second sentence indirectly says ext4 > is stable. :( :( I'll prepare a PR tomorrow to revise this whole section > based on the new information. > > If anyone knows of other docs that recommend ext4, please let me know! > They need to be updated. > > > > After Jewel is out, we would like explicitly recommend *against* ext4 > > > and stop testing it. > > > > > Changing your recommendations is fine, stopping testing/supporting it > > isn't. > > People deployed Ext4 in good faith and can be expected to use it at least > > until their HW is up for replacement (4-5 years). > > I agree, which is why I asked. > > And part of it depends on what it's being used for. If there are major > users using ext4 for RGW then their deployments are at risk and they > should swap it out for data safety reasons alone. (Or, we need to figure > out how to fix long object name support on ext4.) On the other hand, if > the only ext4 users are using RBD only, then they can safely continue with > lower max object names, and upstream testing is important to let those > OSDs age out naturally. > > Does your cluster support RBD, RGW, or something else? > > > > Why: > > > > > > Recently we discovered an issue with the long object name handling that > > > is not fixable without rewriting a significant chunk of FileStores > > > filename handling. (There is a limit in the amount of xattr data ext4 > > > can store in the inode, which causes problems in LFNIndex.) > > > > > Is that also true if the Ext4 inode size is larger than default? > > I'm not sure... Sam, do you know? (It's somewhat academic, though, since > we can't change the inode size on existing file systems.) > > > > We *could* invest a ton of time rewriting this to fix, but it only > > > affects ext4, which we never recommended, and we plan to deprecate > > > FileStore once BlueStore is stable anyway, so it seems like a waste of > > > time that would be better spent elsewhere. > > > > > If you (that is RH) is going to declare bluestore stable this year, I > > would be very surprised. > > My hope is that it can be the *default* for L (next spring). But we'll > see. > > > Either way, dropping support before the successor is truly ready doesn't > > sit well with me. > > Yeah, I misspoke. Once BlueStore is supported and the default, support > for FileStore won't be dropped immediately. But we'll want to communicate > that eventually it will lose support. How strongly that is messaged > probably depends on how confident we are in BlueStore at that point. And > I confess I haven't thought much about how long "long enough" is yet. > > > Which brings me to the reasons why people would want to migrate (NOT > > talking about starting freshly) to bluestore. > > > > 1. Will it be faster (IOPS) than filestore with SSD journals? > > Don't think so, but feel free to prove me wrong. > > It will absolutely faster on the same hardware. Whether BlueStore on HDD > only is faster than FileStore HDD + SSD journal will depend on the > workload. > > > 2. Will it be bit-rot proof? Note the deafening silence from the devs in > > this thread: > > http://www.spinics.net/lists/ceph-users/msg26510.html > > I missed that thread, sorry. > > We (Mirantis, SanDisk, Red Hat) are currently working on checksum support > in BlueStore. Part of the reason why BlueStore is the preferred path is > because we will probably never see full checksumming in ext4 or XFS. > > > > Also, by dropping ext4 test coverage in ceph-qa-suite, we can > > > significantly improve time/coverage for FileStore on XFS and on > > > BlueStore. > > > > > Really, isn't that fully automated? > > It is, but hardware and time are finite. Fewer tests on FileStore+ext4 > means more tests on FileStore+XFS or BlueStore. But this is a minor > point. > > > > The long file name handling is problematic anytime someone is storing > > > rados objects with long names. The primary user that does this is RGW, > > > which means any RGW cluster using ext4 should recreate their OSDs to > use > > > XFS. Other librados users could be affected too, though, like users > > > with very long rbd image names (e.g., > 100 characters), or custom > > > librados users. > > > > > > How: > > > > > > To make this change as visible as possible, the plan is to make > ceph-osd > > > refuse to start if the backend is unable to support the configured max > > > object name (osd_max_object_name_len). The OSD will complain that ext4 > > > cannot store such an object and refuse to start. A user who is only > > > using RBD might decide they don't need long file names to work and can > > > adjust the osd_max_object_name_len setting to something small (say, 64) > > > and run successfully. They would be taking a risk, though, because we > > > would like to stop testing on ext4. > > > > > > Is this reasonable? > > About as reasonable as dropping format 1 support, that is not at all. > > https://www.mail-archive.com/ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org/msg28070.html > > Fortunately nobody (to my knowledge) has suggested dropping format 1 > support. :) > > > I'm officially only allowed to do (preventative) maintenance during > weekend > > nights on our main production cluster. > > That would mean 13 ruined weekends at the realistic rate of 1 OSD per > > night, so you can see where my lack of enthusiasm for OSD recreation > comes > > from. > > Yeah. :( > > > > If there significant ext4 users that are unwilling > > > to recreate their OSDs, now would be the time to speak up. > > > > > Consider that done. > > Thank you for the feedback! > > sage > _______________________________________________ > Ceph-maintainers mailing list > Ceph-maintainers-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com > > [-- Attachment #1.2: Type: text/html, Size: 10178 bytes --] [-- Attachment #2: Type: text/plain, Size: 178 bytes --] _______________________________________________ ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 1:12 ` [ceph-users] " Sage Weil [not found] ` <alpine.DEB.2.11.1604112046570.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> @ 2016-04-12 2:43 ` Christian Balzer 2016-04-12 13:56 ` Sage Weil 1 sibling, 1 reply; 36+ messages in thread From: Christian Balzer @ 2016-04-12 2:43 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, ceph-users, ceph-maintainers Hello, On Mon, 11 Apr 2016 21:12:14 -0400 (EDT) Sage Weil wrote: > On Tue, 12 Apr 2016, Christian Balzer wrote: > > > > Hello, > > > > What a lovely missive to start off my working day... > > > > On Mon, 11 Apr 2016 17:39:37 -0400 (EDT) Sage Weil wrote: > > > > > Hi, > > > > > > ext4 has never been recommended, but we did test it. > > Patently wrong, as Shinobu just pointed. > > > > Ext4 never was (especially recently) flogged as much as XFS, but it > > always was a recommended, supported filestorage filesystem, unlike the > > experimental BTRFS of ZFS. > > And for various reasons people, including me, deployed it instead of > > XFS. > > Greg definitely wins the prize for raising this as a major issue, then > (and for naming you as one of the major ext4 users). > I'm sure there are other ones, it's often surprising how people will pipe up on this ML for the first time with really massive deployments they've been running for years w/o ever being on anybody's radar. > I was not aware that we were recommending ext4 anywhere. FWIW, here's > what the docs currently say: > > Ceph OSD Daemons rely heavily upon the stability and performance of the > underlying filesystem. > > Note: We currently recommend XFS for production deployments. We > recommend btrfs for testing, development, and any non-critical > deployments. We believe that btrfs has the correct feature set and > roadmap to serve Ceph in the long-term, but XFS and ext4 provide the > necessary stability for today’s deployments. btrfs development is > proceeding rapidly: users should be comfortable installing the latest > released upstream kernels and be able to track development activity for > critical bug fixes. > > Ceph OSD Daemons depend on the Extended Attributes (XATTRs) of the > underlying file system for various forms of internal object state and > metadata. The underlying filesystem must provide sufficient capacity > for XATTRs. btrfs does not bound the total xattr metadata stored with a > file. XFS has a relatively large limit (64 KB) that most deployments > won’t encounter, but the ext4 is too small to be usable. > > (http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=ext4) > > Unfortunately that second paragraph, second sentence indirectly says > ext4 is stable. :( :( I'll prepare a PR tomorrow to revise this whole > section based on the new information. > Not only that, the "filestore xattr use omap" section afterwards reinforces that by clearly suggesting that this is the official work-around for the XATTR issue. > If anyone knows of other docs that recommend ext4, please let me know! > They need to be updated. > Not going to try find any cached versions, but when I did my first deployment with Dumpling I don't think the "Note" section was there or as prominent. Not that it would have stopped me from using Ext4, mind. > > > After Jewel is out, we would like explicitly recommend *against* > > > ext4 and stop testing it. > > > > > Changing your recommendations is fine, stopping testing/supporting it > > isn't. > > People deployed Ext4 in good faith and can be expected to use it at > > least until their HW is up for replacement (4-5 years). > > I agree, which is why I asked. > > And part of it depends on what it's being used for. If there are major > users using ext4 for RGW then their deployments are at risk and they > should swap it out for data safety reasons alone. (Or, we need to > figure out how to fix long object name support on ext4.) On the other > hand, if the only ext4 users are using RBD only, then they can safely > continue with lower max object names, and upstream testing is important > to let those OSDs age out naturally. > > Does your cluster support RBD, RGW, or something else? > Only RBD on all clusters so far and definitely no plans to change that for the main, mission critical production cluster. I might want to add CephFS to the other production cluster at some time, though. No RGW, but if/when RGW supports "listing objects quickly" (is what I vaguely remember from my conversation with Timo Sirainen, the Dovecot author) we would be very interested in that particular piece of Ceph as well. On a completely new cluster though, so no issue. > > > Why: > > > > > > Recently we discovered an issue with the long object name handling > > > that is not fixable without rewriting a significant chunk of > > > FileStores filename handling. (There is a limit in the amount of > > > xattr data ext4 can store in the inode, which causes problems in > > > LFNIndex.) > > > > > Is that also true if the Ext4 inode size is larger than default? > > I'm not sure... Sam, do you know? (It's somewhat academic, though, > since we can't change the inode size on existing file systems.) > Yes and no. Some people (and I think not just me) were perfectly capable of reading between the lines and format their Ext4 FS accordingly: "mkfs.ext4 -J size=1024 -I 2048 -i 65536 ... " (the -I bit) > > > We *could* invest a ton of time rewriting this to fix, but it only > > > affects ext4, which we never recommended, and we plan to deprecate > > > FileStore once BlueStore is stable anyway, so it seems like a waste > > > of time that would be better spent elsewhere. > > > > > If you (that is RH) is going to declare bluestore stable this year, I > > would be very surprised. > > My hope is that it can be the *default* for L (next spring). But we'll > see. > Yeah, that's my most optimistic estimate as well. > > Either way, dropping support before the successor is truly ready > > doesn't sit well with me. > > Yeah, I misspoke. Once BlueStore is supported and the default, support > for FileStore won't be dropped immediately. But we'll want to > communicate that eventually it will lose support. How strongly that is > messaged probably depends on how confident we are in BlueStore at that > point. And I confess I haven't thought much about how long "long > enough" is yet. > Again, most people that deploy Ceph in a commercial environment (that is working for a company) will be under pressure by the penny-pinching department to use their HW for 4-5 years (never mind the pace of technology and Moore's law). So you will want to: a) Announce the end of FileStore ASAP, but then again you can't really do that before BlueStore is stable. b) support FileStore for 4 years at least after BlueStore is the default. This could be done by having a _real_ LTS release, instead of dragging Filestore into newer version. > > Which brings me to the reasons why people would want to migrate (NOT > > talking about starting freshly) to bluestore. > > > > 1. Will it be faster (IOPS) than filestore with SSD journals? > > Don't think so, but feel free to prove me wrong. > > It will absolutely faster on the same hardware. Whether BlueStore on > HDD only is faster than FileStore HDD + SSD journal will depend on the > workload. > Where would the Journal SSDs enter the picture with BlueStore? Not at all, AFAIK, right? I'm thinking again about people with existing HW again. What do they do with those SSDs, which aren't necessarily sized in a fashion to be sensible SSD pools/cache tiers? > > 2. Will it be bit-rot proof? Note the deafening silence from the devs > > in this thread: > > http://www.spinics.net/lists/ceph-users/msg26510.html > > I missed that thread, sorry. > > We (Mirantis, SanDisk, Red Hat) are currently working on checksum > support in BlueStore. Part of the reason why BlueStore is the preferred > path is because we will probably never see full checksumming in ext4 or > XFS. > Now this (when done correctly) and BlueStore being a stable default will be a much, MUCH higher motivation for people to migrate to it than terminating support for something that works perfectly well (for my use case at least). > > > Also, by dropping ext4 test coverage in ceph-qa-suite, we can > > > significantly improve time/coverage for FileStore on XFS and on > > > BlueStore. > > > > > Really, isn't that fully automated? > > It is, but hardware and time are finite. Fewer tests on FileStore+ext4 > means more tests on FileStore+XFS or BlueStore. But this is a minor > point. > > > > The long file name handling is problematic anytime someone is > > > storing rados objects with long names. The primary user that does > > > this is RGW, which means any RGW cluster using ext4 should recreate > > > their OSDs to use XFS. Other librados users could be affected too, > > > though, like users with very long rbd image names (e.g., > 100 > > > characters), or custom librados users. > > > > > > How: > > > > > > To make this change as visible as possible, the plan is to make > > > ceph-osd refuse to start if the backend is unable to support the > > > configured max object name (osd_max_object_name_len). The OSD will > > > complain that ext4 cannot store such an object and refuse to start. > > > A user who is only using RBD might decide they don't need long file > > > names to work and can adjust the osd_max_object_name_len setting to > > > something small (say, 64) and run successfully. They would be > > > taking a risk, though, because we would like to stop testing on ext4. > > > > > > Is this reasonable? > > About as reasonable as dropping format 1 support, that is not at all. > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28070.html > > Fortunately nobody (to my knowledge) has suggested dropping format 1 > support. :) > I suggest you look at that thread and your official release notes: --- * The rbd legacy image format (version 1) is deprecated with the Jewel release. Attempting to create a new version 1 RBD image will result in a warning. Future releases of Ceph will remove support for version 1 RBD images. --- > > I'm officially only allowed to do (preventative) maintenance during > > weekend nights on our main production cluster. > > That would mean 13 ruined weekends at the realistic rate of 1 OSD per > > night, so you can see where my lack of enthusiasm for OSD recreation > > comes from. > > Yeah. :( > > > > If there significant ext4 users that are unwilling > > > to recreate their OSDs, now would be the time to speak up. > > > > > Consider that done. > > Thank you for the feedback! > Thanks for getting back to me so quickly. Christian -- Christian Balzer Network/Systems Engineer chibi@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 2:43 ` [ceph-users] " Christian Balzer @ 2016-04-12 13:56 ` Sage Weil [not found] ` <alpine.DEB.2.11.1604120837120.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Sage Weil @ 2016-04-12 13:56 UTC (permalink / raw) To: Christian Balzer; +Cc: ceph-devel, ceph-users, ceph-maintainers Hi all, I've posted a pull request that updates any mention of ext4 in the docs: https://github.com/ceph/ceph/pull/8556 In particular, I would appreciate any feedback on https://github.com/ceph/ceph/pull/8556/commits/49604303124a2b546e66d6e130ad4fa296602b01 both on substance and delivery. Given the previous lack of clarity around ext4, and that it works well enough for RBD and other short object name workloads, I think the most we can do now is deprecate it to steer any new OSDs away. And at least in the non-RGW case, I mean deprecate in the "recommend alternative" sense of the word, not that it won't be tested or that any code will be removed. https://en.wikipedia.org/wiki/Deprecation#Software_deprecation If there are ext4 + RGW users, that is still a difficult issue, since it is broken now, and expensive to fix. On Tue, 12 Apr 2016, Christian Balzer wrote: > Only RBD on all clusters so far and definitely no plans to change that > for the main, mission critical production cluster. I might want to add > CephFS to the other production cluster at some time, though. That's good to hear. If you continue to use ext4 (by adjusting down the max object length), the only limitation you should hit is an indirect cap on the max RBD image name length. > No RGW, but if/when RGW supports "listing objects quickly" (is what I > vaguely remember from my conversation with Timo Sirainen, the Dovecot > author) we would be very interested in that particular piece of Ceph as > well. On a completely new cluster though, so no issue. OT, but I suspect he was referring to something slightly different here. Our conversations about object listing vs the dovecot backend surrounded the *rados* listing semantics (hash-based, not prefix/name based). RGW supports fast sorted/prefix name listings, but you pay for it by maintaining an index (which slows down PUT). The latest RGW in Jewel has experimental support for a non-indexed 'blind' bucket as well for users that need some of the RGW features (ACLs, striping, etc.) but not the ordered object listing and other index-dependent features. > Again, most people that deploy Ceph in a commercial environment (that is > working for a company) will be under pressure by the penny-pinching > department to use their HW for 4-5 years (never mind the pace of > technology and Moore's law). > > So you will want to: > a) Announce the end of FileStore ASAP, but then again you can't really > do that before BlueStore is stable. > b) support FileStore for 4 years at least after BlueStore is the default. > This could be done by having a _real_ LTS release, instead of dragging > Filestore into newer version. Right. Nothing can be done until the preferred alternative is completely stable, and from then it will take quite some time to drop support or remove it given the install base. > > > Which brings me to the reasons why people would want to migrate (NOT > > > talking about starting freshly) to bluestore. > > > > > > 1. Will it be faster (IOPS) than filestore with SSD journals? > > > Don't think so, but feel free to prove me wrong. > > > > It will absolutely faster on the same hardware. Whether BlueStore on > > HDD only is faster than FileStore HDD + SSD journal will depend on the > > workload. > > > Where would the Journal SSDs enter the picture with BlueStore? > Not at all, AFAIK, right? BlueStore can use as many as three devices: one for the WAL (journal, though it can be much smaller than FileStores, e.g., 128MB), one for metadata (e.g., an SSD partition), and one for data. > I'm thinking again about people with existing HW again. > What do they do with those SSDs, which aren't necessarily sized in a > fashion to be sensible SSD pools/cache tiers? We can either use them for BlueStore wal and metadata, or as a cache for the data device (e.g., dm-cache, bcache, FlashCache), or some combination of the above. It will take some time to figure out which gives the best performance (and for which workloads). > > > 2. Will it be bit-rot proof? Note the deafening silence from the devs > > > in this thread: > > > http://www.spinics.net/lists/ceph-users/msg26510.html > > > > I missed that thread, sorry. > > > > We (Mirantis, SanDisk, Red Hat) are currently working on checksum > > support in BlueStore. Part of the reason why BlueStore is the preferred > > path is because we will probably never see full checksumming in ext4 or > > XFS. > > > Now this (when done correctly) and BlueStore being a stable default will > be a much, MUCH higher motivation for people to migrate to it than > terminating support for something that works perfectly well (for my use > case at least). Agreed. > > > > How: > > > > > > > > To make this change as visible as possible, the plan is to make > > > > ceph-osd refuse to start if the backend is unable to support the > > > > configured max object name (osd_max_object_name_len). The OSD will > > > > complain that ext4 cannot store such an object and refuse to start. > > > > A user who is only using RBD might decide they don't need long file > > > > names to work and can adjust the osd_max_object_name_len setting to > > > > something small (say, 64) and run successfully. They would be > > > > taking a risk, though, because we would like to stop testing on ext4. > > > > > > > > Is this reasonable? > > > About as reasonable as dropping format 1 support, that is not at all. > > > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28070.html > > > > Fortunately nobody (to my knowledge) has suggested dropping format 1 > > support. :) > > > I suggest you look at that thread and your official release notes: > --- > * The rbd legacy image format (version 1) is deprecated with the Jewel release. > Attempting to create a new version 1 RBD image will result in a warning. > Future releases of Ceph will remove support for version 1 RBD images. > --- "Future releases of Ceph *may* remove support" might be more accurate, but it doesn't make for as compelling a warning, and it's pretty likely that *eventually* it will make sense to drop it. That won't happen without a proper conversation about user impact and migration, though. There are real problems with format 1 besides just the lack of new features (e.g., rename vs watchers). This is what 'deprecation' means: we're not dropping support now (that *would* be unreasonable), but we're warning users that at some future point we (probably) will. If there is any reason why new images shouldn't be created with v2, please let us know. Obviously v1 -> v2 image conversion remains an open issue. Thanks- sage ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <alpine.DEB.2.11.1604120837120.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>]
* Re: Deprecating ext4 support [not found] ` <alpine.DEB.2.11.1604120837120.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> @ 2016-04-13 3:27 ` Christian Balzer 0 siblings, 0 replies; 36+ messages in thread From: Christian Balzer @ 2016-04-13 3:27 UTC (permalink / raw) To: Sage Weil, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA Hello, On Tue, 12 Apr 2016 09:56:32 -0400 (EDT) Sage Weil wrote: > Hi all, > > I've posted a pull request that updates any mention of ext4 in the docs: > > https://github.com/ceph/ceph/pull/8556 > > In particular, I would appreciate any feedback on > > https://github.com/ceph/ceph/pull/8556/commits/49604303124a2b546e66d6e130ad4fa296602b01 > > both on substance and delivery. > > Given the previous lack of clarity around ext4, and that it works well > enough for RBD and other short object name workloads, I think the most > we can do now is deprecate it to steer any new OSDs away. > A clear statement of what "short" means in this context and if this (in general) applies to RBD and CephFS would probably be helpful. > And at least in the non-RGW case, I mean deprecate in the "recommend > alternative" sense of the word, not that it won't be tested or that any > code will be removed. > > https://en.wikipedia.org/wiki/Deprecation#Software_deprecation > > If there are ext4 + RGW users, that is still a difficult issue, since it > is broken now, and expensive to fix. > I'm wondering what the cross section of RGW (being "stable" a lot longer than CephFS) and Ext4 users is for this to pop up so late in the game. Also, since Sam didn't pipe up, I'd still would like to know if this is "fixed" by having larger than the default 256Byte Ext4 inodes (2KB in my case) as it isn't purely academic for me. Or maybe other people like "Michael Metz-Martini" who need Ext4 for performance reasons and can't obviously go to BlueStore yet. > > On Tue, 12 Apr 2016, Christian Balzer wrote: > > Only RBD on all clusters so far and definitely no plans to change that > > for the main, mission critical production cluster. I might want to add > > CephFS to the other production cluster at some time, though. > > That's good to hear. If you continue to use ext4 (by adjusting down the > max object length), the only limitation you should hit is an indirect > cap on the max RBD image name length. > Just to parse this sentence correctly, is it the name of the object (output of "rados ls"), the name of the image "rbd ls" or either? > > No RGW, but if/when RGW supports "listing objects quickly" (is what I > > vaguely remember from my conversation with Timo Sirainen, the Dovecot > > author) we would be very interested in that particular piece of Ceph as > > well. On a completely new cluster though, so no issue. > > OT, but I suspect he was referring to something slightly different > here. Our conversations about object listing vs the dovecot backend > surrounded the *rados* listing semantics (hash-based, not prefix/name > based). RGW supports fast sorted/prefix name listings, but you pay for > it by maintaining an index (which slows down PUT). The latest RGW in > Jewel has experimental support for a non-indexed 'blind' bucket as well > for users that need some of the RGW features (ACLs, striping, etc.) but > not the ordered object listing and other index-dependent features. > Sorry about the OT, but since the Dovecot (Pro) backend supports S3 I would have thought that RGW would be a logical expansion from there, not going for a completely new (but likely a lot faster) backend using rados. Oh well, I shall go poke them. > > Again, most people that deploy Ceph in a commercial environment (that > > is working for a company) will be under pressure by the penny-pinching > > department to use their HW for 4-5 years (never mind the pace of > > technology and Moore's law). > > > > So you will want to: > > a) Announce the end of FileStore ASAP, but then again you can't really > > do that before BlueStore is stable. > > b) support FileStore for 4 years at least after BlueStore is the > > default. This could be done by having a _real_ LTS release, instead of > > dragging Filestore into newer version. > > Right. Nothing can be done until the preferred alternative is > completely stable, and from then it will take quite some time to drop > support or remove it given the install base. > > > > > Which brings me to the reasons why people would want to migrate > > > > (NOT talking about starting freshly) to bluestore. > > > > > > > > 1. Will it be faster (IOPS) than filestore with SSD journals? > > > > Don't think so, but feel free to prove me wrong. > > > > > > It will absolutely faster on the same hardware. Whether BlueStore on > > > HDD only is faster than FileStore HDD + SSD journal will depend on > > > the workload. > > > > > Where would the Journal SSDs enter the picture with BlueStore? > > Not at all, AFAIK, right? > > BlueStore can use as many as three devices: one for the WAL (journal, > though it can be much smaller than FileStores, e.g., 128MB), one for > metadata (e.g., an SSD partition), and one for data. > Right, I blanked on that, despite having read the K/V storage back when they first showed up. Just didn't make the connection with BlueStore. OK, so we have a small write-intent-log, probably even better hosted on NVRAM with new installs. The metadata is the same/similar to what lives in ...current/meta/... on OSDs these days? If so, that's 30MB per PG in my case, so not a lot either. > > I'm thinking again about people with existing HW again. > > What do they do with those SSDs, which aren't necessarily sized in a > > fashion to be sensible SSD pools/cache tiers? > > We can either use them for BlueStore wal and metadata, or as a cache for > the data device (e.g., dm-cache, bcache, FlashCache), or some > combination of the above. It will take some time to figure out which > gives the best performance (and for which workloads). > Including finding out which sauce these caching layers prefer when eating your data. ^_- Given the current state of affairs and reports of people here I'll likely take a comfy backseat there. > > > > 2. Will it be bit-rot proof? Note the deafening silence from the > > > > devs in this thread: > > > > http://www.spinics.net/lists/ceph-users/msg26510.html > > > > > > I missed that thread, sorry. > > > > > > We (Mirantis, SanDisk, Red Hat) are currently working on checksum > > > support in BlueStore. Part of the reason why BlueStore is the > > > preferred path is because we will probably never see full > > > checksumming in ext4 or XFS. > > > > > Now this (when done correctly) and BlueStore being a stable default > > will be a much, MUCH higher motivation for people to migrate to it than > > terminating support for something that works perfectly well (for my use > > case at least). > > Agreed. > > > > > > How: > > > > > > > > > > To make this change as visible as possible, the plan is to make > > > > > ceph-osd refuse to start if the backend is unable to support the > > > > > configured max object name (osd_max_object_name_len). The OSD > > > > > will complain that ext4 cannot store such an object and refuse > > > > > to start. A user who is only using RBD might decide they don't > > > > > need long file names to work and can adjust the > > > > > osd_max_object_name_len setting to something small (say, 64) and > > > > > run successfully. They would be taking a risk, though, because > > > > > we would like to stop testing on ext4. > > > > > > > > > > Is this reasonable? > > > > About as reasonable as dropping format 1 support, that is not at > > > > all. > > > > https://www.mail-archive.com/ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org/msg28070.html > > > > > > Fortunately nobody (to my knowledge) has suggested dropping format 1 > > > support. :) > > > > > I suggest you look at that thread and your official release notes: > > --- > > * The rbd legacy image format (version 1) is deprecated with the Jewel > > release. Attempting to create a new version 1 RBD image will result in > > a warning. Future releases of Ceph will remove support for version 1 > > RBD images. --- > > "Future releases of Ceph *may* remove support" might be more accurate, > but it doesn't make for as compelling a warning, and it's pretty likely > that *eventually* it will make sense to drop it. That won't happen > without a proper conversation about user impact and migration, though. > There are real problems with format 1 besides just the lack of new > features (e.g., rename vs watchers). > > This is what 'deprecation' means: we're not dropping support now (that > *would* be unreasonable), but we're warning users that at some future > point we (probably) will. If there is any reason why new images > shouldn't be created with v2, please let us know. Obviously v1 -> v2 > image conversion remains an open issue. > Yup, I did change my default format on the other cluster early on to 2, but the mission critical one is a lot older and at 1 with over 450 images/VMs. So having something that will convert things with a light touch is very much needed. Thanks again, Christian -- Christian Balzer Network/Systems Engineer chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications http://www.gol.com/ ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <20160412083925.5106311d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org>]
* Antw: Re: Deprecating ext4 support [not found] ` <20160412083925.5106311d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org> @ 2016-04-14 9:43 ` Steffen Weißgerber 0 siblings, 0 replies; 36+ messages in thread From: Steffen Weißgerber @ 2016-04-14 9:43 UTC (permalink / raw) To: chibi-FW+hd8ioUD0 Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ, ceph-announce-Qp0mS5GaXlQ >>> Christian Balzer <chibi-FW+hd8ioUD0@public.gmane.org> schrieb am Dienstag, 12. April 2016 um 01:39: > Hello, > Hi, > I'm officially only allowed to do (preventative) maintenance during weekend > nights on our main production cluster. > That would mean 13 ruined weekends at the realistic rate of 1 OSD per > night, so you can see where my lack of enthusiasm for OSD recreation comes > from. > Wondering extremely about that. We introduced ceph for VM's on RBD to not have to move maintenance time to night shift. My understanding of ceph is that it was also made as reliable storage in case of hardware failure. So what's the difference between maintain an osd and it's failure in effect for the end user? In both cases it should be none. Maintaining OSD's should be routine so that you're confident that your application stays save while hardware fails in a amount one configured unused reserve. In the end what happens to your cluster, when a complete node fails? Regards Steffen > > Christian > -- > Christian Balzer Network/Systems Engineer > chibi-FW+hd8ioUD0@public.gmane.org Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Klinik-Service Neubrandenburg GmbH Allendestr. 30, 17036 Neubrandenburg Amtsgericht Neubrandenburg, HRB 2457 Geschaeftsfuehrerin: Gudrun Kappich ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Deprecating ext4 support [not found] ` <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-11 21:42 ` Allen Samuels 2016-04-11 23:39 ` Christian Balzer @ 2016-04-12 7:00 ` Michael Metz-Martini | SpeedPartner GmbH 2016-04-13 2:29 ` [ceph-users] " Christian Balzer 2 siblings, 1 reply; 36+ messages in thread From: Michael Metz-Martini | SpeedPartner GmbH @ 2016-04-12 7:00 UTC (permalink / raw) To: Sage Weil, ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ, ceph-announce-Qp0mS5GaXlQ Hi, Am 11.04.2016 um 23:39 schrieb Sage Weil: > ext4 has never been recommended, but we did test it. After Jewel is out, > we would like explicitly recommend *against* ext4 and stop testing it. Hmmm. We're currently migrating away from xfs as we had some strange performance-issues which were resolved / got better by switching to ext4. We think this is related to our high number of objects (4358 Mobjects according to ceph -s). > Recently we discovered an issue with the long object name handling > that is not fixable without rewriting a significant chunk of > FileStores filename handling. (There is a limit in the amount of > xattr data ext4 can store in the inode, which causes problems in > LFNIndex.) We're only using cephfs so we shouldn't be affected by your discovered bug, right? -- Kind regards Michael ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 7:00 ` Michael Metz-Martini | SpeedPartner GmbH @ 2016-04-13 2:29 ` Christian Balzer 2016-04-13 12:30 ` Sage Weil 2016-04-13 12:51 ` Michael Metz-Martini | SpeedPartner GmbH 0 siblings, 2 replies; 36+ messages in thread From: Christian Balzer @ 2016-04-13 2:29 UTC (permalink / raw) To: ceph-users Cc: Michael Metz-Martini | SpeedPartner GmbH, Sage Weil, ceph-devel, ceph-maintainers Hello, On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner GmbH wrote: > Hi, > > Am 11.04.2016 um 23:39 schrieb Sage Weil: > > ext4 has never been recommended, but we did test it. After Jewel is > > out, we would like explicitly recommend *against* ext4 and stop > > testing it. > Hmmm. We're currently migrating away from xfs as we had some strange > performance-issues which were resolved / got better by switching to > ext4. We think this is related to our high number of objects (4358 > Mobjects according to ceph -s). > It would be interesting to see on how this maps out to the OSDs/PGs. I'd guess loads and loads of subdirectories per PG, which is probably where Ext4 performs better than XFS. > > > Recently we discovered an issue with the long object name handling > > that is not fixable without rewriting a significant chunk of > > FileStores filename handling. (There is a limit in the amount of > > xattr data ext4 can store in the inode, which causes problems in > > LFNIndex.) > We're only using cephfs so we shouldn't be affected by your discovered > bug, right? > I don't use CephFS, but you should be able to tell this yourself by doing a "rados -p <poolname> ls" on your data and metadata pools and see the resulting name lengths. However since you have so many objects, I'd do that on a test cluster, if you have one. ^o^ If CephFS is using the same/similar hashing to create object names as it does with RBD images I'd imagine you're OK. Christian -- Christian Balzer Network/Systems Engineer chibi@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-13 2:29 ` [ceph-users] " Christian Balzer @ 2016-04-13 12:30 ` Sage Weil 2016-04-14 0:57 ` Christian Balzer 2016-04-13 12:51 ` Michael Metz-Martini | SpeedPartner GmbH 1 sibling, 1 reply; 36+ messages in thread From: Sage Weil @ 2016-04-13 12:30 UTC (permalink / raw) To: Christian Balzer Cc: ceph-users, Michael Metz-Martini | SpeedPartner GmbH, ceph-devel, ceph-maintainers On Wed, 13 Apr 2016, Christian Balzer wrote: > > > Recently we discovered an issue with the long object name handling > > > that is not fixable without rewriting a significant chunk of > > > FileStores filename handling. (There is a limit in the amount of > > > xattr data ext4 can store in the inode, which causes problems in > > > LFNIndex.) > > We're only using cephfs so we shouldn't be affected by your discovered > > bug, right? > > > I don't use CephFS, but you should be able to tell this yourself by doing > a "rados -p <poolname> ls" on your data and metadata pools and see the > resulting name lengths. > However since you have so many objects, I'd do that on a test cluster, if > you have one. ^o^ > If CephFS is using the same/similar hashing to create object names as it > does with RBD images I'd imagine you're OK. All of CephFS's object names are short, like RBD's. For RBD, there is only one object per image that is long: rbd_id.$name. As long as your RBD image names are "short" (a max length of 256 chars is enough to make ext4 happy) you'll be fine. sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-13 12:30 ` Sage Weil @ 2016-04-14 0:57 ` Christian Balzer 0 siblings, 0 replies; 36+ messages in thread From: Christian Balzer @ 2016-04-14 0:57 UTC (permalink / raw) To: Sage Weil Cc: ceph-users, Michael Metz-Martini | SpeedPartner GmbH, ceph-devel, ceph-maintainers On Wed, 13 Apr 2016 08:30:52 -0400 (EDT) Sage Weil wrote: > On Wed, 13 Apr 2016, Christian Balzer wrote: > > > > Recently we discovered an issue with the long object name handling > > > > that is not fixable without rewriting a significant chunk of > > > > FileStores filename handling. (There is a limit in the amount of > > > > xattr data ext4 can store in the inode, which causes problems in > > > > LFNIndex.) > > > We're only using cephfs so we shouldn't be affected by your > > > discovered bug, right? > > > > > I don't use CephFS, but you should be able to tell this yourself by > > doing a "rados -p <poolname> ls" on your data and metadata pools and > > see the resulting name lengths. > > However since you have so many objects, I'd do that on a test cluster, > > if you have one. ^o^ > > If CephFS is using the same/similar hashing to create object names as > > it does with RBD images I'd imagine you're OK. > > All of CephFS's object names are short, like RBD's. > Sweet! > For RBD, there is only one object per image that is long: rbd_id.$name. > As long as your RBD image names are "short" (a max length of 256 chars > is enough to make ext4 happy) you'll be fine. > No worries there, ganeti definitely creates them way shorter than that and IIRC so do Open(Stack/Nebula). Christian -- Christian Balzer Network/Systems Engineer chibi@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-13 2:29 ` [ceph-users] " Christian Balzer 2016-04-13 12:30 ` Sage Weil @ 2016-04-13 12:51 ` Michael Metz-Martini | SpeedPartner GmbH 1 sibling, 0 replies; 36+ messages in thread From: Michael Metz-Martini | SpeedPartner GmbH @ 2016-04-13 12:51 UTC (permalink / raw) To: Christian Balzer, ceph-users; +Cc: Sage Weil, ceph-devel, ceph-maintainers Hi, Am 13.04.2016 um 04:29 schrieb Christian Balzer: > On Tue, 12 Apr 2016 09:00:19 +0200 Michael Metz-Martini | SpeedPartner > GmbH wrote: >> Am 11.04.2016 um 23:39 schrieb Sage Weil: >>> ext4 has never been recommended, but we did test it. After Jewel is >>> out, we would like explicitly recommend *against* ext4 and stop >>> testing it. >> Hmmm. We're currently migrating away from xfs as we had some strange >> performance-issues which were resolved / got better by switching to >> ext4. We think this is related to our high number of objects (4358 >> Mobjects according to ceph -s). > It would be interesting to see on how this maps out to the OSDs/PGs. > I'd guess loads and loads of subdirectories per PG, which is probably where > Ext4 performs better than XFS. A simple ls -l takes "ages" on XFS while ext4 lists a directory immediately. According to our findings regarding XFS this seems to be "normal" behavior. pool name category KB objects data - 3240 2265521646 document_root - 577364 10150 images - 96197462245 2256616709 metadata - 1150105 35903724 queue - 542967346 173865 raw - 36875247450 13095410 total of 4736 pgs, 6 pools, 124 TB data, 4359 Mobjects What would you like to see? tree? du per Directory? As you can see we have one data-object in pool "data" per file saved somewhere else. I'm not sure what's this related to, but maybe this is a must by cephfs. -- Kind regards Michael Metz-Martini ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Deprecating ext4 support 2016-04-11 21:39 Deprecating ext4 support Sage Weil [not found] ` <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> @ 2016-04-11 21:44 ` Sage Weil 2016-04-11 21:57 ` Mark Nelson 2016-04-12 7:45 ` [ceph-users] " Jan Schermer 2016-04-12 6:39 ` [Ceph-maintainers] " Loic Dachary 2016-04-13 14:19 ` [ceph-users] " Francois Lafont 3 siblings, 2 replies; 36+ messages in thread From: Sage Weil @ 2016-04-11 21:44 UTC (permalink / raw) To: ceph-devel, ceph-users, ceph-maintainers, ceph-announce On Mon, 11 Apr 2016, Sage Weil wrote: > Hi, > > ext4 has never been recommended, but we did test it. After Jewel is out, > we would like explicitly recommend *against* ext4 and stop testing it. I should clarify that this is a proposal and solicitation of feedback--we haven't made any decisions yet. Now is the time to weigh in. sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Deprecating ext4 support 2016-04-11 21:44 ` Sage Weil @ 2016-04-11 21:57 ` Mark Nelson [not found] ` <570C1DBC.3040408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-04-12 7:45 ` [ceph-users] " Jan Schermer 1 sibling, 1 reply; 36+ messages in thread From: Mark Nelson @ 2016-04-11 21:57 UTC (permalink / raw) To: Sage Weil, ceph-devel, ceph-users, ceph-maintainers, ceph-announce On 04/11/2016 04:44 PM, Sage Weil wrote: > On Mon, 11 Apr 2016, Sage Weil wrote: >> Hi, >> >> ext4 has never been recommended, but we did test it. After Jewel is out, >> we would like explicitly recommend *against* ext4 and stop testing it. > > I should clarify that this is a proposal and solicitation of feedback--we > haven't made any decisions yet. Now is the time to weigh in. To add to this on the performance side, we stopped doing regular performance testing on ext4 (and btrfs) sometime back around when ICE was released to focus specifically on filestore behavior on xfs. There were some cases at the time where ext4 was faster than xfs, but not consistently so. btrfs is often quite fast on fresh fs, but degrades quickly due to fragmentation induced by cow with small-writes-to-large-object workloads (IE RBD small writes). If btrfs auto-defrag is now safe to use in production it might be worth looking at again, but probably not ext4. Set sail for bluestore! Mark > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <570C1DBC.3040408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: Deprecating ext4 support [not found] ` <570C1DBC.3040408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2016-04-11 22:49 ` Shinobu Kinjo 2016-04-11 23:54 ` [ceph-users] " Robin H. Johnson 2016-04-11 23:09 ` Lionel Bouton 1 sibling, 1 reply; 36+ messages in thread From: Shinobu Kinjo @ 2016-04-11 22:49 UTC (permalink / raw) To: Mark Nelson Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ, ceph-announce-Qp0mS5GaXlQ Just to clarify to prevent any confusion. Honestly I've never used ext4 as underlying filesystem for the Ceph cluster, but according to wiki [1], ext4 is recommended -; [1] https://en.wikipedia.org/wiki/Ceph_%28software%29 Shinobu ----- Original Message ----- From: "Mark Nelson" <mnelson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> To: "Sage Weil" <sage-BnTBU8nroG7k1uMJSBkQmQ@public.gmane.org>, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, ceph-users-Qp0mS5GaXlQ@public.gmane.org, ceph-maintainers-Qp0mS5GaXlQ@public.gmane.org, ceph-announce-Qp0mS5GaXlQ@public.gmane.org Sent: Tuesday, April 12, 2016 6:57:16 AM Subject: Re: [ceph-users] Deprecating ext4 support On 04/11/2016 04:44 PM, Sage Weil wrote: > On Mon, 11 Apr 2016, Sage Weil wrote: >> Hi, >> >> ext4 has never been recommended, but we did test it. After Jewel is out, >> we would like explicitly recommend *against* ext4 and stop testing it. > > I should clarify that this is a proposal and solicitation of feedback--we > haven't made any decisions yet. Now is the time to weigh in. To add to this on the performance side, we stopped doing regular performance testing on ext4 (and btrfs) sometime back around when ICE was released to focus specifically on filestore behavior on xfs. There were some cases at the time where ext4 was faster than xfs, but not consistently so. btrfs is often quite fast on fresh fs, but degrades quickly due to fragmentation induced by cow with small-writes-to-large-object workloads (IE RBD small writes). If btrfs auto-defrag is now safe to use in production it might be worth looking at again, but probably not ext4. Set sail for bluestore! Mark > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > _______________________________________________ ceph-users mailing list ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-11 22:49 ` Shinobu Kinjo @ 2016-04-11 23:54 ` Robin H. Johnson 0 siblings, 0 replies; 36+ messages in thread From: Robin H. Johnson @ 2016-04-11 23:54 UTC (permalink / raw) To: Shinobu Kinjo Cc: Mark Nelson, Sage Weil, ceph-devel, ceph-users, ceph-maintainers, ceph-announce On Mon, Apr 11, 2016 at 06:49:09PM -0400, Shinobu Kinjo wrote: > Just to clarify to prevent any confusion. > > Honestly I've never used ext4 as underlying filesystem for the Ceph cluster, but according to wiki [1], ext4 is recommended -; > > [1] https://en.wikipedia.org/wiki/Ceph_%28software%29 Clearly somebody made a copy&paste error from the actual documentation. Here's the docs on master and the recent LTS releases. http://docs.ceph.com/docs/firefly/rados/configuration/filesystem-recommendations/ http://docs.ceph.com/docs/hammer/rados/configuration/filesystem-recommendations/ http://docs.ceph.com/docs/master2/rados/configuration/filesystem-recommendations/ The documentation has NEVER recommended ext4. Here's a slice of all history for that file: http://dev.gentoo.org/~robbat2/ceph-history-of-filesystem-recommendations.patch Generated with $ git log -C -C -M -p ceph/master -- \ doc/rados/configuration/filesystem-recommendations.rst \ doc/config-cluster/file-system-recommendations.rst \ doc/config-cluster/file_system_recommendations.rst The very first version, back in 2012, said: > ``ext4`` is a poor file system choice if you intend to deploy the > RADOS Gateway or use snapshots on versions earlier than 0.45. -- Robin Hugh Johnson Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee E-Mail : robbat2@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: Deprecating ext4 support [not found] ` <570C1DBC.3040408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-04-11 22:49 ` Shinobu Kinjo @ 2016-04-11 23:09 ` Lionel Bouton 1 sibling, 0 replies; 36+ messages in thread From: Lionel Bouton @ 2016-04-11 23:09 UTC (permalink / raw) To: Mark Nelson, Sage Weil, ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users-Qp0mS5GaXlQ, ceph-maintainers-Qp0mS5GaXlQ, ceph-announce-Qp0mS5GaXlQ Hi, Le 11/04/2016 23:57, Mark Nelson a écrit : > [...] > To add to this on the performance side, we stopped doing regular > performance testing on ext4 (and btrfs) sometime back around when ICE > was released to focus specifically on filestore behavior on xfs. > There were some cases at the time where ext4 was faster than xfs, but > not consistently so. btrfs is often quite fast on fresh fs, but > degrades quickly due to fragmentation induced by cow with > small-writes-to-large-object workloads (IE RBD small writes). If > btrfs auto-defrag is now safe to use in production it might be worth > looking at again, but probably not ext4. For BTRFS, autodefrag is probably not performance-safe (yet), at least with RBD access patterns. At least it wasn't in 4.1.9 when we tested it last time (the performance degraded slowly but surely over several weeks from an initially good performing filesystem to the point where we measured a 100% increase in average latencies and large spikes and stopped the experiment). I didn't see any patches on linux-btrfs since then (it might have benefited from other modifications, but the autodefrag algorithm wasn't reworked itself AFAIK). That's not an inherent problem of BTRFS but of the autodefrag implementation though. Deactivating autodefrag and reimplementing a basic, cautious defragmentation scheduler gave us noticeably better latencies with BTRFS vs XFS (~30% better) on the same hardware and workload long term (as in almost a year and countless full-disk rewrites on the same filesystems due to both normal writes and rebalancing with 3 to 4 months of XFS and BTRFS OSDs coexisting for comparison purposes). I'll certainly remount a subset of our OSDs autodefrag as I did with 4.1.9 when we will deploy 4.4.x or a later LTS kernel. So I might have more up to date information in the coming months. I don't plan to compare BTRFS to XFS anymore though : XFS only saves us from running our defragmentation scheduler, BTRFS is far more suited to our workload and we've seen constant improvements in behavior in the (arguably bumpy until late 3.19 versions) 3.16.x to 4.1.x road. Other things: * If the journal is not on a separate partition (SSD), it should definitely be re-created NoCoW to avoid unnecessary fragmentation. From memory : stop OSD, touch journal.new, chattr +C journal.new, dd if=journal of=journal.new (your dd options here for best perf/least amount of cache eviction), rm journal, mv journal.new journal, start OSD again. * filestore btrfs snap = false is mandatory if you want consistent performance (at least on HDDs). It may not be felt with almost empty OSDs but performance hiccups appear if any non trivial amount of data is added to the filesystems. IIRC, after debugging surprisingly the snapshot creation didn't seem to be the actual cause of the performance problems but the snapshot deletion... It's so bad that the default should probably be false and not true. Lionel ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-11 21:44 ` Sage Weil 2016-04-11 21:57 ` Mark Nelson @ 2016-04-12 7:45 ` Jan Schermer 2016-04-12 18:00 ` Sage Weil 1 sibling, 1 reply; 36+ messages in thread From: Jan Schermer @ 2016-04-12 7:45 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, ceph-users, ceph-maintainers, ceph-announce I'd like to raise these points, then 1) some people (like me) will never ever use XFS if they have a choice given no choice, we will not use something that depends on XFS 2) choice is always good 3) doesn't majority of Ceph users only care about RBD? (Angry rant coming) Even our last performance testing of Ceph (Infernalis) showed abysmal performance. The most damning sign is the consumption of CPU time at unprecedented rate. Was it faster than Dumpling? Slightly, but it ate more CPU also, so in effect it was not really "faster". It would make *some* sense to only support ZFS or BTRFS because you can offload things like clones/snapshots and consistency to the filesystem - which would make the architecture much simpler and everything much faster. Instead you insist on XFS and reimplement everything in software. I always dismissed this because CPU time was ususally cheap, but in practice it simply doesn't work. You duplicate things that filesystems had solved for years now (namely crash consistency - though we have seen that fail as well), instead of letting them do their work and stripping the IO path to the bare necessity and letting someone smarter and faster handle that. IMO, If Ceph was moving in the right direction there would be no "supported filesystem" debate, instead we'd be free to choose whatever is there that provides the guarantees we need from filesystem (which is usually every filesystem in the kernel) and Ceph would simply distribute our IO around with CRUSH. Right now CRUSH (and in effect what it allows us to do with data) is _the_ reason people use Ceph, as there simply wasn't much else to use for distributed storage. This isn't true anymore and the alternatives are orders of magnitude faster and smaller. Jan P.S. If anybody needs a way out I think I found it, with no need to trust a higher power :P > On 11 Apr 2016, at 23:44, Sage Weil <sage@newdream.net> wrote: > > On Mon, 11 Apr 2016, Sage Weil wrote: >> Hi, >> >> ext4 has never been recommended, but we did test it. After Jewel is out, >> we would like explicitly recommend *against* ext4 and stop testing it. > > I should clarify that this is a proposal and solicitation of feedback--we > haven't made any decisions yet. Now is the time to weigh in. > > sage > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 7:45 ` [ceph-users] " Jan Schermer @ 2016-04-12 18:00 ` Sage Weil 2016-04-12 19:19 ` Jan Schermer 0 siblings, 1 reply; 36+ messages in thread From: Sage Weil @ 2016-04-12 18:00 UTC (permalink / raw) To: Jan Schermer; +Cc: ceph-devel, ceph-users, ceph-maintainers On Tue, 12 Apr 2016, Jan Schermer wrote: > I'd like to raise these points, then > > 1) some people (like me) will never ever use XFS if they have a choice > given no choice, we will not use something that depends on XFS > > 2) choice is always good Okay! > 3) doesn't majority of Ceph users only care about RBD? Probably that's true now. We shouldn't recommend something that prevents them from adding RGW to an existing cluster in the future, though. > (Angry rant coming) > Even our last performance testing of Ceph (Infernalis) showed abysmal > performance. The most damning sign is the consumption of CPU time at > unprecedented rate. Was it faster than Dumpling? Slightly, but it ate > more CPU also, so in effect it was not really "faster". > > It would make *some* sense to only support ZFS or BTRFS because you can > offload things like clones/snapshots and consistency to the filesystem - > which would make the architecture much simpler and everything much > faster. Instead you insist on XFS and reimplement everything in > software. I always dismissed this because CPU time was ususally cheap, > but in practice it simply doesn't work. You duplicate things that > filesystems had solved for years now (namely crash consistency - though > we have seen that fail as well), instead of letting them do their work > and stripping the IO path to the bare necessity and letting someone > smarter and faster handle that. > > IMO, If Ceph was moving in the right direction there would be no > "supported filesystem" debate, instead we'd be free to choose whatever > is there that provides the guarantees we need from filesystem (which is > usually every filesystem in the kernel) and Ceph would simply distribute > our IO around with CRUSH. > > Right now CRUSH (and in effect what it allows us to do with data) is > _the_ reason people use Ceph, as there simply wasn't much else to use > for distributed storage. This isn't true anymore and the alternatives > are orders of magnitude faster and smaller. This touched on pretty much every reason why we are ditching file systems entirely and moving toward BlueStore. Local kernel file systems maintain their own internal consistency, but they only provide what consistency promises the POSIX interface does--which is almost nothing. That's why every complicated data structure (e.g., database) stored on a file system ever includes it's own journal. In our case, what POSIX provides isn't enough. We can't even update a file and it's xattr atomically, let alone the much more complicated transitions we need to do. We coudl "wing it" and hope for the best, then do an expensive crawl and rsync of data on recovery, but we chose very early on not to do that. If you want a system that "just" layers over an existing filesystem, try you can try Gluster (although note that they have a different sort of pain with the ordering of xattr updates, and are moving toward a model that looks more like Ceph's backend in their next version). Offloading stuff to the file system doesn't save you CPU--it just makes someone else responsible. What does save you CPU is avoiding the complexity you don't need (i.e., half of what the kernel file system is doing, and everything we have to do to work around an ill-suited interface) and instead implement exactly the set of features that we need to get the job done. FileStore is slow, mostly because of the above, but also because it is an old and not-very-enlightened design. BlueStore is roughly 2x faster in early testing. Finally, remember you *are* completely free to run Ceph on whatever file system you want--and many do. We just aren't going to test them all for you and promise they will all work. Remember that we have hit different bugs in every single one we've tried. It's not as simple as saying they just have to "provide the guarantees we need" given the complexity of the interface, and almost every time we've tried to use "supported" APIs that are remotely unusually (fallocate, zeroing extents... even xattrs) we've hit bugs or undocumented limits and idiosyncrasies on one fs or another. Cheers- sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 18:00 ` Sage Weil @ 2016-04-12 19:19 ` Jan Schermer 2016-04-12 19:58 ` Sage Weil 0 siblings, 1 reply; 36+ messages in thread From: Jan Schermer @ 2016-04-12 19:19 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, ceph-users, ceph-maintainers > On 12 Apr 2016, at 20:00, Sage Weil <sage@newdream.net> wrote: > > On Tue, 12 Apr 2016, Jan Schermer wrote: >> I'd like to raise these points, then >> >> 1) some people (like me) will never ever use XFS if they have a choice >> given no choice, we will not use something that depends on XFS >> >> 2) choice is always good > > Okay! > >> 3) doesn't majority of Ceph users only care about RBD? > > Probably that's true now. We shouldn't recommend something that prevents > them from adding RGW to an existing cluster in the future, though. > >> (Angry rant coming) >> Even our last performance testing of Ceph (Infernalis) showed abysmal >> performance. The most damning sign is the consumption of CPU time at >> unprecedented rate. Was it faster than Dumpling? Slightly, but it ate >> more CPU also, so in effect it was not really "faster". >> >> It would make *some* sense to only support ZFS or BTRFS because you can >> offload things like clones/snapshots and consistency to the filesystem - >> which would make the architecture much simpler and everything much >> faster. Instead you insist on XFS and reimplement everything in >> software. I always dismissed this because CPU time was ususally cheap, >> but in practice it simply doesn't work. You duplicate things that >> filesystems had solved for years now (namely crash consistency - though >> we have seen that fail as well), instead of letting them do their work >> and stripping the IO path to the bare necessity and letting someone >> smarter and faster handle that. >> >> IMO, If Ceph was moving in the right direction there would be no >> "supported filesystem" debate, instead we'd be free to choose whatever >> is there that provides the guarantees we need from filesystem (which is >> usually every filesystem in the kernel) and Ceph would simply distribute >> our IO around with CRUSH. >> >> Right now CRUSH (and in effect what it allows us to do with data) is >> _the_ reason people use Ceph, as there simply wasn't much else to use >> for distributed storage. This isn't true anymore and the alternatives >> are orders of magnitude faster and smaller. > > This touched on pretty much every reason why we are ditching file > systems entirely and moving toward BlueStore. Nooooooooooooooo! > > Local kernel file systems maintain their own internal consistency, but > they only provide what consistency promises the POSIX interface > does--which is almost nothing. ... which is exactly what everyone expects ... which is everything any app needs > That's why every complicated data > structure (e.g., database) stored on a file system ever includes it's own > journal. ... see? > In our case, what POSIX provides isn't enough. We can't even > update a file and it's xattr atomically, let alone the much more > complicated transitions we need to do. ... have you thought that maybe xattrs weren't meant to be abused this way? Filesystems usually aren't designed to be a performant key=value stores. btw at least i_version should be atomic? And I still feel (ironically) that you don't understand what journals and commits/flushes are for if you make this argument... Btw I think at least i_version xattr could be atomic. > We coudl "wing it" and hope for > the best, then do an expensive crawl and rsync of data on recovery, but we > chose very early on not to do that. If you want a system that "just" > layers over an existing filesystem, try you can try Gluster (although note > that they have a different sort of pain with the ordering of xattr > updates, and are moving toward a model that looks more like Ceph's backend > in their next version). True, which is why we dismissed it. > > Offloading stuff to the file system doesn't save you CPU--it just makes > someone else responsible. What does save you CPU is avoiding the > complexity you don't need (i.e., half of what the kernel file system is > doing, and everything we have to do to work around an ill-suited > interface) and instead implement exactly the set of features that we need > to get the job done. In theory you are right. In practice in-kernel filesystems are fast, and fuse filesystems are slow. Ceph is like that - slow. And you want to be fast by writing more code :) > > FileStore is slow, mostly because of the above, but also because it is an > old and not-very-enlightened design. BlueStore is roughly 2x faster in > early testing. ... which is still literally orders of magnitude slower than a filesystem. I dug into bluestore and how you want to implement it, and from what I understood you are reimplementing what the filesystem journal does... It makes sense it will be 2x faster if you avoid the double-journalling, but I'd be very much surprised if it helped with CPU usage one bit - I certainly don't see my filesystems consuming significant amount of CPU time on any of my machines, and I seriously doubt you're going to do that better, sorry. > > Finally, remember you *are* completely free to run Ceph on whatever file > system you want--and many do. We just aren't going to test them all for > you and promise they will all work. Remember that we have hit different > bugs in every single one we've tried. It's not as simple as saying they > just have to "provide the guarantees we need" given the complexity of the > interface, and almost every time we've tried to use "supported" APIs that > are remotely unusually (fallocate, zeroing extents... even xattrs) we've > hit bugs or undocumented limits and idiosyncrasies on one fs or another. This can be a valid point, those are features people either don't use, or use quite differently. But just because you can stress the filesystems until they break doesn't mean you should go write a new one. What makes you think you will do a better job than all the people who made xfs/ext4/...? Anyway, I don't know how more to debunk the "insufficient guarantees in POSIX filesystem transactions" myth that you insist on fixing, so I guess I'll have to wait until you rewrite everything up to the drive firmware to appreciate it :) Jan P.S. A joke for you How many syscalls does it take for Ceph to write "lightbulb" to the disk? 10 000 ha ha? > > Cheers- > sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 19:19 ` Jan Schermer @ 2016-04-12 19:58 ` Sage Weil 2016-04-12 20:33 ` Jan Schermer 0 siblings, 1 reply; 36+ messages in thread From: Sage Weil @ 2016-04-12 19:58 UTC (permalink / raw) To: Jan Schermer; +Cc: ceph-devel, ceph-users, ceph-maintainers Okay, I'll bite. On Tue, 12 Apr 2016, Jan Schermer wrote: > > Local kernel file systems maintain their own internal consistency, but > > they only provide what consistency promises the POSIX interface > > does--which is almost nothing. > > ... which is exactly what everyone expects > ... which is everything any app needs > > > That's why every complicated data > > structure (e.g., database) stored on a file system ever includes it's own > > journal. > ... see? They do this because POSIX doesn't give them what they want. They implement a *second* journal on top. The result is that you get the overhead from both--the fs journal keeping its data structures consistent, the database keeping its consistent. If you're not careful, that means the db has to do something like file write, fsync, db journal append, fsync. And both fsyncs turn into a *fs* journal io and flush. (Smart databases often avoid most of the fs overhead by putting everything in a single large file, but at that point the file system isn't actually doing anything except passing IO to the block layer). There is nothing wrong with POSIX file systems. They have the unenviable task of catering to a huge variety of workloads and applications, but are truly optimal for very few. And that's fine. If you want a local file system, you should use ext4 or XFS, not Ceph. But it turns ceph-osd isn't a generic application--it has a pretty specific workload pattern, and POSIX doesn't give us the interfaces we want (mainly, atomic transactions or ordered object/file enumeration). > > We coudl "wing it" and hope for > > the best, then do an expensive crawl and rsync of data on recovery, but we > > chose very early on not to do that. If you want a system that "just" > > layers over an existing filesystem, try you can try Gluster (although note > > that they have a different sort of pain with the ordering of xattr > > updates, and are moving toward a model that looks more like Ceph's backend > > in their next version). > > True, which is why we dismissed it. ...and yet it does exactly what you asked for: > > > IMO, If Ceph was moving in the right direction [...] Ceph would > > > simply distribute our IO around with CRUSH. You want ceph to "just use a file system." That's what gluster does--it just layers the distributed namespace right on top of a local namespace. If you didn't care about correctness or data safety, it would be beautiful, and just as fast as the local file system (modulo network). But if you want your data safe, you immediatley realize that local POSIX file systems don't get you want you need: the atomic update of two files on different servers so that you can keep your replicas in sync. Gluster originally took the minimal path to accomplish this: a "simple" prepare/write/commit, using xattrs as transaction markers. We took a heavyweight approach to support arbitrary transactions. And both of us have independently concluded that the local fs is the wrong tool for the job. > > Offloading stuff to the file system doesn't save you CPU--it just makes > > someone else responsible. What does save you CPU is avoiding the > > complexity you don't need (i.e., half of what the kernel file system is > > doing, and everything we have to do to work around an ill-suited > > interface) and instead implement exactly the set of features that we need > > to get the job done. > > In theory you are right. > In practice in-kernel filesystems are fast, and fuse filesystems are slow. > Ceph is like that - slow. And you want to be fast by writing more code :) You get fast by writing the *right* code, and eliminating layers of the stack (the local file system, in this case) that are providing functionality you don't want (or more functionality than you need at too high a price). > I dug into bluestore and how you want to implement it, and from what I > understood you are reimplementing what the filesystem journal does... Yes. The difference is that a single journal manages all of the metadata and data consistency in the system, instead of a local fs journal managing just block allocation and a second ceph journal managing ceph's data structures. The main benefit, though, is that we can choose a different set of semantics, like the ability to overwrite data in a file/object and update metadata atomically. You can't do that with POSIX without building a write-ahead journal and double-writing. > Btw I think at least i_version xattr could be atomic. Nope. All major file systems (other than btrfs) overwrite data in place, which means it is impossible for any piece of metadata to accurately indicate whether you have the old data or the new data (or perhaps a bit of both). > It makes sense it will be 2x faster if you avoid the double-journalling, > but I'd be very much surprised if it helped with CPU usage one bit - I > certainly don't see my filesystems consuming significant amount of CPU > time on any of my machines, and I seriously doubt you're going to do > that better, sorry. Apples and oranges. The file systems aren't doing what we're doing. But once you combine the what we spend now in FileStore + a local fs, BlueStore will absolutely spend less CPU time. > What makes you think you will do a better job than all the people who > made xfs/ext4/...? I don't. XFS et al are great file systems and for the most part I have no complaints about them. The problem is that Ceph doesn't need a file system: it needs a transactional object store with a different set of features. So that's what we're building. sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 19:58 ` Sage Weil @ 2016-04-12 20:33 ` Jan Schermer 2016-04-12 20:47 ` Sage Weil 0 siblings, 1 reply; 36+ messages in thread From: Jan Schermer @ 2016-04-12 20:33 UTC (permalink / raw) To: Sage Weil; +Cc: ceph-devel, ceph-users, ceph-maintainers Still the answer to most of your points from me is "but who needs that?" Who needs to have exactly the same data in two separate objects (replicas)? Ceph needs it because "consistency"?, but the app (VM filesystem) is fine with whatever version because the flush didn't happen (if it did the contents would be the same). You say "Ceph needs", but I say "the guest VM needs" - there's the problem. > On 12 Apr 2016, at 21:58, Sage Weil <sage@newdream.net> wrote: > > Okay, I'll bite. > > On Tue, 12 Apr 2016, Jan Schermer wrote: >>> Local kernel file systems maintain their own internal consistency, but >>> they only provide what consistency promises the POSIX interface >>> does--which is almost nothing. >> >> ... which is exactly what everyone expects >> ... which is everything any app needs >> >>> That's why every complicated data >>> structure (e.g., database) stored on a file system ever includes it's own >>> journal. >> ... see? > > They do this because POSIX doesn't give them what they want. They > implement a *second* journal on top. The result is that you get the > overhead from both--the fs journal keeping its data structures consistent, > the database keeping its consistent. If you're not careful, that means > the db has to do something like file write, fsync, db journal append, > fsync. It's more like transaction log write, flush data write That's simply because most filesystems don't journal data, but some do. > And both fsyncs turn into a *fs* journal io and flush. (Smart > databases often avoid most of the fs overhead by putting everything in a > single large file, but at that point the file system isn't actually doing > anything except passing IO to the block layer). > > There is nothing wrong with POSIX file systems. They have the unenviable > task of catering to a huge variety of workloads and applications, but are > truly optimal for very few. And that's fine. If you want a local file > system, you should use ext4 or XFS, not Ceph. > > But it turns ceph-osd isn't a generic application--it has a pretty > specific workload pattern, and POSIX doesn't give us the interfaces we > want (mainly, atomic transactions or ordered object/file enumeration). The workload (with RBD) is inevitably expecting POSIX. Who needs more than that? To me that indicates unnecessary guarantees. > >>> We coudl "wing it" and hope for >>> the best, then do an expensive crawl and rsync of data on recovery, but we >>> chose very early on not to do that. If you want a system that "just" >>> layers over an existing filesystem, try you can try Gluster (although note >>> that they have a different sort of pain with the ordering of xattr >>> updates, and are moving toward a model that looks more like Ceph's backend >>> in their next version). >> >> True, which is why we dismissed it. > > ...and yet it does exactly what you asked for: I was implying it suffers the same flaws. In any case it wasn't really fast and it seemed overly complex. To be fair it was some while ago when I tried it. Can't talk about consistency - I don't think I ever used it in production as more than a PoC. > >>>> IMO, If Ceph was moving in the right direction [...] Ceph would >>>> simply distribute our IO around with CRUSH. > > You want ceph to "just use a file system." That's what gluster does--it > just layers the distributed namespace right on top of a local namespace. > If you didn't care about correctness or data safety, it would be > beautiful, and just as fast as the local file system (modulo network). > But if you want your data safe, you immediatley realize that local POSIX > file systems don't get you want you need: the atomic update of two files > on different servers so that you can keep your replicas in sync. Gluster > originally took the minimal path to accomplish this: a "simple" > prepare/write/commit, using xattrs as transaction markers. We took a > heavyweight approach to support arbitrary transactions. And both of us > have independently concluded that the local fs is the wrong tool for the > job. > >>> Offloading stuff to the file system doesn't save you CPU--it just makes >>> someone else responsible. What does save you CPU is avoiding the >>> complexity you don't need (i.e., half of what the kernel file system is >>> doing, and everything we have to do to work around an ill-suited >>> interface) and instead implement exactly the set of features that we need >>> to get the job done. >> >> In theory you are right. >> In practice in-kernel filesystems are fast, and fuse filesystems are slow. >> Ceph is like that - slow. And you want to be fast by writing more code :) > > You get fast by writing the *right* code, and eliminating layers of the > stack (the local file system, in this case) that are providing > functionality you don't want (or more functionality than you need at too > high a price). > >> I dug into bluestore and how you want to implement it, and from what I >> understood you are reimplementing what the filesystem journal does... > > Yes. The difference is that a single journal manages all of the metadata > and data consistency in the system, instead of a local fs journal managing > just block allocation and a second ceph journal managing ceph's data > structures. > > The main benefit, though, is that we can choose a different set of > semantics, like the ability to overwrite data in a file/object and update > metadata atomically. You can't do that with POSIX without building a > write-ahead journal and double-writing. > >> Btw I think at least i_version xattr could be atomic. > > Nope. All major file systems (other than btrfs) overwrite data in place, > which means it is impossible for any piece of metadata to accurately > indicate whether you have the old data or the new data (or perhaps a bit > of both). > >> It makes sense it will be 2x faster if you avoid the double-journalling, >> but I'd be very much surprised if it helped with CPU usage one bit - I >> certainly don't see my filesystems consuming significant amount of CPU >> time on any of my machines, and I seriously doubt you're going to do >> that better, sorry. > > Apples and oranges. The file systems aren't doing what we're doing. But > once you combine the what we spend now in FileStore + a local fs, > BlueStore will absolutely spend less CPU time. I don't think it's apples and oranges. If I export two files via losetup over iSCSI and make a raid1 swraid out of them in guest VM, I bet it will still be faster than ceph with bluestore. And yet it will provide the same guarantees and do the same job without eating significant CPU time. True or false? Yes, the filesystem is unnecessary in this scenario, but the performance impact is negligible if you use it right. > >> What makes you think you will do a better job than all the people who >> made xfs/ext4/...? > > I don't. XFS et al are great file systems and for the most part I have no > complaints about them. The problem is that Ceph doesn't need a file > system: it needs a transactional object store with a different set of > features. So that's what we're building. > > sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 20:33 ` Jan Schermer @ 2016-04-12 20:47 ` Sage Weil [not found] ` <alpine.DEB.2.11.1604121639590.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> ` (2 more replies) 0 siblings, 3 replies; 36+ messages in thread From: Sage Weil @ 2016-04-12 20:47 UTC (permalink / raw) To: Jan Schermer; +Cc: ceph-devel, ceph-users, ceph-maintainers On Tue, 12 Apr 2016, Jan Schermer wrote: > Still the answer to most of your points from me is "but who needs that?" > Who needs to have exactly the same data in two separate objects > (replicas)? Ceph needs it because "consistency"?, but the app (VM > filesystem) is fine with whatever version because the flush didn't > happen (if it did the contents would be the same). If you want replicated VM store that isn't picky about consistency, try Sheepdog. Or your mdraid over iSCSI proposal. We care about these things because VMs are just one of many users of rados, and because even if we could get away with being sloppy in some (or even most) cases with VMs, we need the strong consistency to build other features people want, like RBD journaling for multi-site async replication. Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that chose rados for a reason. And we want to make sense of an inconsistency when we find one on scrub. (Does it mean the disk is returning bad data, or we just crashed during a write a while back?) ... Cheers- sage ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <alpine.DEB.2.11.1604121639590.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org>]
* Re: Deprecating ext4 support [not found] ` <alpine.DEB.2.11.1604121639590.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> @ 2016-04-12 21:08 ` Nick Fisk [not found] ` <4f0f087c.9Ro.9Gf.hg.1qX1VyMOaD-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org> 0 siblings, 1 reply; 36+ messages in thread From: Nick Fisk @ 2016-04-12 21:08 UTC (permalink / raw) To: 'Sage Weil', 'Jan Schermer' Cc: 'ceph-devel', 'ceph-users', ceph-maintainers-Qp0mS5GaXlQ Jan, I would like to echo Sage's response here. It seems you only want a subset of what Ceph offers, whereas RADOS is designed to offer a whole lot more, which requires a lot more intelligence at the lower levels. I must say I have found your attitude to both Sage and the Ceph project as a whole over the last few emails quite disrespectful. I spend a lot of my time trying to sell the benefits of open source, which centre on the openness of the idea/code and not around the fact that you can get it for free. One of the things that I like about open source is the constructive, albeit sometimes abrupt, constructive criticism that results in a better product. Simply shouting Ceph is slow and it's because dev's don't understand filesystems is not constructive. I've just come back from an expo at ExCel London where many providers are passionately talking about Ceph. There seems to be a lot of big money sloshing about for something that is inherently "wrong" Sage and the core Ceph team seem like very clever people to me and I trust that over the years of development, that if they have decided that standard FS's are not the ideal backing store for Ceph, that this is probably correct decision. However I am also aware that the human condition "Can't see the wood for the trees" is everywhere and I'm sure if you have any clever insights into filesystem behaviour, the Ceph Dev team would be more than open to suggestions. Personally I wish I could contribute more to the project as I feel that I (any my company) get more from Ceph than we put in, but it strikes a nerve when there is such negative criticism for what effectively is a free product. Yes, I also suffer from the problem of slow sync writes, but the benefit of being able to shift 1U servers around a Rack/DC compared to a SAS tethered 4U jbod somewhat outweighs that as well as several other advanatages. A new cluster that we are deploying has several hardware choices which go a long way to improve this performance as well. Coupled with the coming Bluestore, the future looks bright. > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org] On Behalf Of > Sage Weil > Sent: 12 April 2016 21:48 > To: Jan Schermer <jan-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org> > Cc: ceph-devel <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; ceph-users <ceph- > users-Qp0mS5GaXlQ@public.gmane.org>; ceph-maintainers-Qp0mS5GaXlQ@public.gmane.org > Subject: Re: [ceph-users] Deprecating ext4 support > > On Tue, 12 Apr 2016, Jan Schermer wrote: > > Still the answer to most of your points from me is "but who needs that?" > > Who needs to have exactly the same data in two separate objects > > (replicas)? Ceph needs it because "consistency"?, but the app (VM > > filesystem) is fine with whatever version because the flush didn't > > happen (if it did the contents would be the same). > > If you want replicated VM store that isn't picky about consistency, try > Sheepdog. Or your mdraid over iSCSI proposal. > > We care about these things because VMs are just one of many users of > rados, and because even if we could get away with being sloppy in some (or > even most) cases with VMs, we need the strong consistency to build other > features people want, like RBD journaling for multi-site async replication. > > Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that > chose rados for a reason. > > And we want to make sense of an inconsistency when we find one on scrub. > (Does it mean the disk is returning bad data, or we just crashed during a write > a while back?) > > ... > > Cheers- > sage > > _______________________________________________ > ceph-users mailing list > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <4f0f087c.9Ro.9Gf.hg.1qX1VyMOaD-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org>]
* Re: Deprecating ext4 support [not found] ` <4f0f087c.9Ro.9Gf.hg.1qX1VyMOaD-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org> @ 2016-04-12 21:22 ` wido-fspyXLx8qC4 0 siblings, 0 replies; 36+ messages in thread From: wido-fspyXLx8qC4 @ 2016-04-12 21:22 UTC (permalink / raw) To: Nick Fisk; +Cc: ceph-users, ceph-devel, ceph-maintainers-Qp0mS5GaXlQ > Op 12 apr. 2016 om 23:09 heeft Nick Fisk <nick-ksME7r3P/wO1Qrn1Bg8BZw@public.gmane.org> het volgende geschreven: > > Jan, > > I would like to echo Sage's response here. It seems you only want a subset > of what Ceph offers, whereas RADOS is designed to offer a whole lot more, > which requires a lot more intelligence at the lower levels. > I fully agree with your e-mail. I think the Ceph devvers have earned their respect over the years and they know what they are talking about. For years I have been wondering why there even was a POSIX filesystem underneath Ceph. > I must say I have found your attitude to both Sage and the Ceph project as a > whole over the last few emails quite disrespectful. I spend a lot of my time > trying to sell the benefits of open source, which centre on the openness of > the idea/code and not around the fact that you can get it for free. One of > the things that I like about open source is the constructive, albeit > sometimes abrupt, constructive criticism that results in a better product. > Simply shouting Ceph is slow and it's because dev's don't understand > filesystems is not constructive. > > I've just come back from an expo at ExCel London where many providers are > passionately talking about Ceph. There seems to be a lot of big money > sloshing about for something that is inherently "wrong" > > Sage and the core Ceph team seem like very clever people to me and I trust > that over the years of development, that if they have decided that standard > FS's are not the ideal backing store for Ceph, that this is probably correct > decision. However I am also aware that the human condition "Can't see the > wood for the trees" is everywhere and I'm sure if you have any clever > insights into filesystem behaviour, the Ceph Dev team would be more than > open to suggestions. > > Personally I wish I could contribute more to the project as I feel that I > (any my company) get more from Ceph than we put in, but it strikes a nerve > when there is such negative criticism for what effectively is a free > product. > > Yes, I also suffer from the problem of slow sync writes, but the benefit of > being able to shift 1U servers around a Rack/DC compared to a SAS tethered > 4U jbod somewhat outweighs that as well as several other advanatages. A new > cluster that we are deploying has several hardware choices which go a long > way to improve this performance as well. Coupled with the coming Bluestore, > the future looks bright. > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org] On Behalf Of >> Sage Weil >> Sent: 12 April 2016 21:48 >> To: Jan Schermer <jan-SB6/BxVxTjHtwjQa/ONI9g@public.gmane.org> >> Cc: ceph-devel <ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>; ceph-users <ceph- >> users-Qp0mS5GaXlQ@public.gmane.org>; ceph-maintainers-Qp0mS5GaXlQ@public.gmane.org >> Subject: Re: [ceph-users] Deprecating ext4 support >> >>> On Tue, 12 Apr 2016, Jan Schermer wrote: >>> Still the answer to most of your points from me is "but who needs that?" >>> Who needs to have exactly the same data in two separate objects >>> (replicas)? Ceph needs it because "consistency"?, but the app (VM >>> filesystem) is fine with whatever version because the flush didn't >>> happen (if it did the contents would be the same). >> >> If you want replicated VM store that isn't picky about consistency, try >> Sheepdog. Or your mdraid over iSCSI proposal. >> >> We care about these things because VMs are just one of many users of >> rados, and because even if we could get away with being sloppy in some (or >> even most) cases with VMs, we need the strong consistency to build other >> features people want, like RBD journaling for multi-site async > replication. >> >> Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that >> chose rados for a reason. >> >> And we want to make sense of an inconsistency when we find one on scrub. >> (Does it mean the disk is returning bad data, or we just crashed during a > write >> a while back?) >> >> ... >> >> Cheers- >> sage >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ^ permalink raw reply [flat|nested] 36+ messages in thread
[parent not found: <362740c3.9Ro.9Gf.hg.1qX1VyMOaE@mailjet.com>]
* Re: [ceph-users] Deprecating ext4 support [not found] ` <362740c3.9Ro.9Gf.hg.1qX1VyMOaE@mailjet.com> @ 2016-04-12 23:12 ` Jan Schermer 2016-04-13 13:13 ` Sage Weil 0 siblings, 1 reply; 36+ messages in thread From: Jan Schermer @ 2016-04-12 23:12 UTC (permalink / raw) To: Nick Fisk; +Cc: Sage Weil, ceph-devel, ceph-users, ceph-maintainers I apologise, I probably should have dialed down a bit. I'd like to personally apologise to Sage, for being so patient with my ranting. To be clear: We are so lucky to have Ceph. It was something we sorely needed and for the right price (free). It's was a dream come true to cloud providers - and it still is. However, working with it in production, spending much time getting to know how ceph works, what it does, and also seeing how and where it fails prompted my interest in where it's going, because big public clouds are one thing, traditional SMB/Small enterprise needs are another and that's where I feel it fails hard. So I tried prodding here on ML, watched performance talks (which, frankly, reinforced my confirmation bias) and hoped to see some hint of it getting bette. That for me equals simpler, faster, not reinventing the wheel. I truly don't see that and it makes me sad. You are talking about the big picture - Ceph for storing anything, new architecture - and it sounds cool. Given enough money and time it can materialise, I won't elaborate on that. I just hope you don't forget about the measly RBD users like me (I'd guesstimate a silent 90%+ majority, but no idea, hopefully the product manager has a better one) who are frustrated from the current design. I'd like to think I represent those users who used to solve HA with DRBD 10 years ago, who had to battle NFS shares with rsync and inotify scripts, who were the only people on-call every morning at 3AM when logrotate killed their IO, all while having to work with rotting hardware and no budget. We are still out there and there's nothing for us - RBD is not as fast, simple or reliable as DRBD, filesystem is not as simple nor as fast as rsync, scrubbing still wakes us at 3AM... I'd very much like Ceph to be my storage system of choice in the future again, which is why I am so vocal with my opinions, and maybe truly selfish with my needs. I have not yet been convinced of the bright future, and - being the sceptical^Wcynical monster I turned into - I expect everything which makes my spidey sense tingle to fail, as it usually does. But that's called confirmation bias, which can make my whole point moot I guess :) Jan > On 12 Apr 2016, at 23:08, Nick Fisk <nick@fisk.me.uk> wrote: > > Jan, > > I would like to echo Sage's response here. It seems you only want a subset > of what Ceph offers, whereas RADOS is designed to offer a whole lot more, > which requires a lot more intelligence at the lower levels. > > I must say I have found your attitude to both Sage and the Ceph project as a > whole over the last few emails quite disrespectful. I spend a lot of my time > trying to sell the benefits of open source, which centre on the openness of > the idea/code and not around the fact that you can get it for free. One of > the things that I like about open source is the constructive, albeit > sometimes abrupt, constructive criticism that results in a better product. > Simply shouting Ceph is slow and it's because dev's don't understand > filesystems is not constructive. > > I've just come back from an expo at ExCel London where many providers are > passionately talking about Ceph. There seems to be a lot of big money > sloshing about for something that is inherently "wrong" > > Sage and the core Ceph team seem like very clever people to me and I trust > that over the years of development, that if they have decided that standard > FS's are not the ideal backing store for Ceph, that this is probably correct > decision. However I am also aware that the human condition "Can't see the > wood for the trees" is everywhere and I'm sure if you have any clever > insights into filesystem behaviour, the Ceph Dev team would be more than > open to suggestions. > > Personally I wish I could contribute more to the project as I feel that I > (any my company) get more from Ceph than we put in, but it strikes a nerve > when there is such negative criticism for what effectively is a free > product. > > Yes, I also suffer from the problem of slow sync writes, but the benefit of > being able to shift 1U servers around a Rack/DC compared to a SAS tethered > 4U jbod somewhat outweighs that as well as several other advanatages. A new > cluster that we are deploying has several hardware choices which go a long > way to improve this performance as well. Coupled with the coming Bluestore, > the future looks bright. > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of >> Sage Weil >> Sent: 12 April 2016 21:48 >> To: Jan Schermer <jan@schermer.cz> >> Cc: ceph-devel <ceph-devel@vger.kernel.org>; ceph-users <ceph- >> users@ceph.com>; ceph-maintainers@ceph.com >> Subject: Re: [ceph-users] Deprecating ext4 support >> >> On Tue, 12 Apr 2016, Jan Schermer wrote: >>> Still the answer to most of your points from me is "but who needs that?" >>> Who needs to have exactly the same data in two separate objects >>> (replicas)? Ceph needs it because "consistency"?, but the app (VM >>> filesystem) is fine with whatever version because the flush didn't >>> happen (if it did the contents would be the same). >> >> If you want replicated VM store that isn't picky about consistency, try >> Sheepdog. Or your mdraid over iSCSI proposal. >> >> We care about these things because VMs are just one of many users of >> rados, and because even if we could get away with being sloppy in some (or >> even most) cases with VMs, we need the strong consistency to build other >> features people want, like RBD journaling for multi-site async > replication. >> >> Then there's the CephFS MDS, RGW, and a pile of out-of-tree users that >> chose rados for a reason. >> >> And we want to make sense of an inconsistency when we find one on scrub. >> (Does it mean the disk is returning bad data, or we just crashed during a > write >> a while back?) >> >> ... >> >> Cheers- >> sage >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 23:12 ` [ceph-users] " Jan Schermer @ 2016-04-13 13:13 ` Sage Weil 0 siblings, 0 replies; 36+ messages in thread From: Sage Weil @ 2016-04-13 13:13 UTC (permalink / raw) To: Jan Schermer; +Cc: Nick Fisk, ceph-devel, ceph-users On Wed, 13 Apr 2016, Jan Schermer wrote: > I apologise, I probably should have dialed down a bit. > I'd like to personally apologise to Sage, for being so patient with my > ranting. No worries :) > I just hope you don't forget about the measly RBD users like me (I'd > guesstimate a silent 90%+ majority, but no idea, hopefully the product > manager has a better one) who are frustrated from the current design. Don't worry: RBD users are a pretty clear #1 as far as where our current priorities are, and driving most of the decisions we make in RADOS. They're just not the only priorities. Cheers- sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-12 20:47 ` Sage Weil [not found] ` <alpine.DEB.2.11.1604121639590.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> [not found] ` <362740c3.9Ro.9Gf.hg.1qX1VyMOaE@mailjet.com> @ 2016-04-13 13:06 ` Sage Weil 2016-04-14 18:05 ` Jianjian Huo 2 siblings, 1 reply; 36+ messages in thread From: Sage Weil @ 2016-04-13 13:06 UTC (permalink / raw) To: Jan Schermer; +Cc: ceph-devel, ceph-users On Tue, 12 Apr 2016, Jan Schermer wrote: > Who needs to have exactly the same data in two separate objects > (replicas)? Ceph needs it because "consistency"?, but the app (VM > filesystem) is fine with whatever version because the flush didn't > happen (if it did the contents would be the same). While we're talking/thinking about this, here's a simple example of why the simple solution (let the replicas be out of sync), which seems reasonable at first, can blow up in your face. If a disk block contains A and you write B over the top of it and then there is a failure (e.g. power loss before you issue a flush), it's okay for the disk to contain either A or B. In a replicated system, let's say 2x mirroring (call them R1 and R2), you might end up with B on R1 and A on R2. If you don't immediately clean it up, then at some point down the line you might switch from reading R1 to reading R2 and the disk block will go "back in time" (previously you read B, now you read A). A single disk/replica will never do that, and applications can break. For example, if the block in question is a journal block, we might see B the first time (valid journal!), the do a bunch of work and journal/write new stuff to the blocks that follow. Then we lose power again, lose R1, replay the journal, read A from R2, and stop journal replay early... missing out on all the new stuff. This can easily corrupt a file system or database or whatever else. It might sound unlikely, but keep in mind that writes to these all-important metadata and commit blocks are extremely frequent. It's the kind of thing you can usually get away with, until you don't, and then you have a very bad day... sage ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-13 13:06 ` Sage Weil @ 2016-04-14 18:05 ` Jianjian Huo 2016-04-14 18:30 ` Samuel Just 0 siblings, 1 reply; 36+ messages in thread From: Jianjian Huo @ 2016-04-14 18:05 UTC (permalink / raw) To: Sage Weil; +Cc: Jan Schermer, ceph-devel, ceph-users On Wed, Apr 13, 2016 at 6:06 AM, Sage Weil <sage@newdream.net> wrote: > On Tue, 12 Apr 2016, Jan Schermer wrote: >> Who needs to have exactly the same data in two separate objects >> (replicas)? Ceph needs it because "consistency"?, but the app (VM >> filesystem) is fine with whatever version because the flush didn't >> happen (if it did the contents would be the same). > > While we're talking/thinking about this, here's a simple example of why > the simple solution (let the replicas be out of sync), which seems > reasonable at first, can blow up in your face. > > If a disk block contains A and you write B over the top of it and then > there is a failure (e.g. power loss before you issue a flush), it's okay > for the disk to contain either A or B. In a replicated system, let's say > 2x mirroring (call them R1 and R2), you might end up with B on R1 and A > on R2. If you don't immediately clean it up, then at some point down the > line you might switch from reading R1 to reading R2 and the disk block > will go "back in time" (previously you read B, now you read A). A > single disk/replica will never do that, and applications can break. > > For example, if the block in question is a journal block, we might see B > the first time (valid journal!), the do a bunch of work and > journal/write new stuff to the blocks that follow. Then we lose > power again, lose R1, replay the journal, read A from R2, and stop journal > replay early... missing out on all the new stuff. This can easily corrupt > a file system or database or whatever else. If data is critical, applications use their own replicas, MySQL, Cassandra, MongoDB... if above scenario happens and one replica is out of sync, they use quorum like protocol to guarantee reading the latest data, and repair those out-of-sync replicas. so eventual consistency in storage is acceptable for them? Jianjian > > It might sound unlikely, but keep in mind that writes to these > all-important metadata and commit blocks are extremely frequent. It's the > kind of thing you can usually get away with, until you don't, and then you > have a very bad day... > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-14 18:05 ` Jianjian Huo @ 2016-04-14 18:30 ` Samuel Just 0 siblings, 0 replies; 36+ messages in thread From: Samuel Just @ 2016-04-14 18:30 UTC (permalink / raw) To: Jianjian Huo; +Cc: Sage Weil, Jan Schermer, ceph-devel, ceph-users It doesn't seem like it would be wise to run such systems on top of rbd. -Sam On Thu, Apr 14, 2016 at 11:05 AM, Jianjian Huo <samuel.huo@gmail.com> wrote: > On Wed, Apr 13, 2016 at 6:06 AM, Sage Weil <sage@newdream.net> wrote: >> On Tue, 12 Apr 2016, Jan Schermer wrote: >>> Who needs to have exactly the same data in two separate objects >>> (replicas)? Ceph needs it because "consistency"?, but the app (VM >>> filesystem) is fine with whatever version because the flush didn't >>> happen (if it did the contents would be the same). >> >> While we're talking/thinking about this, here's a simple example of why >> the simple solution (let the replicas be out of sync), which seems >> reasonable at first, can blow up in your face. >> >> If a disk block contains A and you write B over the top of it and then >> there is a failure (e.g. power loss before you issue a flush), it's okay >> for the disk to contain either A or B. In a replicated system, let's say >> 2x mirroring (call them R1 and R2), you might end up with B on R1 and A >> on R2. If you don't immediately clean it up, then at some point down the >> line you might switch from reading R1 to reading R2 and the disk block >> will go "back in time" (previously you read B, now you read A). A >> single disk/replica will never do that, and applications can break. >> >> For example, if the block in question is a journal block, we might see B >> the first time (valid journal!), the do a bunch of work and >> journal/write new stuff to the blocks that follow. Then we lose >> power again, lose R1, replay the journal, read A from R2, and stop journal >> replay early... missing out on all the new stuff. This can easily corrupt >> a file system or database or whatever else. > > If data is critical, applications use their own replicas, MySQL, > Cassandra, MongoDB... if above scenario happens and one replica is out > of sync, they use quorum like protocol to guarantee reading the latest > data, and repair those out-of-sync replicas. so eventual consistency > in storage is acceptable for them? > > Jianjian >> >> It might sound unlikely, but keep in mind that writes to these >> all-important metadata and commit blocks are extremely frequent. It's the >> kind of thing you can usually get away with, until you don't, and then you >> have a very bad day... >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [Ceph-maintainers] Deprecating ext4 support 2016-04-11 21:39 Deprecating ext4 support Sage Weil [not found] ` <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-11 21:44 ` Sage Weil @ 2016-04-12 6:39 ` Loic Dachary 2016-04-13 14:19 ` [ceph-users] " Francois Lafont 3 siblings, 0 replies; 36+ messages in thread From: Loic Dachary @ 2016-04-12 6:39 UTC (permalink / raw) To: Sage Weil, ceph-devel, ceph-users, ceph-maintainers, ceph-announce Hi Sage, I suspect most people nowadays run tests and develop on ext4. Not supporting ext4 in the future means we'll need to find a convenient way for developers to run tests against the supported file systems. My 2cts :-) On 11/04/2016 23:39, Sage Weil wrote: > Hi, > > ext4 has never been recommended, but we did test it. After Jewel is out, > we would like explicitly recommend *against* ext4 and stop testing it. > > Why: > > Recently we discovered an issue with the long object name handling that is > not fixable without rewriting a significant chunk of FileStores filename > handling. (There is a limit in the amount of xattr data ext4 can store in > the inode, which causes problems in LFNIndex.) > > We *could* invest a ton of time rewriting this to fix, but it only affects > ext4, which we never recommended, and we plan to deprecate FileStore once > BlueStore is stable anyway, so it seems like a waste of time that would be > better spent elsewhere. > > Also, by dropping ext4 test coverage in ceph-qa-suite, we can > significantly improve time/coverage for FileStore on XFS and on BlueStore. > > The long file name handling is problematic anytime someone is storing > rados objects with long names. The primary user that does this is RGW, > which means any RGW cluster using ext4 should recreate their OSDs to use > XFS. Other librados users could be affected too, though, like users > with very long rbd image names (e.g., > 100 characters), or custom > librados users. > > How: > > To make this change as visible as possible, the plan is to make ceph-osd > refuse to start if the backend is unable to support the configured max > object name (osd_max_object_name_len). The OSD will complain that ext4 > cannot store such an object and refuse to start. A user who is only using > RBD might decide they don't need long file names to work and can adjust > the osd_max_object_name_len setting to something small (say, 64) and run > successfully. They would be taking a risk, though, because we would like > to stop testing on ext4. > > Is this reasonable? If there significant ext4 users that are unwilling to > recreate their OSDs, now would be the time to speak up. > > Thanks! > sage > > _______________________________________________ > Ceph-maintainers mailing list > Ceph-maintainers@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-maintainers-ceph.com > -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [ceph-users] Deprecating ext4 support 2016-04-11 21:39 Deprecating ext4 support Sage Weil ` (2 preceding siblings ...) 2016-04-12 6:39 ` [Ceph-maintainers] " Loic Dachary @ 2016-04-13 14:19 ` Francois Lafont 3 siblings, 0 replies; 36+ messages in thread From: Francois Lafont @ 2016-04-13 14:19 UTC (permalink / raw) To: Sage Weil, ceph-devel, ceph-users, ceph-maintainers, ceph-announce Hello, On 11/04/2016 23:39, Sage Weil wrote: > [...] Is this reasonable? [...] Warning: I'm just a ceph user and definitively non-expert user. 1. Personally, if you see the documentation, read a little the maling list and/or IRC, it seems to me _clear_ that ext4 is not recommended even if the opposite if mentioned sometimes (personally I don't use ext4 in my ceph cluster, I use xfs as the doc says). 2. I'm not a ceph expert but I can imagine the monstrous work that represents the development of a software such as ceph and I think it can be reasonable sometimes to limit the work when it's possible. So make ext4 deprecated seems to me reasonable. I think the comfort of the users is important but, in a _long_ term, it seems to me important that the developers can concentrate their work to important things. -- François Lafont -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2016-04-14 18:30 UTC | newest] Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-04-11 21:39 Deprecating ext4 support Sage Weil [not found] ` <alpine.DEB.2.11.1604111632520.13448-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-11 21:42 ` Allen Samuels 2016-04-11 21:47 ` [ceph-users] " Jan Schermer 2016-04-11 23:39 ` Christian Balzer 2016-04-12 1:12 ` [ceph-users] " Sage Weil [not found] ` <alpine.DEB.2.11.1604112046570.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-12 1:32 ` Shinobu Kinjo 2016-04-12 2:05 ` [Ceph-maintainers] " hp cre 2016-04-12 2:43 ` [ceph-users] " Christian Balzer 2016-04-12 13:56 ` Sage Weil [not found] ` <alpine.DEB.2.11.1604120837120.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-13 3:27 ` Christian Balzer [not found] ` <20160412083925.5106311d-9yhXNL7Kh0lSCLKNlHTxZM8NsWr+9BEh@public.gmane.org> 2016-04-14 9:43 ` Antw: " Steffen Weißgerber 2016-04-12 7:00 ` Michael Metz-Martini | SpeedPartner GmbH 2016-04-13 2:29 ` [ceph-users] " Christian Balzer 2016-04-13 12:30 ` Sage Weil 2016-04-14 0:57 ` Christian Balzer 2016-04-13 12:51 ` Michael Metz-Martini | SpeedPartner GmbH 2016-04-11 21:44 ` Sage Weil 2016-04-11 21:57 ` Mark Nelson [not found] ` <570C1DBC.3040408-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2016-04-11 22:49 ` Shinobu Kinjo 2016-04-11 23:54 ` [ceph-users] " Robin H. Johnson 2016-04-11 23:09 ` Lionel Bouton 2016-04-12 7:45 ` [ceph-users] " Jan Schermer 2016-04-12 18:00 ` Sage Weil 2016-04-12 19:19 ` Jan Schermer 2016-04-12 19:58 ` Sage Weil 2016-04-12 20:33 ` Jan Schermer 2016-04-12 20:47 ` Sage Weil [not found] ` <alpine.DEB.2.11.1604121639590.29593-Wo5lQnKln9t9PHm/lf2LFUEOCMrvLtNR@public.gmane.org> 2016-04-12 21:08 ` Nick Fisk [not found] ` <4f0f087c.9Ro.9Gf.hg.1qX1VyMOaD-ImYt9qTNe79BDgjK7y7TUQ@public.gmane.org> 2016-04-12 21:22 ` wido-fspyXLx8qC4 [not found] ` <362740c3.9Ro.9Gf.hg.1qX1VyMOaE@mailjet.com> 2016-04-12 23:12 ` [ceph-users] " Jan Schermer 2016-04-13 13:13 ` Sage Weil 2016-04-13 13:06 ` Sage Weil 2016-04-14 18:05 ` Jianjian Huo 2016-04-14 18:30 ` Samuel Just 2016-04-12 6:39 ` [Ceph-maintainers] " Loic Dachary 2016-04-13 14:19 ` [ceph-users] " Francois Lafont
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.