All of lore.kernel.org
 help / color / mirror / Atom feed
* Best way to store billions of files
@ 2010-08-01 12:08 Roland Rabben
  2010-08-01 16:17 ` Gregory Farnum
  2010-08-24  8:53 ` Anton VG.
  0 siblings, 2 replies; 4+ messages in thread
From: Roland Rabben @ 2010-08-01 12:08 UTC (permalink / raw)
  To: ceph-devel

I am researching alternatives to GlusterFS that I am currently using.
My need is to store billions of files (big and small), and I am trying
to find out if there are any considerations I should make when
planning folder structure and server config using Ceph.

On my GlusterFS system things seems to slow down dramatically as I
grow the number of files. A simple ls takes forever. So I am looking
for alternatives.

Right now my folder structrure looks like this:

Users are grouped into folders, named /000, /001, ... /999 , using a hash.
Each user has its own folder inside the numbered folders
Inside each user-folder the users files are stored in folders named
/000, /001, ... /999, also using a hash.

Would this folder structure or the ammount of files become a problem using Ceph?

I generally use 4U storage nodes with 36 x 1,5 TB or 2 TB SATA drives,
8 core CPU and 6 GB RAM. My application is write once and read many.
What recommendations would you give with regards to setting up the
filesystem on the storage nodes? ext3? ext4? lvm? RAID?

Today I am mounting all disks as individual ext3 partitions and tying
them together with GlusterFS. Would this work with Ceph or would you
recommend making one large LVM volume on each storage node that you
expose to Ceph?

I know Ceph is not production ready yet, but from the activity on this
mailing list things looks promising.

Best regards
Roland Rabben

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Best way to store billions of files
  2010-08-01 12:08 Best way to store billions of files Roland Rabben
@ 2010-08-01 16:17 ` Gregory Farnum
       [not found]   ` <AANLkTimaUsbQTAmOi0TXquOnfJvPMmBkMSRSKuicPvrh@mail.gmail.com>
  2010-08-24  8:53 ` Anton VG.
  1 sibling, 1 reply; 4+ messages in thread
From: Gregory Farnum @ 2010-08-01 16:17 UTC (permalink / raw)
  To: Roland Rabben; +Cc: ceph-devel

On Sun, Aug 1, 2010 at 5:08 AM, Roland Rabben <roland@jotta.no> wrote:
> I know Ceph is not production ready yet, but from the activity on this
> mailing list things looks promising.
As you note, Ceph is definitely not production-ready yet. Part of this
means that its testing in large-scale environments is limited, so
there may be bugs or unexpected behaviors. That said:

> I am researching alternatives to GlusterFS that I am currently using.
> My need is to store billions of files (big and small), and I am trying
> to find out if there are any considerations I should make when
> planning folder structure and server config using Ceph.
>
> On my GlusterFS system things seems to slow down dramatically as I
> grow the number of files. A simple ls takes forever. So I am looking
> for alternatives.
>
> Right now my folder structrure looks like this:
>
> Users are grouped into folders, named /000, /001, ... /999 , using a hash.
> Each user has its own folder inside the numbered folders
> Inside each user-folder the users files are stored in folders named
> /000, /001, ... /999, also using a hash.
>
> Would this folder structure or the ammount of files become a problem using Ceph?
This structure *should* definitely be okay for Ceph -- it stores
dentries as part of the containing inode, and the metadata servers
make extensive use of an in-memory cache, so an ls will generally
require either zero or one on-disk lookups.

> I generally use 4U storage nodes with 36 x 1,5 TB or 2 TB SATA drives,
> 8 core CPU and 6 GB RAM. My application is write once and read many.
> What recommendations would you give with regards to setting up the
> filesystem on the storage nodes? ext3? ext4? lvm? RAID?
Again going back to the limited testing, I don't think anybody knows
what the best configuration will be with such disk-heavy nodes. But
you'll definitely want to run btrfs on storage nodes as it supports a
number of features which enable Ceph to be faster under some
circumstances and to have more reliable recovery behavior.
Speculating on the best-performance configuration is hard without
knowing your usage patterns, but: Given the limited memory
availability, I would probably create 2 or 3 btrfs volumes across all
the disks (except saving one extra disk per volume to use as a
journal), and then run one OSD per volume (with an
appropriately-configured CRUSH map to prevent replicating data onto
the same node!). If you can expand the memory above 12GB I'd probably
stuff the machine full and then run one OSD per 2GB (or maybe 1GB, but
your caching will be weaker) and partition the drives using btrfs
accordingly.

> Today I am mounting all disks as individual ext3 partitions and tying
> them together with GlusterFS. Would this work with Ceph or would you
> recommend making one large LVM volume on each storage node that you
> expose to Ceph?
This would be a bad idea with Ceph, you'll need to combine the disks
into logical volumes (but as I said, btrfs can do this). The reason
being that Ceph can only handle one directory/OSD instance and you
don't want to stuff 36 OSDs into 8 cores and 6 GB of RAM. :)
-Greg

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Best way to store billions of files
       [not found]     ` <AANLkTikJSaZx9O2HGzNUJ6YQ5fhtMmE_vfNzQPH=Ferv@mail.gmail.com>
@ 2010-08-01 18:23       ` Roland Rabben
  0 siblings, 0 replies; 4+ messages in thread
From: Roland Rabben @ 2010-08-01 18:23 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Thanks. Copying the respons back to the list.

Roland

2010/8/1 Gregory Farnum <gregf@hq.newdream.net>:
> On Sun, Aug 1, 2010 at 11:02 AM, Roland Rabben <roland@jotta.no> wrote:
>> Great. I'll have a look at BTRFS. Any draw-backs with BTRFS? It looks
>> pretty young.
> It is pretty young, but we expect it'll be ready (at least for
> replicated storage) as soon as Ceph is. :)
>
>> So if I understand you correctly. Use BTRFS to combine the disks in
>> logical volumes. Perhaps 3 logical volumes each accross 12 disks each?
>> Then running 3 OSDs, each with 4 GB of RAM.
> Well, actually you'd want to do 3 logical volumes across 11 volumes
> each, and save one disk per OSD instance to provide a journaling
> device.
>
>> 6 logical volumes accross 6 disks each. Then running 6 OSDs with 2 GB RAM each.
> We don't really have performance data to determine which of these
> setups will be better for you; you'd have to experiment. Each OSD
> daemon will take up between 200 and 800MB of RAM to do its work, but
> any extra will be used by the kernel to cache file data, and depending
> on your workload that can be a serious performance advantage.
> It's not like you need to manually partition the RAM or anything, though!
>
>> Does BTRFS support a situation if a disk in a logical volume fails?
>> Any RAID 5-like features where it could continue running wit a failed
>> disk and rebuild once the failed disk is replaced?
> Hmm, I don't know. I'm sure somebody on the list does though, if you
> want to move the discussion back on-list. :) (We don't get enough
> traffic to need discussions to stay off-list for traffic reasons or
> anything, and if you keep it on-list Sage [lead developer] will see it
> all.)
>
>> Any performance gains with larger number of disks in a logical BTRFS volume?
> Not sure. I think btrfs can stripe across disks but depending on your
> network connection that's more likely to be a limiting factor. :)
> -Greg
>



-- 
Roland Rabben
Founder & CEO Jotta AS
Cell: +47 90 85 85 39
Phone: +47 21 04 29 00
Email: roland@jotta.no

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Best way to store billions of files
  2010-08-01 12:08 Best way to store billions of files Roland Rabben
  2010-08-01 16:17 ` Gregory Farnum
@ 2010-08-24  8:53 ` Anton VG.
  1 sibling, 0 replies; 4+ messages in thread
From: Anton VG. @ 2010-08-24  8:53 UTC (permalink / raw)
  To: Roland Rabben; +Cc: ceph-devel

Just got reed of the sh... named Gluster after a 1.5 years of being nervious 
with it's glitches, and switched to linux software raid6 array for the time 
being... while ceph stabilizes...

On Sunday 01 August 2010 17:08:49 Roland Rabben wrote:
> I am researching alternatives to GlusterFS that I am currently using.
> My need is to store billions of files (big and small), and I am trying
> to find out if there are any considerations I should make when
> planning folder structure and server config using Ceph.
> 
> On my GlusterFS system things seems to slow down dramatically as I
> grow the number of files. A simple ls takes forever. So I am looking
> for alternatives.
> 
> Right now my folder structrure looks like this:
> 
> Users are grouped into folders, named /000, /001, ... /999 , using a hash.
> Each user has its own folder inside the numbered folders
> Inside each user-folder the users files are stored in folders named
> /000, /001, ... /999, also using a hash.
> 
> Would this folder structure or the ammount of files become a problem using
> Ceph?
> 
> I generally use 4U storage nodes with 36 x 1,5 TB or 2 TB SATA drives,
> 8 core CPU and 6 GB RAM. My application is write once and read many.
> What recommendations would you give with regards to setting up the
> filesystem on the storage nodes? ext3? ext4? lvm? RAID?
> 
> Today I am mounting all disks as individual ext3 partitions and tying
> them together with GlusterFS. Would this work with Ceph or would you
> recommend making one large LVM volume on each storage node that you
> expose to Ceph?
> 
> I know Ceph is not production ready yet, but from the activity on this
> mailing list things looks promising.
> 
> Best regards
> Roland Rabben
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-08-24  8:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-01 12:08 Best way to store billions of files Roland Rabben
2010-08-01 16:17 ` Gregory Farnum
     [not found]   ` <AANLkTimaUsbQTAmOi0TXquOnfJvPMmBkMSRSKuicPvrh@mail.gmail.com>
     [not found]     ` <AANLkTikJSaZx9O2HGzNUJ6YQ5fhtMmE_vfNzQPH=Ferv@mail.gmail.com>
2010-08-01 18:23       ` Roland Rabben
2010-08-24  8:53 ` Anton VG.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.