All of lore.kernel.org
 help / color / mirror / Atom feed
* rbd kernel block driver memory usage
@ 2023-01-25 16:57 Stefan Hajnoczi
  2023-01-26 13:48 ` Ilya Dryomov
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2023-01-25 16:57 UTC (permalink / raw)
  To: Ilya Dryomov, Dongsheng Yang
  Cc: ceph-devel, vromanso, kwolf, mimehta, acardace

[-- Attachment #1: Type: text/plain, Size: 347 bytes --]

Hi,
What sort of memory usage is expected under heavy I/O to an rbd block
device with O_DIRECT?

For example:
- Page cache: none (O_DIRECT)
- Socket snd/rcv buffers: yes
- Internal rbd buffers?

I am trying to understand how similar Linux rbd block devices behave
compared to local block device memory consumption (like NVMe PCI).

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rbd kernel block driver memory usage
  2023-01-25 16:57 rbd kernel block driver memory usage Stefan Hajnoczi
@ 2023-01-26 13:48 ` Ilya Dryomov
  2023-01-26 14:36   ` Stefan Hajnoczi
  0 siblings, 1 reply; 8+ messages in thread
From: Ilya Dryomov @ 2023-01-26 13:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Dongsheng Yang, ceph-devel, vromanso, kwolf, mimehta, acardace

On Wed, Jan 25, 2023 at 5:57 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> Hi,
> What sort of memory usage is expected under heavy I/O to an rbd block
> device with O_DIRECT?
>
> For example:
> - Page cache: none (O_DIRECT)
> - Socket snd/rcv buffers: yes

Hi Stefan,

There is a socket open to each OSD (object storage daemon).  A Ceph
cluster may have tens, hundreds or even thousands of OSDs (although the
latter is rare -- usually folks end up with several smaller clusters
instead a single large cluster).  Under heavy random I/O and given
a big enough RBD image, it's reasonable to assume that most if not all
OSDs would be involved and therefore their sessions would be active.

A thing to note is that, by default, OSD sessions are shared between
RBD devices.  So as long as all RBD images that are mapped on a node
belong to the same cluster, the same set of sockets would be used.

Idle OSD sockets get closed after 60 seconds of inactivity.


> - Internal rbd buffers?
>
> I am trying to understand how similar Linux rbd block devices behave
> compared to local block device memory consumption (like NVMe PCI).

RBD doesn't do any internal buffering.  Data is read from/written to
the wire directly to/from BIO pages.  The only exception to that is the
"secure" mode -- built-in encryption for Ceph on-the-wire protocol.  In
that case the data is buffered, partly because RBD obviously can't mess
with plaintext data in the BIO and partly because the Linux kernel
crypto API isn't flexible enough.

There is some memory overhead associated with each I/O (OSD request
metadata encoding, mostly).  It's surely larger than in the NVMe PCI
case.  I don't have the exact number but it should be less than 4K per
I/O in almost all cases.  This memory is coming out of private SLAB
caches and could be reclaimable had we set SLAB_RECLAIM_ACCOUNT on
them.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rbd kernel block driver memory usage
  2023-01-26 13:48 ` Ilya Dryomov
@ 2023-01-26 14:36   ` Stefan Hajnoczi
  2023-01-26 15:49     ` Anthony D'Atri
  2023-01-26 18:14     ` Maged Mokhtar
  0 siblings, 2 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2023-01-26 14:36 UTC (permalink / raw)
  To: Ilya Dryomov
  Cc: Dongsheng Yang, ceph-devel, vromanso, kwolf, mimehta, acardace

[-- Attachment #1: Type: text/plain, Size: 2213 bytes --]

On Thu, Jan 26, 2023 at 02:48:27PM +0100, Ilya Dryomov wrote:
> On Wed, Jan 25, 2023 at 5:57 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > Hi,
> > What sort of memory usage is expected under heavy I/O to an rbd block
> > device with O_DIRECT?
> >
> > For example:
> > - Page cache: none (O_DIRECT)
> > - Socket snd/rcv buffers: yes
> 
> Hi Stefan,
> 
> There is a socket open to each OSD (object storage daemon).  A Ceph
> cluster may have tens, hundreds or even thousands of OSDs (although the
> latter is rare -- usually folks end up with several smaller clusters
> instead a single large cluster).  Under heavy random I/O and given
> a big enough RBD image, it's reasonable to assume that most if not all
> OSDs would be involved and therefore their sessions would be active.
> 
> A thing to note is that, by default, OSD sessions are shared between
> RBD devices.  So as long as all RBD images that are mapped on a node
> belong to the same cluster, the same set of sockets would be used.
> 
> Idle OSD sockets get closed after 60 seconds of inactivity.
> 
> 
> > - Internal rbd buffers?
> >
> > I am trying to understand how similar Linux rbd block devices behave
> > compared to local block device memory consumption (like NVMe PCI).
> 
> RBD doesn't do any internal buffering.  Data is read from/written to
> the wire directly to/from BIO pages.  The only exception to that is the
> "secure" mode -- built-in encryption for Ceph on-the-wire protocol.  In
> that case the data is buffered, partly because RBD obviously can't mess
> with plaintext data in the BIO and partly because the Linux kernel
> crypto API isn't flexible enough.
> 
> There is some memory overhead associated with each I/O (OSD request
> metadata encoding, mostly).  It's surely larger than in the NVMe PCI
> case.  I don't have the exact number but it should be less than 4K per
> I/O in almost all cases.  This memory is coming out of private SLAB
> caches and could be reclaimable had we set SLAB_RECLAIM_ACCOUNT on
> them.

Thanks, this information is very useful. I was trying to get a sense of
whether to look deeper into the rbd driver in a OOM kill scenario.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rbd kernel block driver memory usage
  2023-01-26 14:36   ` Stefan Hajnoczi
@ 2023-01-26 15:49     ` Anthony D'Atri
  2023-01-27  9:58       ` Ilya Dryomov
  2023-01-26 18:14     ` Maged Mokhtar
  1 sibling, 1 reply; 8+ messages in thread
From: Anthony D'Atri @ 2023-01-26 15:49 UTC (permalink / raw)
  To: ceph-devel

>> 
>> There is a socket open to each OSD (object storage daemon).

I’ve always understood that there were *two* to each OSD, was I misinformed?

>>  A Ceph cluster may have tens, hundreds or even thousands of OSDs (although the
>> latter is rare -- usually folks end up with several smaller clusters
>> instead a single large cluster).

… though if a client has multiple RBD volumes attached, it may be talking to more than one cluster.  I’ve seen a client exhaust the file descriptor limit on a hypervisor doing this after a cluster expansion.

>> A thing to note is that, by default, OSD sessions are shared between
>> RBD devices.  So as long as all RBD images that are mapped on a node
>> belong to the same cluster, the same set of sockets would be used.

Before … Luminous was it? AIUI they weren’t pooled, so older releases may have higher consumption.
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rbd kernel block driver memory usage
  2023-01-26 14:36   ` Stefan Hajnoczi
  2023-01-26 15:49     ` Anthony D'Atri
@ 2023-01-26 18:14     ` Maged Mokhtar
  2023-01-26 21:51       ` Stefan Hajnoczi
  1 sibling, 1 reply; 8+ messages in thread
From: Maged Mokhtar @ 2023-01-26 18:14 UTC (permalink / raw)
  To: Stefan Hajnoczi, Ilya Dryomov
  Cc: Dongsheng Yang, ceph-devel, vromanso, kwolf, mimehta, acardace

in case of object map which the driver loads, takes 2 bits per 4 MB of 
image size. 16 TB image requires 1 MB of memory.

>> I was trying to get a sense ofwhether to look deeper into the rbd driver in a OOM kill scenario.

If you are looking into OOM, maybe look into lowering queue_depth which you can specify when you map the image. Technically it belongs to the block layer queue rather than the rbd driver itself, If you write 4MB block size and your queue_depth is 1000, you need 4GB memory for inflight data for that single image, if you have many images it could add up.

/maged


On 26/01/2023 16:36, Stefan Hajnoczi wrote:
> On Thu, Jan 26, 2023 at 02:48:27PM +0100, Ilya Dryomov wrote:
>> On Wed, Jan 25, 2023 at 5:57 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>> Hi,
>>> What sort of memory usage is expected under heavy I/O to an rbd block
>>> device with O_DIRECT?
>>>
>>> For example:
>>> - Page cache: none (O_DIRECT)
>>> - Socket snd/rcv buffers: yes
>> Hi Stefan,
>>
>> There is a socket open to each OSD (object storage daemon).  A Ceph
>> cluster may have tens, hundreds or even thousands of OSDs (although the
>> latter is rare -- usually folks end up with several smaller clusters
>> instead a single large cluster).  Under heavy random I/O and given
>> a big enough RBD image, it's reasonable to assume that most if not all
>> OSDs would be involved and therefore their sessions would be active.
>>
>> A thing to note is that, by default, OSD sessions are shared between
>> RBD devices.  So as long as all RBD images that are mapped on a node
>> belong to the same cluster, the same set of sockets would be used.
>>
>> Idle OSD sockets get closed after 60 seconds of inactivity.
>>
>>
>>> - Internal rbd buffers?
>>>
>>> I am trying to understand how similar Linux rbd block devices behave
>>> compared to local block device memory consumption (like NVMe PCI).
>> RBD doesn't do any internal buffering.  Data is read from/written to
>> the wire directly to/from BIO pages.  The only exception to that is the
>> "secure" mode -- built-in encryption for Ceph on-the-wire protocol.  In
>> that case the data is buffered, partly because RBD obviously can't mess
>> with plaintext data in the BIO and partly because the Linux kernel
>> crypto API isn't flexible enough.
>>
>> There is some memory overhead associated with each I/O (OSD request
>> metadata encoding, mostly).  It's surely larger than in the NVMe PCI
>> case.  I don't have the exact number but it should be less than 4K per
>> I/O in almost all cases.  This memory is coming out of private SLAB
>> caches and could be reclaimable had we set SLAB_RECLAIM_ACCOUNT on
>> them.
> Thanks, this information is very useful. I was trying to get a sense of
> whether to look deeper into the rbd driver in a OOM kill scenario.
>
> Stefan


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rbd kernel block driver memory usage
  2023-01-26 18:14     ` Maged Mokhtar
@ 2023-01-26 21:51       ` Stefan Hajnoczi
  2023-01-27  9:40         ` Maged Mokhtar
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Hajnoczi @ 2023-01-26 21:51 UTC (permalink / raw)
  To: Maged Mokhtar
  Cc: Ilya Dryomov, Dongsheng Yang, ceph-devel, vromanso, kwolf,
	mimehta, acardace

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

On Thu, Jan 26, 2023 at 08:14:22PM +0200, Maged Mokhtar wrote:
> in case of object map which the driver loads, takes 2 bits per 4 MB of image
> size. 16 TB image requires 1 MB of memory.
> 
> > > I was trying to get a sense ofwhether to look deeper into the rbd driver in a OOM kill scenario.
> 
> If you are looking into OOM, maybe look into lowering queue_depth which you can specify when you map the image. Technically it belongs to the block layer queue rather than the rbd driver itself, If you write 4MB block size and your queue_depth is 1000, you need 4GB memory for inflight data for that single image, if you have many images it could add up.

Thanks!

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rbd kernel block driver memory usage
  2023-01-26 21:51       ` Stefan Hajnoczi
@ 2023-01-27  9:40         ` Maged Mokhtar
  0 siblings, 0 replies; 8+ messages in thread
From: Maged Mokhtar @ 2023-01-27  9:40 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: ceph-devel


another thing that can help OOM cases, avoid using rbd client on same 
host as OSDs. This was a general recommendation aseptically with earlier 
kernels (3.x).


On 26/01/2023 23:51, Stefan Hajnoczi wrote:
> On Thu, Jan 26, 2023 at 08:14:22PM +0200, Maged Mokhtar wrote:
>> in case of object map which the driver loads, takes 2 bits per 4 MB of image
>> size. 16 TB image requires 1 MB of memory.
>>
>>>> I was trying to get a sense ofwhether to look deeper into the rbd driver in a OOM kill scenario.
>> If you are looking into OOM, maybe look into lowering queue_depth which you can specify when you map the image. Technically it belongs to the block layer queue rather than the rbd driver itself, If you write 4MB block size and your queue_depth is 1000, you need 4GB memory for inflight data for that single image, if you have many images it could add up.
> Thanks!
>
> Stefan


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: rbd kernel block driver memory usage
  2023-01-26 15:49     ` Anthony D'Atri
@ 2023-01-27  9:58       ` Ilya Dryomov
  0 siblings, 0 replies; 8+ messages in thread
From: Ilya Dryomov @ 2023-01-27  9:58 UTC (permalink / raw)
  To: Anthony D'Atri; +Cc: ceph-devel

On Thu, Jan 26, 2023 at 5:08 PM Anthony D'Atri <aad@dreamsnake.net> wrote:
>
> >>
> >> There is a socket open to each OSD (object storage daemon).
>
> I’ve always understood that there were *two* to each OSD, was I misinformed?

Hi Anthony,

It looks like you were misinformed -- there is just one client -> OSD
socket.

>
> >>  A Ceph cluster may have tens, hundreds or even thousands of OSDs (although the
> >> latter is rare -- usually folks end up with several smaller clusters
> >> instead a single large cluster).
>
> … though if a client has multiple RBD volumes attached, it may be talking to more than one cluster.  I’ve seen a client exhaust the file descriptor limit on a hypervisor doing this after a cluster expansion.
>
> >> A thing to note is that, by default, OSD sessions are shared between
> >> RBD devices.  So as long as all RBD images that are mapped on a node
> >> belong to the same cluster, the same set of sockets would be used.
>
> Before … Luminous was it? AIUI they weren’t pooled, so older releases may have higher consumption.

No, this behavior goes back to when RBD was introduced in 2010.  It has
always been enabled by default so nothing changed in this regard around
Luminous.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-01-27  9:58 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-25 16:57 rbd kernel block driver memory usage Stefan Hajnoczi
2023-01-26 13:48 ` Ilya Dryomov
2023-01-26 14:36   ` Stefan Hajnoczi
2023-01-26 15:49     ` Anthony D'Atri
2023-01-27  9:58       ` Ilya Dryomov
2023-01-26 18:14     ` Maged Mokhtar
2023-01-26 21:51       ` Stefan Hajnoczi
2023-01-27  9:40         ` Maged Mokhtar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.