All of lore.kernel.org
 help / color / mirror / Atom feed
* rdma-core device memory leak
@ 2019-07-22  8:10 Gal Pressman
  2019-07-22  9:15 ` Leon Romanovsky
  2019-07-22 11:48 ` Jason Gunthorpe
  0 siblings, 2 replies; 5+ messages in thread
From: Gal Pressman @ 2019-07-22  8:10 UTC (permalink / raw)
  To: RDMA mailing list
  Cc: Jason Gunthorpe, Doug Ledford, Leon Romanovsky, Maor Gottlieb

Hi all,

I'm seeing memory leaks when running tests with valgrind memcheck tool [1]. It
seems like it's caused due to verbs_device refcount never reaching zero.

Last related commit is 8125fdeb69bb ("verbs: Avoid ibv_device memory leak"),
which seems like it should prevent this issue - but I'm not sure it covers all
cases.

When calling ibv_get_device_list, try_driver will eventually get called and set
the device refcount to one. The refcount for each device will be increased when
iterating the devices list, and on each verbs_init_context call.

In the free flow, the refcount is decreased on verbs_uninit_context and when
iterating the devices list - which brings the refcount back to one, as initially
set by try_driver (hence uninit_device isn't called).

Is there a reason for initializing refcount to one instead of zero? According to
the cited commit the reference count should be decreased when the device no
longer exists in the sysfs, but the device isn't necessarily removed from the sysfs.

[1]
==35758== HEAP SUMMARY:
==35758==     in use at exit: 27,777 bytes in 88 blocks
==35758==   total heap usage: 295 allocs, 207 frees, 141,751 bytes allocated
==35758==
==35758== 728 bytes in 1 blocks are possibly lost in loss record 3 of 8
==35758==    at 0x4C2A935: calloc (vg_replace_malloc.c:711)
==35758==    by 0x6FF263F: efa_device_alloc (efa.c:161)
==35758==    by 0x4E42B67: try_driver (init.c:365)
==35758==    by 0x4E42DBE: try_drivers (init.c:429)
==35758==    by 0x4E42DBE: try_all_drivers (init.c:519)
==35758==    by 0x4E43798: ibverbs_get_device_list (init.c:584)
==35758==    by 0x4E40870: ibv_get_device_list@@IBVERBS_1.1 (device.c:74)
==35758==    by 0x400691: main (device_list.c:46)
==35758==
==35758== 1,048 bytes in 1 blocks are possibly lost in loss record 6 of 8
==35758==    at 0x4C2A935: calloc (vg_replace_malloc.c:711)
==35758==    by 0x4E425C5: find_sysfs_devs_nl_cb (ibdev_nl.c:156)
==35758==    by 0x5697E4B: nl_recvmsgs_report (in /usr/lib64/libnl-3.so.200.23.0)
==35758==    by 0x56982B8: nl_recvmsgs (in /usr/lib64/libnl-3.so.200.23.0)
==35758==    by 0x4E4734F: rdmanl_get_devices (rdma_nl.c:96)
==35758==    by 0x4E42726: find_sysfs_devs_nl (ibdev_nl.c:205)
==35758==    by 0x4E43501: ibverbs_get_device_list (init.c:538)
==35758==    by 0x4E40870: ibv_get_device_list@@IBVERBS_1.1 (device.c:74)
==35758==    by 0x400691: main (device_list.c:46)

Thanks,
Gal

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rdma-core device memory leak
  2019-07-22  8:10 rdma-core device memory leak Gal Pressman
@ 2019-07-22  9:15 ` Leon Romanovsky
  2019-07-22 11:21   ` Gal Pressman
  2019-07-22 11:48 ` Jason Gunthorpe
  1 sibling, 1 reply; 5+ messages in thread
From: Leon Romanovsky @ 2019-07-22  9:15 UTC (permalink / raw)
  To: Gal Pressman
  Cc: RDMA mailing list, Jason Gunthorpe, Doug Ledford, Maor Gottlieb

On Mon, Jul 22, 2019 at 11:10:51AM +0300, Gal Pressman wrote:
> Hi all,
>
> I'm seeing memory leaks when running tests with valgrind memcheck tool [1]. It
> seems like it's caused due to verbs_device refcount never reaching zero.
>
> Last related commit is 8125fdeb69bb ("verbs: Avoid ibv_device memory leak"),
> which seems like it should prevent this issue - but I'm not sure it covers all
> cases.
>
> When calling ibv_get_device_list, try_driver will eventually get called and set
> the device refcount to one. The refcount for each device will be increased when
> iterating the devices list, and on each verbs_init_context call.
>
> In the free flow, the refcount is decreased on verbs_uninit_context and when
> iterating the devices list - which brings the refcount back to one, as initially
> set by try_driver (hence uninit_device isn't called).
>
> Is there a reason for initializing refcount to one instead of zero? According to
> the cited commit the reference count should be decreased when the device no
> longer exists in the sysfs, but the device isn't necessarily removed from the sysfs.

Such scheme allows us to avoid rdma-core provider reinitialization every
time application "plays" with ibv_get_device_list(). Anyway, the rdma-core
library (libibverbs) won't be unloaded till dclose() is called and glibc
reference count won't reach zero, so we don't need to release provider
till that point of time too.

Thanks

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rdma-core device memory leak
  2019-07-22  9:15 ` Leon Romanovsky
@ 2019-07-22 11:21   ` Gal Pressman
  2019-07-22 11:38     ` Leon Romanovsky
  0 siblings, 1 reply; 5+ messages in thread
From: Gal Pressman @ 2019-07-22 11:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: RDMA mailing list, Jason Gunthorpe, Doug Ledford, Maor Gottlieb

On 22/07/2019 12:15, Leon Romanovsky wrote:
> On Mon, Jul 22, 2019 at 11:10:51AM +0300, Gal Pressman wrote:
>> Hi all,
>>
>> I'm seeing memory leaks when running tests with valgrind memcheck tool [1]. It
>> seems like it's caused due to verbs_device refcount never reaching zero.
>>
>> Last related commit is 8125fdeb69bb ("verbs: Avoid ibv_device memory leak"),
>> which seems like it should prevent this issue - but I'm not sure it covers all
>> cases.
>>
>> When calling ibv_get_device_list, try_driver will eventually get called and set
>> the device refcount to one. The refcount for each device will be increased when
>> iterating the devices list, and on each verbs_init_context call.
>>
>> In the free flow, the refcount is decreased on verbs_uninit_context and when
>> iterating the devices list - which brings the refcount back to one, as initially
>> set by try_driver (hence uninit_device isn't called).
>>
>> Is there a reason for initializing refcount to one instead of zero? According to
>> the cited commit the reference count should be decreased when the device no
>> longer exists in the sysfs, but the device isn't necessarily removed from the sysfs.
> 
> Such scheme allows us to avoid rdma-core provider reinitialization every
> time application "plays" with ibv_get_device_list(). Anyway, the rdma-core
> library (libibverbs) won't be unloaded till dclose() is called and glibc
> reference count won't reach zero, so we don't need to release provider
> till that point of time too.

So you consider these valgrind errors false alarms?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rdma-core device memory leak
  2019-07-22 11:21   ` Gal Pressman
@ 2019-07-22 11:38     ` Leon Romanovsky
  0 siblings, 0 replies; 5+ messages in thread
From: Leon Romanovsky @ 2019-07-22 11:38 UTC (permalink / raw)
  To: Gal Pressman
  Cc: RDMA mailing list, Jason Gunthorpe, Doug Ledford, Maor Gottlieb

On Mon, Jul 22, 2019 at 02:21:18PM +0300, Gal Pressman wrote:
> On 22/07/2019 12:15, Leon Romanovsky wrote:
> > On Mon, Jul 22, 2019 at 11:10:51AM +0300, Gal Pressman wrote:
> >> Hi all,
> >>
> >> I'm seeing memory leaks when running tests with valgrind memcheck tool [1]. It
> >> seems like it's caused due to verbs_device refcount never reaching zero.
> >>
> >> Last related commit is 8125fdeb69bb ("verbs: Avoid ibv_device memory leak"),
> >> which seems like it should prevent this issue - but I'm not sure it covers all
> >> cases.
> >>
> >> When calling ibv_get_device_list, try_driver will eventually get called and set
> >> the device refcount to one. The refcount for each device will be increased when
> >> iterating the devices list, and on each verbs_init_context call.
> >>
> >> In the free flow, the refcount is decreased on verbs_uninit_context and when
> >> iterating the devices list - which brings the refcount back to one, as initially
> >> set by try_driver (hence uninit_device isn't called).
> >>
> >> Is there a reason for initializing refcount to one instead of zero? According to
> >> the cited commit the reference count should be decreased when the device no
> >> longer exists in the sysfs, but the device isn't necessarily removed from the sysfs.
> >
> > Such scheme allows us to avoid rdma-core provider reinitialization every
> > time application "plays" with ibv_get_device_list(). Anyway, the rdma-core
> > library (libibverbs) won't be unloaded till dclose() is called and glibc
> > reference count won't reach zero, so we don't need to release provider
> > till that point of time too.
>
> So you consider these valgrind errors false alarms?

Yes, valgrind checks executed code and unlikely to check unload sequence.
In your case, the unload code wasn't called at all.

Thanks

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: rdma-core device memory leak
  2019-07-22  8:10 rdma-core device memory leak Gal Pressman
  2019-07-22  9:15 ` Leon Romanovsky
@ 2019-07-22 11:48 ` Jason Gunthorpe
  1 sibling, 0 replies; 5+ messages in thread
From: Jason Gunthorpe @ 2019-07-22 11:48 UTC (permalink / raw)
  To: Gal Pressman
  Cc: RDMA mailing list, Doug Ledford, Leon Romanovsky, Maor Gottlieb

On Mon, Jul 22, 2019 at 11:10:51AM +0300, Gal Pressman wrote:
> Hi all,
> 
> I'm seeing memory leaks when running tests with valgrind memcheck tool [1]. It
> seems like it's caused due to verbs_device refcount never reaching zero.
> 
> Last related commit is 8125fdeb69bb ("verbs: Avoid ibv_device memory leak"),
> which seems like it should prevent this issue - but I'm not sure it covers all
> cases.
> 
> When calling ibv_get_device_list, try_driver will eventually get called and set
> the device refcount to one. The refcount for each device will be increased when
> iterating the devices list, and on each verbs_init_context call.
> 
> In the free flow, the refcount is decreased on verbs_uninit_context and when
> iterating the devices list - which brings the refcount back to one, as initially
> set by try_driver (hence uninit_device isn't called).

It is supposed to cache the device list in the library
(device.:device_list) and there is no function to cleanup the cache to
silence the valgrind warnings.

Jason

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-07-22 11:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-22  8:10 rdma-core device memory leak Gal Pressman
2019-07-22  9:15 ` Leon Romanovsky
2019-07-22 11:21   ` Gal Pressman
2019-07-22 11:38     ` Leon Romanovsky
2019-07-22 11:48 ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.