All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Yi Zhang <yizhan-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
Subject: Re: mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during stress test on reset_controller
Date: Tue, 14 Mar 2017 18:52:03 +0200	[thread overview]
Message-ID: <0a825b18-df06-9a6d-38c9-402f4ee121f7@mellanox.com> (raw)
In-Reply-To: <860db62d-ae93-d94c-e5fb-88e7b643f737-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>



On 3/14/2017 3:35 PM, Yi Zhang wrote:
>
>
> On 03/13/2017 02:16 AM, Max Gurtovoy wrote:
>>
>>
>> On 3/10/2017 6:52 PM, Leon Romanovsky wrote:
>>> On Thu, Mar 09, 2017 at 12:20:14PM +0800, Yi Zhang wrote:
>>>>
>>>>> I'm using CX5-LX device and have not seen any issues with it.
>>>>>
>>>>> Would it be possible to retest with kmemleak?
>>>>>
>>>> Here is the device I used.
>>>>
>>>> Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
>>>>
>>>> The issue always can be reproduced with about 1000 time.
>>>>
>>>> Another thing is I found one strange phenomenon from the log:
>>>>
>>>> before the OOM occurred, most of the log are  about "adding queue", and
>>>> after the OOM occurred, most of the log are about "nvmet_rdma: freeing
>>>> queue".
>>>>
>>>> seems the release work: "schedule_work(&queue->release_work);" not
>>>> executed
>>>> timely, not sure whether the OOM is caused by this reason.
>>>
>>> Sagi,
>>> The release function is placed in global workqueue. I'm not familiar
>>> with NVMe design and I don't know all the details, but maybe the
>>> proper way will
>>> be to create special workqueue with MEM_RECLAIM flag to ensure the
>>> progress?
>>>
>>
>> Hi,
>>
>> I was able to repro it in my lab with ConnectX3. added a dedicated
>> workqueue with high priority but the bug still happens.
>> if I add a "sleep 1" after echo 1
>> >/sys/block/nvme0n1/device/reset_controller the test pass. So there is
>> no leak IMO, but the allocation process is much faster than the
>> destruction of the resources.
>> In the initiator we don't wait for RDMA_CM_EVENT_DISCONNECTED event
>> after we call rdma_disconnect, and we try to connect immediatly again.
>> maybe we need to slow down the storm of connect requests from the
>> initiator somehow to let the target time to settle up.
>>
>> Max.
>>
>>
> Hi Sagi
> Let's use this mail loop to track the OOM issue. :)
>
> Thanks
> Yi

Hi Yi,
I can't repro the OOM issue with 4.11-rc2 (don't know why actually).
which kernel are you using ?

Max.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)
From: maxg@mellanox.com (Max Gurtovoy)
Subject: mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during stress test on reset_controller
Date: Tue, 14 Mar 2017 18:52:03 +0200	[thread overview]
Message-ID: <0a825b18-df06-9a6d-38c9-402f4ee121f7@mellanox.com> (raw)
In-Reply-To: <860db62d-ae93-d94c-e5fb-88e7b643f737@redhat.com>



On 3/14/2017 3:35 PM, Yi Zhang wrote:
>
>
> On 03/13/2017 02:16 AM, Max Gurtovoy wrote:
>>
>>
>> On 3/10/2017 6:52 PM, Leon Romanovsky wrote:
>>> On Thu, Mar 09, 2017@12:20:14PM +0800, Yi Zhang wrote:
>>>>
>>>>> I'm using CX5-LX device and have not seen any issues with it.
>>>>>
>>>>> Would it be possible to retest with kmemleak?
>>>>>
>>>> Here is the device I used.
>>>>
>>>> Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
>>>>
>>>> The issue always can be reproduced with about 1000 time.
>>>>
>>>> Another thing is I found one strange phenomenon from the log:
>>>>
>>>> before the OOM occurred, most of the log are  about "adding queue", and
>>>> after the OOM occurred, most of the log are about "nvmet_rdma: freeing
>>>> queue".
>>>>
>>>> seems the release work: "schedule_work(&queue->release_work);" not
>>>> executed
>>>> timely, not sure whether the OOM is caused by this reason.
>>>
>>> Sagi,
>>> The release function is placed in global workqueue. I'm not familiar
>>> with NVMe design and I don't know all the details, but maybe the
>>> proper way will
>>> be to create special workqueue with MEM_RECLAIM flag to ensure the
>>> progress?
>>>
>>
>> Hi,
>>
>> I was able to repro it in my lab with ConnectX3. added a dedicated
>> workqueue with high priority but the bug still happens.
>> if I add a "sleep 1" after echo 1
>> >/sys/block/nvme0n1/device/reset_controller the test pass. So there is
>> no leak IMO, but the allocation process is much faster than the
>> destruction of the resources.
>> In the initiator we don't wait for RDMA_CM_EVENT_DISCONNECTED event
>> after we call rdma_disconnect, and we try to connect immediatly again.
>> maybe we need to slow down the storm of connect requests from the
>> initiator somehow to let the target time to settle up.
>>
>> Max.
>>
>>
> Hi Sagi
> Let's use this mail loop to track the OOM issue. :)
>
> Thanks
> Yi

Hi Yi,
I can't repro the OOM issue with 4.11-rc2 (don't know why actually).
which kernel are you using ?

Max.

  parent reply	other threads:[~2017-03-14 16:52 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1908657724.31179983.1488539944957.JavaMail.zimbra@redhat.com>
     [not found] ` <1908657724.31179983.1488539944957.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-03 11:55   ` mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during stress test on reset_controller Yi Zhang
2017-03-03 11:55     ` Yi Zhang
     [not found]     ` <2013049462.31187009.1488542111040.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-05  8:12       ` Leon Romanovsky
2017-03-05  8:12         ` Leon Romanovsky
     [not found]         ` <20170305081206.GI14379-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-03-08 15:48           ` Christoph Hellwig
2017-03-08 15:48             ` Christoph Hellwig
     [not found]             ` <20170308154815.GB24437-jcswGhMUV9g@public.gmane.org>
2017-03-09  8:42               ` Leon Romanovsky
2017-03-09  8:42                 ` Leon Romanovsky
2017-03-09  8:46           ` Leon Romanovsky
2017-03-09  8:46             ` Leon Romanovsky
     [not found]             ` <20170309084641.GY14379-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-03-09 10:33               ` Yi Zhang
2017-03-09 10:33                 ` Yi Zhang
2017-03-06 11:23       ` Sagi Grimberg
2017-03-06 11:23         ` Sagi Grimberg
     [not found]         ` <95e045a8-ace0-6a9a-b9a9-555cb2670572-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-09  4:20           ` Yi Zhang
2017-03-09  4:20             ` Yi Zhang
2017-03-09 11:42             ` Max Gurtovoy
2017-03-10  8:12               ` Yi Zhang
     [not found]             ` <d21c5571-78fd-7882-b4cc-c24f76f6ff47-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-10 16:52               ` Leon Romanovsky
2017-03-10 16:52                 ` Leon Romanovsky
     [not found]                 ` <20170310165214.GC14379-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-03-12 18:16                   ` Max Gurtovoy
2017-03-12 18:16                     ` Max Gurtovoy
     [not found]                     ` <56e8ccd3-8116-89a1-2f65-eb61a91c5f84-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-03-14 13:35                       ` Yi Zhang
2017-03-14 13:35                         ` Yi Zhang
     [not found]                         ` <860db62d-ae93-d94c-e5fb-88e7b643f737-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-14 16:52                           ` Max Gurtovoy [this message]
2017-03-14 16:52                             ` Max Gurtovoy
     [not found]                             ` <0a825b18-df06-9a6d-38c9-402f4ee121f7-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-03-15  7:48                               ` Yi Zhang
2017-03-15  7:48                                 ` Yi Zhang
     [not found]                                 ` <7496c68a-15f3-d8cb-b17f-20f5a59a24d2-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-16 16:51                                   ` Sagi Grimberg
2017-03-16 16:51                                     ` Sagi Grimberg
     [not found]                                     ` <31678a43-f76c-a921-e40c-470b0de1a86c-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-18 11:51                                       ` Yi Zhang
2017-03-18 11:51                                         ` Yi Zhang
     [not found]                                         ` <1768681609.3995777.1489837916289.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-03-18 17:50                                           ` Sagi Grimberg
2017-03-18 17:50                                             ` Sagi Grimberg
2017-03-19  7:01                                       ` Leon Romanovsky
2017-03-19  7:01                                         ` Leon Romanovsky
     [not found]                                         ` <20170319070115.GP2079-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-18 17:01                                           ` Yi Zhang
2017-05-18 17:01                                             ` Yi Zhang
     [not found]                                             ` <136275928.8307994.1495126919829.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-05-19 16:17                                               ` Yi Zhang
2017-05-19 16:17                                                 ` Yi Zhang
     [not found]                                                 ` <358169046.8629042.1495210672801.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-06-04 15:49                                                   ` Sagi Grimberg
2017-06-04 15:49                                                     ` Sagi Grimberg
     [not found]                                                     ` <6bf26cbc-71e4-a030-628b-a2ee1d1de94b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-15  8:45                                                       ` Yi Zhang
2017-06-15  8:45                                                         ` Yi Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0a825b18-df06-9a6d-38c9-402f4ee121f7@mellanox.com \
    --to=maxg-vpraknaxozvwk0htik3j/w@public.gmane.org \
    --cc=leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
    --cc=linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org \
    --cc=yizhan-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.