linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Oded Gabbay <oded.gabbay@gmail.com>
To: linux-kernel@vger.kernel.org, oshpigelman@habana.ai, ttayar@habana.ai
Cc: gregkh@linuxfoundation.org
Subject: [PATCH 6/6] habanalabs: increase timeout during reset
Date: Sat, 28 Mar 2020 11:52:38 +0300	[thread overview]
Message-ID: <20200328085238.3428-6-oded.gabbay@gmail.com> (raw)
In-Reply-To: <20200328085238.3428-1-oded.gabbay@gmail.com>

When doing training, the DL framework (e.g. tensorflow) performs hundreds
of thousands of memory allocations and mappings. In case the driver needs
to perform hard-reset during training, the driver kills the application and
unmaps all those memory allocations. Unfortunately, because of that large
amount of mappings, the driver isn't able to do that in the current timeout
(5 seconds). Therefore, increase the timeout significantly to 30 seconds
to avoid situation where the driver resets the device with active mappings,
which sometime can cause a kernel bug.

BTW, it doesn't mean we will spend all the 30 seconds because the reset
thread checks every one second if the unmap operation is done.

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
---
 drivers/misc/habanalabs/habanalabs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/habanalabs/habanalabs.h b/drivers/misc/habanalabs/habanalabs.h
index 199f7835ae46..6c54d0ba0a1d 100644
--- a/drivers/misc/habanalabs/habanalabs.h
+++ b/drivers/misc/habanalabs/habanalabs.h
@@ -23,7 +23,7 @@
 
 #define HL_MMAP_CB_MASK			(0x8000000000000000ull >> PAGE_SHIFT)
 
-#define HL_PENDING_RESET_PER_SEC	5
+#define HL_PENDING_RESET_PER_SEC	30
 
 #define HL_DEVICE_TIMEOUT_USEC		1000000 /* 1 s */
 
-- 
2.17.1


  parent reply	other threads:[~2020-03-28  8:52 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-28  8:52 [PATCH 1/6] habanalabs: don't wait for ASIC CPU after reset Oded Gabbay
2020-03-28  8:52 ` [PATCH 2/6] habanalabs: remove stop-on-error flag from DMA Oded Gabbay
2020-03-28  8:52 ` [PATCH 3/6] habanalabs: re-factor H/W queues initialization Oded Gabbay
2020-03-28  8:52 ` [PATCH 4/6] habanalabs: unify and improve device cpu init Oded Gabbay
2020-03-30  6:07   ` Omer Shpigelman
2020-03-28  8:52 ` [PATCH 5/6] habanalabs: print warning when reset is requested Oded Gabbay
2020-03-30  6:09   ` Omer Shpigelman
2020-03-28  8:52 ` Oded Gabbay [this message]
2020-03-30  6:14   ` [PATCH 6/6] habanalabs: increase timeout during reset Omer Shpigelman
2020-03-30  6:01 ` [PATCH 1/6] habanalabs: don't wait for ASIC CPU after reset Omer Shpigelman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200328085238.3428-6-oded.gabbay@gmail.com \
    --to=oded.gabbay@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=oshpigelman@habana.ai \
    --cc=ttayar@habana.ai \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).