All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jianfeng Tan <jianfeng.tan@intel.com>
To: dev@dpdk.org
Cc: sergio.gonzalez.monroy@intel.com, nhorman@tuxdriver.com,
	david.marchand@6wind.com, thomas.monjalon@6wind.com,
	Jianfeng Tan <jianfeng.tan@intel.com>
Subject: [PATCH v5] eal: fix allocating all free hugepages
Date: Tue, 31 May 2016 03:37:07 +0000	[thread overview]
Message-ID: <1464665827-24965-1-git-send-email-jianfeng.tan@intel.com> (raw)
In-Reply-To: <1453661393-85704-1-git-send-email-jianfeng.tan@intel.com>

EAL memory init allocates all free hugepages of the whole system,
which seen from sysfs, even when applications do not ask so many.
When there is a limitation on how many hugepages an application can
use (such as cgroup.hugetlb), or hugetlbfs is specified with an
option of size (exceeding the quota of the fs), it just fails to
start even there are enough hugepages allocated.

To fix above issue, this patch:
 - Changes the logic to continue memory init to see if hugetlb
   requirement of application can be addressed by already allocated
   hugepages.
 - To make sure each hugepage is allocated successfully, we add a
   recover mechanism, which relies on a mem access to fault-in
   hugepages, and if it fails with SIGBUS, recover to previously
   saved stack environment with siglongjmp().

For the case of CONFIG_RTE_EAL_SINGLE_FILE_SEGMENTS (enabled by
default when compiling IVSHMEM target), it's indispensable to
mapp all free hugepages in the system. Under this case, it fails
to start when allocating fails.

Test example:
  a. cgcreate -g hugetlb:/test-subgroup
  b. cgset -r hugetlb.1GB.limit_in_bytes=2147483648 test-subgroup
  c. cgexec -g hugetlb:test-subgroup \
          ./examples/helloworld/build/helloworld -c 0x2 -n 4

       
Fixes: af75078fece ("first public release")

Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
---
v5:
 - Make this method as default instead of using an option.
 - When SIGBUS is triggered in the case of RTE_EAL_SINGLE_FILE_SEGMENTS,
   just return error.
 - Add prefix "huge_" to newly added function and static variables.
 - Move the internal_config.memory assignment after the page allocations.
v4:
 - Change map_all_hugepages to return unsigned instead of int.
v3:
 - Reword commit message to include it fixes the hugetlbfs quota issue.
 - setjmp -> sigsetjmp.
 - Fix RTE_LOG complaint from ERR to DEBUG as it does not mean init error
   so far.
 - Fix the second map_all_hugepages's return value check.
v2:
 - Address the compiling error by move setjmp into a wrap method.

 lib/librte_eal/linuxapp/eal/eal.c        |  20 -----
 lib/librte_eal/linuxapp/eal/eal_memory.c | 138 ++++++++++++++++++++++++++++---
 2 files changed, 125 insertions(+), 33 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index 8aafd51..4a8dfbd 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -465,24 +465,6 @@ eal_parse_vfio_intr(const char *mode)
 	return -1;
 }
 
-static inline size_t
-eal_get_hugepage_mem_size(void)
-{
-	uint64_t size = 0;
-	unsigned i, j;
-
-	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
-		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
-		if (hpi->hugedir != NULL) {
-			for (j = 0; j < RTE_MAX_NUMA_NODES; j++) {
-				size += hpi->hugepage_sz * hpi->num_pages[j];
-			}
-		}
-	}
-
-	return (size < SIZE_MAX) ? (size_t)(size) : SIZE_MAX;
-}
-
 /* Parse the arguments for --log-level only */
 static void
 eal_log_level_parse(int argc, char **argv)
@@ -766,8 +748,6 @@ rte_eal_init(int argc, char **argv)
 	if (internal_config.memory == 0 && internal_config.force_sockets == 0) {
 		if (internal_config.no_hugetlbfs)
 			internal_config.memory = MEMSIZE_IF_NO_HUGE_PAGE;
-		else
-			internal_config.memory = eal_get_hugepage_mem_size();
 	}
 
 	if (internal_config.vmware_tsc_map == 1) {
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index 5b9132c..dc6f49b 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -80,6 +80,8 @@
 #include <errno.h>
 #include <sys/ioctl.h>
 #include <sys/time.h>
+#include <signal.h>
+#include <setjmp.h>
 
 #include <rte_log.h>
 #include <rte_memory.h>
@@ -309,6 +311,21 @@ get_virtual_area(size_t *size, size_t hugepage_sz)
 	return addr;
 }
 
+static sigjmp_buf huge_jmpenv;
+
+static void huge_sigbus_handler(int signo __rte_unused)
+{
+	siglongjmp(huge_jmpenv, 1);
+}
+
+/* Put setjmp into a wrap method to avoid compiling error. Any non-volatile,
+ * non-static local variable in the stack frame calling sigsetjmp might be
+ * clobbered by a call to longjmp.
+ */
+static int huge_wrap_sigsetjmp(void)
+{
+	return sigsetjmp(huge_jmpenv, 1);
+}
 /*
  * Mmap all hugepages of hugepage table: it first open a file in
  * hugetlbfs, then mmap() hugepage_sz data in it. If orig is set, the
@@ -316,7 +333,7 @@ get_virtual_area(size_t *size, size_t hugepage_sz)
  * in hugepg_tbl[i].final_va. The second mapping (when orig is 0) tries to
  * map continguous physical blocks in contiguous virtual blocks.
  */
-static int
+static unsigned
 map_all_hugepages(struct hugepage_file *hugepg_tbl,
 		struct hugepage_info *hpi, int orig)
 {
@@ -394,9 +411,9 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl,
 		/* try to create hugepage file */
 		fd = open(hugepg_tbl[i].filepath, O_CREAT | O_RDWR, 0755);
 		if (fd < 0) {
-			RTE_LOG(ERR, EAL, "%s(): open failed: %s\n", __func__,
+			RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n", __func__,
 					strerror(errno));
-			return -1;
+			return i;
 		}
 
 		/* map the segment, and populate page tables,
@@ -404,10 +421,10 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl,
 		virtaddr = mmap(vma_addr, hugepage_sz, PROT_READ | PROT_WRITE,
 				MAP_SHARED | MAP_POPULATE, fd, 0);
 		if (virtaddr == MAP_FAILED) {
-			RTE_LOG(ERR, EAL, "%s(): mmap failed: %s\n", __func__,
+			RTE_LOG(DEBUG, EAL, "%s(): mmap failed: %s\n", __func__,
 					strerror(errno));
 			close(fd);
-			return -1;
+			return i;
 		}
 
 		if (orig) {
@@ -417,12 +434,33 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl,
 			hugepg_tbl[i].final_va = virtaddr;
 		}
 
+		if (orig) {
+			/* In linux, hugetlb limitations, like cgroup, are
+			 * enforced at fault time instead of mmap(), even
+			 * with the option of MAP_POPULATE. Kernel will send
+			 * a SIGBUS signal. To avoid to be killed, save stack
+			 * environment here, if SIGBUS happens, we can jump
+			 * back here.
+			 */
+			if (huge_wrap_sigsetjmp()) {
+				RTE_LOG(DEBUG, EAL, "SIGBUS: Cannot mmap more "
+					"hugepages of size %u MB\n",
+					(unsigned)(hugepage_sz / 0x100000));
+				munmap(virtaddr, hugepage_sz);
+				close(fd);
+				unlink(hugepg_tbl[i].filepath);
+				return i;
+			}
+			*(int *)virtaddr = 0;
+		}
+
+
 		/* set shared flock on the file. */
 		if (flock(fd, LOCK_SH | LOCK_NB) == -1) {
-			RTE_LOG(ERR, EAL, "%s(): Locking file failed:%s \n",
+			RTE_LOG(DEBUG, EAL, "%s(): Locking file failed:%s \n",
 				__func__, strerror(errno));
 			close(fd);
-			return -1;
+			return i;
 		}
 
 		close(fd);
@@ -430,7 +468,8 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl,
 		vma_addr = (char *)vma_addr + hugepage_sz;
 		vma_len -= hugepage_sz;
 	}
-	return 0;
+
+	return i;
 }
 
 #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
@@ -1036,6 +1075,51 @@ calc_num_pages_per_socket(uint64_t * memory,
 	return total_num_pages;
 }
 
+static inline size_t
+eal_get_hugepage_mem_size(void)
+{
+	uint64_t size = 0;
+	unsigned i, j;
+
+	for (i = 0; i < internal_config.num_hugepage_sizes; i++) {
+		struct hugepage_info *hpi = &internal_config.hugepage_info[i];
+		if (hpi->hugedir != NULL) {
+			for (j = 0; j < RTE_MAX_NUMA_NODES; j++) {
+				size += hpi->hugepage_sz * hpi->num_pages[j];
+			}
+		}
+	}
+
+	return (size < SIZE_MAX) ? (size_t)(size) : SIZE_MAX;
+}
+
+static struct sigaction huge_action_old;
+static int huge_need_recover;
+
+static void
+huge_register_sigbus(void)
+{
+	sigset_t mask;
+	struct sigaction action;
+
+	sigemptyset(&mask);
+	sigaddset(&mask, SIGBUS);
+	action.sa_flags = 0;
+	action.sa_mask = mask;
+	action.sa_handler = huge_sigbus_handler;
+
+	huge_need_recover = !sigaction(SIGBUS, &action, &huge_action_old);
+}
+
+static void
+huge_recover_sigbus(void)
+{
+	if (huge_need_recover) {
+		sigaction(SIGBUS, &huge_action_old, NULL);
+		huge_need_recover = 0;
+	}
+}
+
 /*
  * Prepare physical memory mapping: fill configuration structure with
  * these infos, return 0 on success.
@@ -1122,8 +1206,11 @@ rte_eal_hugepage_init(void)
 
 	hp_offset = 0; /* where we start the current page size entries */
 
+	huge_register_sigbus();
+
 	/* map all hugepages and sort them */
 	for (i = 0; i < (int)internal_config.num_hugepage_sizes; i ++){
+		unsigned pages_old, pages_new;
 		struct hugepage_info *hpi;
 
 		/*
@@ -1137,10 +1224,28 @@ rte_eal_hugepage_init(void)
 			continue;
 
 		/* map all hugepages available */
-		if (map_all_hugepages(&tmp_hp[hp_offset], hpi, 1) < 0){
-			RTE_LOG(DEBUG, EAL, "Failed to mmap %u MB hugepages\n",
-					(unsigned)(hpi->hugepage_sz / 0x100000));
+		pages_old = hpi->num_pages[0];
+		pages_new = map_all_hugepages(&tmp_hp[hp_offset], hpi, 1);
+		if (pages_new < pages_old) {
+#ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
+			RTE_LOG(ERR, EAL,
+				"%d not %d hugepages of size %u MB allocated\n",
+				pages_new, pages_old,
+				(unsigned)(hpi->hugepage_sz / 0x100000));
 			goto fail;
+#else
+			RTE_LOG(DEBUG, EAL,
+				"%d not %d hugepages of size %u MB allocated\n",
+				pages_new, pages_old,
+				(unsigned)(hpi->hugepage_sz / 0x100000));
+
+			int pages = pages_old - pages_new;
+
+			nr_hugepages -= pages;
+			hpi->num_pages[0] = pages_new;
+			if (pages_new == 0)
+				continue;
+#endif
 		}
 
 		/* find physical addresses and sockets for each hugepage */
@@ -1172,8 +1277,9 @@ rte_eal_hugepage_init(void)
 		hp_offset += new_pages_count[i];
 #else
 		/* remap all hugepages */
-		if (map_all_hugepages(&tmp_hp[hp_offset], hpi, 0) < 0){
-			RTE_LOG(DEBUG, EAL, "Failed to remap %u MB pages\n",
+		if (map_all_hugepages(&tmp_hp[hp_offset], hpi, 0) !=
+		    hpi->num_pages[0]) {
+			RTE_LOG(ERR, EAL, "Failed to remap %u MB pages\n",
 					(unsigned)(hpi->hugepage_sz / 0x100000));
 			goto fail;
 		}
@@ -1187,6 +1293,11 @@ rte_eal_hugepage_init(void)
 #endif
 	}
 
+	huge_recover_sigbus();
+
+	if (internal_config.memory == 0 && internal_config.force_sockets == 0)
+		internal_config.memory = eal_get_hugepage_mem_size();
+
 #ifdef RTE_EAL_SINGLE_FILE_SEGMENTS
 	nr_hugefiles = 0;
 	for (i = 0; i < (int) internal_config.num_hugepage_sizes; i++) {
@@ -1373,6 +1484,7 @@ rte_eal_hugepage_init(void)
 	return 0;
 
 fail:
+	huge_recover_sigbus();
 	free(tmp_hp);
 	return -1;
 }
-- 
2.1.4

  parent reply	other threads:[~2016-05-31  3:37 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-24 18:49 [RFC] eal: add cgroup-aware resource self discovery Jianfeng Tan
2016-01-25 13:46 ` Neil Horman
2016-01-26  2:22   ` Tan, Jianfeng
2016-01-26 14:19     ` Neil Horman
2016-01-27 12:02       ` Tan, Jianfeng
2016-01-27 17:30         ` Neil Horman
2016-01-29 11:22 ` [PATCH] eal: make resource initialization more robust Jianfeng Tan
2016-02-01 18:08   ` Neil Horman
2016-02-22  6:08   ` Tan, Jianfeng
2016-02-22 13:18     ` Neil Horman
2016-02-28 21:12   ` Thomas Monjalon
2016-02-29  1:50     ` Tan, Jianfeng
2016-03-04 10:05 ` [PATCH] eal: add option --avail-cores to detect lcores Jianfeng Tan
2016-03-08  8:54   ` Panu Matilainen
2016-03-08 17:38     ` Tan, Jianfeng
2016-03-09 13:05       ` Panu Matilainen
2016-03-09 13:53         ` Tan, Jianfeng
2016-03-09 14:01           ` Ananyev, Konstantin
2016-03-09 14:17             ` Tan, Jianfeng
2016-03-09 14:44               ` Ananyev, Konstantin
2016-03-09 14:55                 ` Tan, Jianfeng
2016-03-09 15:17                   ` Ananyev, Konstantin
2016-03-09 17:45                     ` Tan, Jianfeng
2016-03-09 19:33                       ` Ananyev, Konstantin
2016-03-10  1:36                         ` Tan, Jianfeng
2016-05-18 12:46         ` David Marchand
2016-05-19  2:25           ` Tan, Jianfeng
2016-06-30 13:43             ` Thomas Monjalon
2016-07-01  0:52               ` Tan, Jianfeng
2016-04-26 12:39   ` Tan, Jianfeng
2016-03-04 10:58 ` [PATCH] eal: make hugetlb initialization more robust Jianfeng Tan
2016-03-08  1:42   ` [PATCH v2] " Jianfeng Tan
2016-03-08  8:46     ` Tan, Jianfeng
2016-05-04 11:07     ` Sergio Gonzalez Monroy
2016-05-04 11:28       ` Tan, Jianfeng
2016-05-04 12:25     ` Sergio Gonzalez Monroy
2016-05-09 10:48   ` [PATCH v3] " Jianfeng Tan
2016-05-10  8:54     ` Sergio Gonzalez Monroy
2016-05-10  9:11       ` Tan, Jianfeng
2016-05-12  0:44   ` [PATCH v4] " Jianfeng Tan
2016-05-17 16:39     ` David Marchand
2016-05-18  7:56       ` Sergio Gonzalez Monroy
2016-05-18  9:34         ` David Marchand
2016-05-19  2:00       ` Tan, Jianfeng
2016-05-17 16:40     ` Thomas Monjalon
2016-05-18  8:06       ` Sergio Gonzalez Monroy
2016-05-18  9:38         ` David Marchand
2016-05-19  2:11         ` Tan, Jianfeng
2016-05-31  3:37 ` Jianfeng Tan [this message]
2016-06-06  2:49   ` [PATCH v5] eal: fix allocating all free hugepages Pei, Yulong
2016-06-08 11:27   ` Sergio Gonzalez Monroy
2016-06-30 13:34     ` Thomas Monjalon
2016-08-31  3:07 ` [PATCH v2] eal: restrict cores detection Jianfeng Tan
2016-08-31 15:30   ` Stephen Hemminger
2016-09-01  1:15     ` Tan, Jianfeng
2016-09-01  1:31 ` [PATCH v3] " Jianfeng Tan
2016-09-02 16:53   ` Bruce Richardson
2016-09-16 14:04     ` Thomas Monjalon
2016-09-16 14:02   ` Thomas Monjalon
2016-12-02 17:48   ` [PATCH v4] eal: restrict cores auto detection Jianfeng Tan
2016-12-08 18:19     ` Thomas Monjalon
2016-12-09 15:14       ` Bruce Richardson
2016-12-21 14:31         ` Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1464665827-24965-1-git-send-email-jianfeng.tan@intel.com \
    --to=jianfeng.tan@intel.com \
    --cc=david.marchand@6wind.com \
    --cc=dev@dpdk.org \
    --cc=nhorman@tuxdriver.com \
    --cc=sergio.gonzalez.monroy@intel.com \
    --cc=thomas.monjalon@6wind.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.