All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix file locking in EAL memory
@ 2018-04-19 12:26 Anatoly Burakov
  2018-04-19 12:26 ` [PATCH 1/2] mem: add memalloc init stage Anatoly Burakov
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-19 12:26 UTC (permalink / raw)
  To: dev

This patchset replaces the fcntl()-based locking used in
the original DPDK memory hotplug patchset, to an flock()-
and lockfile-based implementation, due to numerous (well,
one, really) problems with how fcntl() locks work.

Long story short, fcntl() locks will be dropped if any
fd referring to locked file, is closed - even if it's not
the last fd, even if it wasn't even the fd that was used
to lock the file in the first place, even if it wasn't
you who closed that fd, but some other library.

This patchset corrects this saddening design defect in the
original implementation.

One of the ways to work around this was using OFD locks,
but they are only supported on kernels 3.15+, so we cannot
rely on them if we want to support old kernels. Hence, we
use per-segment lockfiles. The number of file descriptors
we open does not end up more than in non-single file
segments case - we still open the same amount of files (a
file per page), plus a file per memseg list.

Additionally, since flock() is not atomic, we also lock the
hugepage dir to prevent multiple processes from concurrently
performing operations on hugetlbfs mounts.

If you know of a more enlightened way of fixing this
limitation, you are certainly welcome to comment :)

Anatoly Burakov (2):
  mem: add memalloc init stage
  mem: revert to using flock() and add per-segment lockfiles

 lib/librte_eal/bsdapp/eal/eal_memalloc.c        |   6 +
 lib/librte_eal/common/eal_common_memory.c       |   3 +
 lib/librte_eal/common/eal_filesystem.h          |  18 +
 lib/librte_eal/common/eal_memalloc.h            |   3 +
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c |  28 +-
 lib/librte_eal/linuxapp/eal/eal_memalloc.c      | 604 +++++++++++++++---------
 lib/librte_eal/linuxapp/eal/eal_memory.c        |  22 +-
 7 files changed, 420 insertions(+), 264 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] mem: add memalloc init stage
  2018-04-19 12:26 [PATCH 0/2] Fix file locking in EAL memory Anatoly Burakov
@ 2018-04-19 12:26 ` Anatoly Burakov
  2018-04-24 14:06   ` Bruce Richardson
  2018-04-19 12:26 ` [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-19 12:26 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson

Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal_memalloc.c   |  6 +++
 lib/librte_eal/common/eal_common_memory.c  |  3 ++
 lib/librte_eal/common/eal_memalloc.h       |  3 ++
 lib/librte_eal/linuxapp/eal/eal_memalloc.c | 66 ++++++++++++++++++------------
 4 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c
index 461732f..f7f07ab 100644
--- a/lib/librte_eal/bsdapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c
@@ -46,3 +46,9 @@ eal_memalloc_sync_with_primary(void)
 	RTE_LOG(ERR, EAL, "Memory hotplug not supported on FreeBSD\n");
 	return -1;
 }
+
+int
+eal_memalloc_init(void)
+{
+	return 0;
+}
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 24a9ed5..dd9062d 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -864,6 +864,9 @@ rte_eal_memory_init(void)
 	if (retval < 0)
 		goto fail;
 
+	if (eal_memalloc_init() < 0)
+		goto fail;
+
 	retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
 			rte_eal_hugepage_init() :
 			rte_eal_hugepage_attach();
diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h
index 6736fa3..662b3b5 100644
--- a/lib/librte_eal/common/eal_memalloc.h
+++ b/lib/librte_eal/common/eal_memalloc.h
@@ -76,4 +76,7 @@ eal_memalloc_mem_alloc_validator_unregister(const char *name, int socket_id);
 int
 eal_memalloc_mem_alloc_validate(int socket_id, size_t new_len);
 
+int
+eal_memalloc_init(void);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 1f553dd..162306a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -1060,33 +1060,11 @@ sync_walk(const struct rte_memseg_list *msl, void *arg __rte_unused)
 	struct hugepage_info *hi = NULL;
 	unsigned int i;
 	int msl_idx;
-	bool new_msl = false;
 
 	msl_idx = msl - mcfg->memsegs;
 	primary_msl = &mcfg->memsegs[msl_idx];
 	local_msl = &local_memsegs[msl_idx];
 
-	/* check if secondary has this memseg list set up */
-	if (local_msl->base_va == NULL) {
-		char name[PATH_MAX];
-		int ret;
-		new_msl = true;
-
-		/* create distinct fbarrays for each secondary */
-		snprintf(name, RTE_FBARRAY_NAME_LEN, "%s_%i",
-			primary_msl->memseg_arr.name, getpid());
-
-		ret = rte_fbarray_init(&local_msl->memseg_arr, name,
-			primary_msl->memseg_arr.len,
-			primary_msl->memseg_arr.elt_sz);
-		if (ret < 0) {
-			RTE_LOG(ERR, EAL, "Cannot initialize local memory map\n");
-			return -1;
-		}
-
-		local_msl->base_va = primary_msl->base_va;
-	}
-
 	for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
 		uint64_t cur_sz =
 			internal_config.hugepage_info[i].hugepage_sz;
@@ -1101,10 +1079,8 @@ sync_walk(const struct rte_memseg_list *msl, void *arg __rte_unused)
 		return -1;
 	}
 
-	/* if versions don't match or if we have just allocated a new
-	 * memseg list, synchronize everything
-	 */
-	if ((new_msl || local_msl->version != primary_msl->version) &&
+	/* if versions don't match, synchronize everything */
+	if (local_msl->version != primary_msl->version &&
 			sync_existing(primary_msl, local_msl, hi, msl_idx))
 		return -1;
 	return 0;
@@ -1122,3 +1098,41 @@ eal_memalloc_sync_with_primary(void)
 		return -1;
 	return 0;
 }
+
+static int
+secondary_msl_create_walk(const struct rte_memseg_list *msl,
+		void *arg __rte_unused)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *primary_msl, *local_msl;
+	char name[PATH_MAX];
+	int msl_idx, ret;
+
+	msl_idx = msl - mcfg->memsegs;
+	primary_msl = &mcfg->memsegs[msl_idx];
+	local_msl = &local_memsegs[msl_idx];
+
+	/* create distinct fbarrays for each secondary */
+	snprintf(name, RTE_FBARRAY_NAME_LEN, "%s_%i",
+		primary_msl->memseg_arr.name, getpid());
+
+	ret = rte_fbarray_init(&local_msl->memseg_arr, name,
+		primary_msl->memseg_arr.len,
+		primary_msl->memseg_arr.elt_sz);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Cannot initialize local memory map\n");
+		return -1;
+	}
+	local_msl->base_va = primary_msl->base_va;
+
+	return 0;
+}
+
+int
+eal_memalloc_init(void)
+{
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
+		if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0)
+			return -1;
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-19 12:26 [PATCH 0/2] Fix file locking in EAL memory Anatoly Burakov
  2018-04-19 12:26 ` [PATCH 1/2] mem: add memalloc init stage Anatoly Burakov
@ 2018-04-19 12:26 ` Anatoly Burakov
  2018-04-24 13:57   ` Bruce Richardson
  2018-04-24 14:07   ` Bruce Richardson
  2018-04-24 15:54 ` [PATCH v2 0/2] Fix file locking in EAL memory Anatoly Burakov
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-19 12:26 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.

However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:

  This interface follows the completely stupid semantics of System
  V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
  locks associated with a file for a given process are removed
  when any file descriptor for that file is closed by that process.
  This semantic means that applications must be aware of any files
  that a subroutine library may access.

Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.

So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().

We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.

So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.

One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.

[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html

Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66b4 ("eal: add single file segments option")
Cc: anatoly.burakov@intel.com

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    We could've used OFD locks [1] instead of flock(), but they are
    only supported on kernel 3.15+ [2], so we're forced to use this
    flock()-based contraption to support older kernels.
    
    [1] https://www.gnu.org/software/libc/manual/html_node/Open-File-Description-Locks.html
    [2] http://man7.org/linux/man-pages/man2/fcntl.2.html

 lib/librte_eal/common/eal_filesystem.h          |  18 +
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c |  28 +-
 lib/librte_eal/linuxapp/eal/eal_memalloc.c      | 538 +++++++++++++++---------
 lib/librte_eal/linuxapp/eal/eal_memory.c        |  22 +-
 4 files changed, 368 insertions(+), 238 deletions(-)

diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h
index ad059ef..31d0860 100644
--- a/lib/librte_eal/common/eal_filesystem.h
+++ b/lib/librte_eal/common/eal_filesystem.h
@@ -115,6 +115,24 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id
 	return buffer;
 }
 
+/** String format for hugepage map lock files. */
+#define HUGEFILE_LOCK_FMT "%s/%smap_%d_%d.lock"
+
+static inline const char *
+eal_get_hugefile_lock_path(char *buffer, size_t buflen, int list_idx,
+		int seg_idx)
+{
+	const char *directory = default_config_dir;
+	const char *home_dir = getenv("HOME");
+
+	if (getuid() != 0 && home_dir != NULL)
+		directory = home_dir;
+	snprintf(buffer, buflen - 1, HUGEFILE_LOCK_FMT, directory,
+			internal_config.hugefile_prefix, list_idx, seg_idx);
+	buffer[buflen - 1] = '\0';
+	return buffer;
+}
+
 /** define the default filename prefix for the %s values above */
 #define HUGEFILE_PREFIX_DEFAULT "rte"
 
diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
index db5aabd..7eca711 100644
--- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
+++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
@@ -237,18 +237,6 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
 }
 
 /*
- * uses fstat to report the size of a file on disk
- */
-static off_t
-get_file_size(int fd)
-{
-	struct stat st;
-	if (fstat(fd, &st) < 0)
-		return 0;
-	return st.st_size;
-}
-
-/*
  * Clear the hugepage directory of whatever hugepage files
  * there are. Checks if the file is locked (i.e.
  * if it's in use by another DPDK process).
@@ -278,8 +266,6 @@ clear_hugedir(const char * hugedir)
 	}
 
 	while(dirent != NULL){
-		struct flock lck = {0};
-
 		/* skip files that don't match the hugepage pattern */
 		if (fnmatch(filter, dirent->d_name, 0) > 0) {
 			dirent = readdir(dir);
@@ -296,19 +282,11 @@ clear_hugedir(const char * hugedir)
 		}
 
 		/* non-blocking lock */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = get_file_size(fd);
+		lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
-		lck_result = fcntl(fd, F_SETLK, &lck);
-
-		/* if lock succeeds, unlock and remove the file */
-		if (lck_result != -1) {
-			lck.l_type = F_UNLCK;
-			fcntl(fd, F_SETLK, &lck);
+		/* if lock succeeds, remove the file */
+		if (lck_result != -1)
 			unlinkat(dir_fd, dirent->d_name, 0);
-		}
 		close (fd);
 		dirent = readdir(dir);
 	}
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 162306a..9161496 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -46,24 +46,25 @@
  */
 static int fallocate_supported = -1; /* unknown */
 
-/*
- * If each page is in a separate file, we can close fd's since we need each fd
- * only once. However, in single file segments mode, we can get away with using
- * a single fd for entire segments, but we need to store them somewhere. Each
- * fd is different within each process, so we'll store them in a local tailq.
+/* for single-file segments, we need some kind of mechanism to keep track of
+ * which hugepages can be freed back to the system, and which cannot. we cannot
+ * use flock() because they don't allow locking parts of a file, and we cannot
+ * use fcntl() due to issues with their semantics, so we will have to rely on a
+ * bunch of lockfiles for each page.
+ *
+ * we cannot know how many pages a system will have in advance, but we do know
+ * that they come in lists, and we know lengths of these lists. so, simply store
+ * a malloc'd array of fd's indexed by list and segment index.
+ *
+ * they will be initialized at startup, and filled as we allocate/deallocate
+ * segments. also, use this to track memseg list proper fd.
  */
-struct msl_entry {
-	TAILQ_ENTRY(msl_entry) next;
-	unsigned int msl_idx;
-	int fd;
-};
-
-/** Double linked list of memseg list fd's. */
-TAILQ_HEAD(msl_entry_list, msl_entry);
-
-static struct msl_entry_list msl_entry_list =
-		TAILQ_HEAD_INITIALIZER(msl_entry_list);
-static rte_spinlock_t tailq_lock = RTE_SPINLOCK_INITIALIZER;
+static struct {
+	int *fds; /**< dynamically allocated array of segment lock fd's */
+	int memseg_list_fd; /**< memseg list fd */
+	int len; /**< total length of the array */
+	int count; /**< entries used in an array */
+} lock_fds[RTE_MAX_MEMSEG_LISTS];
 
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
@@ -158,35 +159,6 @@ resotre_numa(int *oldpolicy, struct bitmask *oldmask)
 }
 #endif
 
-static struct msl_entry *
-get_msl_entry_by_idx(unsigned int list_idx)
-{
-	struct msl_entry *te;
-
-	rte_spinlock_lock(&tailq_lock);
-
-	TAILQ_FOREACH(te, &msl_entry_list, next) {
-		if (te->msl_idx == list_idx)
-			break;
-	}
-	if (te == NULL) {
-		/* doesn't exist, so create it and set fd to -1 */
-
-		te = malloc(sizeof(*te));
-		if (te == NULL) {
-			RTE_LOG(ERR, EAL, "%s(): cannot allocate tailq entry for memseg list\n",
-				__func__);
-			goto unlock;
-		}
-		te->msl_idx = list_idx;
-		te->fd = -1;
-		TAILQ_INSERT_TAIL(&msl_entry_list, te, next);
-	}
-unlock:
-	rte_spinlock_unlock(&tailq_lock);
-	return te;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -199,19 +171,6 @@ get_file_size(int fd)
 	return st.st_size;
 }
 
-/*
- * uses fstat to check if file size on disk is zero (regular fstat won't show
- * true file size due to how fallocate works)
- */
-static bool
-is_zero_length(int fd)
-{
-	struct stat st;
-	if (fstat(fd, &st) < 0)
-		return false;
-	return st.st_blocks == 0;
-}
-
 /* we cannot use rte_memseg_list_walk() here because we will be holding a
  * write lock whenever we enter every function in this file, however copying
  * the same iteration code everywhere is not ideal as well. so, use a lockless
@@ -238,6 +197,101 @@ memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg)
 	return 0;
 }
 
+/* returns 1 on successful lock, 0 on unsuccessful lock, -1 on error */
+static int lock(int fd, int type)
+{
+	int ret;
+
+	/* flock may be interrupted */
+	do {
+		ret = flock(fd, type | LOCK_NB);
+	} while (ret && errno == EINTR);
+
+	if (ret && errno == EWOULDBLOCK) {
+		/* couldn't lock */
+		return 0;
+	} else if (ret) {
+		RTE_LOG(ERR, EAL, "%s(): error calling flock(): %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+	/* lock was successful */
+	return 1;
+}
+
+static int get_lockfile(int list_idx, int seg_idx)
+{
+	char path[PATH_MAX] = {0};
+	int fd;
+
+	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
+		return -1;
+	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
+		return -1;
+
+	fd = lock_fds[list_idx].fds[seg_idx];
+	/* does this lock already exist? */
+	if (fd >= 0)
+		return fd;
+
+	eal_get_hugefile_lock_path(path, sizeof(path), list_idx, seg_idx);
+
+	fd = open(path, O_CREAT | O_RDWR, 0660);
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): error creating lockfile '%s': %s\n",
+			__func__, path, strerror(errno));
+		return -1;
+	}
+	/* take out a read lock */
+	if (lock(fd, LOCK_SH) != 1) {
+		RTE_LOG(ERR, EAL, "%s(): failed to take out a readlock on '%s': %s\n",
+			__func__, path, strerror(errno));
+		close(fd);
+		return -1;
+	}
+	/* store it for future reference */
+	lock_fds[list_idx].fds[seg_idx] = fd;
+	lock_fds[list_idx].count++;
+	return fd;
+}
+
+static int put_lockfile(int list_idx, int seg_idx)
+{
+	int fd, ret;
+
+	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
+		return -1;
+	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
+		return -1;
+
+	fd = lock_fds[list_idx].fds[seg_idx];
+
+	/* upgrade lock to exclusive to see if we can remove the lockfile */
+	ret = lock(fd, LOCK_EX);
+	if (ret == 1) {
+		/* we've succeeded in taking exclusive lock, this lockfile may
+		 * be removed.
+		 */
+		char path[PATH_MAX] = {0};
+		eal_get_hugefile_lock_path(path, sizeof(path), list_idx,
+				seg_idx);
+		if (unlink(path)) {
+			RTE_LOG(ERR, EAL, "%s(): error removing lockfile '%s': %s\n",
+					__func__, path, strerror(errno));
+		}
+	}
+	/* we don't want to leak the fd, so even if we fail to lock, close fd
+	 * and remove it from list anyway.
+	 */
+	close(fd);
+	lock_fds[list_idx].fds[seg_idx] = -1;
+	lock_fds[list_idx].count--;
+
+	if (ret < 0)
+		return -1;
+	return 0;
+}
+
 static int
 get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 		unsigned int list_idx, unsigned int seg_idx)
@@ -245,31 +299,29 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 	int fd;
 
 	if (internal_config.single_file_segments) {
-		/*
-		 * try to find a tailq entry, for this memseg list, or create
-		 * one if it doesn't exist.
-		 */
-		struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-		if (te == NULL) {
-			RTE_LOG(ERR, EAL, "%s(): cannot allocate tailq entry for memseg list\n",
-				__func__);
-			return -1;
-		} else if (te->fd < 0) {
-			/* create a hugepage file */
-			eal_get_hugefile_path(path, buflen, hi->hugedir,
-					list_idx);
+		/* create a hugepage file path */
+		eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx);
+
+		fd = lock_fds[list_idx].memseg_list_fd;
+
+		if (fd < 0) {
 			fd = open(path, O_CREAT | O_RDWR, 0600);
 			if (fd < 0) {
-				RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
+				RTE_LOG(ERR, EAL, "%s(): open failed: %s\n",
 					__func__, strerror(errno));
 				return -1;
 			}
-			te->fd = fd;
-		} else {
-			fd = te->fd;
+			/* take out a read lock and keep it indefinitely */
+			if (lock(fd, LOCK_SH) < 0) {
+				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+					__func__, strerror(errno));
+				close(fd);
+				return -1;
+			}
+			lock_fds[list_idx].memseg_list_fd = fd;
 		}
 	} else {
-		/* one file per page, just create it */
+		/* create a hugepage file path */
 		eal_get_hugefile_path(path, buflen, hi->hugedir,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
 		fd = open(path, O_CREAT | O_RDWR, 0600);
@@ -278,48 +330,27 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 					strerror(errno));
 			return -1;
 		}
+		/* take out a read lock */
+		if (lock(fd, LOCK_SH) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+				__func__, strerror(errno));
+			close(fd);
+			return -1;
+		}
 	}
 	return fd;
 }
 
-/* returns 1 on successful lock, 0 on unsuccessful lock, -1 on error */
-static int lock(int fd, uint64_t offset, uint64_t len, int type)
-{
-	struct flock lck;
-	int ret;
-
-	memset(&lck, 0, sizeof(lck));
-
-	lck.l_type = type;
-	lck.l_whence = SEEK_SET;
-	lck.l_start = offset;
-	lck.l_len = len;
-
-	ret = fcntl(fd, F_SETLK, &lck);
-
-	if (ret && (errno == EAGAIN || errno == EACCES)) {
-		/* locked by another process, not an error */
-		return 0;
-	} else if (ret) {
-		RTE_LOG(ERR, EAL, "%s(): error calling fcntl(): %s\n",
-			__func__, strerror(errno));
-		/* we've encountered an unexpected error */
-		return -1;
-	}
-	return 1;
-}
-
 static int
-resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
-		bool grow)
+resize_hugefile(int fd, char *path, int list_idx, int seg_idx,
+		uint64_t fa_offset, uint64_t page_sz, bool grow)
 {
 	bool again = false;
 	do {
 		if (fallocate_supported == 0) {
 			/* we cannot deallocate memory if fallocate() is not
-			 * supported, but locks are still needed to prevent
-			 * primary process' initialization from clearing out
-			 * huge pages used by this process.
+			 * supported, and hugepage file is already locked at
+			 * creation, so no further synchronization needed.
 			 */
 
 			if (!grow) {
@@ -337,13 +368,10 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 					__func__, strerror(errno));
 				return -1;
 			}
-			/* not being able to take out a read lock is an error */
-			if (lock(fd, fa_offset, page_sz, F_RDLCK) != 1)
-				return -1;
 		} else {
 			int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE |
 					FALLOC_FL_KEEP_SIZE;
-			int ret;
+			int ret, lock_fd;
 
 			/* if fallocate() is supported, we need to take out a
 			 * read lock on allocate (to prevent other processes
@@ -351,20 +379,69 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 			 * lock on deallocate (to ensure nobody else is using
 			 * this page).
 			 *
-			 * we can't use flock() for this, as we actually need to
-			 * lock part of the file, not the entire file.
+			 * read locks on page itself are already taken out at
+			 * file creation, in get_seg_fd().
+			 *
+			 * we cannot rely on simple use of flock() call, because
+			 * we need to be able to lock a section of the file,
+			 * and we cannot use fcntl() locks, because of numerous
+			 * problems with their semantics, so we will use
+			 * deterministically named lock files for each section
+			 * of the file.
+			 *
+			 * if we're shrinking the file, we want to upgrade our
+			 * lock from shared to exclusive.
+			 *
+			 * lock_fd is an fd for a lockfile, not for the segment
+			 * list.
 			 */
+			lock_fd = get_lockfile(list_idx, seg_idx);
 
 			if (!grow) {
-				ret = lock(fd, fa_offset, page_sz, F_WRLCK);
+				/* we are using this lockfile to determine
+				 * whether this particular page is locked, as we
+				 * are in single file segments mode and thus
+				 * cannot use regular flock() to get this info.
+				 *
+				 * we want to try and take out an exclusive lock
+				 * on the lock file to determine if we're the
+				 * last ones using this page, and if not, we
+				 * won't be shrinking it, and will instead exit
+				 * prematurely.
+				 */
+				ret = lock(lock_fd, LOCK_EX);
+
+				/* drop the lock on the lockfile, so that even
+				 * if we couldn't shrink the file ourselves, we
+				 * are signalling to other processes that we're
+				 * no longer using this page.
+				 */
+				if (put_lockfile(list_idx, seg_idx))
+					RTE_LOG(ERR, EAL, "Could not unlock segment\n");
+
+				/* additionally, if this was the last lock on
+				 * this segment list, we can safely close the
+				 * page file fd, so that one of the processes
+				 * could then delete the file after shrinking.
+				 */
+				if (ret < 1 && lock_fds[list_idx].count == 0) {
+					close(fd);
+					lock_fds[list_idx].memseg_list_fd = -1;
+				}
 
-				if (ret < 0)
+				if (ret < 0) {
+					RTE_LOG(ERR, EAL, "Could not lock segment\n");
 					return -1;
-				else if (ret == 0)
-					/* failed to lock, not an error */
+				}
+				if (ret == 0)
+					/* failed to lock, not an error. */
 					return 0;
 			}
-			if (fallocate(fd, flags, fa_offset, page_sz) < 0) {
+
+			/* grow or shrink the file */
+			ret = fallocate(fd, flags, fa_offset, page_sz);
+
+			if (ret < 0) {
 				if (fallocate_supported == -1 &&
 						errno == ENOTSUP) {
 					RTE_LOG(ERR, EAL, "%s(): fallocate() not supported, hugepage deallocation will be disabled\n",
@@ -380,16 +457,18 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 			} else {
 				fallocate_supported = 1;
 
-				if (grow) {
-					/* if can't read lock, it's an error */
-					if (lock(fd, fa_offset, page_sz,
-							F_RDLCK) != 1)
-						return -1;
-				} else {
-					/* if can't unlock, it's an error */
-					if (lock(fd, fa_offset, page_sz,
-							F_UNLCK) != 1)
-						return -1;
+				/* we've grew/shrunk the file, and we hold an
+				 * exclusive lock now. check if there are no
+				 * more segments active in this segment list,
+				 * and remove the file if there aren't.
+				 */
+				if (lock_fds[list_idx].count == 0) {
+					if (unlink(path))
+						RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n",
+							__func__, path,
+							strerror(errno));
+					close(fd);
+					lock_fds[list_idx].memseg_list_fd = -1;
 				}
 			}
 		}
@@ -411,15 +490,19 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	int fd;
 	size_t alloc_sz;
 
+	/* takes out a read lock on segment or segment list */
 	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
-	if (fd < 0)
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
 		return -1;
+	}
 
 	alloc_sz = hi->hugepage_sz;
 	if (internal_config.single_file_segments) {
 		map_offset = seg_idx * alloc_sz;
-		ret = resize_hugefile(fd, map_offset, alloc_sz, true);
-		if (ret < 1)
+		ret = resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				alloc_sz, true);
+		if (ret < 0)
 			goto resized;
 	} else {
 		map_offset = 0;
@@ -428,28 +511,6 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 				__func__, strerror(errno));
 			goto resized;
 		}
-		/* we've allocated a page - take out a read lock. we're using
-		 * fcntl() locks rather than flock() here because doing that
-		 * gives us one huge advantage - fcntl() locks are per-process,
-		 * not per-file descriptor, which means that we don't have to
-		 * keep the original fd's around to keep a lock on the file.
-		 *
-		 * this is useful, because when it comes to unmapping pages, we
-		 * will have to take out a write lock (to figure out if another
-		 * process still has this page mapped), and to do itwith flock()
-		 * we'll have to use original fd, as lock is associated with
-		 * that particular fd. with fcntl(), this is not necessary - we
-		 * can open a new fd and use fcntl() on that.
-		 */
-		ret = lock(fd, map_offset, alloc_sz, F_RDLCK);
-
-		/* this should not fail */
-		if (ret != 1) {
-			RTE_LOG(ERR, EAL, "%s(): error locking file: %s\n",
-				__func__,
-				strerror(errno));
-			goto resized;
-		}
 	}
 
 	/*
@@ -518,19 +579,14 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	munmap(addr, alloc_sz);
 resized:
 	if (internal_config.single_file_segments) {
-		resize_hugefile(fd, map_offset, alloc_sz, false);
-		if (is_zero_length(fd)) {
-			struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-			if (te != NULL && te->fd >= 0) {
-				close(te->fd);
-				te->fd = -1;
-			}
-			/* ignore errors, can't make it any worse */
-			unlink(path);
-		}
+		resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				alloc_sz, false);
+		/* ignore failure, can't make it any worse */
 	} else {
+		/* only remove file if we can take out a write lock */
+		if (lock(fd, LOCK_EX) == 1)
+			unlink(path);
 		close(fd);
-		unlink(path);
 	}
 	return -1;
 }
@@ -546,6 +602,14 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	/* erase page data */
 	memset(ms->addr, 0, ms->len);
 
+	/* if we are not in single file segments mode, we're going to unmap the
+	 * segment and thus drop the lock on original fd, so take out another
+	 * shared lock before we do that.
+	 */
+	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+	if (fd < 0)
+		return -1;
+
 	if (mmap(ms->addr, ms->len, PROT_READ,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
 				MAP_FAILED) {
@@ -553,41 +617,23 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
-	if (fd < 0)
-		return -1;
-
 	if (internal_config.single_file_segments) {
 		map_offset = seg_idx * ms->len;
-		if (resize_hugefile(fd, map_offset, ms->len, false))
+		if (resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				ms->len, false))
 			return -1;
-		/* if file is zero-length, we've already shrunk it, so it's
-		 * safe to remove.
-		 */
-		if (is_zero_length(fd)) {
-			struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-			if (te != NULL && te->fd >= 0) {
-				close(te->fd);
-				te->fd = -1;
-			}
-			unlink(path);
-		}
 		ret = 0;
 	} else {
 		/* if we're able to take out a write lock, we're the last one
 		 * holding onto this page.
 		 */
-
-		ret = lock(fd, 0, ms->len, F_WRLCK);
+		ret = lock(fd, LOCK_EX);
 		if (ret >= 0) {
 			/* no one else is using this page */
 			if (ret == 1)
 				unlink(path);
-			ret = lock(fd, 0, ms->len, F_UNLCK);
-			if (ret != 1)
-				RTE_LOG(ERR, EAL, "%s(): unable to unlock file %s\n",
-					__func__, path);
 		}
+		/* closing fd will drop the lock */
 		close(fd);
 	}
 
@@ -612,7 +658,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	struct alloc_walk_param *wa = arg;
 	struct rte_memseg_list *cur_msl;
 	size_t page_sz;
-	int cur_idx, start_idx, j;
+	int cur_idx, start_idx, j, dir_fd;
 	unsigned int msl_idx, need, i;
 
 	if (msl->page_sz != wa->page_sz)
@@ -633,6 +679,25 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 		return 0;
 	start_idx = cur_idx;
 
+	/* do not allow any page allocations during the time we're allocating,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(wa->hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
+
 	for (i = 0; i < need; i++, cur_idx++) {
 		struct rte_memseg *cur;
 		void *map_addr;
@@ -668,6 +733,8 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 			/* clear the list */
 			if (wa->ms)
 				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			close(dir_fd);
 			return -1;
 		}
 		if (wa->ms)
@@ -679,6 +746,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	wa->segs_allocated = i;
 	if (i > 0)
 		cur_msl->version++;
+	close(dir_fd);
 	return 1;
 }
 
@@ -693,7 +761,7 @@ free_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	struct rte_memseg_list *found_msl;
 	struct free_walk_param *wa = arg;
 	uintptr_t start_addr, end_addr;
-	int msl_idx, seg_idx;
+	int msl_idx, seg_idx, ret, dir_fd;
 
 	start_addr = (uintptr_t) msl->base_va;
 	end_addr = start_addr + msl->memseg_arr.len * (size_t)msl->page_sz;
@@ -708,11 +776,34 @@ free_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	/* msl is const */
 	found_msl = &mcfg->memsegs[msl_idx];
 
+	/* do not allow any page allocations during the time we're freeing,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(wa->hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
+
 	found_msl->version++;
 
 	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
 
-	if (free_seg(wa->ms, wa->hi, msl_idx, seg_idx))
+	ret = free_seg(wa->ms, wa->hi, msl_idx, seg_idx);
+
+	close(dir_fd);
+
+	if (ret < 0)
 		return -1;
 
 	return 1;
@@ -1034,22 +1125,46 @@ sync_existing(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
 		unsigned int msl_idx)
 {
-	int ret;
+	int ret, dir_fd;
+
+	/* do not allow any page allocations during the time we're allocating,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
 
 	/* ensure all allocated space is the same in both lists */
 	ret = sync_status(primary_msl, local_msl, hi, msl_idx, true);
 	if (ret < 0)
-		return -1;
+		goto fail;
 
 	/* ensure all unallocated space is the same in both lists */
 	ret = sync_status(primary_msl, local_msl, hi, msl_idx, false);
 	if (ret < 0)
-		return -1;
+		goto fail;
 
 	/* update version number */
 	local_msl->version = primary_msl->version;
 
+	close(dir_fd);
+
 	return 0;
+fail:
+	close(dir_fd);
+	return -1;
 }
 
 static int
@@ -1128,11 +1243,46 @@ secondary_msl_create_walk(const struct rte_memseg_list *msl,
 	return 0;
 }
 
+static int
+secondary_lock_list_create_walk(const struct rte_memseg_list *msl,
+		void *arg __rte_unused)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	unsigned int i, len;
+	int msl_idx;
+	int *data;
+
+	msl_idx = msl - mcfg->memsegs;
+	len = msl->memseg_arr.len;
+
+	/* ensure we have space to store lock fd per each possible segment */
+	data = malloc(sizeof(int) * len);
+	if (data == NULL) {
+		RTE_LOG(ERR, EAL, "Unable to allocate space for lock descriptors\n");
+		return -1;
+	}
+	/* set all fd's as invalid */
+	for (i = 0; i < len; i++)
+		data[i] = -1;
+
+	lock_fds[msl_idx].fds = data;
+	lock_fds[msl_idx].len = len;
+	lock_fds[msl_idx].count = 0;
+	lock_fds[msl_idx].memseg_list_fd = -1;
+
+	return 0;
+}
+
 int
 eal_memalloc_init(void)
 {
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
 		if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0)
 			return -1;
+
+	/* initialize all of the lock fd lists */
+	if (internal_config.single_file_segments)
+		if (rte_memseg_list_walk(secondary_lock_list_create_walk, NULL))
+			return -1;
 	return 0;
 }
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index fadc1de..ad3add8 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -259,7 +259,6 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
 	int fd;
 	unsigned i;
 	void *virtaddr;
-	struct flock lck = {0};
 #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
 	int node_id = -1;
 	int essential_prev = 0;
@@ -378,13 +377,8 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
 		}
 		*(int *)virtaddr = 0;
 
-
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = hugepage_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): Locking file failed:%s \n",
 				__func__, strerror(errno));
 			close(fd);
@@ -708,7 +702,6 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
 #endif
 		struct hugepage_file *hfile = &hugepages[cur_page];
 		struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx);
-		struct flock lck;
 		void *addr;
 		int fd;
 
@@ -719,11 +712,7 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
 			return -1;
 		}
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = page_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "Could not lock '%s': %s\n",
 					hfile->filepath, strerror(errno));
 			close(fd);
@@ -1732,7 +1721,6 @@ eal_legacy_hugepage_attach(void)
 		struct hugepage_file *hf = &hp[i];
 		size_t map_sz = hf->size;
 		void *map_addr = hf->final_va;
-		struct flock lck;
 
 		/* if size is zero, no more pages left */
 		if (map_sz == 0)
@@ -1754,11 +1742,7 @@ eal_legacy_hugepage_attach(void)
 		}
 
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = map_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): Locking file failed: %s\n",
 				__func__, strerror(errno));
 			close(fd);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-19 12:26 ` [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
@ 2018-04-24 13:57   ` Bruce Richardson
  2018-04-24 14:07   ` Bruce Richardson
  1 sibling, 0 replies; 22+ messages in thread
From: Bruce Richardson @ 2018-04-24 13:57 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev

On Thu, Apr 19, 2018 at 01:26:29PM +0100, Anatoly Burakov wrote:
> The original implementation used flock() locks, but was later
> switched to using fcntl() locks for page locking, because
> fcntl() locks allow locking parts of a file, which is useful
> for single-file segments mode, where locking the entire file
> isn't as useful because we still need to grow and shrink it.
> 
> However, according to fcntl()'s Ubuntu manpage [1], semantics of
> fcntl() locks have a giant oversight:
> 
>   This interface follows the completely stupid semantics of System
>   V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>   locks associated with a file for a given process are removed
>   when any file descriptor for that file is closed by that process.
>   This semantic means that applications must be aware of any files
>   that a subroutine library may access.
> 
> Basically, closing *any* fd with an fcntl() lock (which we do because
> we don't want to leak fd's) will drop the lock completely.
> 
> So, in this commit, we will be reverting back to using flock() locks
> everywhere. However, that still leaves the problem of locking parts
> of a memseg list file in single file segments mode, and we will be
> solving it with creating separate lock files per each page, and
> tracking those with flock().
> 
> We will also be removing all of this tailq business and replacing it
> with a simple array - saving a few bytes is not worth the extra
> hassle of dealing with pointers and potential memory allocation
> failures. Also, remove the tailq lock since it is not needed - these
> fd lists are per-process, and within a given process, it is always
> only one thread handling access to hugetlbfs.
> 
> So, first one to allocate a segment will create a lockfile, and put
> a shared lock on it. When we're shrinking the page file, we will be
> trying to take out a write lock on that lockfile, which would fail if
> any other process is holding onto the lockfile as well. This way, we
> can know if we can shrink the segment file. Also, if no other locks
> are found in the lock list for a given memseg list, the memseg list
> fd is automatically closed.
> 
> One other thing to note is, according to flock() Ubuntu manpage [2],
> upgrading the lock from shared to exclusive is implemented by dropping
> and reacquiring the lock, which is not atomic and thus would have
> created race conditions. So, on attempting to perform operations in
> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
> that only one process could perform hugetlbfs operations concurrently.
> 
> [1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
> 
> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
> Fixes: 2a04139f66b4 ("eal: add single file segments option")
> Cc: anatoly.burakov@intel.com
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
<snip>
> +
> +static int get_lockfile(int list_idx, int seg_idx)

Minor nit:
Not sure about this name, since it returns an fd rather than a filename or
filepath. How about get_lock_fd().

> +{
> +	char path[PATH_MAX] = {0};
> +	int fd;
> +
> +	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
> +		return -1;
> +	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
> +		return -1;
> +
> +	fd = lock_fds[list_idx].fds[seg_idx];
> +	/* does this lock already exist? */
> +	if (fd >= 0)
> +		return fd;
> +
> +	eal_get_hugefile_lock_path(path, sizeof(path), list_idx, seg_idx);
> +
> +	fd = open(path, O_CREAT | O_RDWR, 0660);
> +	if (fd < 0) {
> +		RTE_LOG(ERR, EAL, "%s(): error creating lockfile '%s': %s\n",
> +			__func__, path, strerror(errno));
> +		return -1;
> +	}
> +	/* take out a read lock */
> +	if (lock(fd, LOCK_SH) != 1) {
> +		RTE_LOG(ERR, EAL, "%s(): failed to take out a readlock on '%s': %s\n",
> +			__func__, path, strerror(errno));
> +		close(fd);
> +		return -1;
> +	}
> +	/* store it for future reference */
> +	lock_fds[list_idx].fds[seg_idx] = fd;
> +	lock_fds[list_idx].count++;
> +	return fd;
> +}
> +
> +static int put_lockfile(int list_idx, int seg_idx)

This name doesn't really tell me much. I realise that "put" is the opposite
of "get", but if we get a file descriptor from the previous function, it's
not exactly something we need to "put back". Can you come up with a more
descriptive name here.

> +{
> +	int fd, ret;
> +
> +	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
> +		return -1;
> +	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
> +		return -1;
> +
> +	fd = lock_fds[list_idx].fds[seg_idx];
> +
> +	/* upgrade lock to exclusive to see if we can remove the lockfile */
> +	ret = lock(fd, LOCK_EX);
> +	if (ret == 1) {
> +		/* we've succeeded in taking exclusive lock, this lockfile may
> +		 * be removed.
> +		 */
> +		char path[PATH_MAX] = {0};
> +		eal_get_hugefile_lock_path(path, sizeof(path), list_idx,
> +				seg_idx);
> +		if (unlink(path)) {
> +			RTE_LOG(ERR, EAL, "%s(): error removing lockfile '%s': %s\n",
> +					__func__, path, strerror(errno));
> +		}
> +	}
> +	/* we don't want to leak the fd, so even if we fail to lock, close fd
> +	 * and remove it from list anyway.
> +	 */
> +	close(fd);
> +	lock_fds[list_idx].fds[seg_idx] = -1;
> +	lock_fds[list_idx].count--;
> +
> +	if (ret < 0)
> +		return -1;
> +	return 0;
> +}
> +
<snip>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] mem: add memalloc init stage
  2018-04-19 12:26 ` [PATCH 1/2] mem: add memalloc init stage Anatoly Burakov
@ 2018-04-24 14:06   ` Bruce Richardson
  0 siblings, 0 replies; 22+ messages in thread
From: Bruce Richardson @ 2018-04-24 14:06 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev

On Thu, Apr 19, 2018 at 01:26:28PM +0100, Anatoly Burakov wrote:
> Currently, memseg lists for secondary process are allocated on
> sync (triggered by init), when they are accessed for the first
> time. Move this initialization to a separate init stage for
> memalloc.
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
>  lib/librte_eal/bsdapp/eal/eal_memalloc.c   |  6 +++
>  lib/librte_eal/common/eal_common_memory.c  |  3 ++
>  lib/librte_eal/common/eal_memalloc.h       |  3 ++
>  lib/librte_eal/linuxapp/eal/eal_memalloc.c | 66 ++++++++++++++++++------------
>  4 files changed, 52 insertions(+), 26 deletions(-)
> 
Looks ok to me.

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-19 12:26 ` [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
  2018-04-24 13:57   ` Bruce Richardson
@ 2018-04-24 14:07   ` Bruce Richardson
  1 sibling, 0 replies; 22+ messages in thread
From: Bruce Richardson @ 2018-04-24 14:07 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev

On Thu, Apr 19, 2018 at 01:26:29PM +0100, Anatoly Burakov wrote:
> The original implementation used flock() locks, but was later
> switched to using fcntl() locks for page locking, because
> fcntl() locks allow locking parts of a file, which is useful
> for single-file segments mode, where locking the entire file
> isn't as useful because we still need to grow and shrink it.
> 
> However, according to fcntl()'s Ubuntu manpage [1], semantics of
> fcntl() locks have a giant oversight:
> 
>   This interface follows the completely stupid semantics of System
>   V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>   locks associated with a file for a given process are removed
>   when any file descriptor for that file is closed by that process.
>   This semantic means that applications must be aware of any files
>   that a subroutine library may access.
> 
> Basically, closing *any* fd with an fcntl() lock (which we do because
> we don't want to leak fd's) will drop the lock completely.
> 
> So, in this commit, we will be reverting back to using flock() locks
> everywhere. However, that still leaves the problem of locking parts
> of a memseg list file in single file segments mode, and we will be
> solving it with creating separate lock files per each page, and
> tracking those with flock().
> 
> We will also be removing all of this tailq business and replacing it
> with a simple array - saving a few bytes is not worth the extra
> hassle of dealing with pointers and potential memory allocation
> failures. Also, remove the tailq lock since it is not needed - these
> fd lists are per-process, and within a given process, it is always
> only one thread handling access to hugetlbfs.
> 
> So, first one to allocate a segment will create a lockfile, and put
> a shared lock on it. When we're shrinking the page file, we will be
> trying to take out a write lock on that lockfile, which would fail if
> any other process is holding onto the lockfile as well. This way, we
> can know if we can shrink the segment file. Also, if no other locks
> are found in the lock list for a given memseg list, the memseg list
> fd is automatically closed.
> 
> One other thing to note is, according to flock() Ubuntu manpage [2],
> upgrading the lock from shared to exclusive is implemented by dropping
> and reacquiring the lock, which is not atomic and thus would have
> created race conditions. So, on attempting to perform operations in
> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
> that only one process could perform hugetlbfs operations concurrently.
> 
> [1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
> 
> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
> Fixes: 2a04139f66b4 ("eal: add single file segments option")
> Cc: anatoly.burakov@intel.com
> 
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---

While the memory subsystem in DPDK has changed a lot since I last looked at
it, thereby preventing me from doing an in-depth review, this change makes
sense to me. So for this and any future versions:

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 0/2] Fix file locking in EAL memory
  2018-04-19 12:26 [PATCH 0/2] Fix file locking in EAL memory Anatoly Burakov
  2018-04-19 12:26 ` [PATCH 1/2] mem: add memalloc init stage Anatoly Burakov
  2018-04-19 12:26 ` [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
@ 2018-04-24 15:54 ` Anatoly Burakov
  2018-04-24 16:32   ` Stephen Hemminger
                     ` (3 more replies)
  2018-04-24 15:54 ` [PATCH v2 1/2] mem: add memalloc init stage Anatoly Burakov
  2018-04-24 15:54 ` [PATCH v2 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
  4 siblings, 4 replies; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-24 15:54 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, thomas

This patchset replaces the fcntl()-based locking used in
the original DPDK memory hotplug patchset, to an flock()-
and lockfile-based implementation, due to numerous (well,
one, really) problems with how fcntl() locks work.

Long story short, fcntl() locks will be dropped if any
fd referring to locked file, is closed - even if it's not
the last fd, even if it wasn't even the fd that was used
to lock the file in the first place, even if it wasn't
you who closed that fd, but some other library.

This patchset corrects this saddening design defect in the
original implementation.

One of the ways to work around this was using OFD locks,
but they are only supported on kernels 3.15+, so we cannot
rely on them if we want to support old kernels. Hence, we
use per-segment lockfiles. The number of file descriptors
we open does not end up more than in non-single file
segments case - we still open the same amount of files (a
file per page), plus a file per memseg list.

Additionally, since flock() is not atomic, we also lock the
hugepage dir to prevent multiple processes from concurrently
performing operations on hugetlbfs mounts.

If you know of a more enlightened way of fixing this
limitation, you are certainly welcome to comment :)

v2:
- Fixes as per review comments
- Make lockfiles hidden by default

Anatoly Burakov (2):
  mem: add memalloc init stage
  mem: revert to using flock() and add per-segment lockfiles

 lib/librte_eal/bsdapp/eal/eal_memalloc.c        |   6 +
 lib/librte_eal/common/eal_common_memory.c       |   3 +
 lib/librte_eal/common/eal_filesystem.h          |  18 +
 lib/librte_eal/common/eal_memalloc.h            |   3 +
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c |  28 +-
 lib/librte_eal/linuxapp/eal/eal_memalloc.c      | 604 +++++++++++++++---------
 lib/librte_eal/linuxapp/eal/eal_memory.c        |  22 +-
 7 files changed, 420 insertions(+), 264 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/2] mem: add memalloc init stage
  2018-04-19 12:26 [PATCH 0/2] Fix file locking in EAL memory Anatoly Burakov
                   ` (2 preceding siblings ...)
  2018-04-24 15:54 ` [PATCH v2 0/2] Fix file locking in EAL memory Anatoly Burakov
@ 2018-04-24 15:54 ` Anatoly Burakov
  2018-04-24 15:54 ` [PATCH v2 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
  4 siblings, 0 replies; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-24 15:54 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, thomas

Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal_memalloc.c   |  6 +++
 lib/librte_eal/common/eal_common_memory.c  |  3 ++
 lib/librte_eal/common/eal_memalloc.h       |  3 ++
 lib/librte_eal/linuxapp/eal/eal_memalloc.c | 66 ++++++++++++++++++------------
 4 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c
index 461732f..f7f07ab 100644
--- a/lib/librte_eal/bsdapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c
@@ -46,3 +46,9 @@ eal_memalloc_sync_with_primary(void)
 	RTE_LOG(ERR, EAL, "Memory hotplug not supported on FreeBSD\n");
 	return -1;
 }
+
+int
+eal_memalloc_init(void)
+{
+	return 0;
+}
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 24a9ed5..dd9062d 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -864,6 +864,9 @@ rte_eal_memory_init(void)
 	if (retval < 0)
 		goto fail;
 
+	if (eal_memalloc_init() < 0)
+		goto fail;
+
 	retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
 			rte_eal_hugepage_init() :
 			rte_eal_hugepage_attach();
diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h
index 6736fa3..662b3b5 100644
--- a/lib/librte_eal/common/eal_memalloc.h
+++ b/lib/librte_eal/common/eal_memalloc.h
@@ -76,4 +76,7 @@ eal_memalloc_mem_alloc_validator_unregister(const char *name, int socket_id);
 int
 eal_memalloc_mem_alloc_validate(int socket_id, size_t new_len);
 
+int
+eal_memalloc_init(void);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 1f553dd..162306a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -1060,33 +1060,11 @@ sync_walk(const struct rte_memseg_list *msl, void *arg __rte_unused)
 	struct hugepage_info *hi = NULL;
 	unsigned int i;
 	int msl_idx;
-	bool new_msl = false;
 
 	msl_idx = msl - mcfg->memsegs;
 	primary_msl = &mcfg->memsegs[msl_idx];
 	local_msl = &local_memsegs[msl_idx];
 
-	/* check if secondary has this memseg list set up */
-	if (local_msl->base_va == NULL) {
-		char name[PATH_MAX];
-		int ret;
-		new_msl = true;
-
-		/* create distinct fbarrays for each secondary */
-		snprintf(name, RTE_FBARRAY_NAME_LEN, "%s_%i",
-			primary_msl->memseg_arr.name, getpid());
-
-		ret = rte_fbarray_init(&local_msl->memseg_arr, name,
-			primary_msl->memseg_arr.len,
-			primary_msl->memseg_arr.elt_sz);
-		if (ret < 0) {
-			RTE_LOG(ERR, EAL, "Cannot initialize local memory map\n");
-			return -1;
-		}
-
-		local_msl->base_va = primary_msl->base_va;
-	}
-
 	for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
 		uint64_t cur_sz =
 			internal_config.hugepage_info[i].hugepage_sz;
@@ -1101,10 +1079,8 @@ sync_walk(const struct rte_memseg_list *msl, void *arg __rte_unused)
 		return -1;
 	}
 
-	/* if versions don't match or if we have just allocated a new
-	 * memseg list, synchronize everything
-	 */
-	if ((new_msl || local_msl->version != primary_msl->version) &&
+	/* if versions don't match, synchronize everything */
+	if (local_msl->version != primary_msl->version &&
 			sync_existing(primary_msl, local_msl, hi, msl_idx))
 		return -1;
 	return 0;
@@ -1122,3 +1098,41 @@ eal_memalloc_sync_with_primary(void)
 		return -1;
 	return 0;
 }
+
+static int
+secondary_msl_create_walk(const struct rte_memseg_list *msl,
+		void *arg __rte_unused)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *primary_msl, *local_msl;
+	char name[PATH_MAX];
+	int msl_idx, ret;
+
+	msl_idx = msl - mcfg->memsegs;
+	primary_msl = &mcfg->memsegs[msl_idx];
+	local_msl = &local_memsegs[msl_idx];
+
+	/* create distinct fbarrays for each secondary */
+	snprintf(name, RTE_FBARRAY_NAME_LEN, "%s_%i",
+		primary_msl->memseg_arr.name, getpid());
+
+	ret = rte_fbarray_init(&local_msl->memseg_arr, name,
+		primary_msl->memseg_arr.len,
+		primary_msl->memseg_arr.elt_sz);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Cannot initialize local memory map\n");
+		return -1;
+	}
+	local_msl->base_va = primary_msl->base_va;
+
+	return 0;
+}
+
+int
+eal_memalloc_init(void)
+{
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
+		if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0)
+			return -1;
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-19 12:26 [PATCH 0/2] Fix file locking in EAL memory Anatoly Burakov
                   ` (3 preceding siblings ...)
  2018-04-24 15:54 ` [PATCH v2 1/2] mem: add memalloc init stage Anatoly Burakov
@ 2018-04-24 15:54 ` Anatoly Burakov
  4 siblings, 0 replies; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-24 15:54 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, thomas, anatoly.burakov

The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.

However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:

  This interface follows the completely stupid semantics of System
  V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
  locks associated with a file for a given process are removed
  when any file descriptor for that file is closed by that process.
  This semantic means that applications must be aware of any files
  that a subroutine library may access.

Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.

So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().

We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.

So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.

One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.

[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html

Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66b4 ("eal: add single file segments option")
Cc: anatoly.burakov@intel.com

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---

Notes:
    v2:
    - Make lockfiles hidden
    - renamed functions as per Bruce's comments
    
    We could've used OFD locks [1] instead of flock(), but they are
    only supported on kernel 3.15+ [2], so we're forced to use this
    flock()-based contraption to support older kernels.
    
    [1] https://www.gnu.org/software/libc/manual/html_node/Open-File-Description-Locks.html
    [2] http://man7.org/linux/man-pages/man2/fcntl.2.html

 lib/librte_eal/common/eal_filesystem.h          |  18 +
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c |  28 +-
 lib/librte_eal/linuxapp/eal/eal_memalloc.c      | 538 +++++++++++++++---------
 lib/librte_eal/linuxapp/eal/eal_memory.c        |  22 +-
 4 files changed, 368 insertions(+), 238 deletions(-)

diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h
index ad059ef..8e4d793 100644
--- a/lib/librte_eal/common/eal_filesystem.h
+++ b/lib/librte_eal/common/eal_filesystem.h
@@ -115,6 +115,24 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id
 	return buffer;
 }
 
+/** String format for hugepage map lock files. */
+#define HUGEFILE_LOCK_FMT "%s/.%smap_%d_%d.lock"
+
+static inline const char *
+eal_get_hugefile_lock_path(char *buffer, size_t buflen, int list_idx,
+		int seg_idx)
+{
+	const char *directory = default_config_dir;
+	const char *home_dir = getenv("HOME");
+
+	if (getuid() != 0 && home_dir != NULL)
+		directory = home_dir;
+	snprintf(buffer, buflen - 1, HUGEFILE_LOCK_FMT, directory,
+			internal_config.hugefile_prefix, list_idx, seg_idx);
+	buffer[buflen - 1] = '\0';
+	return buffer;
+}
+
 /** define the default filename prefix for the %s values above */
 #define HUGEFILE_PREFIX_DEFAULT "rte"
 
diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
index db5aabd..7eca711 100644
--- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
+++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
@@ -237,18 +237,6 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
 }
 
 /*
- * uses fstat to report the size of a file on disk
- */
-static off_t
-get_file_size(int fd)
-{
-	struct stat st;
-	if (fstat(fd, &st) < 0)
-		return 0;
-	return st.st_size;
-}
-
-/*
  * Clear the hugepage directory of whatever hugepage files
  * there are. Checks if the file is locked (i.e.
  * if it's in use by another DPDK process).
@@ -278,8 +266,6 @@ clear_hugedir(const char * hugedir)
 	}
 
 	while(dirent != NULL){
-		struct flock lck = {0};
-
 		/* skip files that don't match the hugepage pattern */
 		if (fnmatch(filter, dirent->d_name, 0) > 0) {
 			dirent = readdir(dir);
@@ -296,19 +282,11 @@ clear_hugedir(const char * hugedir)
 		}
 
 		/* non-blocking lock */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = get_file_size(fd);
+		lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
-		lck_result = fcntl(fd, F_SETLK, &lck);
-
-		/* if lock succeeds, unlock and remove the file */
-		if (lck_result != -1) {
-			lck.l_type = F_UNLCK;
-			fcntl(fd, F_SETLK, &lck);
+		/* if lock succeeds, remove the file */
+		if (lck_result != -1)
 			unlinkat(dir_fd, dirent->d_name, 0);
-		}
 		close (fd);
 		dirent = readdir(dir);
 	}
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 162306a..efbb760 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -46,24 +46,25 @@
  */
 static int fallocate_supported = -1; /* unknown */
 
-/*
- * If each page is in a separate file, we can close fd's since we need each fd
- * only once. However, in single file segments mode, we can get away with using
- * a single fd for entire segments, but we need to store them somewhere. Each
- * fd is different within each process, so we'll store them in a local tailq.
+/* for single-file segments, we need some kind of mechanism to keep track of
+ * which hugepages can be freed back to the system, and which cannot. we cannot
+ * use flock() because they don't allow locking parts of a file, and we cannot
+ * use fcntl() due to issues with their semantics, so we will have to rely on a
+ * bunch of lockfiles for each page.
+ *
+ * we cannot know how many pages a system will have in advance, but we do know
+ * that they come in lists, and we know lengths of these lists. so, simply store
+ * a malloc'd array of fd's indexed by list and segment index.
+ *
+ * they will be initialized at startup, and filled as we allocate/deallocate
+ * segments. also, use this to track memseg list proper fd.
  */
-struct msl_entry {
-	TAILQ_ENTRY(msl_entry) next;
-	unsigned int msl_idx;
-	int fd;
-};
-
-/** Double linked list of memseg list fd's. */
-TAILQ_HEAD(msl_entry_list, msl_entry);
-
-static struct msl_entry_list msl_entry_list =
-		TAILQ_HEAD_INITIALIZER(msl_entry_list);
-static rte_spinlock_t tailq_lock = RTE_SPINLOCK_INITIALIZER;
+static struct {
+	int *fds; /**< dynamically allocated array of segment lock fd's */
+	int memseg_list_fd; /**< memseg list fd */
+	int len; /**< total length of the array */
+	int count; /**< entries used in an array */
+} lock_fds[RTE_MAX_MEMSEG_LISTS];
 
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
@@ -158,35 +159,6 @@ resotre_numa(int *oldpolicy, struct bitmask *oldmask)
 }
 #endif
 
-static struct msl_entry *
-get_msl_entry_by_idx(unsigned int list_idx)
-{
-	struct msl_entry *te;
-
-	rte_spinlock_lock(&tailq_lock);
-
-	TAILQ_FOREACH(te, &msl_entry_list, next) {
-		if (te->msl_idx == list_idx)
-			break;
-	}
-	if (te == NULL) {
-		/* doesn't exist, so create it and set fd to -1 */
-
-		te = malloc(sizeof(*te));
-		if (te == NULL) {
-			RTE_LOG(ERR, EAL, "%s(): cannot allocate tailq entry for memseg list\n",
-				__func__);
-			goto unlock;
-		}
-		te->msl_idx = list_idx;
-		te->fd = -1;
-		TAILQ_INSERT_TAIL(&msl_entry_list, te, next);
-	}
-unlock:
-	rte_spinlock_unlock(&tailq_lock);
-	return te;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -199,19 +171,6 @@ get_file_size(int fd)
 	return st.st_size;
 }
 
-/*
- * uses fstat to check if file size on disk is zero (regular fstat won't show
- * true file size due to how fallocate works)
- */
-static bool
-is_zero_length(int fd)
-{
-	struct stat st;
-	if (fstat(fd, &st) < 0)
-		return false;
-	return st.st_blocks == 0;
-}
-
 /* we cannot use rte_memseg_list_walk() here because we will be holding a
  * write lock whenever we enter every function in this file, however copying
  * the same iteration code everywhere is not ideal as well. so, use a lockless
@@ -238,6 +197,101 @@ memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg)
 	return 0;
 }
 
+/* returns 1 on successful lock, 0 on unsuccessful lock, -1 on error */
+static int lock(int fd, int type)
+{
+	int ret;
+
+	/* flock may be interrupted */
+	do {
+		ret = flock(fd, type | LOCK_NB);
+	} while (ret && errno == EINTR);
+
+	if (ret && errno == EWOULDBLOCK) {
+		/* couldn't lock */
+		return 0;
+	} else if (ret) {
+		RTE_LOG(ERR, EAL, "%s(): error calling flock(): %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+	/* lock was successful */
+	return 1;
+}
+
+static int get_segment_lock_fd(int list_idx, int seg_idx)
+{
+	char path[PATH_MAX] = {0};
+	int fd;
+
+	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
+		return -1;
+	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
+		return -1;
+
+	fd = lock_fds[list_idx].fds[seg_idx];
+	/* does this lock already exist? */
+	if (fd >= 0)
+		return fd;
+
+	eal_get_hugefile_lock_path(path, sizeof(path), list_idx, seg_idx);
+
+	fd = open(path, O_CREAT | O_RDWR, 0660);
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): error creating lockfile '%s': %s\n",
+			__func__, path, strerror(errno));
+		return -1;
+	}
+	/* take out a read lock */
+	if (lock(fd, LOCK_SH) != 1) {
+		RTE_LOG(ERR, EAL, "%s(): failed to take out a readlock on '%s': %s\n",
+			__func__, path, strerror(errno));
+		close(fd);
+		return -1;
+	}
+	/* store it for future reference */
+	lock_fds[list_idx].fds[seg_idx] = fd;
+	lock_fds[list_idx].count++;
+	return fd;
+}
+
+static int unlock_segment(int list_idx, int seg_idx)
+{
+	int fd, ret;
+
+	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
+		return -1;
+	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
+		return -1;
+
+	fd = lock_fds[list_idx].fds[seg_idx];
+
+	/* upgrade lock to exclusive to see if we can remove the lockfile */
+	ret = lock(fd, LOCK_EX);
+	if (ret == 1) {
+		/* we've succeeded in taking exclusive lock, this lockfile may
+		 * be removed.
+		 */
+		char path[PATH_MAX] = {0};
+		eal_get_hugefile_lock_path(path, sizeof(path), list_idx,
+				seg_idx);
+		if (unlink(path)) {
+			RTE_LOG(ERR, EAL, "%s(): error removing lockfile '%s': %s\n",
+					__func__, path, strerror(errno));
+		}
+	}
+	/* we don't want to leak the fd, so even if we fail to lock, close fd
+	 * and remove it from list anyway.
+	 */
+	close(fd);
+	lock_fds[list_idx].fds[seg_idx] = -1;
+	lock_fds[list_idx].count--;
+
+	if (ret < 0)
+		return -1;
+	return 0;
+}
+
 static int
 get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 		unsigned int list_idx, unsigned int seg_idx)
@@ -245,31 +299,29 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 	int fd;
 
 	if (internal_config.single_file_segments) {
-		/*
-		 * try to find a tailq entry, for this memseg list, or create
-		 * one if it doesn't exist.
-		 */
-		struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-		if (te == NULL) {
-			RTE_LOG(ERR, EAL, "%s(): cannot allocate tailq entry for memseg list\n",
-				__func__);
-			return -1;
-		} else if (te->fd < 0) {
-			/* create a hugepage file */
-			eal_get_hugefile_path(path, buflen, hi->hugedir,
-					list_idx);
+		/* create a hugepage file path */
+		eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx);
+
+		fd = lock_fds[list_idx].memseg_list_fd;
+
+		if (fd < 0) {
 			fd = open(path, O_CREAT | O_RDWR, 0600);
 			if (fd < 0) {
-				RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
+				RTE_LOG(ERR, EAL, "%s(): open failed: %s\n",
 					__func__, strerror(errno));
 				return -1;
 			}
-			te->fd = fd;
-		} else {
-			fd = te->fd;
+			/* take out a read lock and keep it indefinitely */
+			if (lock(fd, LOCK_SH) < 0) {
+				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+					__func__, strerror(errno));
+				close(fd);
+				return -1;
+			}
+			lock_fds[list_idx].memseg_list_fd = fd;
 		}
 	} else {
-		/* one file per page, just create it */
+		/* create a hugepage file path */
 		eal_get_hugefile_path(path, buflen, hi->hugedir,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
 		fd = open(path, O_CREAT | O_RDWR, 0600);
@@ -278,48 +330,27 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 					strerror(errno));
 			return -1;
 		}
+		/* take out a read lock */
+		if (lock(fd, LOCK_SH) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+				__func__, strerror(errno));
+			close(fd);
+			return -1;
+		}
 	}
 	return fd;
 }
 
-/* returns 1 on successful lock, 0 on unsuccessful lock, -1 on error */
-static int lock(int fd, uint64_t offset, uint64_t len, int type)
-{
-	struct flock lck;
-	int ret;
-
-	memset(&lck, 0, sizeof(lck));
-
-	lck.l_type = type;
-	lck.l_whence = SEEK_SET;
-	lck.l_start = offset;
-	lck.l_len = len;
-
-	ret = fcntl(fd, F_SETLK, &lck);
-
-	if (ret && (errno == EAGAIN || errno == EACCES)) {
-		/* locked by another process, not an error */
-		return 0;
-	} else if (ret) {
-		RTE_LOG(ERR, EAL, "%s(): error calling fcntl(): %s\n",
-			__func__, strerror(errno));
-		/* we've encountered an unexpected error */
-		return -1;
-	}
-	return 1;
-}
-
 static int
-resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
-		bool grow)
+resize_hugefile(int fd, char *path, int list_idx, int seg_idx,
+		uint64_t fa_offset, uint64_t page_sz, bool grow)
 {
 	bool again = false;
 	do {
 		if (fallocate_supported == 0) {
 			/* we cannot deallocate memory if fallocate() is not
-			 * supported, but locks are still needed to prevent
-			 * primary process' initialization from clearing out
-			 * huge pages used by this process.
+			 * supported, and hugepage file is already locked at
+			 * creation, so no further synchronization needed.
 			 */
 
 			if (!grow) {
@@ -337,13 +368,10 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 					__func__, strerror(errno));
 				return -1;
 			}
-			/* not being able to take out a read lock is an error */
-			if (lock(fd, fa_offset, page_sz, F_RDLCK) != 1)
-				return -1;
 		} else {
 			int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE |
 					FALLOC_FL_KEEP_SIZE;
-			int ret;
+			int ret, lock_fd;
 
 			/* if fallocate() is supported, we need to take out a
 			 * read lock on allocate (to prevent other processes
@@ -351,20 +379,69 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 			 * lock on deallocate (to ensure nobody else is using
 			 * this page).
 			 *
-			 * we can't use flock() for this, as we actually need to
-			 * lock part of the file, not the entire file.
+			 * read locks on page itself are already taken out at
+			 * file creation, in get_seg_fd().
+			 *
+			 * we cannot rely on simple use of flock() call, because
+			 * we need to be able to lock a section of the file,
+			 * and we cannot use fcntl() locks, because of numerous
+			 * problems with their semantics, so we will use
+			 * deterministically named lock files for each section
+			 * of the file.
+			 *
+			 * if we're shrinking the file, we want to upgrade our
+			 * lock from shared to exclusive.
+			 *
+			 * lock_fd is an fd for a lockfile, not for the segment
+			 * list.
 			 */
+			lock_fd = get_segment_lock_fd(list_idx, seg_idx);
 
 			if (!grow) {
-				ret = lock(fd, fa_offset, page_sz, F_WRLCK);
+				/* we are using this lockfile to determine
+				 * whether this particular page is locked, as we
+				 * are in single file segments mode and thus
+				 * cannot use regular flock() to get this info.
+				 *
+				 * we want to try and take out an exclusive lock
+				 * on the lock file to determine if we're the
+				 * last ones using this page, and if not, we
+				 * won't be shrinking it, and will instead exit
+				 * prematurely.
+				 */
+				ret = lock(lock_fd, LOCK_EX);
+
+				/* drop the lock on the lockfile, so that even
+				 * if we couldn't shrink the file ourselves, we
+				 * are signalling to other processes that we're
+				 * no longer using this page.
+				 */
+				if (unlock_segment(list_idx, seg_idx))
+					RTE_LOG(ERR, EAL, "Could not unlock segment\n");
+
+				/* additionally, if this was the last lock on
+				 * this segment list, we can safely close the
+				 * page file fd, so that one of the processes
+				 * could then delete the file after shrinking.
+				 */
+				if (ret < 1 && lock_fds[list_idx].count == 0) {
+					close(fd);
+					lock_fds[list_idx].memseg_list_fd = -1;
+				}
 
-				if (ret < 0)
+				if (ret < 0) {
+					RTE_LOG(ERR, EAL, "Could not lock segment\n");
 					return -1;
-				else if (ret == 0)
-					/* failed to lock, not an error */
+				}
+				if (ret == 0)
+					/* failed to lock, not an error. */
 					return 0;
 			}
-			if (fallocate(fd, flags, fa_offset, page_sz) < 0) {
+
+			/* grow or shrink the file */
+			ret = fallocate(fd, flags, fa_offset, page_sz);
+
+			if (ret < 0) {
 				if (fallocate_supported == -1 &&
 						errno == ENOTSUP) {
 					RTE_LOG(ERR, EAL, "%s(): fallocate() not supported, hugepage deallocation will be disabled\n",
@@ -380,16 +457,18 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 			} else {
 				fallocate_supported = 1;
 
-				if (grow) {
-					/* if can't read lock, it's an error */
-					if (lock(fd, fa_offset, page_sz,
-							F_RDLCK) != 1)
-						return -1;
-				} else {
-					/* if can't unlock, it's an error */
-					if (lock(fd, fa_offset, page_sz,
-							F_UNLCK) != 1)
-						return -1;
+				/* we've grew/shrunk the file, and we hold an
+				 * exclusive lock now. check if there are no
+				 * more segments active in this segment list,
+				 * and remove the file if there aren't.
+				 */
+				if (lock_fds[list_idx].count == 0) {
+					if (unlink(path))
+						RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n",
+							__func__, path,
+							strerror(errno));
+					close(fd);
+					lock_fds[list_idx].memseg_list_fd = -1;
 				}
 			}
 		}
@@ -411,15 +490,19 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	int fd;
 	size_t alloc_sz;
 
+	/* takes out a read lock on segment or segment list */
 	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
-	if (fd < 0)
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
 		return -1;
+	}
 
 	alloc_sz = hi->hugepage_sz;
 	if (internal_config.single_file_segments) {
 		map_offset = seg_idx * alloc_sz;
-		ret = resize_hugefile(fd, map_offset, alloc_sz, true);
-		if (ret < 1)
+		ret = resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				alloc_sz, true);
+		if (ret < 0)
 			goto resized;
 	} else {
 		map_offset = 0;
@@ -428,28 +511,6 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 				__func__, strerror(errno));
 			goto resized;
 		}
-		/* we've allocated a page - take out a read lock. we're using
-		 * fcntl() locks rather than flock() here because doing that
-		 * gives us one huge advantage - fcntl() locks are per-process,
-		 * not per-file descriptor, which means that we don't have to
-		 * keep the original fd's around to keep a lock on the file.
-		 *
-		 * this is useful, because when it comes to unmapping pages, we
-		 * will have to take out a write lock (to figure out if another
-		 * process still has this page mapped), and to do itwith flock()
-		 * we'll have to use original fd, as lock is associated with
-		 * that particular fd. with fcntl(), this is not necessary - we
-		 * can open a new fd and use fcntl() on that.
-		 */
-		ret = lock(fd, map_offset, alloc_sz, F_RDLCK);
-
-		/* this should not fail */
-		if (ret != 1) {
-			RTE_LOG(ERR, EAL, "%s(): error locking file: %s\n",
-				__func__,
-				strerror(errno));
-			goto resized;
-		}
 	}
 
 	/*
@@ -518,19 +579,14 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	munmap(addr, alloc_sz);
 resized:
 	if (internal_config.single_file_segments) {
-		resize_hugefile(fd, map_offset, alloc_sz, false);
-		if (is_zero_length(fd)) {
-			struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-			if (te != NULL && te->fd >= 0) {
-				close(te->fd);
-				te->fd = -1;
-			}
-			/* ignore errors, can't make it any worse */
-			unlink(path);
-		}
+		resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				alloc_sz, false);
+		/* ignore failure, can't make it any worse */
 	} else {
+		/* only remove file if we can take out a write lock */
+		if (lock(fd, LOCK_EX) == 1)
+			unlink(path);
 		close(fd);
-		unlink(path);
 	}
 	return -1;
 }
@@ -546,6 +602,14 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	/* erase page data */
 	memset(ms->addr, 0, ms->len);
 
+	/* if we are not in single file segments mode, we're going to unmap the
+	 * segment and thus drop the lock on original fd, so take out another
+	 * shared lock before we do that.
+	 */
+	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+	if (fd < 0)
+		return -1;
+
 	if (mmap(ms->addr, ms->len, PROT_READ,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
 				MAP_FAILED) {
@@ -553,41 +617,23 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
-	if (fd < 0)
-		return -1;
-
 	if (internal_config.single_file_segments) {
 		map_offset = seg_idx * ms->len;
-		if (resize_hugefile(fd, map_offset, ms->len, false))
+		if (resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				ms->len, false))
 			return -1;
-		/* if file is zero-length, we've already shrunk it, so it's
-		 * safe to remove.
-		 */
-		if (is_zero_length(fd)) {
-			struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-			if (te != NULL && te->fd >= 0) {
-				close(te->fd);
-				te->fd = -1;
-			}
-			unlink(path);
-		}
 		ret = 0;
 	} else {
 		/* if we're able to take out a write lock, we're the last one
 		 * holding onto this page.
 		 */
-
-		ret = lock(fd, 0, ms->len, F_WRLCK);
+		ret = lock(fd, LOCK_EX);
 		if (ret >= 0) {
 			/* no one else is using this page */
 			if (ret == 1)
 				unlink(path);
-			ret = lock(fd, 0, ms->len, F_UNLCK);
-			if (ret != 1)
-				RTE_LOG(ERR, EAL, "%s(): unable to unlock file %s\n",
-					__func__, path);
 		}
+		/* closing fd will drop the lock */
 		close(fd);
 	}
 
@@ -612,7 +658,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	struct alloc_walk_param *wa = arg;
 	struct rte_memseg_list *cur_msl;
 	size_t page_sz;
-	int cur_idx, start_idx, j;
+	int cur_idx, start_idx, j, dir_fd;
 	unsigned int msl_idx, need, i;
 
 	if (msl->page_sz != wa->page_sz)
@@ -633,6 +679,25 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 		return 0;
 	start_idx = cur_idx;
 
+	/* do not allow any page allocations during the time we're allocating,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(wa->hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
+
 	for (i = 0; i < need; i++, cur_idx++) {
 		struct rte_memseg *cur;
 		void *map_addr;
@@ -668,6 +733,8 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 			/* clear the list */
 			if (wa->ms)
 				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			close(dir_fd);
 			return -1;
 		}
 		if (wa->ms)
@@ -679,6 +746,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	wa->segs_allocated = i;
 	if (i > 0)
 		cur_msl->version++;
+	close(dir_fd);
 	return 1;
 }
 
@@ -693,7 +761,7 @@ free_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	struct rte_memseg_list *found_msl;
 	struct free_walk_param *wa = arg;
 	uintptr_t start_addr, end_addr;
-	int msl_idx, seg_idx;
+	int msl_idx, seg_idx, ret, dir_fd;
 
 	start_addr = (uintptr_t) msl->base_va;
 	end_addr = start_addr + msl->memseg_arr.len * (size_t)msl->page_sz;
@@ -708,11 +776,34 @@ free_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	/* msl is const */
 	found_msl = &mcfg->memsegs[msl_idx];
 
+	/* do not allow any page allocations during the time we're freeing,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(wa->hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
+
 	found_msl->version++;
 
 	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
 
-	if (free_seg(wa->ms, wa->hi, msl_idx, seg_idx))
+	ret = free_seg(wa->ms, wa->hi, msl_idx, seg_idx);
+
+	close(dir_fd);
+
+	if (ret < 0)
 		return -1;
 
 	return 1;
@@ -1034,22 +1125,46 @@ sync_existing(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
 		unsigned int msl_idx)
 {
-	int ret;
+	int ret, dir_fd;
+
+	/* do not allow any page allocations during the time we're allocating,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
 
 	/* ensure all allocated space is the same in both lists */
 	ret = sync_status(primary_msl, local_msl, hi, msl_idx, true);
 	if (ret < 0)
-		return -1;
+		goto fail;
 
 	/* ensure all unallocated space is the same in both lists */
 	ret = sync_status(primary_msl, local_msl, hi, msl_idx, false);
 	if (ret < 0)
-		return -1;
+		goto fail;
 
 	/* update version number */
 	local_msl->version = primary_msl->version;
 
+	close(dir_fd);
+
 	return 0;
+fail:
+	close(dir_fd);
+	return -1;
 }
 
 static int
@@ -1128,11 +1243,46 @@ secondary_msl_create_walk(const struct rte_memseg_list *msl,
 	return 0;
 }
 
+static int
+secondary_lock_list_create_walk(const struct rte_memseg_list *msl,
+		void *arg __rte_unused)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	unsigned int i, len;
+	int msl_idx;
+	int *data;
+
+	msl_idx = msl - mcfg->memsegs;
+	len = msl->memseg_arr.len;
+
+	/* ensure we have space to store lock fd per each possible segment */
+	data = malloc(sizeof(int) * len);
+	if (data == NULL) {
+		RTE_LOG(ERR, EAL, "Unable to allocate space for lock descriptors\n");
+		return -1;
+	}
+	/* set all fd's as invalid */
+	for (i = 0; i < len; i++)
+		data[i] = -1;
+
+	lock_fds[msl_idx].fds = data;
+	lock_fds[msl_idx].len = len;
+	lock_fds[msl_idx].count = 0;
+	lock_fds[msl_idx].memseg_list_fd = -1;
+
+	return 0;
+}
+
 int
 eal_memalloc_init(void)
 {
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
 		if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0)
 			return -1;
+
+	/* initialize all of the lock fd lists */
+	if (internal_config.single_file_segments)
+		if (rte_memseg_list_walk(secondary_lock_list_create_walk, NULL))
+			return -1;
 	return 0;
 }
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index fadc1de..ad3add8 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -259,7 +259,6 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
 	int fd;
 	unsigned i;
 	void *virtaddr;
-	struct flock lck = {0};
 #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
 	int node_id = -1;
 	int essential_prev = 0;
@@ -378,13 +377,8 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
 		}
 		*(int *)virtaddr = 0;
 
-
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = hugepage_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): Locking file failed:%s \n",
 				__func__, strerror(errno));
 			close(fd);
@@ -708,7 +702,6 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
 #endif
 		struct hugepage_file *hfile = &hugepages[cur_page];
 		struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx);
-		struct flock lck;
 		void *addr;
 		int fd;
 
@@ -719,11 +712,7 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
 			return -1;
 		}
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = page_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "Could not lock '%s': %s\n",
 					hfile->filepath, strerror(errno));
 			close(fd);
@@ -1732,7 +1721,6 @@ eal_legacy_hugepage_attach(void)
 		struct hugepage_file *hf = &hp[i];
 		size_t map_sz = hf->size;
 		void *map_addr = hf->final_va;
-		struct flock lck;
 
 		/* if size is zero, no more pages left */
 		if (map_sz == 0)
@@ -1754,11 +1742,7 @@ eal_legacy_hugepage_attach(void)
 		}
 
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = map_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): Locking file failed: %s\n",
 				__func__, strerror(errno));
 			close(fd);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] Fix file locking in EAL memory
  2018-04-24 15:54 ` [PATCH v2 0/2] Fix file locking in EAL memory Anatoly Burakov
@ 2018-04-24 16:32   ` Stephen Hemminger
  2018-04-24 17:25     ` Burakov, Anatoly
  2018-04-25 10:36   ` [PATCH v3 " Anatoly Burakov
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Stephen Hemminger @ 2018-04-24 16:32 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev, bruce.richardson, thomas

On Tue, 24 Apr 2018 16:54:21 +0100
Anatoly Burakov <anatoly.burakov@intel.com> wrote:

> One of the ways to work around this was using OFD locks,
> but they are only supported on kernels 3.15+,

Maybe time to move to next LTS kernel (3.16)?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] Fix file locking in EAL memory
  2018-04-24 16:32   ` Stephen Hemminger
@ 2018-04-24 17:25     ` Burakov, Anatoly
  2018-04-24 20:05       ` Thomas Monjalon
  0 siblings, 1 reply; 22+ messages in thread
From: Burakov, Anatoly @ 2018-04-24 17:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, bruce.richardson, thomas

On 24-Apr-18 5:32 PM, Stephen Hemminger wrote:
> On Tue, 24 Apr 2018 16:54:21 +0100
> Anatoly Burakov <anatoly.burakov@intel.com> wrote:
> 
>> One of the ways to work around this was using OFD locks,
>> but they are only supported on kernels 3.15+,
> 
> Maybe time to move to next LTS kernel (3.16)?
> 

I'm all for it, but it's not my call to make :(

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] Fix file locking in EAL memory
  2018-04-24 17:25     ` Burakov, Anatoly
@ 2018-04-24 20:05       ` Thomas Monjalon
  2018-04-24 20:34         ` Stephen Hemminger
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Monjalon @ 2018-04-24 20:05 UTC (permalink / raw)
  To: Burakov, Anatoly; +Cc: Stephen Hemminger, dev, bruce.richardson

24/04/2018 19:25, Burakov, Anatoly:
> On 24-Apr-18 5:32 PM, Stephen Hemminger wrote:
> > On Tue, 24 Apr 2018 16:54:21 +0100
> > Anatoly Burakov <anatoly.burakov@intel.com> wrote:
> > 
> >> One of the ways to work around this was using OFD locks,
> >> but they are only supported on kernels 3.15+,
> > 
> > Maybe time to move to next LTS kernel (3.16)?
> > 
> 
> I'm all for it, but it's not my call to make :(

I am in favor of upgrading kernel requirements.
The discussion already started few months ago:
	https://dpdk.org/patch/34778

We just need to consider how to support the latest versions
of distributions Debian, Ubuntu, SLES and RHEL.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/2] Fix file locking in EAL memory
  2018-04-24 20:05       ` Thomas Monjalon
@ 2018-04-24 20:34         ` Stephen Hemminger
  0 siblings, 0 replies; 22+ messages in thread
From: Stephen Hemminger @ 2018-04-24 20:34 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: Burakov, Anatoly, dev, bruce.richardson

On Tue, 24 Apr 2018 22:05:10 +0200
Thomas Monjalon <thomas@monjalon.net> wrote:

> 24/04/2018 19:25, Burakov, Anatoly:
> > On 24-Apr-18 5:32 PM, Stephen Hemminger wrote:  
> > > On Tue, 24 Apr 2018 16:54:21 +0100
> > > Anatoly Burakov <anatoly.burakov@intel.com> wrote:
> > >   
> > >> One of the ways to work around this was using OFD locks,
> > >> but they are only supported on kernels 3.15+,  
> > > 
> > > Maybe time to move to next LTS kernel (3.16)?
> > >   
> > 
> > I'm all for it, but it's not my call to make :(  
> 
> I am in favor of upgrading kernel requirements.
> The discussion already started few months ago:
> 	https://dpdk.org/patch/34778
> 
> We just need to consider how to support the latest versions
> of distributions Debian, Ubuntu, SLES and RHEL.

Does anyone have a quick map for major distributions, current stable versions?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 0/2] Fix file locking in EAL memory
  2018-04-24 15:54 ` [PATCH v2 0/2] Fix file locking in EAL memory Anatoly Burakov
  2018-04-24 16:32   ` Stephen Hemminger
@ 2018-04-25 10:36   ` Anatoly Burakov
  2018-04-27 21:50     ` Thomas Monjalon
  2018-04-25 10:36   ` [PATCH v3 1/2] mem: add memalloc init stage Anatoly Burakov
  2018-04-25 10:36   ` [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
  3 siblings, 1 reply; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-25 10:36 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, thomas

This patchset replaces the fcntl()-based locking used in
the original DPDK memory hotplug patchset, to an flock()-
and lockfile-based implementation, due to numerous (well,
one, really) problems with how fcntl() locks work.

Long story short, fcntl() locks will be dropped if any
fd referring to locked file, is closed - even if it's not
the last fd, even if it wasn't even the fd that was used
to lock the file in the first place, even if it wasn't
you who closed that fd, but some other library.

This patchset corrects this saddening design defect in the
original implementation.

One of the ways to work around this was using OFD locks,
but they are only supported on kernels 3.15+, so we cannot
rely on them if we want to support old kernels. Hence, we
use per-segment lockfiles. The number of file descriptors
we open does not end up more than in non-single file
segments case - we still open the same amount of files (a
file per page), plus a file per memseg list.

Additionally, since flock() is not atomic, we also lock the
hugepage dir to prevent multiple processes from concurrently
performing operations on hugetlbfs mounts.

If you know of a more enlightened way of fixing this
limitation, you are certainly welcome to comment :)

v3:
- Made lockfile naming more consistent with hugetlbfs file naming

v2:
- Fixes as per review comments
- Make lockfiles hidden by default

Anatoly Burakov (2):
  mem: add memalloc init stage
  mem: revert to using flock() and add per-segment lockfiles

 lib/librte_eal/bsdapp/eal/eal_memalloc.c        |   6 +
 lib/librte_eal/common/eal_common_memory.c       |   3 +
 lib/librte_eal/common/eal_filesystem.h          |  17 +
 lib/librte_eal/common/eal_memalloc.h            |   3 +
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c |  28 +-
 lib/librte_eal/linuxapp/eal/eal_memalloc.c      | 605 +++++++++++++++---------
 lib/librte_eal/linuxapp/eal/eal_memory.c        |  22 +-
 7 files changed, 420 insertions(+), 264 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v3 1/2] mem: add memalloc init stage
  2018-04-24 15:54 ` [PATCH v2 0/2] Fix file locking in EAL memory Anatoly Burakov
  2018-04-24 16:32   ` Stephen Hemminger
  2018-04-25 10:36   ` [PATCH v3 " Anatoly Burakov
@ 2018-04-25 10:36   ` Anatoly Burakov
  2018-04-25 10:36   ` [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
  3 siblings, 0 replies; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-25 10:36 UTC (permalink / raw)
  To: dev; +Cc: Bruce Richardson, thomas

Currently, memseg lists for secondary process are allocated on
sync (triggered by init), when they are accessed for the first
time. Move this initialization to a separate init stage for
memalloc.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_eal/bsdapp/eal/eal_memalloc.c   |  6 +++
 lib/librte_eal/common/eal_common_memory.c  |  3 ++
 lib/librte_eal/common/eal_memalloc.h       |  3 ++
 lib/librte_eal/linuxapp/eal/eal_memalloc.c | 66 ++++++++++++++++++------------
 4 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/lib/librte_eal/bsdapp/eal/eal_memalloc.c b/lib/librte_eal/bsdapp/eal/eal_memalloc.c
index 461732f..f7f07ab 100644
--- a/lib/librte_eal/bsdapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/bsdapp/eal/eal_memalloc.c
@@ -46,3 +46,9 @@ eal_memalloc_sync_with_primary(void)
 	RTE_LOG(ERR, EAL, "Memory hotplug not supported on FreeBSD\n");
 	return -1;
 }
+
+int
+eal_memalloc_init(void)
+{
+	return 0;
+}
diff --git a/lib/librte_eal/common/eal_common_memory.c b/lib/librte_eal/common/eal_common_memory.c
index 24a9ed5..dd9062d 100644
--- a/lib/librte_eal/common/eal_common_memory.c
+++ b/lib/librte_eal/common/eal_common_memory.c
@@ -864,6 +864,9 @@ rte_eal_memory_init(void)
 	if (retval < 0)
 		goto fail;
 
+	if (eal_memalloc_init() < 0)
+		goto fail;
+
 	retval = rte_eal_process_type() == RTE_PROC_PRIMARY ?
 			rte_eal_hugepage_init() :
 			rte_eal_hugepage_attach();
diff --git a/lib/librte_eal/common/eal_memalloc.h b/lib/librte_eal/common/eal_memalloc.h
index 6736fa3..662b3b5 100644
--- a/lib/librte_eal/common/eal_memalloc.h
+++ b/lib/librte_eal/common/eal_memalloc.h
@@ -76,4 +76,7 @@ eal_memalloc_mem_alloc_validator_unregister(const char *name, int socket_id);
 int
 eal_memalloc_mem_alloc_validate(int socket_id, size_t new_len);
 
+int
+eal_memalloc_init(void);
+
 #endif /* EAL_MEMALLOC_H */
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 1f553dd..162306a 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -1060,33 +1060,11 @@ sync_walk(const struct rte_memseg_list *msl, void *arg __rte_unused)
 	struct hugepage_info *hi = NULL;
 	unsigned int i;
 	int msl_idx;
-	bool new_msl = false;
 
 	msl_idx = msl - mcfg->memsegs;
 	primary_msl = &mcfg->memsegs[msl_idx];
 	local_msl = &local_memsegs[msl_idx];
 
-	/* check if secondary has this memseg list set up */
-	if (local_msl->base_va == NULL) {
-		char name[PATH_MAX];
-		int ret;
-		new_msl = true;
-
-		/* create distinct fbarrays for each secondary */
-		snprintf(name, RTE_FBARRAY_NAME_LEN, "%s_%i",
-			primary_msl->memseg_arr.name, getpid());
-
-		ret = rte_fbarray_init(&local_msl->memseg_arr, name,
-			primary_msl->memseg_arr.len,
-			primary_msl->memseg_arr.elt_sz);
-		if (ret < 0) {
-			RTE_LOG(ERR, EAL, "Cannot initialize local memory map\n");
-			return -1;
-		}
-
-		local_msl->base_va = primary_msl->base_va;
-	}
-
 	for (i = 0; i < RTE_DIM(internal_config.hugepage_info); i++) {
 		uint64_t cur_sz =
 			internal_config.hugepage_info[i].hugepage_sz;
@@ -1101,10 +1079,8 @@ sync_walk(const struct rte_memseg_list *msl, void *arg __rte_unused)
 		return -1;
 	}
 
-	/* if versions don't match or if we have just allocated a new
-	 * memseg list, synchronize everything
-	 */
-	if ((new_msl || local_msl->version != primary_msl->version) &&
+	/* if versions don't match, synchronize everything */
+	if (local_msl->version != primary_msl->version &&
 			sync_existing(primary_msl, local_msl, hi, msl_idx))
 		return -1;
 	return 0;
@@ -1122,3 +1098,41 @@ eal_memalloc_sync_with_primary(void)
 		return -1;
 	return 0;
 }
+
+static int
+secondary_msl_create_walk(const struct rte_memseg_list *msl,
+		void *arg __rte_unused)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	struct rte_memseg_list *primary_msl, *local_msl;
+	char name[PATH_MAX];
+	int msl_idx, ret;
+
+	msl_idx = msl - mcfg->memsegs;
+	primary_msl = &mcfg->memsegs[msl_idx];
+	local_msl = &local_memsegs[msl_idx];
+
+	/* create distinct fbarrays for each secondary */
+	snprintf(name, RTE_FBARRAY_NAME_LEN, "%s_%i",
+		primary_msl->memseg_arr.name, getpid());
+
+	ret = rte_fbarray_init(&local_msl->memseg_arr, name,
+		primary_msl->memseg_arr.len,
+		primary_msl->memseg_arr.elt_sz);
+	if (ret < 0) {
+		RTE_LOG(ERR, EAL, "Cannot initialize local memory map\n");
+		return -1;
+	}
+	local_msl->base_va = primary_msl->base_va;
+
+	return 0;
+}
+
+int
+eal_memalloc_init(void)
+{
+	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
+		if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0)
+			return -1;
+	return 0;
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-24 15:54 ` [PATCH v2 0/2] Fix file locking in EAL memory Anatoly Burakov
                     ` (2 preceding siblings ...)
  2018-04-25 10:36   ` [PATCH v3 1/2] mem: add memalloc init stage Anatoly Burakov
@ 2018-04-25 10:36   ` Anatoly Burakov
  2018-04-28  9:38     ` Andrew Rybchenko
  3 siblings, 1 reply; 22+ messages in thread
From: Anatoly Burakov @ 2018-04-25 10:36 UTC (permalink / raw)
  To: dev; +Cc: bruce.richardson, thomas, anatoly.burakov

The original implementation used flock() locks, but was later
switched to using fcntl() locks for page locking, because
fcntl() locks allow locking parts of a file, which is useful
for single-file segments mode, where locking the entire file
isn't as useful because we still need to grow and shrink it.

However, according to fcntl()'s Ubuntu manpage [1], semantics of
fcntl() locks have a giant oversight:

  This interface follows the completely stupid semantics of System
  V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
  locks associated with a file for a given process are removed
  when any file descriptor for that file is closed by that process.
  This semantic means that applications must be aware of any files
  that a subroutine library may access.

Basically, closing *any* fd with an fcntl() lock (which we do because
we don't want to leak fd's) will drop the lock completely.

So, in this commit, we will be reverting back to using flock() locks
everywhere. However, that still leaves the problem of locking parts
of a memseg list file in single file segments mode, and we will be
solving it with creating separate lock files per each page, and
tracking those with flock().

We will also be removing all of this tailq business and replacing it
with a simple array - saving a few bytes is not worth the extra
hassle of dealing with pointers and potential memory allocation
failures. Also, remove the tailq lock since it is not needed - these
fd lists are per-process, and within a given process, it is always
only one thread handling access to hugetlbfs.

So, first one to allocate a segment will create a lockfile, and put
a shared lock on it. When we're shrinking the page file, we will be
trying to take out a write lock on that lockfile, which would fail if
any other process is holding onto the lockfile as well. This way, we
can know if we can shrink the segment file. Also, if no other locks
are found in the lock list for a given memseg list, the memseg list
fd is automatically closed.

One other thing to note is, according to flock() Ubuntu manpage [2],
upgrading the lock from shared to exclusive is implemented by dropping
and reacquiring the lock, which is not atomic and thus would have
created race conditions. So, on attempting to perform operations in
hugetlbfs, we will take out a writelock on hugetlbfs directory, so
that only one process could perform hugetlbfs operations concurrently.

[1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
[2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html

Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
Fixes: 2a04139f66b4 ("eal: add single file segments option")
Cc: anatoly.burakov@intel.com

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---

Notes:
    v3:
    - Make lock naming consistent with hugetlbfs file naming
    
    v2:
    - Make lockfiles hidden
    - renamed functions as per Bruce's comments
    
    We could've used OFD locks [1] instead of flock(), but they are
    only supported on kernel 3.15+ [2], so we're forced to use this
    flock()-based contraption to support older kernels.
    
    [1] https://www.gnu.org/software/libc/manual/html_node/Open-File-Description-Locks.html
    [2] http://man7.org/linux/man-pages/man2/fcntl.2.html

 lib/librte_eal/common/eal_filesystem.h          |  17 +
 lib/librte_eal/linuxapp/eal/eal_hugepage_info.c |  28 +-
 lib/librte_eal/linuxapp/eal/eal_memalloc.c      | 539 +++++++++++++++---------
 lib/librte_eal/linuxapp/eal/eal_memory.c        |  22 +-
 4 files changed, 368 insertions(+), 238 deletions(-)

diff --git a/lib/librte_eal/common/eal_filesystem.h b/lib/librte_eal/common/eal_filesystem.h
index ad059ef..0a82f89 100644
--- a/lib/librte_eal/common/eal_filesystem.h
+++ b/lib/librte_eal/common/eal_filesystem.h
@@ -115,6 +115,23 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id
 	return buffer;
 }
 
+/** String format for hugepage map lock files. */
+#define HUGEFILE_LOCK_FMT "%s/.%smap_%d.lock"
+
+static inline const char *
+eal_get_hugefile_lock_path(char *buffer, size_t buflen, int f_id)
+{
+	const char *directory = default_config_dir;
+	const char *home_dir = getenv("HOME");
+
+	if (getuid() != 0 && home_dir != NULL)
+		directory = home_dir;
+	snprintf(buffer, buflen - 1, HUGEFILE_LOCK_FMT, directory,
+			internal_config.hugefile_prefix, f_id);
+	buffer[buflen - 1] = '\0';
+	return buffer;
+}
+
 /** define the default filename prefix for the %s values above */
 #define HUGEFILE_PREFIX_DEFAULT "rte"
 
diff --git a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
index db5aabd..7eca711 100644
--- a/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
+++ b/lib/librte_eal/linuxapp/eal/eal_hugepage_info.c
@@ -237,18 +237,6 @@ get_hugepage_dir(uint64_t hugepage_sz, char *hugedir, int len)
 }
 
 /*
- * uses fstat to report the size of a file on disk
- */
-static off_t
-get_file_size(int fd)
-{
-	struct stat st;
-	if (fstat(fd, &st) < 0)
-		return 0;
-	return st.st_size;
-}
-
-/*
  * Clear the hugepage directory of whatever hugepage files
  * there are. Checks if the file is locked (i.e.
  * if it's in use by another DPDK process).
@@ -278,8 +266,6 @@ clear_hugedir(const char * hugedir)
 	}
 
 	while(dirent != NULL){
-		struct flock lck = {0};
-
 		/* skip files that don't match the hugepage pattern */
 		if (fnmatch(filter, dirent->d_name, 0) > 0) {
 			dirent = readdir(dir);
@@ -296,19 +282,11 @@ clear_hugedir(const char * hugedir)
 		}
 
 		/* non-blocking lock */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = get_file_size(fd);
+		lck_result = flock(fd, LOCK_EX | LOCK_NB);
 
-		lck_result = fcntl(fd, F_SETLK, &lck);
-
-		/* if lock succeeds, unlock and remove the file */
-		if (lck_result != -1) {
-			lck.l_type = F_UNLCK;
-			fcntl(fd, F_SETLK, &lck);
+		/* if lock succeeds, remove the file */
+		if (lck_result != -1)
 			unlinkat(dir_fd, dirent->d_name, 0);
-		}
 		close (fd);
 		dirent = readdir(dir);
 	}
diff --git a/lib/librte_eal/linuxapp/eal/eal_memalloc.c b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
index 162306a..39e00c6 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memalloc.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memalloc.c
@@ -46,24 +46,25 @@
  */
 static int fallocate_supported = -1; /* unknown */
 
-/*
- * If each page is in a separate file, we can close fd's since we need each fd
- * only once. However, in single file segments mode, we can get away with using
- * a single fd for entire segments, but we need to store them somewhere. Each
- * fd is different within each process, so we'll store them in a local tailq.
+/* for single-file segments, we need some kind of mechanism to keep track of
+ * which hugepages can be freed back to the system, and which cannot. we cannot
+ * use flock() because they don't allow locking parts of a file, and we cannot
+ * use fcntl() due to issues with their semantics, so we will have to rely on a
+ * bunch of lockfiles for each page.
+ *
+ * we cannot know how many pages a system will have in advance, but we do know
+ * that they come in lists, and we know lengths of these lists. so, simply store
+ * a malloc'd array of fd's indexed by list and segment index.
+ *
+ * they will be initialized at startup, and filled as we allocate/deallocate
+ * segments. also, use this to track memseg list proper fd.
  */
-struct msl_entry {
-	TAILQ_ENTRY(msl_entry) next;
-	unsigned int msl_idx;
-	int fd;
-};
-
-/** Double linked list of memseg list fd's. */
-TAILQ_HEAD(msl_entry_list, msl_entry);
-
-static struct msl_entry_list msl_entry_list =
-		TAILQ_HEAD_INITIALIZER(msl_entry_list);
-static rte_spinlock_t tailq_lock = RTE_SPINLOCK_INITIALIZER;
+static struct {
+	int *fds; /**< dynamically allocated array of segment lock fd's */
+	int memseg_list_fd; /**< memseg list fd */
+	int len; /**< total length of the array */
+	int count; /**< entries used in an array */
+} lock_fds[RTE_MAX_MEMSEG_LISTS];
 
 /** local copy of a memory map, used to synchronize memory hotplug in MP */
 static struct rte_memseg_list local_memsegs[RTE_MAX_MEMSEG_LISTS];
@@ -158,35 +159,6 @@ resotre_numa(int *oldpolicy, struct bitmask *oldmask)
 }
 #endif
 
-static struct msl_entry *
-get_msl_entry_by_idx(unsigned int list_idx)
-{
-	struct msl_entry *te;
-
-	rte_spinlock_lock(&tailq_lock);
-
-	TAILQ_FOREACH(te, &msl_entry_list, next) {
-		if (te->msl_idx == list_idx)
-			break;
-	}
-	if (te == NULL) {
-		/* doesn't exist, so create it and set fd to -1 */
-
-		te = malloc(sizeof(*te));
-		if (te == NULL) {
-			RTE_LOG(ERR, EAL, "%s(): cannot allocate tailq entry for memseg list\n",
-				__func__);
-			goto unlock;
-		}
-		te->msl_idx = list_idx;
-		te->fd = -1;
-		TAILQ_INSERT_TAIL(&msl_entry_list, te, next);
-	}
-unlock:
-	rte_spinlock_unlock(&tailq_lock);
-	return te;
-}
-
 /*
  * uses fstat to report the size of a file on disk
  */
@@ -199,19 +171,6 @@ get_file_size(int fd)
 	return st.st_size;
 }
 
-/*
- * uses fstat to check if file size on disk is zero (regular fstat won't show
- * true file size due to how fallocate works)
- */
-static bool
-is_zero_length(int fd)
-{
-	struct stat st;
-	if (fstat(fd, &st) < 0)
-		return false;
-	return st.st_blocks == 0;
-}
-
 /* we cannot use rte_memseg_list_walk() here because we will be holding a
  * write lock whenever we enter every function in this file, however copying
  * the same iteration code everywhere is not ideal as well. so, use a lockless
@@ -238,6 +197,102 @@ memseg_list_walk_thread_unsafe(rte_memseg_list_walk_t func, void *arg)
 	return 0;
 }
 
+/* returns 1 on successful lock, 0 on unsuccessful lock, -1 on error */
+static int lock(int fd, int type)
+{
+	int ret;
+
+	/* flock may be interrupted */
+	do {
+		ret = flock(fd, type | LOCK_NB);
+	} while (ret && errno == EINTR);
+
+	if (ret && errno == EWOULDBLOCK) {
+		/* couldn't lock */
+		return 0;
+	} else if (ret) {
+		RTE_LOG(ERR, EAL, "%s(): error calling flock(): %s\n",
+			__func__, strerror(errno));
+		return -1;
+	}
+	/* lock was successful */
+	return 1;
+}
+
+static int get_segment_lock_fd(int list_idx, int seg_idx)
+{
+	char path[PATH_MAX] = {0};
+	int fd;
+
+	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
+		return -1;
+	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
+		return -1;
+
+	fd = lock_fds[list_idx].fds[seg_idx];
+	/* does this lock already exist? */
+	if (fd >= 0)
+		return fd;
+
+	eal_get_hugefile_lock_path(path, sizeof(path),
+			list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+
+	fd = open(path, O_CREAT | O_RDWR, 0660);
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): error creating lockfile '%s': %s\n",
+			__func__, path, strerror(errno));
+		return -1;
+	}
+	/* take out a read lock */
+	if (lock(fd, LOCK_SH) != 1) {
+		RTE_LOG(ERR, EAL, "%s(): failed to take out a readlock on '%s': %s\n",
+			__func__, path, strerror(errno));
+		close(fd);
+		return -1;
+	}
+	/* store it for future reference */
+	lock_fds[list_idx].fds[seg_idx] = fd;
+	lock_fds[list_idx].count++;
+	return fd;
+}
+
+static int unlock_segment(int list_idx, int seg_idx)
+{
+	int fd, ret;
+
+	if (list_idx < 0 || list_idx >= (int)RTE_DIM(lock_fds))
+		return -1;
+	if (seg_idx < 0 || seg_idx >= lock_fds[list_idx].len)
+		return -1;
+
+	fd = lock_fds[list_idx].fds[seg_idx];
+
+	/* upgrade lock to exclusive to see if we can remove the lockfile */
+	ret = lock(fd, LOCK_EX);
+	if (ret == 1) {
+		/* we've succeeded in taking exclusive lock, this lockfile may
+		 * be removed.
+		 */
+		char path[PATH_MAX] = {0};
+		eal_get_hugefile_lock_path(path, sizeof(path),
+				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
+		if (unlink(path)) {
+			RTE_LOG(ERR, EAL, "%s(): error removing lockfile '%s': %s\n",
+					__func__, path, strerror(errno));
+		}
+	}
+	/* we don't want to leak the fd, so even if we fail to lock, close fd
+	 * and remove it from list anyway.
+	 */
+	close(fd);
+	lock_fds[list_idx].fds[seg_idx] = -1;
+	lock_fds[list_idx].count--;
+
+	if (ret < 0)
+		return -1;
+	return 0;
+}
+
 static int
 get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 		unsigned int list_idx, unsigned int seg_idx)
@@ -245,31 +300,29 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 	int fd;
 
 	if (internal_config.single_file_segments) {
-		/*
-		 * try to find a tailq entry, for this memseg list, or create
-		 * one if it doesn't exist.
-		 */
-		struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-		if (te == NULL) {
-			RTE_LOG(ERR, EAL, "%s(): cannot allocate tailq entry for memseg list\n",
-				__func__);
-			return -1;
-		} else if (te->fd < 0) {
-			/* create a hugepage file */
-			eal_get_hugefile_path(path, buflen, hi->hugedir,
-					list_idx);
+		/* create a hugepage file path */
+		eal_get_hugefile_path(path, buflen, hi->hugedir, list_idx);
+
+		fd = lock_fds[list_idx].memseg_list_fd;
+
+		if (fd < 0) {
 			fd = open(path, O_CREAT | O_RDWR, 0600);
 			if (fd < 0) {
-				RTE_LOG(DEBUG, EAL, "%s(): open failed: %s\n",
+				RTE_LOG(ERR, EAL, "%s(): open failed: %s\n",
 					__func__, strerror(errno));
 				return -1;
 			}
-			te->fd = fd;
-		} else {
-			fd = te->fd;
+			/* take out a read lock and keep it indefinitely */
+			if (lock(fd, LOCK_SH) < 0) {
+				RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+					__func__, strerror(errno));
+				close(fd);
+				return -1;
+			}
+			lock_fds[list_idx].memseg_list_fd = fd;
 		}
 	} else {
-		/* one file per page, just create it */
+		/* create a hugepage file path */
 		eal_get_hugefile_path(path, buflen, hi->hugedir,
 				list_idx * RTE_MAX_MEMSEG_PER_LIST + seg_idx);
 		fd = open(path, O_CREAT | O_RDWR, 0600);
@@ -278,48 +331,27 @@ get_seg_fd(char *path, int buflen, struct hugepage_info *hi,
 					strerror(errno));
 			return -1;
 		}
+		/* take out a read lock */
+		if (lock(fd, LOCK_SH) < 0) {
+			RTE_LOG(ERR, EAL, "%s(): lock failed: %s\n",
+				__func__, strerror(errno));
+			close(fd);
+			return -1;
+		}
 	}
 	return fd;
 }
 
-/* returns 1 on successful lock, 0 on unsuccessful lock, -1 on error */
-static int lock(int fd, uint64_t offset, uint64_t len, int type)
-{
-	struct flock lck;
-	int ret;
-
-	memset(&lck, 0, sizeof(lck));
-
-	lck.l_type = type;
-	lck.l_whence = SEEK_SET;
-	lck.l_start = offset;
-	lck.l_len = len;
-
-	ret = fcntl(fd, F_SETLK, &lck);
-
-	if (ret && (errno == EAGAIN || errno == EACCES)) {
-		/* locked by another process, not an error */
-		return 0;
-	} else if (ret) {
-		RTE_LOG(ERR, EAL, "%s(): error calling fcntl(): %s\n",
-			__func__, strerror(errno));
-		/* we've encountered an unexpected error */
-		return -1;
-	}
-	return 1;
-}
-
 static int
-resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
-		bool grow)
+resize_hugefile(int fd, char *path, int list_idx, int seg_idx,
+		uint64_t fa_offset, uint64_t page_sz, bool grow)
 {
 	bool again = false;
 	do {
 		if (fallocate_supported == 0) {
 			/* we cannot deallocate memory if fallocate() is not
-			 * supported, but locks are still needed to prevent
-			 * primary process' initialization from clearing out
-			 * huge pages used by this process.
+			 * supported, and hugepage file is already locked at
+			 * creation, so no further synchronization needed.
 			 */
 
 			if (!grow) {
@@ -337,13 +369,10 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 					__func__, strerror(errno));
 				return -1;
 			}
-			/* not being able to take out a read lock is an error */
-			if (lock(fd, fa_offset, page_sz, F_RDLCK) != 1)
-				return -1;
 		} else {
 			int flags = grow ? 0 : FALLOC_FL_PUNCH_HOLE |
 					FALLOC_FL_KEEP_SIZE;
-			int ret;
+			int ret, lock_fd;
 
 			/* if fallocate() is supported, we need to take out a
 			 * read lock on allocate (to prevent other processes
@@ -351,20 +380,69 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 			 * lock on deallocate (to ensure nobody else is using
 			 * this page).
 			 *
-			 * we can't use flock() for this, as we actually need to
-			 * lock part of the file, not the entire file.
+			 * read locks on page itself are already taken out at
+			 * file creation, in get_seg_fd().
+			 *
+			 * we cannot rely on simple use of flock() call, because
+			 * we need to be able to lock a section of the file,
+			 * and we cannot use fcntl() locks, because of numerous
+			 * problems with their semantics, so we will use
+			 * deterministically named lock files for each section
+			 * of the file.
+			 *
+			 * if we're shrinking the file, we want to upgrade our
+			 * lock from shared to exclusive.
+			 *
+			 * lock_fd is an fd for a lockfile, not for the segment
+			 * list.
 			 */
+			lock_fd = get_segment_lock_fd(list_idx, seg_idx);
 
 			if (!grow) {
-				ret = lock(fd, fa_offset, page_sz, F_WRLCK);
+				/* we are using this lockfile to determine
+				 * whether this particular page is locked, as we
+				 * are in single file segments mode and thus
+				 * cannot use regular flock() to get this info.
+				 *
+				 * we want to try and take out an exclusive lock
+				 * on the lock file to determine if we're the
+				 * last ones using this page, and if not, we
+				 * won't be shrinking it, and will instead exit
+				 * prematurely.
+				 */
+				ret = lock(lock_fd, LOCK_EX);
+
+				/* drop the lock on the lockfile, so that even
+				 * if we couldn't shrink the file ourselves, we
+				 * are signalling to other processes that we're
+				 * no longer using this page.
+				 */
+				if (unlock_segment(list_idx, seg_idx))
+					RTE_LOG(ERR, EAL, "Could not unlock segment\n");
+
+				/* additionally, if this was the last lock on
+				 * this segment list, we can safely close the
+				 * page file fd, so that one of the processes
+				 * could then delete the file after shrinking.
+				 */
+				if (ret < 1 && lock_fds[list_idx].count == 0) {
+					close(fd);
+					lock_fds[list_idx].memseg_list_fd = -1;
+				}
 
-				if (ret < 0)
+				if (ret < 0) {
+					RTE_LOG(ERR, EAL, "Could not lock segment\n");
 					return -1;
-				else if (ret == 0)
-					/* failed to lock, not an error */
+				}
+				if (ret == 0)
+					/* failed to lock, not an error. */
 					return 0;
 			}
-			if (fallocate(fd, flags, fa_offset, page_sz) < 0) {
+
+			/* grow or shrink the file */
+			ret = fallocate(fd, flags, fa_offset, page_sz);
+
+			if (ret < 0) {
 				if (fallocate_supported == -1 &&
 						errno == ENOTSUP) {
 					RTE_LOG(ERR, EAL, "%s(): fallocate() not supported, hugepage deallocation will be disabled\n",
@@ -380,16 +458,18 @@ resize_hugefile(int fd, uint64_t fa_offset, uint64_t page_sz,
 			} else {
 				fallocate_supported = 1;
 
-				if (grow) {
-					/* if can't read lock, it's an error */
-					if (lock(fd, fa_offset, page_sz,
-							F_RDLCK) != 1)
-						return -1;
-				} else {
-					/* if can't unlock, it's an error */
-					if (lock(fd, fa_offset, page_sz,
-							F_UNLCK) != 1)
-						return -1;
+				/* we've grew/shrunk the file, and we hold an
+				 * exclusive lock now. check if there are no
+				 * more segments active in this segment list,
+				 * and remove the file if there aren't.
+				 */
+				if (lock_fds[list_idx].count == 0) {
+					if (unlink(path))
+						RTE_LOG(ERR, EAL, "%s(): unlinking '%s' failed: %s\n",
+							__func__, path,
+							strerror(errno));
+					close(fd);
+					lock_fds[list_idx].memseg_list_fd = -1;
 				}
 			}
 		}
@@ -411,15 +491,19 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	int fd;
 	size_t alloc_sz;
 
+	/* takes out a read lock on segment or segment list */
 	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
-	if (fd < 0)
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "Couldn't get fd on hugepage file\n");
 		return -1;
+	}
 
 	alloc_sz = hi->hugepage_sz;
 	if (internal_config.single_file_segments) {
 		map_offset = seg_idx * alloc_sz;
-		ret = resize_hugefile(fd, map_offset, alloc_sz, true);
-		if (ret < 1)
+		ret = resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				alloc_sz, true);
+		if (ret < 0)
 			goto resized;
 	} else {
 		map_offset = 0;
@@ -428,28 +512,6 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 				__func__, strerror(errno));
 			goto resized;
 		}
-		/* we've allocated a page - take out a read lock. we're using
-		 * fcntl() locks rather than flock() here because doing that
-		 * gives us one huge advantage - fcntl() locks are per-process,
-		 * not per-file descriptor, which means that we don't have to
-		 * keep the original fd's around to keep a lock on the file.
-		 *
-		 * this is useful, because when it comes to unmapping pages, we
-		 * will have to take out a write lock (to figure out if another
-		 * process still has this page mapped), and to do itwith flock()
-		 * we'll have to use original fd, as lock is associated with
-		 * that particular fd. with fcntl(), this is not necessary - we
-		 * can open a new fd and use fcntl() on that.
-		 */
-		ret = lock(fd, map_offset, alloc_sz, F_RDLCK);
-
-		/* this should not fail */
-		if (ret != 1) {
-			RTE_LOG(ERR, EAL, "%s(): error locking file: %s\n",
-				__func__,
-				strerror(errno));
-			goto resized;
-		}
 	}
 
 	/*
@@ -518,19 +580,14 @@ alloc_seg(struct rte_memseg *ms, void *addr, int socket_id,
 	munmap(addr, alloc_sz);
 resized:
 	if (internal_config.single_file_segments) {
-		resize_hugefile(fd, map_offset, alloc_sz, false);
-		if (is_zero_length(fd)) {
-			struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-			if (te != NULL && te->fd >= 0) {
-				close(te->fd);
-				te->fd = -1;
-			}
-			/* ignore errors, can't make it any worse */
-			unlink(path);
-		}
+		resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				alloc_sz, false);
+		/* ignore failure, can't make it any worse */
 	} else {
+		/* only remove file if we can take out a write lock */
+		if (lock(fd, LOCK_EX) == 1)
+			unlink(path);
 		close(fd);
-		unlink(path);
 	}
 	return -1;
 }
@@ -546,6 +603,14 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 	/* erase page data */
 	memset(ms->addr, 0, ms->len);
 
+	/* if we are not in single file segments mode, we're going to unmap the
+	 * segment and thus drop the lock on original fd, so take out another
+	 * shared lock before we do that.
+	 */
+	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
+	if (fd < 0)
+		return -1;
+
 	if (mmap(ms->addr, ms->len, PROT_READ,
 			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0) ==
 				MAP_FAILED) {
@@ -553,41 +618,23 @@ free_seg(struct rte_memseg *ms, struct hugepage_info *hi,
 		return -1;
 	}
 
-	fd = get_seg_fd(path, sizeof(path), hi, list_idx, seg_idx);
-	if (fd < 0)
-		return -1;
-
 	if (internal_config.single_file_segments) {
 		map_offset = seg_idx * ms->len;
-		if (resize_hugefile(fd, map_offset, ms->len, false))
+		if (resize_hugefile(fd, path, list_idx, seg_idx, map_offset,
+				ms->len, false))
 			return -1;
-		/* if file is zero-length, we've already shrunk it, so it's
-		 * safe to remove.
-		 */
-		if (is_zero_length(fd)) {
-			struct msl_entry *te = get_msl_entry_by_idx(list_idx);
-			if (te != NULL && te->fd >= 0) {
-				close(te->fd);
-				te->fd = -1;
-			}
-			unlink(path);
-		}
 		ret = 0;
 	} else {
 		/* if we're able to take out a write lock, we're the last one
 		 * holding onto this page.
 		 */
-
-		ret = lock(fd, 0, ms->len, F_WRLCK);
+		ret = lock(fd, LOCK_EX);
 		if (ret >= 0) {
 			/* no one else is using this page */
 			if (ret == 1)
 				unlink(path);
-			ret = lock(fd, 0, ms->len, F_UNLCK);
-			if (ret != 1)
-				RTE_LOG(ERR, EAL, "%s(): unable to unlock file %s\n",
-					__func__, path);
 		}
+		/* closing fd will drop the lock */
 		close(fd);
 	}
 
@@ -612,7 +659,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	struct alloc_walk_param *wa = arg;
 	struct rte_memseg_list *cur_msl;
 	size_t page_sz;
-	int cur_idx, start_idx, j;
+	int cur_idx, start_idx, j, dir_fd;
 	unsigned int msl_idx, need, i;
 
 	if (msl->page_sz != wa->page_sz)
@@ -633,6 +680,25 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 		return 0;
 	start_idx = cur_idx;
 
+	/* do not allow any page allocations during the time we're allocating,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(wa->hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
+
 	for (i = 0; i < need; i++, cur_idx++) {
 		struct rte_memseg *cur;
 		void *map_addr;
@@ -668,6 +734,8 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 			/* clear the list */
 			if (wa->ms)
 				memset(wa->ms, 0, sizeof(*wa->ms) * wa->n_segs);
+
+			close(dir_fd);
 			return -1;
 		}
 		if (wa->ms)
@@ -679,6 +747,7 @@ alloc_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	wa->segs_allocated = i;
 	if (i > 0)
 		cur_msl->version++;
+	close(dir_fd);
 	return 1;
 }
 
@@ -693,7 +762,7 @@ free_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	struct rte_memseg_list *found_msl;
 	struct free_walk_param *wa = arg;
 	uintptr_t start_addr, end_addr;
-	int msl_idx, seg_idx;
+	int msl_idx, seg_idx, ret, dir_fd;
 
 	start_addr = (uintptr_t) msl->base_va;
 	end_addr = start_addr + msl->memseg_arr.len * (size_t)msl->page_sz;
@@ -708,11 +777,34 @@ free_seg_walk(const struct rte_memseg_list *msl, void *arg)
 	/* msl is const */
 	found_msl = &mcfg->memsegs[msl_idx];
 
+	/* do not allow any page allocations during the time we're freeing,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(wa->hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			wa->hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
+
 	found_msl->version++;
 
 	rte_fbarray_set_free(&found_msl->memseg_arr, seg_idx);
 
-	if (free_seg(wa->ms, wa->hi, msl_idx, seg_idx))
+	ret = free_seg(wa->ms, wa->hi, msl_idx, seg_idx);
+
+	close(dir_fd);
+
+	if (ret < 0)
 		return -1;
 
 	return 1;
@@ -1034,22 +1126,46 @@ sync_existing(struct rte_memseg_list *primary_msl,
 		struct rte_memseg_list *local_msl, struct hugepage_info *hi,
 		unsigned int msl_idx)
 {
-	int ret;
+	int ret, dir_fd;
+
+	/* do not allow any page allocations during the time we're allocating,
+	 * because file creation and locking operations are not atomic,
+	 * and we might be the first or the last ones to use a particular page,
+	 * so we need to ensure atomicity of every operation.
+	 */
+	dir_fd = open(hi->hugedir, O_RDONLY);
+	if (dir_fd < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot open '%s': %s\n", __func__,
+			hi->hugedir, strerror(errno));
+		return -1;
+	}
+	/* blocking writelock */
+	if (flock(dir_fd, LOCK_EX)) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot lock '%s': %s\n", __func__,
+			hi->hugedir, strerror(errno));
+		close(dir_fd);
+		return -1;
+	}
 
 	/* ensure all allocated space is the same in both lists */
 	ret = sync_status(primary_msl, local_msl, hi, msl_idx, true);
 	if (ret < 0)
-		return -1;
+		goto fail;
 
 	/* ensure all unallocated space is the same in both lists */
 	ret = sync_status(primary_msl, local_msl, hi, msl_idx, false);
 	if (ret < 0)
-		return -1;
+		goto fail;
 
 	/* update version number */
 	local_msl->version = primary_msl->version;
 
+	close(dir_fd);
+
 	return 0;
+fail:
+	close(dir_fd);
+	return -1;
 }
 
 static int
@@ -1128,11 +1244,46 @@ secondary_msl_create_walk(const struct rte_memseg_list *msl,
 	return 0;
 }
 
+static int
+secondary_lock_list_create_walk(const struct rte_memseg_list *msl,
+		void *arg __rte_unused)
+{
+	struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
+	unsigned int i, len;
+	int msl_idx;
+	int *data;
+
+	msl_idx = msl - mcfg->memsegs;
+	len = msl->memseg_arr.len;
+
+	/* ensure we have space to store lock fd per each possible segment */
+	data = malloc(sizeof(int) * len);
+	if (data == NULL) {
+		RTE_LOG(ERR, EAL, "Unable to allocate space for lock descriptors\n");
+		return -1;
+	}
+	/* set all fd's as invalid */
+	for (i = 0; i < len; i++)
+		data[i] = -1;
+
+	lock_fds[msl_idx].fds = data;
+	lock_fds[msl_idx].len = len;
+	lock_fds[msl_idx].count = 0;
+	lock_fds[msl_idx].memseg_list_fd = -1;
+
+	return 0;
+}
+
 int
 eal_memalloc_init(void)
 {
 	if (rte_eal_process_type() == RTE_PROC_SECONDARY)
 		if (rte_memseg_list_walk(secondary_msl_create_walk, NULL) < 0)
 			return -1;
+
+	/* initialize all of the lock fd lists */
+	if (internal_config.single_file_segments)
+		if (rte_memseg_list_walk(secondary_lock_list_create_walk, NULL))
+			return -1;
 	return 0;
 }
diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
index fadc1de..ad3add8 100644
--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
@@ -259,7 +259,6 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
 	int fd;
 	unsigned i;
 	void *virtaddr;
-	struct flock lck = {0};
 #ifdef RTE_EAL_NUMA_AWARE_HUGEPAGES
 	int node_id = -1;
 	int essential_prev = 0;
@@ -378,13 +377,8 @@ map_all_hugepages(struct hugepage_file *hugepg_tbl, struct hugepage_info *hpi,
 		}
 		*(int *)virtaddr = 0;
 
-
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = hugepage_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): Locking file failed:%s \n",
 				__func__, strerror(errno));
 			close(fd);
@@ -708,7 +702,6 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
 #endif
 		struct hugepage_file *hfile = &hugepages[cur_page];
 		struct rte_memseg *ms = rte_fbarray_get(arr, ms_idx);
-		struct flock lck;
 		void *addr;
 		int fd;
 
@@ -719,11 +712,7 @@ remap_segment(struct hugepage_file *hugepages, int seg_start, int seg_end)
 			return -1;
 		}
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = page_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "Could not lock '%s': %s\n",
 					hfile->filepath, strerror(errno));
 			close(fd);
@@ -1732,7 +1721,6 @@ eal_legacy_hugepage_attach(void)
 		struct hugepage_file *hf = &hp[i];
 		size_t map_sz = hf->size;
 		void *map_addr = hf->final_va;
-		struct flock lck;
 
 		/* if size is zero, no more pages left */
 		if (map_sz == 0)
@@ -1754,11 +1742,7 @@ eal_legacy_hugepage_attach(void)
 		}
 
 		/* set shared lock on the file. */
-		lck.l_type = F_RDLCK;
-		lck.l_whence = SEEK_SET;
-		lck.l_start = 0;
-		lck.l_len = map_sz;
-		if (fcntl(fd, F_SETLK, &lck) == -1) {
+		if (flock(fd, LOCK_SH) < 0) {
 			RTE_LOG(DEBUG, EAL, "%s(): Locking file failed: %s\n",
 				__func__, strerror(errno));
 			close(fd);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 0/2] Fix file locking in EAL memory
  2018-04-25 10:36   ` [PATCH v3 " Anatoly Burakov
@ 2018-04-27 21:50     ` Thomas Monjalon
  0 siblings, 0 replies; 22+ messages in thread
From: Thomas Monjalon @ 2018-04-27 21:50 UTC (permalink / raw)
  To: Anatoly Burakov; +Cc: dev, bruce.richardson

> Anatoly Burakov (2):
>   mem: add memalloc init stage
>   mem: revert to using flock() and add per-segment lockfiles

Applied, thanks

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-25 10:36   ` [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
@ 2018-04-28  9:38     ` Andrew Rybchenko
  2018-04-30  8:29       ` Burakov, Anatoly
  2018-04-30 11:31       ` Burakov, Anatoly
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Rybchenko @ 2018-04-28  9:38 UTC (permalink / raw)
  To: Anatoly Burakov, dev; +Cc: bruce.richardson, thomas

On 04/25/2018 01:36 PM, Anatoly Burakov wrote:
> The original implementation used flock() locks, but was later
> switched to using fcntl() locks for page locking, because
> fcntl() locks allow locking parts of a file, which is useful
> for single-file segments mode, where locking the entire file
> isn't as useful because we still need to grow and shrink it.
>
> However, according to fcntl()'s Ubuntu manpage [1], semantics of
> fcntl() locks have a giant oversight:
>
>    This interface follows the completely stupid semantics of System
>    V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>    locks associated with a file for a given process are removed
>    when any file descriptor for that file is closed by that process.
>    This semantic means that applications must be aware of any files
>    that a subroutine library may access.
>
> Basically, closing *any* fd with an fcntl() lock (which we do because
> we don't want to leak fd's) will drop the lock completely.
>
> So, in this commit, we will be reverting back to using flock() locks
> everywhere. However, that still leaves the problem of locking parts
> of a memseg list file in single file segments mode, and we will be
> solving it with creating separate lock files per each page, and
> tracking those with flock().
>
> We will also be removing all of this tailq business and replacing it
> with a simple array - saving a few bytes is not worth the extra
> hassle of dealing with pointers and potential memory allocation
> failures. Also, remove the tailq lock since it is not needed - these
> fd lists are per-process, and within a given process, it is always
> only one thread handling access to hugetlbfs.
>
> So, first one to allocate a segment will create a lockfile, and put
> a shared lock on it. When we're shrinking the page file, we will be
> trying to take out a write lock on that lockfile, which would fail if
> any other process is holding onto the lockfile as well. This way, we
> can know if we can shrink the segment file. Also, if no other locks
> are found in the lock list for a given memseg list, the memseg list
> fd is automatically closed.
>
> One other thing to note is, according to flock() Ubuntu manpage [2],
> upgrading the lock from shared to exclusive is implemented by dropping
> and reacquiring the lock, which is not atomic and thus would have
> created race conditions. So, on attempting to perform operations in
> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
> that only one process could perform hugetlbfs operations concurrently.
>
> [1] http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
>
> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
> Fixes: 2a04139f66b4 ("eal: add single file segments option")
> Cc: anatoly.burakov@intel.com
>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> Acked-by: Bruce Richardson <bruce.richardson@intel.com>

We have a problem with the changeset if EAL option -m or --socket-mem is 
used.
EAL initialization hangs just after EAL: Probing VFIO support...
strace points to flock(7, LOCK_EX
List of file descriptors:
# ls /proc/25452/fd -l
total 0
lrwx------ 1 root root 64 Apr 28 10:34 0 -> /dev/pts/0
lrwx------ 1 root root 64 Apr 28 10:34 1 -> /dev/pts/0
lrwx------ 1 root root 64 Apr 28 10:32 2 -> /dev/pts/0
lrwx------ 1 root root 64 Apr 28 10:34 3 -> /run/.rte_config
lrwx------ 1 root root 64 Apr 28 10:34 4 -> socket:[154166]
lrwx------ 1 root root 64 Apr 28 10:34 5 -> socket:[154158]
lr-x------ 1 root root 64 Apr 28 10:34 6 -> /dev/hugepages
lr-x------ 1 root root 64 Apr 28 10:34 7 -> /dev/hugepages

I guess the problem is that there are two /dev/hugepages and
it hangs on the second.

Ideas how to solve it?

Andrew.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-28  9:38     ` Andrew Rybchenko
@ 2018-04-30  8:29       ` Burakov, Anatoly
  2018-04-30 11:31       ` Burakov, Anatoly
  1 sibling, 0 replies; 22+ messages in thread
From: Burakov, Anatoly @ 2018-04-30  8:29 UTC (permalink / raw)
  To: Andrew Rybchenko, dev; +Cc: bruce.richardson, thomas

On 28-Apr-18 10:38 AM, Andrew Rybchenko wrote:
> On 04/25/2018 01:36 PM, Anatoly Burakov wrote:
>> The original implementation used flock() locks, but was later
>> switched to using fcntl() locks for page locking, because
>> fcntl() locks allow locking parts of a file, which is useful
>> for single-file segments mode, where locking the entire file
>> isn't as useful because we still need to grow and shrink it.
>>
>> However, according to fcntl()'s Ubuntu manpage [1], semantics of
>> fcntl() locks have a giant oversight:
>>
>>    This interface follows the completely stupid semantics of System
>>    V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>>    locks associated with a file for a given process are removed
>>    when any file descriptor for that file is closed by that process.
>>    This semantic means that applications must be aware of any files
>>    that a subroutine library may access.
>>
>> Basically, closing *any* fd with an fcntl() lock (which we do because
>> we don't want to leak fd's) will drop the lock completely.
>>
>> So, in this commit, we will be reverting back to using flock() locks
>> everywhere. However, that still leaves the problem of locking parts
>> of a memseg list file in single file segments mode, and we will be
>> solving it with creating separate lock files per each page, and
>> tracking those with flock().
>>
>> We will also be removing all of this tailq business and replacing it
>> with a simple array - saving a few bytes is not worth the extra
>> hassle of dealing with pointers and potential memory allocation
>> failures. Also, remove the tailq lock since it is not needed - these
>> fd lists are per-process, and within a given process, it is always
>> only one thread handling access to hugetlbfs.
>>
>> So, first one to allocate a segment will create a lockfile, and put
>> a shared lock on it. When we're shrinking the page file, we will be
>> trying to take out a write lock on that lockfile, which would fail if
>> any other process is holding onto the lockfile as well. This way, we
>> can know if we can shrink the segment file. Also, if no other locks
>> are found in the lock list for a given memseg list, the memseg list
>> fd is automatically closed.
>>
>> One other thing to note is, according to flock() Ubuntu manpage [2],
>> upgrading the lock from shared to exclusive is implemented by dropping
>> and reacquiring the lock, which is not atomic and thus would have
>> created race conditions. So, on attempting to perform operations in
>> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
>> that only one process could perform hugetlbfs operations concurrently.
>>
>> [1] 
>> http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
>> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
>>
>> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
>> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
>> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
>> Fixes: 2a04139f66b4 ("eal: add single file segments option")
>> Cc: anatoly.burakov@intel.com
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> We have a problem with the changeset if EAL option -m or --socket-mem is 
> used.
> EAL initialization hangs just after EAL: Probing VFIO support...
> strace points to flock(7, LOCK_EX
> List of file descriptors:
> # ls /proc/25452/fd -l
> total 0
> lrwx------ 1 root root 64 Apr 28 10:34 0 -> /dev/pts/0
> lrwx------ 1 root root 64 Apr 28 10:34 1 -> /dev/pts/0
> lrwx------ 1 root root 64 Apr 28 10:32 2 -> /dev/pts/0
> lrwx------ 1 root root 64 Apr 28 10:34 3 -> /run/.rte_config
> lrwx------ 1 root root 64 Apr 28 10:34 4 -> socket:[154166]
> lrwx------ 1 root root 64 Apr 28 10:34 5 -> socket:[154158]
> lr-x------ 1 root root 64 Apr 28 10:34 6 -> /dev/hugepages
> lr-x------ 1 root root 64 Apr 28 10:34 7 -> /dev/hugepages
> 
> I guess the problem is that there are two /dev/hugepages and
> it hangs on the second.
> 
> Ideas how to solve it?
> 
> Andrew.
> 

Seeing similar reports from validation. I'm looking into it.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-28  9:38     ` Andrew Rybchenko
  2018-04-30  8:29       ` Burakov, Anatoly
@ 2018-04-30 11:31       ` Burakov, Anatoly
  2018-04-30 11:51         ` Maxime Coquelin
  2018-04-30 13:08         ` Andrew Rybchenko
  1 sibling, 2 replies; 22+ messages in thread
From: Burakov, Anatoly @ 2018-04-30 11:31 UTC (permalink / raw)
  To: Andrew Rybchenko, dev; +Cc: bruce.richardson, thomas

On 28-Apr-18 10:38 AM, Andrew Rybchenko wrote:
> On 04/25/2018 01:36 PM, Anatoly Burakov wrote:
>> The original implementation used flock() locks, but was later
>> switched to using fcntl() locks for page locking, because
>> fcntl() locks allow locking parts of a file, which is useful
>> for single-file segments mode, where locking the entire file
>> isn't as useful because we still need to grow and shrink it.
>>
>> However, according to fcntl()'s Ubuntu manpage [1], semantics of
>> fcntl() locks have a giant oversight:
>>
>>    This interface follows the completely stupid semantics of System
>>    V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>>    locks associated with a file for a given process are removed
>>    when any file descriptor for that file is closed by that process.
>>    This semantic means that applications must be aware of any files
>>    that a subroutine library may access.
>>
>> Basically, closing *any* fd with an fcntl() lock (which we do because
>> we don't want to leak fd's) will drop the lock completely.
>>
>> So, in this commit, we will be reverting back to using flock() locks
>> everywhere. However, that still leaves the problem of locking parts
>> of a memseg list file in single file segments mode, and we will be
>> solving it with creating separate lock files per each page, and
>> tracking those with flock().
>>
>> We will also be removing all of this tailq business and replacing it
>> with a simple array - saving a few bytes is not worth the extra
>> hassle of dealing with pointers and potential memory allocation
>> failures. Also, remove the tailq lock since it is not needed - these
>> fd lists are per-process, and within a given process, it is always
>> only one thread handling access to hugetlbfs.
>>
>> So, first one to allocate a segment will create a lockfile, and put
>> a shared lock on it. When we're shrinking the page file, we will be
>> trying to take out a write lock on that lockfile, which would fail if
>> any other process is holding onto the lockfile as well. This way, we
>> can know if we can shrink the segment file. Also, if no other locks
>> are found in the lock list for a given memseg list, the memseg list
>> fd is automatically closed.
>>
>> One other thing to note is, according to flock() Ubuntu manpage [2],
>> upgrading the lock from shared to exclusive is implemented by dropping
>> and reacquiring the lock, which is not atomic and thus would have
>> created race conditions. So, on attempting to perform operations in
>> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
>> that only one process could perform hugetlbfs operations concurrently.
>>
>> [1] 
>> http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
>> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
>>
>> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
>> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
>> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
>> Fixes: 2a04139f66b4 ("eal: add single file segments option")
>> Cc: anatoly.burakov@intel.com
>>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> We have a problem with the changeset if EAL option -m or --socket-mem is 
> used.
> EAL initialization hangs just after EAL: Probing VFIO support...
> strace points to flock(7, LOCK_EX
> List of file descriptors:
> # ls /proc/25452/fd -l
> total 0
> lrwx------ 1 root root 64 Apr 28 10:34 0 -> /dev/pts/0
> lrwx------ 1 root root 64 Apr 28 10:34 1 -> /dev/pts/0
> lrwx------ 1 root root 64 Apr 28 10:32 2 -> /dev/pts/0
> lrwx------ 1 root root 64 Apr 28 10:34 3 -> /run/.rte_config
> lrwx------ 1 root root 64 Apr 28 10:34 4 -> socket:[154166]
> lrwx------ 1 root root 64 Apr 28 10:34 5 -> socket:[154158]
> lr-x------ 1 root root 64 Apr 28 10:34 6 -> /dev/hugepages
> lr-x------ 1 root root 64 Apr 28 10:34 7 -> /dev/hugepages
> 
> I guess the problem is that there are two /dev/hugepages and
> it hangs on the second.
> 
> Ideas how to solve it?
> 
> Andrew.
> 

Hi Andrew,

Please try the following patch:

http://dpdk.org/dev/patchwork/patch/39166/

This should fix the issue.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-30 11:31       ` Burakov, Anatoly
@ 2018-04-30 11:51         ` Maxime Coquelin
  2018-04-30 13:08         ` Andrew Rybchenko
  1 sibling, 0 replies; 22+ messages in thread
From: Maxime Coquelin @ 2018-04-30 11:51 UTC (permalink / raw)
  To: Burakov, Anatoly, Andrew Rybchenko, dev; +Cc: bruce.richardson, thomas



On 04/30/2018 01:31 PM, Burakov, Anatoly wrote:
> On 28-Apr-18 10:38 AM, Andrew Rybchenko wrote:
>> On 04/25/2018 01:36 PM, Anatoly Burakov wrote:
>>> The original implementation used flock() locks, but was later
>>> switched to using fcntl() locks for page locking, because
>>> fcntl() locks allow locking parts of a file, which is useful
>>> for single-file segments mode, where locking the entire file
>>> isn't as useful because we still need to grow and shrink it.
>>>
>>> However, according to fcntl()'s Ubuntu manpage [1], semantics of
>>> fcntl() locks have a giant oversight:
>>>
>>>    This interface follows the completely stupid semantics of System
>>>    V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>>>    locks associated with a file for a given process are removed
>>>    when any file descriptor for that file is closed by that process.
>>>    This semantic means that applications must be aware of any files
>>>    that a subroutine library may access.
>>>
>>> Basically, closing *any* fd with an fcntl() lock (which we do because
>>> we don't want to leak fd's) will drop the lock completely.
>>>
>>> So, in this commit, we will be reverting back to using flock() locks
>>> everywhere. However, that still leaves the problem of locking parts
>>> of a memseg list file in single file segments mode, and we will be
>>> solving it with creating separate lock files per each page, and
>>> tracking those with flock().
>>>
>>> We will also be removing all of this tailq business and replacing it
>>> with a simple array - saving a few bytes is not worth the extra
>>> hassle of dealing with pointers and potential memory allocation
>>> failures. Also, remove the tailq lock since it is not needed - these
>>> fd lists are per-process, and within a given process, it is always
>>> only one thread handling access to hugetlbfs.
>>>
>>> So, first one to allocate a segment will create a lockfile, and put
>>> a shared lock on it. When we're shrinking the page file, we will be
>>> trying to take out a write lock on that lockfile, which would fail if
>>> any other process is holding onto the lockfile as well. This way, we
>>> can know if we can shrink the segment file. Also, if no other locks
>>> are found in the lock list for a given memseg list, the memseg list
>>> fd is automatically closed.
>>>
>>> One other thing to note is, according to flock() Ubuntu manpage [2],
>>> upgrading the lock from shared to exclusive is implemented by dropping
>>> and reacquiring the lock, which is not atomic and thus would have
>>> created race conditions. So, on attempting to perform operations in
>>> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
>>> that only one process could perform hugetlbfs operations concurrently.
>>>
>>> [1] 
>>> http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
>>> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
>>>
>>> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
>>> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
>>> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
>>> Fixes: 2a04139f66b4 ("eal: add single file segments option")
>>> Cc: anatoly.burakov@intel.com
>>>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
>>
>> We have a problem with the changeset if EAL option -m or --socket-mem 
>> is used.
>> EAL initialization hangs just after EAL: Probing VFIO support...
>> strace points to flock(7, LOCK_EX
>> List of file descriptors:
>> # ls /proc/25452/fd -l
>> total 0
>> lrwx------ 1 root root 64 Apr 28 10:34 0 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:34 1 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:32 2 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:34 3 -> /run/.rte_config
>> lrwx------ 1 root root 64 Apr 28 10:34 4 -> socket:[154166]
>> lrwx------ 1 root root 64 Apr 28 10:34 5 -> socket:[154158]
>> lr-x------ 1 root root 64 Apr 28 10:34 6 -> /dev/hugepages
>> lr-x------ 1 root root 64 Apr 28 10:34 7 -> /dev/hugepages
>>
>> I guess the problem is that there are two /dev/hugepages and
>> it hangs on the second.
>>
>> Ideas how to solve it?
>>
>> Andrew.
>>
> 
> Hi Andrew,
> 
> Please try the following patch:
> 
> http://dpdk.org/dev/patchwork/patch/39166/
> 
> This should fix the issue.
> 

I faced the regression in my test bench, your patch fixes the issue in
my case:

Tested-by: Maxime Coquelin <maxime.coquelin@redhat.com>

Thanks,
Maxime

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles
  2018-04-30 11:31       ` Burakov, Anatoly
  2018-04-30 11:51         ` Maxime Coquelin
@ 2018-04-30 13:08         ` Andrew Rybchenko
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Rybchenko @ 2018-04-30 13:08 UTC (permalink / raw)
  To: Burakov, Anatoly, dev; +Cc: bruce.richardson, thomas

On 04/30/2018 02:31 PM, Burakov, Anatoly wrote:
> On 28-Apr-18 10:38 AM, Andrew Rybchenko wrote:
>> On 04/25/2018 01:36 PM, Anatoly Burakov wrote:
>>> The original implementation used flock() locks, but was later
>>> switched to using fcntl() locks for page locking, because
>>> fcntl() locks allow locking parts of a file, which is useful
>>> for single-file segments mode, where locking the entire file
>>> isn't as useful because we still need to grow and shrink it.
>>>
>>> However, according to fcntl()'s Ubuntu manpage [1], semantics of
>>> fcntl() locks have a giant oversight:
>>>
>>>    This interface follows the completely stupid semantics of System
>>>    V and IEEE Std 1003.1-1988 (“POSIX.1”) that require that all
>>>    locks associated with a file for a given process are removed
>>>    when any file descriptor for that file is closed by that process.
>>>    This semantic means that applications must be aware of any files
>>>    that a subroutine library may access.
>>>
>>> Basically, closing *any* fd with an fcntl() lock (which we do because
>>> we don't want to leak fd's) will drop the lock completely.
>>>
>>> So, in this commit, we will be reverting back to using flock() locks
>>> everywhere. However, that still leaves the problem of locking parts
>>> of a memseg list file in single file segments mode, and we will be
>>> solving it with creating separate lock files per each page, and
>>> tracking those with flock().
>>>
>>> We will also be removing all of this tailq business and replacing it
>>> with a simple array - saving a few bytes is not worth the extra
>>> hassle of dealing with pointers and potential memory allocation
>>> failures. Also, remove the tailq lock since it is not needed - these
>>> fd lists are per-process, and within a given process, it is always
>>> only one thread handling access to hugetlbfs.
>>>
>>> So, first one to allocate a segment will create a lockfile, and put
>>> a shared lock on it. When we're shrinking the page file, we will be
>>> trying to take out a write lock on that lockfile, which would fail if
>>> any other process is holding onto the lockfile as well. This way, we
>>> can know if we can shrink the segment file. Also, if no other locks
>>> are found in the lock list for a given memseg list, the memseg list
>>> fd is automatically closed.
>>>
>>> One other thing to note is, according to flock() Ubuntu manpage [2],
>>> upgrading the lock from shared to exclusive is implemented by dropping
>>> and reacquiring the lock, which is not atomic and thus would have
>>> created race conditions. So, on attempting to perform operations in
>>> hugetlbfs, we will take out a writelock on hugetlbfs directory, so
>>> that only one process could perform hugetlbfs operations concurrently.
>>>
>>> [1] 
>>> http://manpages.ubuntu.com/manpages/artful/en/man2/fcntl.2freebsd.html
>>> [2] http://manpages.ubuntu.com/manpages/bionic/en/man2/flock.2.html
>>>
>>> Fixes: 66cc45e293ed ("mem: replace memseg with memseg lists")
>>> Fixes: 582bed1e1d1d ("mem: support mapping hugepages at runtime")
>>> Fixes: a5ff05d60fc5 ("mem: support unmapping pages at runtime")
>>> Fixes: 2a04139f66b4 ("eal: add single file segments option")
>>> Cc: anatoly.burakov@intel.com
>>>
>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
>>
>> We have a problem with the changeset if EAL option -m or --socket-mem 
>> is used.
>> EAL initialization hangs just after EAL: Probing VFIO support...
>> strace points to flock(7, LOCK_EX
>> List of file descriptors:
>> # ls /proc/25452/fd -l
>> total 0
>> lrwx------ 1 root root 64 Apr 28 10:34 0 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:34 1 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:32 2 -> /dev/pts/0
>> lrwx------ 1 root root 64 Apr 28 10:34 3 -> /run/.rte_config
>> lrwx------ 1 root root 64 Apr 28 10:34 4 -> socket:[154166]
>> lrwx------ 1 root root 64 Apr 28 10:34 5 -> socket:[154158]
>> lr-x------ 1 root root 64 Apr 28 10:34 6 -> /dev/hugepages
>> lr-x------ 1 root root 64 Apr 28 10:34 7 -> /dev/hugepages
>>
>> I guess the problem is that there are two /dev/hugepages and
>> it hangs on the second.
>>
>> Ideas how to solve it?
>>
>> Andrew.
>>
>
> Hi Andrew,
>
> Please try the following patch:
>
> http://dpdk.org/dev/patchwork/patch/39166/
>
> This should fix the issue.

Hi Anatoly,

yes, it fixes the issue.

Thanks,
Andrew.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-04-30 13:09 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-19 12:26 [PATCH 0/2] Fix file locking in EAL memory Anatoly Burakov
2018-04-19 12:26 ` [PATCH 1/2] mem: add memalloc init stage Anatoly Burakov
2018-04-24 14:06   ` Bruce Richardson
2018-04-19 12:26 ` [PATCH 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
2018-04-24 13:57   ` Bruce Richardson
2018-04-24 14:07   ` Bruce Richardson
2018-04-24 15:54 ` [PATCH v2 0/2] Fix file locking in EAL memory Anatoly Burakov
2018-04-24 16:32   ` Stephen Hemminger
2018-04-24 17:25     ` Burakov, Anatoly
2018-04-24 20:05       ` Thomas Monjalon
2018-04-24 20:34         ` Stephen Hemminger
2018-04-25 10:36   ` [PATCH v3 " Anatoly Burakov
2018-04-27 21:50     ` Thomas Monjalon
2018-04-25 10:36   ` [PATCH v3 1/2] mem: add memalloc init stage Anatoly Burakov
2018-04-25 10:36   ` [PATCH v3 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov
2018-04-28  9:38     ` Andrew Rybchenko
2018-04-30  8:29       ` Burakov, Anatoly
2018-04-30 11:31       ` Burakov, Anatoly
2018-04-30 11:51         ` Maxime Coquelin
2018-04-30 13:08         ` Andrew Rybchenko
2018-04-24 15:54 ` [PATCH v2 1/2] mem: add memalloc init stage Anatoly Burakov
2018-04-24 15:54 ` [PATCH v2 2/2] mem: revert to using flock() and add per-segment lockfiles Anatoly Burakov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.