[PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
@ 2009-05-27 11:12 ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-27 11:12 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

The following two patches are required to fix problems reported by
starlight@binnacle.cx. The tests cases both involve two processes interacting
with shared memory segments backed by hugetlbfs.

Patch 1 fixes an x86-specific problem where regions sharing page tables
are not being reference counted properly. The page tables get freed early
resulting in bad PMD messages printed to the kernel log and the hugetlb
counters getting corrupted. Strictly speaking, this affects mainline but
the problem is masked by UNEVITABLE_LRU as it never leaves VM_LOCKED set for
hugetlbfs-backed mapping. This does affect the stable branch of 2.6.27 and
distributions based on that kernel such as SLES 11. This patch is required
for 2.6.27-stable and while it is optional for mainline, it should be merged
so that the stable branch does not contain patches that are not in mainline.

Patch 2 fixes a general hugetlbfs problem where it is using VM_SHARED instead
of VM_MAYSHARE to detect if the mapping was MAP_SHARED or MAP_PRIVATE. This
causes hugetlbfs to attempt reserving more pages than is required for
MAP_SHARED and mmap() fails when it should succeed. This patch is needed
for 2.6.30 and -stable. It rejects against 2.6.27.24 but the reject is
trivially resolved by changing the last VM_SHARED in hugetlb_reserve_pages()
to VM_MAYSHARE.

Starlight, if you are still watching, can you reconfirm that this patches
fix the problems you were having?

 arch/x86/mm/hugetlbpage.c |    6 +++++-
 mm/hugetlb.c              |   26 +++++++++++++-------------
 2 files changed, 18 insertions(+), 14 deletions(-)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
@ 2009-05-27 11:12 ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-27 11:12 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

The following two patches are required to fix problems reported by
starlight@binnacle.cx. The tests cases both involve two processes interacting
with shared memory segments backed by hugetlbfs.

Patch 1 fixes an x86-specific problem where regions sharing page tables
are not being reference counted properly. The page tables get freed early
resulting in bad PMD messages printed to the kernel log and the hugetlb
counters getting corrupted. Strictly speaking, this affects mainline but
the problem is masked by UNEVITABLE_LRU as it never leaves VM_LOCKED set for
hugetlbfs-backed mapping. This does affect the stable branch of 2.6.27 and
distributions based on that kernel such as SLES 11. This patch is required
for 2.6.27-stable and while it is optional for mainline, it should be merged
so that the stable branch does not contain patches that are not in mainline.

Patch 2 fixes a general hugetlbfs problem where it is using VM_SHARED instead
of VM_MAYSHARE to detect if the mapping was MAP_SHARED or MAP_PRIVATE. This
causes hugetlbfs to attempt reserving more pages than is required for
MAP_SHARED and mmap() fails when it should succeed. This patch is needed
for 2.6.30 and -stable. It rejects against 2.6.27.24 but the reject is
trivially resolved by changing the last VM_SHARED in hugetlb_reserve_pages()
to VM_MAYSHARE.

Starlight, if you are still watching, can you reconfirm that this patches
fix the problems you were having?

 arch/x86/mm/hugetlbpage.c |    6 +++++-
 mm/hugetlb.c              |   26 +++++++++++++-------------
 2 files changed, 18 insertions(+), 14 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
  2009-05-27 11:12 ` Mel Gorman
@ 2009-05-27 11:12   ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-27 11:12 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

On x86 and x86-64, it is possible that page tables are shared beween shared
mappings backed by hugetlbfs. As part of this, page_table_shareable() checks
a pair of vma->vm_flags and they must match if they are to be shared. All
VMA flags are taken into account, including VM_LOCKED.

The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared VMAs
with different flags. The impact is that the shared segment is sometimes
considered shareable and other times not, depending on what process is
checking.

What happens is that the segment page tables are being shared but the count is
inaccurate depending on the ordering of events. As the page tables are freed
with put_page(), bad pmd's are found when some of the children exit. The
hugepage counters also get corrupted and the Total and Free count will
no longer match even when all the hugepage-backed regions are freed. This
requires a reboot of the machine to "fix".

This patch addresses the problem by comparing all flags except VM_LOCKED when
deciding if pagetables should be shared or not for hugetlbfs-backed mapping.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---
 arch/x86/mm/hugetlbpage.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..f46c340 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	unsigned long sbase = saddr & PUD_MASK;
 	unsigned long s_end = sbase + PUD_SIZE;

+	/* Allow segments to share if only one is marked locked */
+	unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
+	unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED;
+
 	/*
 	 * match the virtual addresses, permission and the alignment of the
 	 * page table page.
 	 */
 	if (pmd_index(addr) != pmd_index(saddr) ||
-	    vma->vm_flags != svma->vm_flags ||
+	    vm_flags != svm_flags ||
 	    sbase < svma->vm_start || svma->vm_end < s_end)
 		return 0;

-- 
1.5.6.5

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
@ 2009-05-27 11:12   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-27 11:12 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

On x86 and x86-64, it is possible that page tables are shared beween shared
mappings backed by hugetlbfs. As part of this, page_table_shareable() checks
a pair of vma->vm_flags and they must match if they are to be shared. All
VMA flags are taken into account, including VM_LOCKED.

The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared VMAs
with different flags. The impact is that the shared segment is sometimes
considered shareable and other times not, depending on what process is
checking.

What happens is that the segment page tables are being shared but the count is
inaccurate depending on the ordering of events. As the page tables are freed
with put_page(), bad pmd's are found when some of the children exit. The
hugepage counters also get corrupted and the Total and Free count will
no longer match even when all the hugepage-backed regions are freed. This
requires a reboot of the machine to "fix".

This patch addresses the problem by comparing all flags except VM_LOCKED when
deciding if pagetables should be shared or not for hugetlbfs-backed mapping.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
---
 arch/x86/mm/hugetlbpage.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 8f307d9..f46c340 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -26,12 +26,16 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
 	unsigned long sbase = saddr & PUD_MASK;
 	unsigned long s_end = sbase + PUD_SIZE;

+	/* Allow segments to share if only one is marked locked */
+	unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED;
+	unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED;
+
 	/*
 	 * match the virtual addresses, permission and the alignment of the
 	 * page table page.
 	 */
 	if (pmd_index(addr) != pmd_index(saddr) ||
-	    vma->vm_flags != svma->vm_flags ||
+	    vm_flags != svm_flags ||
 	    sbase < svma->vm_start || svma->vm_end < s_end)
 		return 0;

-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm: Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs
  2009-05-27 11:12 ` Mel Gorman
@ 2009-05-27 11:12   ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-27 11:12 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

hugetlbfs reserves huge pages but does not fault them at mmap() time to ensure
that future faults succeed. The reservation behaviour differs depending on
whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED
mappings, hugepages are reserved when mmap() is first called and are tracked
based on information associated with the inode. Other processes mapping
MAP_SHARED use the same reservation. MAP_PRIVATE track the reservations
based on the VMA created as part of the mmap() operation. Each process
mapping MAP_PRIVATE must make its own reservation.

hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag and
not VM_MAYSHARE.  For file-backed mappings, such as hugetlbfs, VM_SHARED is
set only if the mapping is MAP_SHARED and the file was opened read-write. If a
shared memory mapping was mapped shared-read-write for populating of data and
mapped shared-read-only by other processes, then hugetlbfs would account for
the mapping as if it was MAP_PRIVATE.  This causes processes to fail to map
the file MAP_SHARED even though it should succeed as the reservation is there.

This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when
the intent of the code was to check whether the VMA was mapped MAP_SHARED
or MAP_PRIVATE.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/hugetlb.c |   26 +++++++++++++-------------
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..e83ad2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref)
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_SHARED))
+	if (!(vma->vm_flags & VM_MAYSHARE))
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
 	return NULL;
@@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	VM_BUG_ON(vma->vm_flags & VM_SHARED);
+	VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
 
 	set_vma_private_data(vma, (get_vma_private_data(vma) &
 				HPAGE_RESV_MASK) | (unsigned long)map);
@@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
 static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	VM_BUG_ON(vma->vm_flags & VM_SHARED);
+	VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
 
 	set_vma_private_data(vma, get_vma_private_data(vma) | flags);
 }
@@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
 	if (vma->vm_flags & VM_NORESERVE)
 		return;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		/* Shared mappings always use reserves */
 		h->resv_huge_pages--;
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
@@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_SHARED))
+	if (!(vma->vm_flags & VM_MAYSHARE))
 		vma->vm_private_data = (void *)0;
 }
 
 /* Returns true if the VMA has associated reserve pages */
 static int vma_has_reserves(struct vm_area_struct *vma)
 {
-	if (vma->vm_flags & VM_SHARED)
+	if (vma->vm_flags & VM_MAYSHARE)
 		return 1;
 	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		return 1;
@@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		return region_chg(&inode->i_mapping->private_list,
 							idx, idx + 1);
@@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		region_add(&inode->i_mapping->private_list, idx, idx + 1);
 
@@ -1893,7 +1893,7 @@ retry_avoidcopy:
 	 * at the time of fork() could consume its reserves on COW instead
 	 * of the full address range.
 	 */
-	if (!(vma->vm_flags & VM_SHARED) &&
+	if (!(vma->vm_flags & VM_MAYSHARE) &&
 			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
 			old_page != pagecache_page)
 		outside_reserve = 1;
@@ -2000,7 +2000,7 @@ retry:
 		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
-		if (vma->vm_flags & VM_SHARED) {
+		if (vma->vm_flags & VM_MAYSHARE) {
 			int err;
 			struct inode *inode = mapping->host;
 
@@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto out_mutex;
 		}
 
-		if (!(vma->vm_flags & VM_SHARED))
+		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
 								vma, address);
 	}
@@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_SHARED)
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		chg = region_chg(&inode->i_mapping->private_list, from, to);
 	else {
 		struct resv_map *resv_map = resv_map_alloc();
@@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_SHARED)
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		region_add(&inode->i_mapping->private_list, from, to);
 	return 0;
 }
-- 
1.5.6.5


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] mm: Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs
@ 2009-05-27 11:12   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-27 11:12 UTC (permalink / raw)
  To: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

hugetlbfs reserves huge pages but does not fault them at mmap() time to ensure
that future faults succeed. The reservation behaviour differs depending on
whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED
mappings, hugepages are reserved when mmap() is first called and are tracked
based on information associated with the inode. Other processes mapping
MAP_SHARED use the same reservation. MAP_PRIVATE track the reservations
based on the VMA created as part of the mmap() operation. Each process
mapping MAP_PRIVATE must make its own reservation.

hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag and
not VM_MAYSHARE.  For file-backed mappings, such as hugetlbfs, VM_SHARED is
set only if the mapping is MAP_SHARED and the file was opened read-write. If a
shared memory mapping was mapped shared-read-write for populating of data and
mapped shared-read-only by other processes, then hugetlbfs would account for
the mapping as if it was MAP_PRIVATE.  This causes processes to fail to map
the file MAP_SHARED even though it should succeed as the reservation is there.

This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when
the intent of the code was to check whether the VMA was mapped MAP_SHARED
or MAP_PRIVATE.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/hugetlb.c |   26 +++++++++++++-------------
 1 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..e83ad2c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -316,7 +316,7 @@ static void resv_map_release(struct kref *ref)
 static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_SHARED))
+	if (!(vma->vm_flags & VM_MAYSHARE))
 		return (struct resv_map *)(get_vma_private_data(vma) &
 							~HPAGE_RESV_MASK);
 	return NULL;
@@ -325,7 +325,7 @@ static struct resv_map *vma_resv_map(struct vm_area_struct *vma)
 static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	VM_BUG_ON(vma->vm_flags & VM_SHARED);
+	VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
 
 	set_vma_private_data(vma, (get_vma_private_data(vma) &
 				HPAGE_RESV_MASK) | (unsigned long)map);
@@ -334,7 +334,7 @@ static void set_vma_resv_map(struct vm_area_struct *vma, struct resv_map *map)
 static void set_vma_resv_flags(struct vm_area_struct *vma, unsigned long flags)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	VM_BUG_ON(vma->vm_flags & VM_SHARED);
+	VM_BUG_ON(vma->vm_flags & VM_MAYSHARE);
 
 	set_vma_private_data(vma, get_vma_private_data(vma) | flags);
 }
@@ -353,7 +353,7 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
 	if (vma->vm_flags & VM_NORESERVE)
 		return;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		/* Shared mappings always use reserves */
 		h->resv_huge_pages--;
 	} else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
@@ -369,14 +369,14 @@ static void decrement_hugepage_resv_vma(struct hstate *h,
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
 {
 	VM_BUG_ON(!is_vm_hugetlb_page(vma));
-	if (!(vma->vm_flags & VM_SHARED))
+	if (!(vma->vm_flags & VM_MAYSHARE))
 		vma->vm_private_data = (void *)0;
 }
 
 /* Returns true if the VMA has associated reserve pages */
 static int vma_has_reserves(struct vm_area_struct *vma)
 {
-	if (vma->vm_flags & VM_SHARED)
+	if (vma->vm_flags & VM_MAYSHARE)
 		return 1;
 	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER))
 		return 1;
@@ -924,7 +924,7 @@ static long vma_needs_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		return region_chg(&inode->i_mapping->private_list,
 							idx, idx + 1);
@@ -949,7 +949,7 @@ static void vma_commit_reservation(struct hstate *h,
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct inode *inode = mapping->host;
 
-	if (vma->vm_flags & VM_SHARED) {
+	if (vma->vm_flags & VM_MAYSHARE) {
 		pgoff_t idx = vma_hugecache_offset(h, vma, addr);
 		region_add(&inode->i_mapping->private_list, idx, idx + 1);
 
@@ -1893,7 +1893,7 @@ retry_avoidcopy:
 	 * at the time of fork() could consume its reserves on COW instead
 	 * of the full address range.
 	 */
-	if (!(vma->vm_flags & VM_SHARED) &&
+	if (!(vma->vm_flags & VM_MAYSHARE) &&
 			is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
 			old_page != pagecache_page)
 		outside_reserve = 1;
@@ -2000,7 +2000,7 @@ retry:
 		clear_huge_page(page, address, huge_page_size(h));
 		__SetPageUptodate(page);
 
-		if (vma->vm_flags & VM_SHARED) {
+		if (vma->vm_flags & VM_MAYSHARE) {
 			int err;
 			struct inode *inode = mapping->host;
 
@@ -2104,7 +2104,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			goto out_mutex;
 		}
 
-		if (!(vma->vm_flags & VM_SHARED))
+		if (!(vma->vm_flags & VM_MAYSHARE))
 			pagecache_page = hugetlbfs_pagecache_page(h,
 								vma, address);
 	}
@@ -2289,7 +2289,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * to reserve the full area even if read-only as mprotect() may be
 	 * called to make the mapping read-write. Assume !vma is a shm mapping
 	 */
-	if (!vma || vma->vm_flags & VM_SHARED)
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		chg = region_chg(&inode->i_mapping->private_list, from, to);
 	else {
 		struct resv_map *resv_map = resv_map_alloc();
@@ -2330,7 +2330,7 @@ int hugetlb_reserve_pages(struct inode *inode,
 	 * consumed reservations are stored in the map. Hence, nothing
 	 * else has to be done for private mappings here
 	 */
-	if (!vma || vma->vm_flags & VM_SHARED)
+	if (!vma || vma->vm_flags & VM_MAYSHARE)
 		region_add(&inode->i_mapping->private_list, from, to);
 	return 0;
 }
-- 
1.5.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
  2009-05-27 11:12   ` Mel Gorman
  (?)
@ 2009-05-27 16:38   ` Eric B Munson
  -1 siblings, 0 replies; 40+ messages in thread
From: Eric B Munson @ 2009-05-27 16:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Adam Litke, Andy Whitcroft, wli

[-- Attachment #1: Type: text/plain, Size: 1577 bytes --]

On Wed, 27 May 2009, Mel Gorman wrote:

> On x86 and x86-64, it is possible that page tables are shared beween shared
> mappings backed by hugetlbfs. As part of this, page_table_shareable() checks
> a pair of vma->vm_flags and they must match if they are to be shared. All
> VMA flags are taken into account, including VM_LOCKED.
> 
> The problem is that VM_LOCKED is cleared on fork(). When a process with a
> shared memory segment forks() to exec() a helper, there will be shared VMAs
> with different flags. The impact is that the shared segment is sometimes
> considered shareable and other times not, depending on what process is
> checking.
> 
> What happens is that the segment page tables are being shared but the count is
> inaccurate depending on the ordering of events. As the page tables are freed
> with put_page(), bad pmd's are found when some of the children exit. The
> hugepage counters also get corrupted and the Total and Free count will
> no longer match even when all the hugepage-backed regions are freed. This
> requires a reboot of the machine to "fix".
> 
> This patch addresses the problem by comparing all flags except VM_LOCKED when
> deciding if pagetables should be shared or not for hugetlbfs-backed mapping.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>

I tested this patch using 2.6.30-rc7 and the libhugetlbfs test suite on x86_64.
Everything looks good to me.

Acked-by: Eric B Munson <ebmunson@us.ibm.com>
Tested-by: Eric B Munson <ebmunson@us.ibm.com>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/2] mm: Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs
  2009-05-27 11:12   ` Mel Gorman
  (?)
@ 2009-05-27 16:40   ` Eric B Munson
  -1 siblings, 0 replies; 40+ messages in thread
From: Eric B Munson @ 2009-05-27 16:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Adam Litke, Andy Whitcroft, wli

[-- Attachment #1: Type: text/plain, Size: 1655 bytes --]

On Wed, 27 May 2009, Mel Gorman wrote:

> hugetlbfs reserves huge pages but does not fault them at mmap() time to ensure
> that future faults succeed. The reservation behaviour differs depending on
> whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED
> mappings, hugepages are reserved when mmap() is first called and are tracked
> based on information associated with the inode. Other processes mapping
> MAP_SHARED use the same reservation. MAP_PRIVATE track the reservations
> based on the VMA created as part of the mmap() operation. Each process
> mapping MAP_PRIVATE must make its own reservation.
> 
> hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag and
> not VM_MAYSHARE.  For file-backed mappings, such as hugetlbfs, VM_SHARED is
> set only if the mapping is MAP_SHARED and the file was opened read-write. If a
> shared memory mapping was mapped shared-read-write for populating of data and
> mapped shared-read-only by other processes, then hugetlbfs would account for
> the mapping as if it was MAP_PRIVATE.  This causes processes to fail to map
> the file MAP_SHARED even though it should succeed as the reservation is there.
> 
> This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when
> the intent of the code was to check whether the VMA was mapped MAP_SHARED
> or MAP_PRIVATE.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I tested this patch on both x86_64 and ppc64 using 2.6.30-rc7 with the libhugetlbfs
test suite and everything looks good.

Acked-by: Eric B Munson <ebmunson@us.ibm.com>
Tested-by: Eric B Munson <ebmunson@us.ibm.com>

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
  2009-05-27 11:12 ` Mel Gorman
@ 2009-05-27 20:14   ` Andrew Morton
  -1 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2009-05-27 20:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mingo, stable, linux-mm, linux-kernel, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, starlight, ebmunson, agl, apw,
	wli

On Wed, 27 May 2009 12:12:27 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> The following two patches are required to fix problems reported by
> starlight@binnacle.cx. The tests cases both involve two processes interacting
> with shared memory segments backed by hugetlbfs.

Thanks.

Both of these address http://bugzilla.kernel.org/show_bug.cgi?id=13302, yes?
I added that info to the changelogs, to close the loop.

Ingo, I'd propose merging both these together rather than routing one
via the x86 tree, OK?

Question is: when?  Are we confident enough to merge it into 2.6.30
now, or should we hold off for 2.6.30.1?  I guess we have a week or
more, and if the changes do break something, we can fix that in
2.6.30.1 ;)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
@ 2009-05-27 20:14   ` Andrew Morton
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2009-05-27 20:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: mingo, stable, linux-mm, linux-kernel, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, starlight, ebmunson, agl, apw,
	wli

On Wed, 27 May 2009 12:12:27 +0100
Mel Gorman <mel@csn.ul.ie> wrote:

> The following two patches are required to fix problems reported by
> starlight@binnacle.cx. The tests cases both involve two processes interacting
> with shared memory segments backed by hugetlbfs.

Thanks.

Both of these address http://bugzilla.kernel.org/show_bug.cgi?id=13302, yes?
I added that info to the changelogs, to close the loop.

Ingo, I'd propose merging both these together rather than routing one
via the x86 tree, OK?

Question is: when?  Are we confident enough to merge it into 2.6.30
now, or should we hold off for 2.6.30.1?  I guess we have a week or
more, and if the changes do break something, we can fix that in
2.6.30.1 ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
  2009-05-27 11:12   ` Mel Gorman
@ 2009-05-27 23:18     ` Ingo Molnar
  -1 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2009-05-27 23:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli


* Mel Gorman <mel@csn.ul.ie> wrote:

> On x86 and x86-64, it is possible that page tables are shared 
> beween shared mappings backed by hugetlbfs. As part of this, 
> page_table_shareable() checks a pair of vma->vm_flags and they 
> must match if they are to be shared. All VMA flags are taken into 
> account, including VM_LOCKED.
> 
> The problem is that VM_LOCKED is cleared on fork(). When a process 
> with a shared memory segment forks() to exec() a helper, there 
> will be shared VMAs with different flags. The impact is that the 
> shared segment is sometimes considered shareable and other times 
> not, depending on what process is checking.
> 
> What happens is that the segment page tables are being shared but 
> the count is inaccurate depending on the ordering of events. As 
> the page tables are freed with put_page(), bad pmd's are found 
> when some of the children exit. The hugepage counters also get 
> corrupted and the Total and Free count will no longer match even 
> when all the hugepage-backed regions are freed. This requires a 
> reboot of the machine to "fix".
> 
> This patch addresses the problem by comparing all flags except 
> VM_LOCKED when deciding if pagetables should be shared or not for 
> hugetlbfs-backed mapping.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>  arch/x86/mm/hugetlbpage.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)

i suspect it would be best to do this due -mm, due to the (larger) 
mm/hugetlb.c cross section, right?

	Ingo

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
@ 2009-05-27 23:18     ` Ingo Molnar
  0 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2009-05-27 23:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli


* Mel Gorman <mel@csn.ul.ie> wrote:

> On x86 and x86-64, it is possible that page tables are shared 
> beween shared mappings backed by hugetlbfs. As part of this, 
> page_table_shareable() checks a pair of vma->vm_flags and they 
> must match if they are to be shared. All VMA flags are taken into 
> account, including VM_LOCKED.
> 
> The problem is that VM_LOCKED is cleared on fork(). When a process 
> with a shared memory segment forks() to exec() a helper, there 
> will be shared VMAs with different flags. The impact is that the 
> shared segment is sometimes considered shareable and other times 
> not, depending on what process is checking.
> 
> What happens is that the segment page tables are being shared but 
> the count is inaccurate depending on the ordering of events. As 
> the page tables are freed with put_page(), bad pmd's are found 
> when some of the children exit. The hugepage counters also get 
> corrupted and the Total and Free count will no longer match even 
> when all the hugepage-backed regions are freed. This requires a 
> reboot of the machine to "fix".
> 
> This patch addresses the problem by comparing all flags except 
> VM_LOCKED when deciding if pagetables should be shared or not for 
> hugetlbfs-backed mapping.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> ---
>  arch/x86/mm/hugetlbpage.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)

i suspect it would be best to do this due -mm, due to the (larger) 
mm/hugetlb.c cross section, right?

	Ingo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
  2009-05-27 20:14   ` Andrew Morton
@ 2009-05-27 23:19     ` Ingo Molnar
  -1 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2009-05-27 23:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, stable, linux-mm, linux-kernel, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, starlight, ebmunson, agl, apw,
	wli


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 27 May 2009 12:12:27 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The following two patches are required to fix problems reported by
> > starlight@binnacle.cx. The tests cases both involve two processes interacting
> > with shared memory segments backed by hugetlbfs.
> 
> Thanks.
> 
> Both of these address 
> http://bugzilla.kernel.org/show_bug.cgi?id=13302, yes? I added 
> that info to the changelogs, to close the loop.
> 
> Ingo, I'd propose merging both these together rather than routing 
> one via the x86 tree, OK?

sure.

> Question is: when?  Are we confident enough to merge it into 
> 2.6.30 now, or should we hold off for 2.6.30.1?  I guess we have a 
> week or more, and if the changes do break something, we can fix 
> that in 2.6.30.1 ;)

With an Acked-by from Hugh i feel pretty confident about it - and as 
long as it get into -rc8 i think we should do it in .30.

	Ingo 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
@ 2009-05-27 23:19     ` Ingo Molnar
  0 siblings, 0 replies; 40+ messages in thread
From: Ingo Molnar @ 2009-05-27 23:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, stable, linux-mm, linux-kernel, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, starlight, ebmunson, agl, apw,
	wli


* Andrew Morton <akpm@linux-foundation.org> wrote:

> On Wed, 27 May 2009 12:12:27 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The following two patches are required to fix problems reported by
> > starlight@binnacle.cx. The tests cases both involve two processes interacting
> > with shared memory segments backed by hugetlbfs.
> 
> Thanks.
> 
> Both of these address 
> http://bugzilla.kernel.org/show_bug.cgi?id=13302, yes? I added 
> that info to the changelogs, to close the loop.
> 
> Ingo, I'd propose merging both these together rather than routing 
> one via the x86 tree, OK?

sure.

> Question is: when?  Are we confident enough to merge it into 
> 2.6.30 now, or should we hold off for 2.6.30.1?  I guess we have a 
> week or more, and if the changes do break something, we can fix 
> that in 2.6.30.1 ;)

With an Acked-by from Hugh i feel pretty confident about it - and as 
long as it get into -rc8 i think we should do it in .30.

	Ingo 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
  2009-05-27 23:18     ` Ingo Molnar
@ 2009-05-28  8:55       ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-28  8:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

On Thu, May 28, 2009 at 01:18:03AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On x86 and x86-64, it is possible that page tables are shared 
> > beween shared mappings backed by hugetlbfs. As part of this, 
> > page_table_shareable() checks a pair of vma->vm_flags and they 
> > must match if they are to be shared. All VMA flags are taken into 
> > account, including VM_LOCKED.
> > 
> > The problem is that VM_LOCKED is cleared on fork(). When a process 
> > with a shared memory segment forks() to exec() a helper, there 
> > will be shared VMAs with different flags. The impact is that the 
> > shared segment is sometimes considered shareable and other times 
> > not, depending on what process is checking.
> > 
> > What happens is that the segment page tables are being shared but 
> > the count is inaccurate depending on the ordering of events. As 
> > the page tables are freed with put_page(), bad pmd's are found 
> > when some of the children exit. The hugepage counters also get 
> > corrupted and the Total and Free count will no longer match even 
> > when all the hugepage-backed regions are freed. This requires a 
> > reboot of the machine to "fix".
> > 
> > This patch addresses the problem by comparing all flags except 
> > VM_LOCKED when deciding if pagetables should be shared or not for 
> > hugetlbfs-backed mapping.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > ---
> >  arch/x86/mm/hugetlbpage.c |    6 +++++-
> >  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> i suspect it would be best to do this due -mm, due to the (larger) 
> mm/hugetlb.c cross section, right?
> 

I'm happy with that approach. Almost all hugetlbfs-related patches have
gone through -mm to date AFAIK even when they have been arch specific
like this.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not
@ 2009-05-28  8:55       ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-28  8:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, starlight, Eric B Munson, Adam Litke,
	Andy Whitcroft, wli

On Thu, May 28, 2009 at 01:18:03AM +0200, Ingo Molnar wrote:
> 
> * Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On x86 and x86-64, it is possible that page tables are shared 
> > beween shared mappings backed by hugetlbfs. As part of this, 
> > page_table_shareable() checks a pair of vma->vm_flags and they 
> > must match if they are to be shared. All VMA flags are taken into 
> > account, including VM_LOCKED.
> > 
> > The problem is that VM_LOCKED is cleared on fork(). When a process 
> > with a shared memory segment forks() to exec() a helper, there 
> > will be shared VMAs with different flags. The impact is that the 
> > shared segment is sometimes considered shareable and other times 
> > not, depending on what process is checking.
> > 
> > What happens is that the segment page tables are being shared but 
> > the count is inaccurate depending on the ordering of events. As 
> > the page tables are freed with put_page(), bad pmd's are found 
> > when some of the children exit. The hugepage counters also get 
> > corrupted and the Total and Free count will no longer match even 
> > when all the hugepage-backed regions are freed. This requires a 
> > reboot of the machine to "fix".
> > 
> > This patch addresses the problem by comparing all flags except 
> > VM_LOCKED when deciding if pagetables should be shared or not for 
> > hugetlbfs-backed mapping.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > ---
> >  arch/x86/mm/hugetlbpage.c |    6 +++++-
> >  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> i suspect it would be best to do this due -mm, due to the (larger) 
> mm/hugetlb.c cross section, right?
> 

I'm happy with that approach. Almost all hugetlbfs-related patches have
gone through -mm to date AFAIK even when they have been arch specific
like this.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
  2009-05-27 20:14   ` Andrew Morton
@ 2009-05-28  8:56     ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-28  8:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, stable, linux-mm, linux-kernel, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, starlight, ebmunson, agl, apw,
	wli

On Wed, May 27, 2009 at 01:14:37PM -0700, Andrew Morton wrote:
> On Wed, 27 May 2009 12:12:27 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The following two patches are required to fix problems reported by
> > starlight@binnacle.cx. The tests cases both involve two processes interacting
> > with shared memory segments backed by hugetlbfs.
> 
> Thanks.
> 
> Both of these address http://bugzilla.kernel.org/show_bug.cgi?id=13302, yes?
> I added that info to the changelogs, to close the loop.
> 

Yes. I'm sorry, I should have included that information in the leader. I
had a niggling feeling I was forgetting something to add to the changelog -
this was it :)

> Ingo, I'd propose merging both these together rather than routing one
> via the x86 tree, OK?
> 
> Question is: when?  Are we confident enough to merge it into 2.6.30
> now, or should we hold off for 2.6.30.1?  I guess we have a week or
> more, and if the changes do break something, we can fix that in
> 2.6.30.1 ;)
> 

FWIW, I'm reasonably confident based on libhugetlbfs regression testing that
I haven't broken something new. If they make it into 2.6.30-rc8, so much
the better. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
@ 2009-05-28  8:56     ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-05-28  8:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, stable, linux-mm, linux-kernel, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, starlight, ebmunson, agl, apw,
	wli

On Wed, May 27, 2009 at 01:14:37PM -0700, Andrew Morton wrote:
> On Wed, 27 May 2009 12:12:27 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The following two patches are required to fix problems reported by
> > starlight@binnacle.cx. The tests cases both involve two processes interacting
> > with shared memory segments backed by hugetlbfs.
> 
> Thanks.
> 
> Both of these address http://bugzilla.kernel.org/show_bug.cgi?id=13302, yes?
> I added that info to the changelogs, to close the loop.
> 

Yes. I'm sorry, I should have included that information in the leader. I
had a niggling feeling I was forgetting something to add to the changelog -
this was it :)

> Ingo, I'd propose merging both these together rather than routing one
> via the x86 tree, OK?
> 
> Question is: when?  Are we confident enough to merge it into 2.6.30
> now, or should we hold off for 2.6.30.1?  I guess we have a week or
> more, and if the changes do break something, we can fix that in
> 2.6.30.1 ;)
> 

FWIW, I'm reasonably confident based on libhugetlbfs regression testing that
I haven't broken something new. If they make it into 2.6.30-rc8, so much
the better. Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
  2009-05-27 11:12 ` Mel Gorman
@ 2009-06-08  1:25   ` starlight
  -1 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-08  1:25 UTC (permalink / raw)
  To: Mel Gorman, Ingo Molnar, Andrew Morton, stable,
	Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, Eric B Munson, Adam Litke, Andy Whitcroft, wli

Mel,

Tried out the two new patches on 2.6.26.4 and everything is 
working now.  The application that uncovered the issue works 
perfectly and hugepages function sanely.

Thank you for the fix.

Regards


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
@ 2009-06-08  1:25   ` starlight
  0 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-08  1:25 UTC (permalink / raw)
  To: Mel Gorman, Ingo Molnar, Andrew Morton, stable,
	Linux Memory Management List
  Cc: Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, Eric B Munson, Adam Litke, Andy Whitcroft, wli

Mel,

Tried out the two new patches on 2.6.26.4 and everything is 
working now.  The application that uncovered the issue works 
perfectly and hugepages function sanely.

Thank you for the fix.

Regards

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
  2009-06-08  1:25   ` starlight
@ 2009-06-08 10:24     ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-06-08 10:24 UTC (permalink / raw)
  To: starlight
  Cc: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, Eric B Munson, Adam Litke, Andy Whitcroft, wli

On Sun, Jun 07, 2009 at 09:25:06PM -0400, starlight@binnacle.cx wrote:
> Mel,
> 
> Tried out the two new patches on 2.6.26.4 and everything is 
> working now.  The application that uncovered the issue works 
> perfectly and hugepages function sanely.
> 

Very cool. Thanks for testing.

> Thank you for the fix.
> 

Thank you for persisting the problem and coming up with the test cases that
reproduce it. Without both, this fix would not have been forthcoming. It's
very much appreciated.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory
@ 2009-06-08 10:24     ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-06-08 10:24 UTC (permalink / raw)
  To: starlight
  Cc: Ingo Molnar, Andrew Morton, stable, Linux Memory Management List,
	Linux Kernel Mailing List, Hugh Dickins, Lee Schermerhorn,
	KOSAKI Motohiro, Eric B Munson, Adam Litke, Andy Whitcroft, wli

On Sun, Jun 07, 2009 at 09:25:06PM -0400, starlight@binnacle.cx wrote:
> Mel,
> 
> Tried out the two new patches on 2.6.26.4 and everything is 
> working now.  The application that uncovered the issue works 
> perfectly and hugepages function sanely.
> 

Very cool. Thanks for testing.

> Thank you for the fix.
> 

Thank you for persisting the problem and coming up with the test cases that
reproduce it. Without both, this fix would not have been forthcoming. It's
very much appreciated.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
  2009-05-27 23:19     ` Ingo Molnar
@ 2009-06-16  0:19       ` starlight
  -1 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16  0:19 UTC (permalink / raw)
  To: linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli

Hello,

I submitted testcase for a hugepages bug that has been 
successfully resolved.  Have an apparently obscure question 
related to MM, and so I am asking anyone who might have some idea 
on this.  Nothing much turned up via Google and digging into
the KMEM code looks daunting.

Running Intel 82598/ixgbe 10 gig Ethernet under heavy stress. 
Generally is working well after tuning IRQ affinities, but a 
fair number of buffer allocation failures are occurring in the 
'ixgbe' device driver and are reported via 'ethtool' statistics. 
 This may be causing data loss.

The kernel primitive returning the error is netdev_alloc_skb().

Are any tuneable parameters available that can reduce or 
eliminate these allocation failures?  Have about eleven 
gigabytes of free memory, though most of that is consumed 
by non-dirty file cache data.  Total system memory is 16GB with 
4GB allocated to hugepages.  Zero swap usage and activity though
swap is enabled.  Most application memory is hugepage or is
'mlock()'ed.

Thank you.

System rebooted before test run.

Dual Xeon E5430, 16GB FB-DIMM RAM.

$ cat /proc/meminfo
MemTotal:     16443828 kB
MemFree:        281176 kB
Buffers:         53896 kB
Cached:       11331924 kB
SwapCached:          0 kB
Active:         200740 kB
Inactive:     11284312 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     16443828 kB
LowFree:        281176 kB
SwapTotal:     2031608 kB
SwapFree:      2031400 kB
Dirty:               4 kB
Writeback:           0 kB
AnonPages:      104464 kB
Mapped:          14644 kB
Slab:           440452 kB
PageTables:       4032 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   8156368 kB
Committed_AS:   122452 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    266872 kB
VmallocChunk: 34359471043 kB
HugePages_Total:  2048
HugePages_Free:    735
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

# ethtool -S eth2 | egrep -v ': 0$'
NIC statistics:
     rx_packets: 724246449
     tx_packets: 229847
     rx_bytes: 152691992335
     tx_bytes: 10573426
     multicast: 725997241
     broadcast: 6
     rx_csum_offload_good: 723051776
     alloc_rx_buff_failed: 7119
     tx_queue_0_packets: 229847
     tx_queue_0_bytes: 10573426
     rx_queue_0_packets: 340698332
     rx_queue_0_bytes: 70844299683
     rx_queue_1_packets: 385298923
     rx_queue_1_bytes: 82276167594

ixgbe driver fragment
=====================
    struct sk_buff *skb = netdev_alloc_skb(adapter->netdev, bufsz);

    if (!skb) {
        adapter->alloc_rx_buff_failed++;
        goto no_buffers;
    }

^ permalink raw reply	[flat|nested] 40+ messages in thread

* QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
@ 2009-06-16  0:19       ` starlight
  0 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16  0:19 UTC (permalink / raw)
  To: linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli

Hello,

I submitted testcase for a hugepages bug that has been 
successfully resolved.  Have an apparently obscure question 
related to MM, and so I am asking anyone who might have some idea 
on this.  Nothing much turned up via Google and digging into
the KMEM code looks daunting.

Running Intel 82598/ixgbe 10 gig Ethernet under heavy stress. 
Generally is working well after tuning IRQ affinities, but a 
fair number of buffer allocation failures are occurring in the 
'ixgbe' device driver and are reported via 'ethtool' statistics. 
 This may be causing data loss.

The kernel primitive returning the error is netdev_alloc_skb().

Are any tuneable parameters available that can reduce or 
eliminate these allocation failures?  Have about eleven 
gigabytes of free memory, though most of that is consumed 
by non-dirty file cache data.  Total system memory is 16GB with 
4GB allocated to hugepages.  Zero swap usage and activity though
swap is enabled.  Most application memory is hugepage or is
'mlock()'ed.

Thank you.

System rebooted before test run.

Dual Xeon E5430, 16GB FB-DIMM RAM.

$ cat /proc/meminfo
MemTotal:     16443828 kB
MemFree:        281176 kB
Buffers:         53896 kB
Cached:       11331924 kB
SwapCached:          0 kB
Active:         200740 kB
Inactive:     11284312 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     16443828 kB
LowFree:        281176 kB
SwapTotal:     2031608 kB
SwapFree:      2031400 kB
Dirty:               4 kB
Writeback:           0 kB
AnonPages:      104464 kB
Mapped:          14644 kB
Slab:           440452 kB
PageTables:       4032 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   8156368 kB
Committed_AS:   122452 kB
VmallocTotal: 34359738367 kB
VmallocUsed:    266872 kB
VmallocChunk: 34359471043 kB
HugePages_Total:  2048
HugePages_Free:    735
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

# ethtool -S eth2 | egrep -v ': 0$'
NIC statistics:
     rx_packets: 724246449
     tx_packets: 229847
     rx_bytes: 152691992335
     tx_bytes: 10573426
     multicast: 725997241
     broadcast: 6
     rx_csum_offload_good: 723051776
     alloc_rx_buff_failed: 7119
     tx_queue_0_packets: 229847
     tx_queue_0_bytes: 10573426
     rx_queue_0_packets: 340698332
     rx_queue_0_bytes: 70844299683
     rx_queue_1_packets: 385298923
     rx_queue_1_bytes: 82276167594

ixgbe driver fragment
=====================
    struct sk_buff *skb = netdev_alloc_skb(adapter->netdev, bufsz);

    if (!skb) {
        adapter->alloc_rx_buff_failed++;
        goto no_buffers;
    }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by  tuning?
  2009-06-16  0:19       ` starlight
@ 2009-06-16  2:26         ` Eric Dumazet
  -1 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2009-06-16  2:26 UTC (permalink / raw)
  To: starlight
  Cc: linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli

starlight@binnacle.cx a écrit :
> Hello,
> 
> I submitted testcase for a hugepages bug that has been 
> successfully resolved.  Have an apparently obscure question 
> related to MM, and so I am asking anyone who might have some idea 
> on this.  Nothing much turned up via Google and digging into
> the KMEM code looks daunting.
> 
> Running Intel 82598/ixgbe 10 gig Ethernet under heavy stress. 
> Generally is working well after tuning IRQ affinities, but a 
> fair number of buffer allocation failures are occurring in the 
> 'ixgbe' device driver and are reported via 'ethtool' statistics. 
>  This may be causing data loss.
> 
> The kernel primitive returning the error is netdev_alloc_skb().
> 
> Are any tuneable parameters available that can reduce or 
> eliminate these allocation failures?  Have about eleven 
> gigabytes of free memory, though most of that is consumed 
> by non-dirty file cache data.  Total system memory is 16GB with 
> 4GB allocated to hugepages.  Zero swap usage and activity though
> swap is enabled.  Most application memory is hugepage or is
> 'mlock()'ed.
> 
> Thank you.
> 
> 
> 
> 
> 
> System rebooted before test run.
> 
> Dual Xeon E5430, 16GB FB-DIMM RAM.
> 
> 
> $ cat /proc/meminfo
> MemTotal:     16443828 kB
> MemFree:        281176 kB
> Buffers:         53896 kB
> Cached:       11331924 kB
> SwapCached:          0 kB
> Active:         200740 kB
> Inactive:     11284312 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:     16443828 kB
> LowFree:        281176 kB
> SwapTotal:     2031608 kB
> SwapFree:      2031400 kB
> Dirty:               4 kB
> Writeback:           0 kB
> AnonPages:      104464 kB
> Mapped:          14644 kB
> Slab:           440452 kB
> PageTables:       4032 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   8156368 kB
> Committed_AS:   122452 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:    266872 kB
> VmallocChunk: 34359471043 kB
> HugePages_Total:  2048
> HugePages_Free:    735
> HugePages_Rsvd:      0
> Hugepagesize:     2048 kB
> 
> 
> # ethtool -S eth2 | egrep -v ': 0$'
> NIC statistics:
>      rx_packets: 724246449
>      tx_packets: 229847
>      rx_bytes: 152691992335
>      tx_bytes: 10573426
>      multicast: 725997241
>      broadcast: 6
>      rx_csum_offload_good: 723051776
>      alloc_rx_buff_failed: 7119
>      tx_queue_0_packets: 229847
>      tx_queue_0_bytes: 10573426
>      rx_queue_0_packets: 340698332
>      rx_queue_0_bytes: 70844299683
>      rx_queue_1_packets: 385298923
>      rx_queue_1_bytes: 82276167594
> 
> 
> ixgbe driver fragment
> =====================
>     struct sk_buff *skb = netdev_alloc_skb(adapter->netdev, bufsz);
> 
>     if (!skb) {
>         adapter->alloc_rx_buff_failed++;
>         goto no_buffers;
>     }
> 

152691992335/724246449 = 210 bytes per rx packet in average

It could make sense to add copybreak feature in this driver to reduce memory needs,
but that also would consume more cpu cycles, and slow down forwarding setups.

Maybe this packet trimming could be done generically in UDP stack input path,
before queueing packet into a receive queue, if amount of available memory
is under a given threshold.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by  tuning?
@ 2009-06-16  2:26         ` Eric Dumazet
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2009-06-16  2:26 UTC (permalink / raw)
  To: starlight
  Cc: linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli

starlight@binnacle.cx a ecrit :
> Hello,
> 
> I submitted testcase for a hugepages bug that has been 
> successfully resolved.  Have an apparently obscure question 
> related to MM, and so I am asking anyone who might have some idea 
> on this.  Nothing much turned up via Google and digging into
> the KMEM code looks daunting.
> 
> Running Intel 82598/ixgbe 10 gig Ethernet under heavy stress. 
> Generally is working well after tuning IRQ affinities, but a 
> fair number of buffer allocation failures are occurring in the 
> 'ixgbe' device driver and are reported via 'ethtool' statistics. 
>  This may be causing data loss.
> 
> The kernel primitive returning the error is netdev_alloc_skb().
> 
> Are any tuneable parameters available that can reduce or 
> eliminate these allocation failures?  Have about eleven 
> gigabytes of free memory, though most of that is consumed 
> by non-dirty file cache data.  Total system memory is 16GB with 
> 4GB allocated to hugepages.  Zero swap usage and activity though
> swap is enabled.  Most application memory is hugepage or is
> 'mlock()'ed.
> 
> Thank you.
> 
> 
> 
> 
> 
> System rebooted before test run.
> 
> Dual Xeon E5430, 16GB FB-DIMM RAM.
> 
> 
> $ cat /proc/meminfo
> MemTotal:     16443828 kB
> MemFree:        281176 kB
> Buffers:         53896 kB
> Cached:       11331924 kB
> SwapCached:          0 kB
> Active:         200740 kB
> Inactive:     11284312 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:     16443828 kB
> LowFree:        281176 kB
> SwapTotal:     2031608 kB
> SwapFree:      2031400 kB
> Dirty:               4 kB
> Writeback:           0 kB
> AnonPages:      104464 kB
> Mapped:          14644 kB
> Slab:           440452 kB
> PageTables:       4032 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   8156368 kB
> Committed_AS:   122452 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:    266872 kB
> VmallocChunk: 34359471043 kB
> HugePages_Total:  2048
> HugePages_Free:    735
> HugePages_Rsvd:      0
> Hugepagesize:     2048 kB
> 
> 
> # ethtool -S eth2 | egrep -v ': 0$'
> NIC statistics:
>      rx_packets: 724246449
>      tx_packets: 229847
>      rx_bytes: 152691992335
>      tx_bytes: 10573426
>      multicast: 725997241
>      broadcast: 6
>      rx_csum_offload_good: 723051776
>      alloc_rx_buff_failed: 7119
>      tx_queue_0_packets: 229847
>      tx_queue_0_bytes: 10573426
>      rx_queue_0_packets: 340698332
>      rx_queue_0_bytes: 70844299683
>      rx_queue_1_packets: 385298923
>      rx_queue_1_bytes: 82276167594
> 
> 
> ixgbe driver fragment
> =====================
>     struct sk_buff *skb = netdev_alloc_skb(adapter->netdev, bufsz);
> 
>     if (!skb) {
>         adapter->alloc_rx_buff_failed++;
>         goto no_buffers;
>     }
> 

152691992335/724246449 = 210 bytes per rx packet in average

It could make sense to add copybreak feature in this driver to reduce memory needs,
but that also would consume more cpu cycles, and slow down forwarding setups.

Maybe this packet trimming could be done generically in UDP stack input path,
before queueing packet into a receive queue, if amount of available memory
is under a given threshold.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by  tuning?
  2009-06-16  2:26         ` Eric Dumazet
@ 2009-06-16  4:12           ` starlight
  -1 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16  4:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli

Eric,

Great thought--thank you.  Running a similar server with 
82571/e1000e and it does not exhibit the problem.  'e1000e' has 
default copybreak=256 while 'ixgbe' has no copybreak.  Rational 
given is

   http://osdir.com/ml/linux.drivers.e1000.devel/2008-01/msg00103.html

But the comparion is a bit apples-and-oranges since the 'e1000e' 
system is dual Opteron 2354 while the 'ixgbe' system is Xeon 
E5430 (a painful choice thus far).  Also 'e1000e' system passes 
data via a PACKET socket while the 'ixgbe' system passes data 
via UDP (a configurable option).

I'm not fully up on how this all works: am I to understand that 
the error could result from RX ring-queue buffers not freeing 
quickly enough because they have a use-count held non-zero as
the packet travels the stack?

I've just doubled some SLAB tuneables that seem relevant, but 
if the cause is the aforementioned, this won't help.  Will
have the answer on the tweaks by the end of Tuesday.

David

At 04:26 AM 6/16/2009 +0200, Eric Dumazet wrote:
>
>152691992335/724246449 = 210 bytes per rx packet in average
>
>It could make sense to add copybreak feature in this driver to 
>reduce memory needs, but that also would consume more cpu 
>cycles, and slow down forwarding setups.
>
>Maybe this packet trimming could be done generically in UDP 
>stack input path, before queueing packet into a receive queue, 
>if amount of available memory is under a given threshold.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by  tuning?
@ 2009-06-16  4:12           ` starlight
  0 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16  4:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli

Eric,

Great thought--thank you.  Running a similar server with 
82571/e1000e and it does not exhibit the problem.  'e1000e' has 
default copybreak=256 while 'ixgbe' has no copybreak.  Rational 
given is

   http://osdir.com/ml/linux.drivers.e1000.devel/2008-01/msg00103.html

But the comparion is a bit apples-and-oranges since the 'e1000e' 
system is dual Opteron 2354 while the 'ixgbe' system is Xeon 
E5430 (a painful choice thus far).  Also 'e1000e' system passes 
data via a PACKET socket while the 'ixgbe' system passes data 
via UDP (a configurable option).

I'm not fully up on how this all works: am I to understand that 
the error could result from RX ring-queue buffers not freeing 
quickly enough because they have a use-count held non-zero as
the packet travels the stack?

I've just doubled some SLAB tuneables that seem relevant, but 
if the cause is the aforementioned, this won't help.  Will
have the answer on the tweaks by the end of Tuesday.

David

At 04:26 AM 6/16/2009 +0200, Eric Dumazet wrote:
>
>152691992335/724246449 = 210 bytes per rx packet in average
>
>It could make sense to add copybreak feature in this driver to 
>reduce memory needs, but that also would consume more cpu 
>cycles, and slow down forwarding setups.
>
>Maybe this packet trimming could be done generically in UDP 
>stack input path, before queueing packet into a receive queue, 
>if amount of available memory is under a given threshold.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced  by  tuning?
  2009-06-16  4:12           ` starlight
  (?)
@ 2009-06-16  6:12             ` Eric Dumazet
  -1 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2009-06-16  6:12 UTC (permalink / raw)
  To: starlight
  Cc: Eric Dumazet, linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli,
	Linux Netdev List

Please dont top post, we prefer other way around :)

starlight@binnacle.cx a écrit :
> Eric,
> 
> Great thought--thank you.  Running a similar server with 
> 82571/e1000e and it does not exhibit the problem.  'e1000e' has 
> default copybreak=256 while 'ixgbe' has no copybreak.  Rational 
> given is
> 
>    http://osdir.com/ml/linux.drivers.e1000.devel/2008-01/msg00103.html
> 
> But the comparion is a bit apples-and-oranges since the 'e1000e' 
> system is dual Opteron 2354 while the 'ixgbe' system is Xeon 
> E5430 (a painful choice thus far).  Also 'e1000e' system passes 
> data via a PACKET socket while the 'ixgbe' system passes data 
> via UDP (a configurable option).
> 
> I'm not fully up on how this all works: am I to understand that 
> the error could result from RX ring-queue buffers not freeing 
> quickly enough because they have a use-count held non-zero as
> the packet travels the stack?

Well, error is normal in stress situation, when no more kernel
memory is available.

cat /proc/net/udp

can show you (in last column) sockets where packets where dropped
by UDP stack if their receive queue was full.

> 
> I've just doubled some SLAB tuneables that seem relevant, but 
> if the cause is the aforementioned, this won't help.  Will
> have the answer on the tweaks by the end of Tuesday.
> 
> David

copybreak in drivers themselves is nice because driver can recycle
its rx skbs much faster, but that is suboptimal in forwarding (routers)
workloads. Its also a lot of duplicated code in every driver.

So we could do the skb trimming (ie : reallocating the data portion to exactly
the size of packet) in core network stack, when we know packet must be handled
by an application, and not dropped or forwarded by kernel.

Because of slab rounding, this reallocation should be done only if resulting data
portion is really smaller (50 %) than original skb.

> 
> 
> 
> At 04:26 AM 6/16/2009 +0200, Eric Dumazet wrote:
>> 152691992335/724246449 = 210 bytes per rx packet in average
>>
>> It could make sense to add copybreak feature in this driver to 
>> reduce memory needs, but that also would consume more cpu 
>> cycles, and slow down forwarding setups.
>>
>> Maybe this packet trimming could be done generically in UDP 
>> stack input path, before queueing packet into a receive queue, 
>> if amount of available memory is under a given threshold.
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced  by  tuning?
@ 2009-06-16  6:12             ` Eric Dumazet
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2009-06-16  6:12 UTC (permalink / raw)
  To: starlight
  Cc: Eric Dumazet, linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli,
	Linux Netdev List

Please dont top post, we prefer other way around :)

starlight@binnacle.cx a écrit :
> Eric,
> 
> Great thought--thank you.  Running a similar server with 
> 82571/e1000e and it does not exhibit the problem.  'e1000e' has 
> default copybreak=256 while 'ixgbe' has no copybreak.  Rational 
> given is
> 
>    http://osdir.com/ml/linux.drivers.e1000.devel/2008-01/msg00103.html
> 
> But the comparion is a bit apples-and-oranges since the 'e1000e' 
> system is dual Opteron 2354 while the 'ixgbe' system is Xeon 
> E5430 (a painful choice thus far).  Also 'e1000e' system passes 
> data via a PACKET socket while the 'ixgbe' system passes data 
> via UDP (a configurable option).
> 
> I'm not fully up on how this all works: am I to understand that 
> the error could result from RX ring-queue buffers not freeing 
> quickly enough because they have a use-count held non-zero as
> the packet travels the stack?

Well, error is normal in stress situation, when no more kernel
memory is available.

cat /proc/net/udp

can show you (in last column) sockets where packets where dropped
by UDP stack if their receive queue was full.

> 
> I've just doubled some SLAB tuneables that seem relevant, but 
> if the cause is the aforementioned, this won't help.  Will
> have the answer on the tweaks by the end of Tuesday.
> 
> David

copybreak in drivers themselves is nice because driver can recycle
its rx skbs much faster, but that is suboptimal in forwarding (routers)
workloads. Its also a lot of duplicated code in every driver.

So we could do the skb trimming (ie : reallocating the data portion to exactly
the size of packet) in core network stack, when we know packet must be handled
by an application, and not dropped or forwarded by kernel.

Because of slab rounding, this reallocation should be done only if resulting data
portion is really smaller (50 %) than original skb.

> 
> 
> 
> At 04:26 AM 6/16/2009 +0200, Eric Dumazet wrote:
>> 152691992335/724246449 = 210 bytes per rx packet in average
>>
>> It could make sense to add copybreak feature in this driver to 
>> reduce memory needs, but that also would consume more cpu 
>> cycles, and slow down forwarding setups.
>>
>> Maybe this packet trimming could be done generically in UDP 
>> stack input path, before queueing packet into a receive queue, 
>> if amount of available memory is under a given threshold.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced  by  tuning?
@ 2009-06-16  6:12             ` Eric Dumazet
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2009-06-16  6:12 UTC (permalink / raw)
  To: starlight
  Cc: Eric Dumazet, linux-kernel, Mel Gorman, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli,
	Linux Netdev List

Please dont top post, we prefer other way around :)

starlight@binnacle.cx a ecrit :
> Eric,
> 
> Great thought--thank you.  Running a similar server with 
> 82571/e1000e and it does not exhibit the problem.  'e1000e' has 
> default copybreak=256 while 'ixgbe' has no copybreak.  Rational 
> given is
> 
>    http://osdir.com/ml/linux.drivers.e1000.devel/2008-01/msg00103.html
> 
> But the comparion is a bit apples-and-oranges since the 'e1000e' 
> system is dual Opteron 2354 while the 'ixgbe' system is Xeon 
> E5430 (a painful choice thus far).  Also 'e1000e' system passes 
> data via a PACKET socket while the 'ixgbe' system passes data 
> via UDP (a configurable option).
> 
> I'm not fully up on how this all works: am I to understand that 
> the error could result from RX ring-queue buffers not freeing 
> quickly enough because they have a use-count held non-zero as
> the packet travels the stack?

Well, error is normal in stress situation, when no more kernel
memory is available.

cat /proc/net/udp

can show you (in last column) sockets where packets where dropped
by UDP stack if their receive queue was full.

> 
> I've just doubled some SLAB tuneables that seem relevant, but 
> if the cause is the aforementioned, this won't help.  Will
> have the answer on the tweaks by the end of Tuesday.
> 
> David

copybreak in drivers themselves is nice because driver can recycle
its rx skbs much faster, but that is suboptimal in forwarding (routers)
workloads. Its also a lot of duplicated code in every driver.

So we could do the skb trimming (ie : reallocating the data portion to exactly
the size of packet) in core network stack, when we know packet must be handled
by an application, and not dropped or forwarded by kernel.

Because of slab rounding, this reallocation should be done only if resulting data
portion is really smaller (50 %) than original skb.

> 
> 
> 
> At 04:26 AM 6/16/2009 +0200, Eric Dumazet wrote:
>> 152691992335/724246449 = 210 bytes per rx packet in average
>>
>> It could make sense to add copybreak feature in this driver to 
>> reduce memory needs, but that also would consume more cpu 
>> cycles, and slow down forwarding setups.
>>
>> Maybe this packet trimming could be done generically in UDP 
>> stack input path, before queueing packet into a receive queue, 
>> if amount of available memory is under a given threshold.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
  2009-06-16  0:19       ` starlight
@ 2009-06-16  9:19         ` Mel Gorman
  -1 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-06-16  9:19 UTC (permalink / raw)
  To: starlight
  Cc: linux-kernel, linux-mm, hugh.dickins, Lee.Schermerhorn,
	kosaki.motohiro, ebmunson, agl, apw, wli

On Mon, Jun 15, 2009 at 08:19:33PM -0400, starlight@binnacle.cx wrote:
> Hello,
> 
> I submitted testcase for a hugepages bug that has been 
> successfully resolved.  Have an apparently obscure question 
> related to MM, and so I am asking anyone who might have some idea 
> on this.  Nothing much turned up via Google and digging into
> the KMEM code looks daunting.
> 
> Running Intel 82598/ixgbe 10 gig Ethernet under heavy stress. 
> Generally is working well after tuning IRQ affinities, but a 
> fair number of buffer allocation failures are occurring in the 
> 'ixgbe' device driver and are reported via 'ethtool' statistics. 
>  This may be causing data loss.
> 

Can you give an example of an allocation failure? Specifically, I want to
see what sort of allocation it was and what order.

For reliable protocols, an allocation failure should recover and the
data get through but obviously there is a drop in network performance
when this happens.

> The kernel primitive returning the error is netdev_alloc_skb().
> 
> Are any tuneable parameters available that can reduce or 
> eliminate these allocation failures?  Have about eleven 
> gigabytes of free memory, though most of that is consumed 
> by non-dirty file cache data.  Total system memory is 16GB with 
> 4GB allocated to hugepages.  Zero swap usage and activity though
> swap is enabled.  Most application memory is hugepage or is
> 'mlock()'ed.
> 

If the allocations are high-order and atomic, increasing min_free_kbytes
can help, particularly in situations where there is a burst of network
traffic. I won't know if they are atomic until I see an error message
though.

> Thank you.
> 
> 
> 
> 
> 
> System rebooted before test run.
> 
> Dual Xeon E5430, 16GB FB-DIMM RAM.
> 
> 
> $ cat /proc/meminfo
> MemTotal:     16443828 kB
> MemFree:        281176 kB
> Buffers:         53896 kB
> Cached:       11331924 kB
> SwapCached:          0 kB
> Active:         200740 kB
> Inactive:     11284312 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:     16443828 kB
> LowFree:        281176 kB
> SwapTotal:     2031608 kB
> SwapFree:      2031400 kB
> Dirty:               4 kB
> Writeback:           0 kB
> AnonPages:      104464 kB
> Mapped:          14644 kB
> Slab:           440452 kB
> PageTables:       4032 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   8156368 kB
> Committed_AS:   122452 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:    266872 kB
> VmallocChunk: 34359471043 kB
> HugePages_Total:  2048
> HugePages_Free:    735
> HugePages_Rsvd:      0
> Hugepagesize:     2048 kB
> 
> 
> # ethtool -S eth2 | egrep -v ': 0$'
> NIC statistics:
>      rx_packets: 724246449
>      tx_packets: 229847
>      rx_bytes: 152691992335
>      tx_bytes: 10573426
>      multicast: 725997241
>      broadcast: 6
>      rx_csum_offload_good: 723051776
>      alloc_rx_buff_failed: 7119
>      tx_queue_0_packets: 229847
>      tx_queue_0_bytes: 10573426
>      rx_queue_0_packets: 340698332
>      rx_queue_0_bytes: 70844299683
>      rx_queue_1_packets: 385298923
>      rx_queue_1_bytes: 82276167594
> 
> 
> ixgbe driver fragment
> =====================
>     struct sk_buff *skb = netdev_alloc_skb(adapter->netdev, bufsz);
> 
>     if (!skb) {
>         adapter->alloc_rx_buff_failed++;
>         goto no_buffers;
>     }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
@ 2009-06-16  9:19         ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2009-06-16  9:19 UTC (permalink / raw)
  To: starlight
  Cc: linux-kernel, linux-mm, hugh.dickins, Lee.Schermerhorn,
	kosaki.motohiro, ebmunson, agl, apw, wli

On Mon, Jun 15, 2009 at 08:19:33PM -0400, starlight@binnacle.cx wrote:
> Hello,
> 
> I submitted testcase for a hugepages bug that has been 
> successfully resolved.  Have an apparently obscure question 
> related to MM, and so I am asking anyone who might have some idea 
> on this.  Nothing much turned up via Google and digging into
> the KMEM code looks daunting.
> 
> Running Intel 82598/ixgbe 10 gig Ethernet under heavy stress. 
> Generally is working well after tuning IRQ affinities, but a 
> fair number of buffer allocation failures are occurring in the 
> 'ixgbe' device driver and are reported via 'ethtool' statistics. 
>  This may be causing data loss.
> 

Can you give an example of an allocation failure? Specifically, I want to
see what sort of allocation it was and what order.

For reliable protocols, an allocation failure should recover and the
data get through but obviously there is a drop in network performance
when this happens.

> The kernel primitive returning the error is netdev_alloc_skb().
> 
> Are any tuneable parameters available that can reduce or 
> eliminate these allocation failures?  Have about eleven 
> gigabytes of free memory, though most of that is consumed 
> by non-dirty file cache data.  Total system memory is 16GB with 
> 4GB allocated to hugepages.  Zero swap usage and activity though
> swap is enabled.  Most application memory is hugepage or is
> 'mlock()'ed.
> 

If the allocations are high-order and atomic, increasing min_free_kbytes
can help, particularly in situations where there is a burst of network
traffic. I won't know if they are atomic until I see an error message
though.

> Thank you.
> 
> 
> 
> 
> 
> System rebooted before test run.
> 
> Dual Xeon E5430, 16GB FB-DIMM RAM.
> 
> 
> $ cat /proc/meminfo
> MemTotal:     16443828 kB
> MemFree:        281176 kB
> Buffers:         53896 kB
> Cached:       11331924 kB
> SwapCached:          0 kB
> Active:         200740 kB
> Inactive:     11284312 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:     16443828 kB
> LowFree:        281176 kB
> SwapTotal:     2031608 kB
> SwapFree:      2031400 kB
> Dirty:               4 kB
> Writeback:           0 kB
> AnonPages:      104464 kB
> Mapped:          14644 kB
> Slab:           440452 kB
> PageTables:       4032 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   8156368 kB
> Committed_AS:   122452 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:    266872 kB
> VmallocChunk: 34359471043 kB
> HugePages_Total:  2048
> HugePages_Free:    735
> HugePages_Rsvd:      0
> Hugepagesize:     2048 kB
> 
> 
> # ethtool -S eth2 | egrep -v ': 0$'
> NIC statistics:
>      rx_packets: 724246449
>      tx_packets: 229847
>      rx_bytes: 152691992335
>      tx_bytes: 10573426
>      multicast: 725997241
>      broadcast: 6
>      rx_csum_offload_good: 723051776
>      alloc_rx_buff_failed: 7119
>      tx_queue_0_packets: 229847
>      tx_queue_0_bytes: 10573426
>      rx_queue_0_packets: 340698332
>      rx_queue_0_bytes: 70844299683
>      rx_queue_1_packets: 385298923
>      rx_queue_1_bytes: 82276167594
> 
> 
> ixgbe driver fragment
> =====================
>     struct sk_buff *skb = netdev_alloc_skb(adapter->netdev, bufsz);
> 
>     if (!skb) {
>         adapter->alloc_rx_buff_failed++;
>         goto no_buffers;
>     }
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
  2009-06-16  9:19         ` Mel Gorman
@ 2009-06-16 15:25           ` starlight
  -1 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16 15:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, hugh.dickins, Lee.Schermerhorn,
	kosaki.motohiro, ebmunson, agl, apw, wli

At 10:19 AM 6/16/2009 +0100, Mel Gorman wrote:

>Can you give an example of an allocation failure? Specifically, I want to
>see what sort of allocation it was and what order.

I think it's just the basic buffer allocation for
Ethernet frames arriving in the 'ixgbe' driver.  Seems
like it's one allocation per frame.  Per the original
message the allocations are made with the 'netdev_alloc_skb()'
kernel call.  The function where this code appears is
named 'ixgbe_alloc_rx_buffers()' and the comment is
"Replace used receive buffers."

The code path in question does not generate an error.  It just
increments the 'alloc_rx_buff_failed' counter for the ethX
device.  In addition it appears that the frame is dropped
only if the PCIe hardware ring-queue associated with each
interface is full.  So on the next interrupt the allocation
is retried and appears to be successful 99% of the time.

>For reliable protocols, an allocation failure should recover and the
>data get through but obviously there is a drop in network performance
>when this happens.

This is for a specialized high-volume UDP multicast application
where data loss of any kind is unacceptable.

>If the allocations are high-order and atomic, increasing min_free_kbytes
>can help, particularly in situations where there is a burst of network
>traffic. I won't know if they are atomic until I see an error message
>though.

Doesn't the use of 'netdev_alloc_skb()' kernel primitive
imply what the nature of the allocation is?  I followed the
call graph down into "kmem" land, but it's a complex place
and so I abandoned the review.

My impression is that 'min_free_kbytes' relates mainly to systems
where significant paging pressure exists.  The servers have zero
paging pressure and lots of free memory, though mostly in the
form of instantly discardable file data cache pages.  In the
past disabling the program that generates the cache pressure
has had no effect on data loss, though I haven't tried it in
relation this specific issue.

Tried increasing a few /proc/slabinfo tuneable parameters today
and this appears to have fixed the issue so far today.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
@ 2009-06-16 15:25           ` starlight
  0 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16 15:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, hugh.dickins, Lee.Schermerhorn,
	kosaki.motohiro, ebmunson, agl, apw, wli

At 10:19 AM 6/16/2009 +0100, Mel Gorman wrote:

>Can you give an example of an allocation failure? Specifically, I want to
>see what sort of allocation it was and what order.

I think it's just the basic buffer allocation for
Ethernet frames arriving in the 'ixgbe' driver.  Seems
like it's one allocation per frame.  Per the original
message the allocations are made with the 'netdev_alloc_skb()'
kernel call.  The function where this code appears is
named 'ixgbe_alloc_rx_buffers()' and the comment is
"Replace used receive buffers."

The code path in question does not generate an error.  It just
increments the 'alloc_rx_buff_failed' counter for the ethX
device.  In addition it appears that the frame is dropped
only if the PCIe hardware ring-queue associated with each
interface is full.  So on the next interrupt the allocation
is retried and appears to be successful 99% of the time.

>For reliable protocols, an allocation failure should recover and the
>data get through but obviously there is a drop in network performance
>when this happens.

This is for a specialized high-volume UDP multicast application
where data loss of any kind is unacceptable.

>If the allocations are high-order and atomic, increasing min_free_kbytes
>can help, particularly in situations where there is a burst of network
>traffic. I won't know if they are atomic until I see an error message
>though.

Doesn't the use of 'netdev_alloc_skb()' kernel primitive
imply what the nature of the allocation is?  I followed the
call graph down into "kmem" land, but it's a complex place
and so I abandoned the review.

My impression is that 'min_free_kbytes' relates mainly to systems
where significant paging pressure exists.  The servers have zero
paging pressure and lots of free memory, though mostly in the
form of instantly discardable file data cache pages.  In the
past disabling the program that generates the cache pressure
has had no effect on data loss, though I haven't tried it in
relation this specific issue.

Tried increasing a few /proc/slabinfo tuneable parameters today
and this appears to have fixed the issue so far today.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced  by  tuning?
  2009-06-16  6:12             ` Eric Dumazet
  (?)
@ 2009-07-05  3:44               ` Herbert Xu
  -1 siblings, 0 replies; 40+ messages in thread
From: Herbert Xu @ 2009-07-05  3:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: starlight, eric.dumazet, linux-kernel, mel, linux-mm,
	hugh.dickins, Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl,
	apw, wli, netdev

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> Because of slab rounding, this reallocation should be done only if resulting data
> portion is really smaller (50 %) than original skb.

If we're going to do this in the core then we should only do it
in the spots where the packet may be held indefinitely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced  by  tuning?
@ 2009-07-05  3:44               ` Herbert Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Herbert Xu @ 2009-07-05  3:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: starlight, eric.dumazet, linux-kernel, mel, linux-mm,
	hugh.dickins, Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl,
	apw, wli, netdev

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> Because of slab rounding, this reallocation should be done only if resulting data
> portion is really smaller (50 %) than original skb.

If we're going to do this in the core then we should only do it
in the spots where the packet may be held indefinitely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced  by  tuning?
@ 2009-07-05  3:44               ` Herbert Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Herbert Xu @ 2009-07-05  3:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: starlight, linux-kernel, mel, linux-mm, hugh.dickins,
	Lee.Schermerhorn, kosaki.motohiro, ebmunson, agl, apw, wli,
	netdev

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> Because of slab rounding, this reallocation should be done only if resulting data
> portion is really smaller (50 %) than original skb.

If we're going to do this in the core then we should only do it
in the spots where the packet may be held indefinitely.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
@ 2009-06-16 17:24 ` starlight
  0 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, hugh.dickins, Lee.Schermerhorn,
	kosaki.motohiro, ebmunson, agl, apw, wli

>Tried increasing a few /proc/slabinfo tuneable parameters today
>and this appears to have fixed the issue so far today.

Spoke too soon.  A burst of allocation fails appeared
a some incoming data was lost.  'e1000e' system had
no problem.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: QUESTION: can netdev_alloc_skb() errors be reduced by tuning?
@ 2009-06-16 17:24 ` starlight
  0 siblings, 0 replies; 40+ messages in thread
From: starlight @ 2009-06-16 17:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, linux-mm, hugh.dickins, Lee.Schermerhorn,
	kosaki.motohiro, ebmunson, agl, apw, wli

>Tried increasing a few /proc/slabinfo tuneable parameters today
>and this appears to have fixed the issue so far today.

Spoke too soon.  A burst of allocation fails appeared
a some incoming data was lost.  'e1000e' system had
no problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2009-07-05  3:45 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-27 11:12 [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory Mel Gorman
2009-05-27 11:12 ` Mel Gorman
2009-05-27 11:12 ` [PATCH 1/2] x86: Ignore VM_LOCKED when determining if hugetlb-backed page tables can be shared or not Mel Gorman
2009-05-27 11:12   ` Mel Gorman
2009-05-27 16:38   ` Eric B Munson
2009-05-27 23:18   ` Ingo Molnar
2009-05-27 23:18     ` Ingo Molnar
2009-05-28  8:55     ` Mel Gorman
2009-05-28  8:55       ` Mel Gorman
2009-05-27 11:12 ` [PATCH 2/2] mm: Account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in hugetlbfs Mel Gorman
2009-05-27 11:12   ` Mel Gorman
2009-05-27 16:40   ` Eric B Munson
2009-05-27 20:14 ` [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory Andrew Morton
2009-05-27 20:14   ` Andrew Morton
2009-05-27 23:19   ` Ingo Molnar
2009-05-27 23:19     ` Ingo Molnar
2009-06-16  0:19     ` QUESTION: can netdev_alloc_skb() errors be reduced by tuning? starlight
2009-06-16  0:19       ` starlight
2009-06-16  2:26       ` Eric Dumazet
2009-06-16  2:26         ` Eric Dumazet
2009-06-16  4:12         ` starlight
2009-06-16  4:12           ` starlight
2009-06-16  6:12           ` Eric Dumazet
2009-06-16  6:12             ` Eric Dumazet
2009-06-16  6:12             ` Eric Dumazet
2009-07-05  3:44             ` Herbert Xu
2009-07-05  3:44               ` Herbert Xu
2009-07-05  3:44               ` Herbert Xu
2009-06-16  9:19       ` Mel Gorman
2009-06-16  9:19         ` Mel Gorman
2009-06-16 15:25         ` starlight
2009-06-16 15:25           ` starlight
2009-05-28  8:56   ` [PATCH 0/2] Fixes for hugetlbfs-related problems on shared memory Mel Gorman
2009-05-28  8:56     ` Mel Gorman
2009-06-08  1:25 ` starlight
2009-06-08  1:25   ` starlight
2009-06-08 10:24   ` Mel Gorman
2009-06-08 10:24     ` Mel Gorman
2009-06-16 17:24 QUESTION: can netdev_alloc_skb() errors be reduced by tuning? starlight
2009-06-16 17:24 ` starlight

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.