sharkcz / rpms / kernel

Forked from rpms/kernel 6 years ago
Clone
Kyle McMartin 76a7ebf
From 9ced9810f0450a7f05eccb40dce4f9e4616c0fb6 Mon Sep 17 00:00:00 2001
Kyle McMartin 76a7ebf
From: Mel Gorman <mel@csn.ul.ie>
Kyle McMartin 76a7ebf
Date: Wed, 24 Nov 2010 22:18:23 -0500
Kyle McMartin 76a7ebf
Subject: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low
Kyle McMartin 76a7ebf
Kyle McMartin 76a7ebf
Commit aa45484 ("calculate a better estimate of NR_FREE_PAGES when memory
Kyle McMartin 76a7ebf
is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
Kyle McMartin 76a7ebf
avoid synchronization overhead, these counters are maintained on a per-cpu
Kyle McMartin 76a7ebf
basis and drained both periodically and when a threshold is above a
Kyle McMartin 76a7ebf
threshold.  On large CPU systems, the difference between the estimate and
Kyle McMartin 76a7ebf
real value of NR_FREE_PAGES can be very high.  The system can get into a
Kyle McMartin 76a7ebf
case where pages are allocated far below the min watermark potentially
Kyle McMartin 76a7ebf
causing livelock issues.  The commit solved the problem by taking a better
Kyle McMartin 76a7ebf
reading of NR_FREE_PAGES when memory was low.
Kyle McMartin 76a7ebf
Kyle McMartin 76a7ebf
Unfortately, as reported by Shaohua Li this accurate reading can consume a
Kyle McMartin 76a7ebf
large amount of CPU time on systems with many sockets due to cache line
Kyle McMartin 76a7ebf
bouncing.  This patch takes a different approach.  For large machines
Kyle McMartin 76a7ebf
where counter drift might be unsafe and while kswapd is awake, the per-cpu
Kyle McMartin 76a7ebf
thresholds for the target pgdat are reduced to limit the level of drift to
Kyle McMartin 76a7ebf
what should be a safe level.  This incurs a performance penalty in heavy
Kyle McMartin 76a7ebf
memory pressure by a factor that depends on the workload and the machine
Kyle McMartin 76a7ebf
but the machine should function correctly without accidentally exhausting
Kyle McMartin 76a7ebf
all memory on a node.  There is an additional cost when kswapd wakes and
Kyle McMartin 76a7ebf
sleeps but the event is not expected to be frequent - in Shaohua's test
Kyle McMartin 76a7ebf
case, there was one recorded sleep and wake event at least.
Kyle McMartin 76a7ebf
Kyle McMartin 76a7ebf
To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
Kyle McMartin 76a7ebf
introduced that takes a more accurate reading of NR_FREE_PAGES when called
Kyle McMartin 76a7ebf
from wakeup_kswapd, when deciding whether it is really safe to go back to
Kyle McMartin 76a7ebf
sleep in sleeping_prematurely() and when deciding if a zone is really
Kyle McMartin 76a7ebf
balanced or not in balance_pgdat().  We are still using an expensive
Kyle McMartin 76a7ebf
function but limiting how often it is called.
Kyle McMartin 76a7ebf
Kyle McMartin 76a7ebf
When the test case is reproduced, the time spent in the watermark
Kyle McMartin 76a7ebf
functions is reduced.  The following report is on the percentage of time
Kyle McMartin 76a7ebf
spent cumulatively spent in the functions zone_nr_free_pages(),
Kyle McMartin 76a7ebf
zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
Kyle McMartin 76a7ebf
zone_page_state_snapshot(), zone_page_state().
Kyle McMartin 76a7ebf
Kyle McMartin 76a7ebf
vanilla                      11.6615%
Kyle McMartin 76a7ebf
disable-threshold            0.2584%
Kyle McMartin 76a7ebf
Kyle McMartin 76a7ebf
Reported-by: Shaohua Li <shaohua.li@intel.com>
Kyle McMartin 76a7ebf
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Kyle McMartin 76a7ebf
Reviewed-by: Christoph Lameter <cl@linux.com>
Kyle McMartin 76a7ebf
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kyle McMartin 76a7ebf
---
Kyle McMartin 76a7ebf
 include/linux/mmzone.h |   10 ++-----
Kyle McMartin 76a7ebf
 include/linux/vmstat.h |    5 +++
Kyle McMartin 76a7ebf
 mm/mmzone.c            |   21 ---------------
Kyle McMartin 76a7ebf
 mm/page_alloc.c        |   35 +++++++++++++++++++-----
Kyle McMartin 76a7ebf
 mm/vmscan.c            |   26 ++++++++++--------
Kyle McMartin 76a7ebf
 mm/vmstat.c            |   68 +++++++++++++++++++++++++++++++++++++++++++++++-
Kyle McMartin 76a7ebf
 6 files changed, 117 insertions(+), 48 deletions(-)
Kyle McMartin 76a7ebf
Kyle McMartin 76a7ebf
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
Kyle McMartin 76a7ebf
index 8b2db3d..1e3d0b4 100644
Kyle McMartin 76a7ebf
--- a/include/linux/mmzone.h
Kyle McMartin 76a7ebf
+++ b/include/linux/mmzone.h
Kyle McMartin 76a7ebf
@@ -463,12 +463,6 @@ static inline int zone_is_oom_locked(const struct zone *zone)
Kyle McMartin 76a7ebf
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
Kyle McMartin 76a7ebf
 }
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
-#ifdef CONFIG_SMP
Kyle McMartin 76a7ebf
-unsigned long zone_nr_free_pages(struct zone *zone);
Kyle McMartin 76a7ebf
-#else
Kyle McMartin 76a7ebf
-#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
Kyle McMartin 76a7ebf
-#endif /* CONFIG_SMP */
Kyle McMartin 76a7ebf
-
Kyle McMartin 76a7ebf
 /*
Kyle McMartin 76a7ebf
  * The "priority" of VM scanning is how much of the queues we will scan in one
Kyle McMartin 76a7ebf
  * go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
Kyle McMartin 76a7ebf
@@ -668,7 +662,9 @@ void get_zone_counts(unsigned long *active, unsigned long *inactive,
Kyle McMartin 76a7ebf
 			unsigned long *free);
Kyle McMartin 76a7ebf
 void build_all_zonelists(void *data);
Kyle McMartin 76a7ebf
 void wakeup_kswapd(struct zone *zone, int order);
Kyle McMartin 76a7ebf
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
+		int classzone_idx, int alloc_flags);
Kyle McMartin 76a7ebf
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
 		int classzone_idx, int alloc_flags);
Kyle McMartin 76a7ebf
 enum memmap_context {
Kyle McMartin 76a7ebf
 	MEMMAP_EARLY,
Kyle McMartin 76a7ebf
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
Kyle McMartin 76a7ebf
index eaaea37..e4cc21c 100644
Kyle McMartin 76a7ebf
--- a/include/linux/vmstat.h
Kyle McMartin 76a7ebf
+++ b/include/linux/vmstat.h
Kyle McMartin 76a7ebf
@@ -254,6 +254,8 @@ extern void dec_zone_state(struct zone *, enum zone_stat_item);
Kyle McMartin 76a7ebf
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
 void refresh_cpu_vm_stats(int);
Kyle McMartin 76a7ebf
+void reduce_pgdat_percpu_threshold(pg_data_t *pgdat);
Kyle McMartin 76a7ebf
+void restore_pgdat_percpu_threshold(pg_data_t *pgdat);
Kyle McMartin 76a7ebf
 #else /* CONFIG_SMP */
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
 /*
Kyle McMartin 76a7ebf
@@ -298,6 +300,9 @@ static inline void __dec_zone_page_state(struct page *page,
Kyle McMartin 76a7ebf
 #define dec_zone_page_state __dec_zone_page_state
Kyle McMartin 76a7ebf
 #define mod_zone_page_state __mod_zone_page_state
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
+static inline void reduce_pgdat_percpu_threshold(pg_data_t *pgdat) { }
Kyle McMartin 76a7ebf
+static inline void restore_pgdat_percpu_threshold(pg_data_t *pgdat) { }
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
 static inline void refresh_cpu_vm_stats(int cpu) { }
Kyle McMartin 76a7ebf
 #endif
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
diff --git a/mm/mmzone.c b/mm/mmzone.c
Kyle McMartin 76a7ebf
index e35bfb8..f5b7d17 100644
Kyle McMartin 76a7ebf
--- a/mm/mmzone.c
Kyle McMartin 76a7ebf
+++ b/mm/mmzone.c
Kyle McMartin 76a7ebf
@@ -87,24 +87,3 @@ int memmap_valid_within(unsigned long pfn,
Kyle McMartin 76a7ebf
 	return 1;
Kyle McMartin 76a7ebf
 }
Kyle McMartin 76a7ebf
 #endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
Kyle McMartin 76a7ebf
-
Kyle McMartin 76a7ebf
-#ifdef CONFIG_SMP
Kyle McMartin 76a7ebf
-/* Called when a more accurate view of NR_FREE_PAGES is needed */
Kyle McMartin 76a7ebf
-unsigned long zone_nr_free_pages(struct zone *zone)
Kyle McMartin 76a7ebf
-{
Kyle McMartin 76a7ebf
-	unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
Kyle McMartin 76a7ebf
-
Kyle McMartin 76a7ebf
-	/*
Kyle McMartin 76a7ebf
-	 * While kswapd is awake, it is considered the zone is under some
Kyle McMartin 76a7ebf
-	 * memory pressure. Under pressure, there is a risk that
Kyle McMartin 76a7ebf
-	 * per-cpu-counter-drift will allow the min watermark to be breached
Kyle McMartin 76a7ebf
-	 * potentially causing a live-lock. While kswapd is awake and
Kyle McMartin 76a7ebf
-	 * free pages are low, get a better estimate for free pages
Kyle McMartin 76a7ebf
-	 */
Kyle McMartin 76a7ebf
-	if (nr_free_pages < zone->percpu_drift_mark &&
Kyle McMartin 76a7ebf
-			!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
Kyle McMartin 76a7ebf
-		return zone_page_state_snapshot(zone, NR_FREE_PAGES);
Kyle McMartin 76a7ebf
-
Kyle McMartin 76a7ebf
-	return nr_free_pages;
Kyle McMartin 76a7ebf
-}
Kyle McMartin 76a7ebf
-#endif /* CONFIG_SMP */
Kyle McMartin 76a7ebf
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
Kyle McMartin 76a7ebf
index f7cc624..cf5d4c0 100644
Kyle McMartin 76a7ebf
--- a/mm/page_alloc.c
Kyle McMartin 76a7ebf
+++ b/mm/page_alloc.c
Kyle McMartin 76a7ebf
@@ -1454,24 +1454,24 @@ static inline int should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
Kyle McMartin 76a7ebf
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
 /*
Kyle McMartin 76a7ebf
- * Return 1 if free pages are above 'mark'. This takes into account the order
Kyle McMartin 76a7ebf
+ * Return true if free pages are above 'mark'. This takes into account the order
Kyle McMartin 76a7ebf
  * of the allocation.
Kyle McMartin 76a7ebf
  */
Kyle McMartin 76a7ebf
-int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
-		      int classzone_idx, int alloc_flags)
Kyle McMartin 76a7ebf
+static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
+		      int classzone_idx, int alloc_flags, long free_pages)
Kyle McMartin 76a7ebf
 {
Kyle McMartin 76a7ebf
 	/* free_pages my go negative - that's OK */
Kyle McMartin 76a7ebf
 	long min = mark;
Kyle McMartin 76a7ebf
-	long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
Kyle McMartin 76a7ebf
 	int o;
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
+	free_pages -= (1 << order) + 1;
Kyle McMartin 76a7ebf
 	if (alloc_flags & ALLOC_HIGH)
Kyle McMartin 76a7ebf
 		min -= min / 2;
Kyle McMartin 76a7ebf
 	if (alloc_flags & ALLOC_HARDER)
Kyle McMartin 76a7ebf
 		min -= min / 4;
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
Kyle McMartin 76a7ebf
-		return 0;
Kyle McMartin 76a7ebf
+		return false;
Kyle McMartin 76a7ebf
 	for (o = 0; o < order; o++) {
Kyle McMartin 76a7ebf
 		/* At the next order, this order's pages become unavailable */
Kyle McMartin 76a7ebf
 		free_pages -= z->free_area[o].nr_free << o;
Kyle McMartin 76a7ebf
@@ -1480,9 +1480,28 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
 		min >>= 1;
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
 		if (free_pages <= min)
Kyle McMartin 76a7ebf
-			return 0;
Kyle McMartin 76a7ebf
+			return false;
Kyle McMartin 76a7ebf
 	}
Kyle McMartin 76a7ebf
-	return 1;
Kyle McMartin 76a7ebf
+	return true;
Kyle McMartin 76a7ebf
+}
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
+		      int classzone_idx, int alloc_flags)
Kyle McMartin 76a7ebf
+{
Kyle McMartin 76a7ebf
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
Kyle McMartin 76a7ebf
+					zone_page_state(z, NR_FREE_PAGES));
Kyle McMartin 76a7ebf
+}
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
Kyle McMartin 76a7ebf
+		      int classzone_idx, int alloc_flags)
Kyle McMartin 76a7ebf
+{
Kyle McMartin 76a7ebf
+	long free_pages = zone_page_state(z, NR_FREE_PAGES);
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
Kyle McMartin 76a7ebf
+		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
Kyle McMartin 76a7ebf
+								free_pages);
Kyle McMartin 76a7ebf
 }
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
 #ifdef CONFIG_NUMA
Kyle McMartin 76a7ebf
@@ -2425,7 +2444,7 @@ void show_free_areas(void)
Kyle McMartin 76a7ebf
 			" all_unreclaimable? %s"
Kyle McMartin 76a7ebf
 			"\n",
Kyle McMartin 76a7ebf
 			zone->name,
Kyle McMartin 76a7ebf
-			K(zone_nr_free_pages(zone)),
Kyle McMartin 76a7ebf
+			K(zone_page_state(zone, NR_FREE_PAGES)),
Kyle McMartin 76a7ebf
 			K(min_wmark_pages(zone)),
Kyle McMartin 76a7ebf
 			K(low_wmark_pages(zone)),
Kyle McMartin 76a7ebf
 			K(high_wmark_pages(zone)),
Kyle McMartin 76a7ebf
diff --git a/mm/vmscan.c b/mm/vmscan.c
Kyle McMartin 76a7ebf
index 9753626..18f4038 100644
Kyle McMartin 76a7ebf
--- a/mm/vmscan.c
Kyle McMartin 76a7ebf
+++ b/mm/vmscan.c
Kyle McMartin 76a7ebf
@@ -2007,7 +2007,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
Kyle McMartin 76a7ebf
 		if (zone->all_unreclaimable)
Kyle McMartin 76a7ebf
 			continue;
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
-		if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
Kyle McMartin 76a7ebf
+		if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
Kyle McMartin 76a7ebf
 								0, 0))
Kyle McMartin 76a7ebf
 			return 1;
Kyle McMartin 76a7ebf
 	}
Kyle McMartin 76a7ebf
@@ -2104,7 +2104,7 @@ loop_again:
Kyle McMartin 76a7ebf
 				shrink_active_list(SWAP_CLUSTER_MAX, zone,
Kyle McMartin 76a7ebf
 							&sc, priority, 0);
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
-			if (!zone_watermark_ok(zone, order,
Kyle McMartin 76a7ebf
+			if (!zone_watermark_ok_safe(zone, order,
Kyle McMartin 76a7ebf
 					high_wmark_pages(zone), 0, 0)) {
Kyle McMartin 76a7ebf
 				end_zone = i;
Kyle McMartin 76a7ebf
 				break;
Kyle McMartin 76a7ebf
@@ -2155,7 +2155,7 @@ loop_again:
Kyle McMartin 76a7ebf
 			 * We put equal pressure on every zone, unless one
Kyle McMartin 76a7ebf
 			 * zone has way too many pages free already.
Kyle McMartin 76a7ebf
 			 */
Kyle McMartin 76a7ebf
-			if (!zone_watermark_ok(zone, order,
Kyle McMartin 76a7ebf
+			if (!zone_watermark_ok_safe(zone, order,
Kyle McMartin 76a7ebf
 					8*high_wmark_pages(zone), end_zone, 0))
Kyle McMartin 76a7ebf
 				shrink_zone(priority, zone, &sc);
Kyle McMartin 76a7ebf
 			reclaim_state->reclaimed_slab = 0;
Kyle McMartin 76a7ebf
@@ -2176,7 +2176,7 @@ loop_again:
Kyle McMartin 76a7ebf
 			    total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
Kyle McMartin 76a7ebf
 				sc.may_writepage = 1;
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
-			if (!zone_watermark_ok(zone, order,
Kyle McMartin 76a7ebf
+			if (!zone_watermark_ok_safe(zone, order,
Kyle McMartin 76a7ebf
 					high_wmark_pages(zone), end_zone, 0)) {
Kyle McMartin 76a7ebf
 				all_zones_ok = 0;
Kyle McMartin 76a7ebf
 				/*
Kyle McMartin 76a7ebf
@@ -2184,7 +2184,7 @@ loop_again:
Kyle McMartin 76a7ebf
 				 * means that we have a GFP_ATOMIC allocation
Kyle McMartin 76a7ebf
 				 * failure risk. Hurry up!
Kyle McMartin 76a7ebf
 				 */
Kyle McMartin 76a7ebf
-				if (!zone_watermark_ok(zone, order,
Kyle McMartin 76a7ebf
+				if (!zone_watermark_ok_safe(zone, order,
Kyle McMartin 76a7ebf
 					    min_wmark_pages(zone), end_zone, 0))
Kyle McMartin 76a7ebf
 					has_under_min_watermark_zone = 1;
Kyle McMartin 76a7ebf
 			}
Kyle McMartin 76a7ebf
@@ -2326,9 +2326,11 @@ static int kswapd(void *p)
Kyle McMartin 76a7ebf
 				 * premature sleep. If not, then go fully
Kyle McMartin 76a7ebf
 				 * to sleep until explicitly woken up
Kyle McMartin 76a7ebf
 				 */
Kyle McMartin 76a7ebf
-				if (!sleeping_prematurely(pgdat, order, remaining))
Kyle McMartin 76a7ebf
+				if (!sleeping_prematurely(pgdat, order, remaining)) {
Kyle McMartin 76a7ebf
+					restore_pgdat_percpu_threshold(pgdat);
Kyle McMartin 76a7ebf
 					schedule();
Kyle McMartin 76a7ebf
-				else {
Kyle McMartin 76a7ebf
+					reduce_pgdat_percpu_threshold(pgdat);
Kyle McMartin 76a7ebf
+				} else {
Kyle McMartin 76a7ebf
 					if (remaining)
Kyle McMartin 76a7ebf
 						count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
Kyle McMartin 76a7ebf
 					else
Kyle McMartin 76a7ebf
@@ -2364,15 +2366,16 @@ void wakeup_kswapd(struct zone *zone, int order)
Kyle McMartin 76a7ebf
 	if (!populated_zone(zone))
Kyle McMartin 76a7ebf
 		return;
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
-	pgdat = zone->zone_pgdat;
Kyle McMartin 76a7ebf
-	if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0, 0))
Kyle McMartin 76a7ebf
+	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
Kyle McMartin 76a7ebf
 		return;
Kyle McMartin 76a7ebf
+	pgdat = zone->zone_pgdat;
Kyle McMartin 76a7ebf
 	if (pgdat->kswapd_max_order < order)
Kyle McMartin 76a7ebf
 		pgdat->kswapd_max_order = order;
Kyle McMartin 76a7ebf
-	if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
Kyle McMartin 76a7ebf
-		return;
Kyle McMartin 76a7ebf
 	if (!waitqueue_active(&pgdat->kswapd_wait))
Kyle McMartin 76a7ebf
 		return;
Kyle McMartin 76a7ebf
+	if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
Kyle McMartin 76a7ebf
+		return;
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
 	wake_up_interruptible(&pgdat->kswapd_wait);
Kyle McMartin 76a7ebf
 }
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
diff --git a/mm/vmstat.c b/mm/vmstat.c
Kyle McMartin 76a7ebf
index 26d5716..41dc8cd 100644
Kyle McMartin 76a7ebf
--- a/mm/vmstat.c
Kyle McMartin 76a7ebf
+++ b/mm/vmstat.c
Kyle McMartin 76a7ebf
@@ -81,6 +81,30 @@ EXPORT_SYMBOL(vm_stat);
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
 #ifdef CONFIG_SMP
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
+static int calculate_pressure_threshold(struct zone *zone)
Kyle McMartin 76a7ebf
+{
Kyle McMartin 76a7ebf
+	int threshold;
Kyle McMartin 76a7ebf
+	int watermark_distance;
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+	/*
Kyle McMartin 76a7ebf
+	 * As vmstats are not up to date, there is drift between the estimated
Kyle McMartin 76a7ebf
+	 * and real values. For high thresholds and a high number of CPUs, it
Kyle McMartin 76a7ebf
+	 * is possible for the min watermark to be breached while the estimated
Kyle McMartin 76a7ebf
+	 * value looks fine. The pressure threshold is a reduced value such
Kyle McMartin 76a7ebf
+	 * that even the maximum amount of drift will not accidentally breach
Kyle McMartin 76a7ebf
+	 * the min watermark
Kyle McMartin 76a7ebf
+	 */
Kyle McMartin 76a7ebf
+	watermark_distance = low_wmark_pages(zone) - min_wmark_pages(zone);
Kyle McMartin 76a7ebf
+	threshold = max(1, (int)(watermark_distance / num_online_cpus()));
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+	/*
Kyle McMartin 76a7ebf
+	 * Maximum threshold is 125
Kyle McMartin 76a7ebf
+	 */
Kyle McMartin 76a7ebf
+	threshold = min(125, threshold);
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+	return threshold;
Kyle McMartin 76a7ebf
+}
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
 static int calculate_threshold(struct zone *zone)
Kyle McMartin 76a7ebf
 {
Kyle McMartin 76a7ebf
 	int threshold;
Kyle McMartin 76a7ebf
@@ -159,6 +183,48 @@ static void refresh_zone_stat_thresholds(void)
Kyle McMartin 76a7ebf
 	}
Kyle McMartin 76a7ebf
 }
Kyle McMartin 76a7ebf
 
Kyle McMartin 76a7ebf
+void reduce_pgdat_percpu_threshold(pg_data_t *pgdat)
Kyle McMartin 76a7ebf
+{
Kyle McMartin 76a7ebf
+	struct zone *zone;
Kyle McMartin 76a7ebf
+	int cpu;
Kyle McMartin 76a7ebf
+	int threshold;
Kyle McMartin 76a7ebf
+	int i;
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+	get_online_cpus();
Kyle McMartin 76a7ebf
+	for (i = 0; i < pgdat->nr_zones; i++) {
Kyle McMartin 76a7ebf
+		zone = &pgdat->node_zones[i];
Kyle McMartin 76a7ebf
+		if (!zone->percpu_drift_mark)
Kyle McMartin 76a7ebf
+			continue;
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+		threshold = calculate_pressure_threshold(zone);
Kyle McMartin 76a7ebf
+		for_each_online_cpu(cpu)
Kyle McMartin 76a7ebf
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
Kyle McMartin 76a7ebf
+							= threshold;
Kyle McMartin 76a7ebf
+	}
Kyle McMartin 76a7ebf
+	put_online_cpus();
Kyle McMartin 76a7ebf
+}
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+void restore_pgdat_percpu_threshold(pg_data_t *pgdat)
Kyle McMartin 76a7ebf
+{
Kyle McMartin 76a7ebf
+	struct zone *zone;
Kyle McMartin 76a7ebf
+	int cpu;
Kyle McMartin 76a7ebf
+	int threshold;
Kyle McMartin 76a7ebf
+	int i;
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+	get_online_cpus();
Kyle McMartin 76a7ebf
+	for (i = 0; i < pgdat->nr_zones; i++) {
Kyle McMartin 76a7ebf
+		zone = &pgdat->node_zones[i];
Kyle McMartin 76a7ebf
+		if (!zone->percpu_drift_mark)
Kyle McMartin 76a7ebf
+			continue;
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
+		threshold = calculate_threshold(zone);
Kyle McMartin 76a7ebf
+		for_each_online_cpu(cpu)
Kyle McMartin 76a7ebf
+			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
Kyle McMartin 76a7ebf
+							= threshold;
Kyle McMartin 76a7ebf
+	}
Kyle McMartin 76a7ebf
+	put_online_cpus();
Kyle McMartin 76a7ebf
+}
Kyle McMartin 76a7ebf
+
Kyle McMartin 76a7ebf
 /*
Kyle McMartin 76a7ebf
  * For use when we know that interrupts are disabled.
Kyle McMartin 76a7ebf
  */
Kyle McMartin 76a7ebf
@@ -826,7 +892,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
Kyle McMartin 76a7ebf
 		   "\n        scanned  %lu"
Kyle McMartin 76a7ebf
 		   "\n        spanned  %lu"
Kyle McMartin 76a7ebf
 		   "\n        present  %lu",
Kyle McMartin 76a7ebf
-		   zone_nr_free_pages(zone),
Kyle McMartin 76a7ebf
+		   zone_page_state(zone, NR_FREE_PAGES),
Kyle McMartin 76a7ebf
 		   min_wmark_pages(zone),
Kyle McMartin 76a7ebf
 		   low_wmark_pages(zone),
Kyle McMartin 76a7ebf
 		   high_wmark_pages(zone),
Kyle McMartin 76a7ebf
-- 
Kyle McMartin 76a7ebf
1.7.3.2
Kyle McMartin 76a7ebf