Blob Blame History Raw
From: Markus Guertler <mguertler@novell.com>
Subject: Introduce (optional) pagecache limit
References: FATE309111, FATE325794
Patch-mainline: Never, SUSE specific

SLE12-SP3->SLE12-SP4
- move from zone to the node reclaim

SLE12->SLE12-SP2
- move the slab shrinking into shrink_all_zones because slab shrinkers
  are numa aware now so we need a to do zone to get the proper node.

SLE11-SP3->SLE12 changes & remarks by mhocko@suse.cz:
- The feature should be deprecated and dropped eventually and replaced
  by Memory cgroup controller.
- vmswappiness was always broken because shrink_list didn't and doesn't
  consider it
- sc->nr_to_reclaim is updated once per shrink_all_zones which means
  that we might end up reclaiming more than expected.

Notes on forward port to SLE12
- shrink_all_zones has to be memcg aware now because there is no global
  LRU anymore. Put this into shrink_zone_per_memcg
- priority is a part of scan_control now
- swappiness is no longer in scan_control but as mentioned above it didn't
  have any effect anyway
- 

Original changelog (as per 11sp3):
----------------------------------
There are apps that consume lots of memory and touch some of their
pages very infrequently; yet those pages are very important for the
overall performance of the app and should not be paged out in favor
of pagecache. The kernel can't know this and takes the wrong decisions,
even with low swappiness values.

This sysctl allows to set a limit for the non-mapped page cache; 
non-mapped meaning that it will not affect shared memory or files
that are mmap()ed -- just anonymous file system cache.
Above this limit, the kernel will always consider removing pages from
the page cache first.

The limit that ends up being enforced is dependent on free memory;
if we have lots of it, the effective limit is much higher -- only when
the free memory gets scarce, we'll become strict about anonymous 
page cache. This should make the setting much more attractive to use.

[Reworked by Kurt Garloff and Nick Piggin]

Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Michal Hocko <mhocko@suse.cz>

---
 Documentation/vm/pagecache-limit |   51 +++++++++
 include/linux/pagemap.h          |    1 
 include/linux/swap.h             |    4 
 kernel/sysctl.c                  |    7 +
 mm/filemap.c                     |    3 
 mm/page_alloc.c                  |   19 +++
 mm/shmem.c                       |    5 
 mm/vmscan.c                      |  207 +++++++++++++++++++++++++++++++++++++++
 8 files changed, 296 insertions(+), 1 deletion(-)

--- /dev/null
+++ b/Documentation/vm/pagecache-limit
@@ -0,0 +1,51 @@
+Functionality:
+-------------
+The patch introduces a new tunable in the proc filesystem:
+
+/proc/sys/vm/pagecache_limit_mb
+
+This tunable sets a limit to the unmapped pages in the pagecache in megabytes.
+If non-zero, it should not be set below 4 (4MB), or the system might behave erratically. In real-life, much larger limits (a few percent of system RAM / a hundred MBs) will be useful.
+
+Examples:
+echo 512 >/proc/sys/vm/pagecache_limit_mb
+
+This sets a baseline limits for the page cache (not the buffer cache!) of 0.5GiB.
+As we only consider pagecache pages that are unmapped, currently mapped pages (files that are mmap'ed such as e.g. binaries and libraries as well as SysV shared memory) are not limited by this.
+NOTE: The real limit depends on the amount of free memory. Every existing free page allows the page cache to grow 8x the amount of free memory above the set baseline. As soon as the free memory is needed, we free up page cache.
+
+
+How it works:
+------------
+The heart of this patch is a new function called shrink_page_cache(). It is called from balance_pgdat (which is the worker for kswapd) if the pagecache is above the limit.
+The function is also called in __alloc_pages_slowpath.
+
+shrink_page_cache() calculates the nr of pages the cache is over its limit. It reduces this number by a factor (so you have to call it several times to get down to the target) then shrinks the pagecache (using the Kernel LRUs).
+
+shrink_page_cache does several passes:
+- Just reclaiming from inactive pagecache memory.
+  This is fast -- but it might not find enough free pages; if that happens,
+  the second pass will happen
+- In the second pass, pages from active list will also be considered.
+- The third pass is just another round of the second pass
+
+In all passes, only unmapped pages will be considered.
+
+
+How it changes memory management:
+--------------------------------
+If the pagecache_limit_mb is set to zero (default), nothing changes.
+
+If set to a positive value, there will be three different operating modes:
+(1) If we still have plenty of free pages, the pagecache limit will NOT be enforced. Memory management decisions are taken as normally.
+(2) However, as soon someone consumes those free pages, we'll start freeing pagecache -- as those are returned to the free page pool, freeing a few pages from pagecache will return us to state (1) -- if however someone consumes these free pages quickly, we'll continue freeing up pages from the pagecache until we reach pagecache_limit_mb.
+(3) Once we are at or below the low watermark, pagecache_limit_mb, the pages in the page cache will be governed by normal paging memory management decisions; if it starts growing above the limit (corrected by the free pages), we'll free some up again.
+
+This feature is useful for machines that have large workloads, carefully sized to eat most of the memory. Depending on the applications page access pattern, the kernel may too easily swap the application memory out in favor of pagecache. This can happen even for low values of swappiness. With this feature, the admin can tell the kernel that only a certain amount of pagecache is really considered useful and that it otherwise should favor the applications memory.
+
+
+Foreground vs. background shrinking:
+-----------------------------------
+
+Usually, the Linux kernel reclaims its memory using the kernel thread kswapd. It reclaims memory in the background. If it can't reclaim memory fast enough, it retries with higher priority and if this still doesn't succeed it uses a direct reclaim path.
+
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -12,6 +12,7 @@
 #include <linux/uaccess.h>
 #include <linux/gfp.h>
 #include <linux/bitops.h>
+#include <linux/swap.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
 #include <linux/hugetlb_inline.h>
 
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -332,6 +332,10 @@ extern unsigned long mem_cgroup_shrink_n
 						unsigned long *nr_scanned);
 extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
+#define FREE_TO_PAGECACHE_RATIO 8
+extern unsigned long pagecache_over_limit(void);
+extern void shrink_page_cache(gfp_t mask, struct page *page);
+extern unsigned int vm_pagecache_limit_mb;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1370,6 +1370,13 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+	{
+		.procname	= "pagecache_limit_mb",
+		.data		= &vm_pagecache_limit_mb,
+		.maxlen		= sizeof(vm_pagecache_limit_mb),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 #ifdef CONFIG_HUGETLB_PAGE
 	{
 		.procname	= "nr_hugepages",
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -900,6 +900,9 @@ int add_to_page_cache_lru(struct page *p
 	void *shadow = NULL;
 	int ret;
 
+	if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+		shrink_page_cache(gfp_mask, page);
+
 	__SetPageLocked(page);
 	ret = __add_to_page_cache_locked(page, mapping, offset,
 					 gfp_mask, &shadow);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7790,6 +7790,25 @@ void zone_pcp_reset(struct zone *zone)
 	local_irq_restore(flags);
 }
 
+/* Returns a number that's positive if the pagecache is above
+ * the set limit. Note that we allow the pagecache to grow
+ * larger if there's plenty of free pages.
+ */
+unsigned long pagecache_over_limit()
+{
+	/* We only want to limit unmapped page cache pages */
+	unsigned long pgcache_pages = global_page_state(NR_FILE_PAGES)
+				    - global_page_state(NR_FILE_MAPPED);
+	unsigned long free_pages = global_page_state(NR_FREE_PAGES);
+	unsigned long limit;
+
+	limit = vm_pagecache_limit_mb * ((1024*1024UL)/PAGE_SIZE) +
+		FREE_TO_PAGECACHE_RATIO * free_pages;
+	if (pgcache_pages > limit)
+		return pgcache_pages - limit;
+	return 0;
+}
+
 #ifdef CONFIG_MEMORY_HOTREMOVE
 /*
  * All pages in the range must be in a single zone and isolated
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1229,8 +1229,11 @@ int shmem_unuse(swp_entry_t swap, struct
 		if (error != -ENOMEM)
 			error = 0;
 		mem_cgroup_cancel_charge(page, memcg, false);
-	} else
+	} else {
 		mem_cgroup_commit_charge(page, memcg, true, false);
+		if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+			shrink_page_cache(GFP_KERNEL, page);
+	}
 out:
 	unlock_page(page);
 	put_page(page);
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -149,6 +149,7 @@ struct scan_control {
  * From 0 .. 100.  Higher means more swappy.
  */
 int vm_swappiness = 60;
+unsigned int vm_pagecache_limit_mb __read_mostly = 0;
 /*
  * The total number of pages which are beyond the high watermark within all
  * zones.
@@ -3152,6 +3153,8 @@ static void clear_pgdat_congested(pg_dat
 	clear_bit(PGDAT_WRITEBACK, &pgdat->flags);
 }
 
+static void __shrink_page_cache(gfp_t mask);
+
 /*
  * Prepare kswapd for sleeping. This verifies that there are no processes
  * waiting in throttle_direct_reclaim() and that watermarks have been met.
@@ -3260,6 +3263,10 @@ static int balance_pgdat(pg_data_t *pgda
 	};
 	count_vm_event(PAGEOUTRUN);
 
+	/* this reclaims from all zones so don't count to sc.nr_reclaimed */
+	if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+		__shrink_page_cache(GFP_KERNEL);
+
 	/*
 	 * Account for the reclaim boost. Note that the zone boost is left in
 	 * place so that parallel allocations that are near the watermark will
@@ -3426,6 +3433,12 @@ static void kswapd_try_to_sleep(pg_data_
 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
 	}
 
+	/* We do not need to loop_again if we have not achieved our
+	 * pagecache target (i.e. && pagecache_over_limit(0) > 0) because
+	 * the limit will be checked next time a page is added to the page
+	 * cache. This might cause a short stall but we should rather not
+	 * keep kswapd awake.
+	 */
 	/*
 	 * After a short sleep, check if it was a premature sleep. If not, then
 	 * go fully to sleep until explicitly woken up.
@@ -3626,6 +3639,200 @@ unsigned long shrink_all_memory(unsigned
 }
 #endif /* CONFIG_HIBERNATION */
 
+
+/*
+ * Similar to shrink_node but it has a different consumer - pagecache limit
+ * so we cannot reuse the original function - and we do not want to clobber
+ * that code path so we have to live with this code duplication.
+ *
+ * In short this simply scans through the given lru for all cgroups for the
+ * given node.
+ *
+ * returns true if we managed to cumulatively reclaim (via nr_reclaimed)
+ * the given nr_to_reclaim pages, false otherwise. The caller knows that
+ * it doesn't have to touch other nodes if the target was hit already.
+ *
+ * DO NOT USE OUTSIDE of shrink_all_nodes unless you have a really really
+ * really good reason.
+ */
+static bool shrink_node_per_memcg(struct pglist_data *pgdat, enum lru_list lru,
+		unsigned long nr_to_scan, unsigned long nr_to_reclaim,
+		unsigned long *nr_reclaimed, struct scan_control *sc)
+{
+	struct mem_cgroup *root = sc->target_mem_cgroup;
+	struct mem_cgroup *memcg;
+	struct mem_cgroup_reclaim_cookie reclaim = {
+		.pgdat = pgdat,
+		.priority = sc->priority,
+	};
+
+	memcg = mem_cgroup_iter(root, NULL, &reclaim);
+	do {
+		struct lruvec *lruvec;
+
+		lruvec = mem_cgroup_lruvec(pgdat, memcg);
+		*nr_reclaimed += shrink_list(lru, nr_to_scan, lruvec, memcg, sc);
+		if (*nr_reclaimed >= nr_to_reclaim) {
+			mem_cgroup_iter_break(root, memcg);
+			return true;
+		}
+
+		memcg = mem_cgroup_iter(root, memcg, &reclaim);
+	} while (memcg);
+
+	return false;
+}
+
+/*
+ * We had to resurect this function for __shrink_page_cache (upstream has
+ * removed it and reworked shrink_all_memory by 7b51755c).
+ *
+ * Tries to reclaim 'nr_pages' pages from LRU lists system-wide, for given
+ * pass.
+ *
+ * For pass > 3 we also try to shrink the LRU lists that contain a few pages
+ */
+static void shrink_all_nodes(unsigned long nr_pages, int pass,
+		struct scan_control *sc)
+{
+	unsigned long nr_reclaimed = 0;
+	int nid;
+
+	for_each_online_node(nid) {
+		struct pglist_data *pgdat = NODE_DATA(nid);
+		enum lru_list lru;
+
+		for_each_evictable_lru(lru) {
+			enum zone_stat_item ls = NR_LRU_BASE + lru;
+			unsigned long lru_pages = node_page_state(pgdat, ls);
+
+			/* For pass = 0, we don't shrink the active list */
+			if (pass == 0 && (lru == LRU_ACTIVE_ANON ||
+						lru == LRU_ACTIVE_FILE))
+				continue;
+
+			/* Original code relied on nr_saved_scan which is no
+			 * longer present so we are just considering LRU pages.
+			 * This means that the zone has to have quite large
+			 * LRU list for default priority and minimum nr_pages
+			 * size (8*SWAP_CLUSTER_MAX). In the end we will tend
+			 * to reclaim more from large zones wrt. small.
+			 * This should be OK because shrink_page_cache is called
+			 * when we are getting to short memory condition so
+			 * LRUs tend to be large.
+			 */
+			if (((lru_pages >> sc->priority) + 1) >= nr_pages || pass > 3) {
+				unsigned long nr_to_scan;
+				struct reclaim_state reclaim_state;
+				unsigned long scanned = sc->nr_scanned;
+				struct reclaim_state *old_rs = current->reclaim_state;
+
+				/* shrink_list takes lru_lock with IRQ off so we
+				 * should be careful about really huge nr_to_scan
+				 */
+				nr_to_scan = min(nr_pages, lru_pages);
+
+				/*
+				 * A bit of a hack but the code has always been
+				 * updating sc->nr_reclaimed once per shrink_all_nodes
+				 * rather than accumulating it for all calls to shrink
+				 * lru. This costs us an additional argument to
+				 * shrink_node_per_memcg but well...
+				 *
+				 * Let's stick with this for bug-to-bug compatibility
+				 */
+				if (shrink_node_per_memcg(pgdat, lru,
+					nr_to_scan, nr_pages, &nr_reclaimed, sc)) {
+					sc->nr_reclaimed += nr_reclaimed;
+					return;
+				}
+
+				current->reclaim_state = &reclaim_state;
+				reclaim_state.reclaimed_slab = 0;
+				shrink_slab(sc->gfp_mask, nid, NULL,
+					    sc->nr_scanned - scanned, lru_pages);
+				sc->nr_reclaimed += reclaim_state.reclaimed_slab;
+				current->reclaim_state = old_rs;
+			}
+		}
+	}
+	sc->nr_reclaimed += nr_reclaimed;
+}
+
+/*
+ * Function to shrink the page cache
+ *
+ * This function calculates the number of pages (nr_pages) the page
+ * cache is over its limit and shrinks the page cache accordingly.
+ *
+ * The maximum number of pages, the page cache shrinks in one call of
+ * this function is limited to SWAP_CLUSTER_MAX pages. Therefore it may
+ * require a number of calls to actually reach the vm_pagecache_limit_kb.
+ *
+ * This function is similar to shrink_all_memory, except that it may never
+ * swap out mapped pages and only does two passes.
+ */
+static void __shrink_page_cache(gfp_t mask)
+{
+	unsigned long ret = 0;
+	int pass;
+	struct scan_control sc = {
+		.gfp_mask = mask,
+		.may_swap = 0,
+		.may_unmap = 0,
+		.may_writepage = 0,
+		.target_mem_cgroup = NULL,
+		.reclaim_idx = gfp_zone(mask),
+	};
+	long nr_pages;
+
+	/* How many pages are we over the limit?
+	 * But don't enforce limit if there's plenty of free mem */
+	nr_pages = pagecache_over_limit();
+
+	/* Don't need to go there in one step; as the freed
+	 * pages are counted FREE_TO_PAGECACHE_RATIO times, this
+	 * is still more than minimally needed. */
+	nr_pages /= 2;
+
+	/* Return early if there's no work to do */
+	if (nr_pages <= 0)
+		return;
+	/* But do a few at least */
+	nr_pages = max_t(unsigned long, nr_pages, 8*SWAP_CLUSTER_MAX);
+
+	/*
+	 * Shrink the LRU in 2 passes:
+	 * 0 = Reclaim from inactive_list only (fast)
+	 * 1 = Reclaim from active list but don't reclaim mapped (not that fast)
+	 * 2 = Reclaim from active list but don't reclaim mapped (2nd pass)
+	 */
+	for (pass = 0; pass < 2; pass++) {
+		for (sc.priority = DEF_PRIORITY; sc.priority >= 0; sc.priority--) {
+			unsigned long nr_to_scan = nr_pages - ret;
+
+			sc.nr_scanned = 0;
+			/* sc.swap_cluster_max = nr_to_scan; */
+			shrink_all_nodes(nr_to_scan, pass, &sc);
+			ret += sc.nr_reclaimed;
+			if (ret >= nr_pages)
+				return;
+		}
+	}
+}
+
+void shrink_page_cache(gfp_t mask, struct page *page)
+{
+	/* FIXME: As we only want to get rid of non-mapped pagecache
+	 * pages and we know we have too many of them, we should not
+	 * need kswapd. */
+	/*
+	wakeup_kswapd(page_zone(page), 0);
+	*/
+
+	__shrink_page_cache(mask);
+}
+
 /* It's optimal to keep kswapds on the same CPUs as their memory, but
    not required for correctness.  So if the last cpu in a node goes
    away, we get changed to run anywhere: as the first one comes back,