Blob Blame History Raw
From: Michal Hocko <mhocko@suse.cz>
Subject: pagecache limit: Fix the shmem deadlock
Patch-mainline: never, SUSE specific
References: bnc#755537

SLE12->SLE12-SP2
- __GFP_WAIT -> __GFP_DIRECT_RECLAIM

See the original patch description for SLE11-SP2 bellow:
This patch is strictly not needed in SLE11-SP3 because we no longer call
add_to_page_cache under info->lock spinlock anymore but shmem_unuse_inode
is still called with shmem_swaplist_mutex and uses GFP_NOWAIT when adding
to the page cache. We have taken a conservative approach and rather shrink
the cache proactively in this case.

Original patch description for reference:

shmem_getpage uses info->lock spinlock to make sure (among other things)
that the shmem inode information is synchronized with the page cache
status so we are calling add_to_page_cache_lru with the lock held.
Unfortunately add_to_page_cache_lru calls add_to_page_cache which handles
page cache limit and it might end up in the direct reclaim which in turn might
sleep even though the given gfp_mask is GFP_NOWAIT so we end up sleeping
in an atomic context -> kaboom.

Let's fix this by enforcing that add_to_page_cache doesn't go into the reclaim
if the gfp_mask says it should be atomic. Caller is then responsible for the
shrinking and we are doing that when we preallocate a page for shmem. Other
callers are not relying on GFP_NOWAIT when adding to page cache.

I really do _hate_ abusing page (NULL) parameter but I was too lazy to 
prepare a cleanup patch which would get rid of __shrink_page_cache and merge
it with shrink_page_cache so we would get rid of the page argument which is
of no use.

Also be strict and BUG_ON when we get into __shrink_page_cache with an atomic
gfp_mask.

Please also note that this change might lead to a more extensive reclaim if we
have more threads fighting for the same shmem page because then the shrinking
is not linearized by the lock and so they might race with the limit evaluation
and start reclaiming all at once. The risk is not that big though becase we
would end up reclaiming NR_CPUs * over_limit pages at maximum.

Signed-off-by: Michal Hocko <mhocko@suse.cz>

---
 mm/shmem.c  |   11 +++++++++++
 mm/vmscan.c |    5 +++++
 2 files changed, 16 insertions(+)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1211,6 +1211,17 @@ int shmem_unuse(swp_entry_t swap, struct
 	/* No radix_tree_preload: swap entry keeps a place for page in tree */
 	error = -EAGAIN;
 
+	/* 
+	 * try to shrink the page cache proactively even though
+	 * we might already have the page in so the shrinking is
+	 * not necessary but this is much easier than dropping 
+	 * the lock in shmem_unuse_inode before add_to_page_cache_lru.
+	 * GFP_NOWAIT makes sure that we do not shrink when adding
+	 * to page cache
+	 */
+	if (unlikely(vm_pagecache_limit_mb) && pagecache_over_limit() > 0)
+		shrink_page_cache(GFP_KERNEL, NULL);
+
 	mutex_lock(&shmem_swaplist_mutex);
 	list_for_each_safe(this, next, &shmem_swaplist) {
 		info = list_entry(this, struct shmem_inode_info, swaplist);
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3786,6 +3786,11 @@ static void __shrink_page_cache(gfp_t ma
 	};
 	long nr_pages;
 
+	/* We might sleep during direct reclaim so make atomic context
+	 * is certainly a bug.
+	 */
+	BUG_ON(!(mask & __GFP_DIRECT_RECLAIM));
+
 	/* How many pages are we over the limit?
 	 * But don't enforce limit if there's plenty of free mem */
 	nr_pages = pagecache_over_limit();