Jiri Slaby c5b460
From: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Jiri Slaby c5b460
Date: Wed, 3 May 2023 17:12:00 +0200
Jiri Slaby c5b460
Subject: [PATCH] mm: fix zswap writeback race condition
Jiri Slaby c5b460
References: bsc#1012628
Jiri Slaby c5b460
Patch-mainline: 6.3.4
Jiri Slaby c5b460
Git-commit: 04fc7816089c5a32c29a04ec94b998e219dfb946
Jiri Slaby c5b460
Jiri Slaby c5b460
commit 04fc7816089c5a32c29a04ec94b998e219dfb946 upstream.
Jiri Slaby c5b460
Jiri Slaby c5b460
The zswap writeback mechanism can cause a race condition resulting in
Jiri Slaby c5b460
memory corruption, where a swapped out page gets swapped in with data that
Jiri Slaby c5b460
was written to a different page.
Jiri Slaby c5b460
Jiri Slaby c5b460
The race unfolds like this:
Jiri Slaby c5b460
1. a page with data A and swap offset X is stored in zswap
Jiri Slaby c5b460
2. page A is removed off the LRU by zpool driver for writeback in
Jiri Slaby c5b460
   zswap-shrink work, data for A is mapped by zpool driver
Jiri Slaby c5b460
3. user space program faults and invalidates page entry A, offset X is
Jiri Slaby c5b460
   considered free
Jiri Slaby c5b460
4. kswapd stores page B at offset X in zswap (zswap could also be
Jiri Slaby c5b460
   full, if so, page B would then be IOed to X, then skip step 5.)
Jiri Slaby c5b460
5. entry A is replaced by B in tree->rbroot, this doesn't affect the
Jiri Slaby c5b460
   local reference held by zswap-shrink work
Jiri Slaby c5b460
6. zswap-shrink work writes back A at X, and frees zswap entry A
Jiri Slaby c5b460
7. swapin of slot X brings A in memory instead of B
Jiri Slaby c5b460
Jiri Slaby c5b460
The fix:
Jiri Slaby c5b460
Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW),
Jiri Slaby c5b460
zswap-shrink work just checks that the local zswap_entry reference is
Jiri Slaby c5b460
still the same as the one in the tree.  If it's not the same it means that
Jiri Slaby c5b460
it's either been invalidated or replaced, in both cases the writeback is
Jiri Slaby c5b460
aborted because the local entry contains stale data.
Jiri Slaby c5b460
Jiri Slaby c5b460
Reproducer:
Jiri Slaby c5b460
I originally found this by running `stress` overnight to validate my work
Jiri Slaby c5b460
on the zswap writeback mechanism, it manifested after hours on my test
Jiri Slaby c5b460
machine.  The key to make it happen is having zswap writebacks, so
Jiri Slaby c5b460
whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do
Jiri Slaby c5b460
the trick.
Jiri Slaby c5b460
Jiri Slaby c5b460
In order to reproduce this faster on a vm, I setup a system with ~100M of
Jiri Slaby c5b460
available memory and a 500M swap file, then running `stress --vm 1
Jiri Slaby c5b460
--vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens
Jiri Slaby c5b460
of minutes.  One can speed things up even more by swinging
Jiri Slaby c5b460
/sys/module/zswap/parameters/max_pool_percent up and down between, say, 20
Jiri Slaby c5b460
and 1; this makes it reproduce in tens of seconds.  It's crucial to set
Jiri Slaby c5b460
`--vm-stride` to something other than 4096 otherwise `stress` won't
Jiri Slaby c5b460
realize that memory has been corrupted because all pages would have the
Jiri Slaby c5b460
same data.
Jiri Slaby c5b460
Jiri Slaby c5b460
Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com
Jiri Slaby c5b460
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Jiri Slaby c5b460
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Jiri Slaby c5b460
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
Jiri Slaby c5b460
Cc: Dan Streetman <ddstreet@ieee.org>
Jiri Slaby c5b460
Cc: Johannes Weiner <hannes@cmpxchg.org>
Jiri Slaby c5b460
Cc: Minchan Kim <minchan@kernel.org>
Jiri Slaby c5b460
Cc: Nitin Gupta <ngupta@vflare.org>
Jiri Slaby c5b460
Cc: Seth Jennings <sjenning@redhat.com>
Jiri Slaby c5b460
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Jiri Slaby c5b460
Cc: <stable@vger.kernel.org>
Jiri Slaby c5b460
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Slaby c5b460
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jiri Slaby c5b460
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Jiri Slaby c5b460
---
Jiri Slaby c5b460
 mm/zswap.c | 16 ++++++++++++++++
Jiri Slaby c5b460
 1 file changed, 16 insertions(+)
Jiri Slaby c5b460
Jiri Slaby c5b460
diff --git a/mm/zswap.c b/mm/zswap.c
Jiri Slaby c5b460
index f6c89049..5d5977c9 100644
Jiri Slaby c5b460
--- a/mm/zswap.c
Jiri Slaby c5b460
+++ b/mm/zswap.c
Jiri Slaby c5b460
@@ -995,6 +995,22 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
Jiri Slaby c5b460
 		goto fail;
Jiri Slaby c5b460
 
Jiri Slaby c5b460
 	case ZSWAP_SWAPCACHE_NEW: /* page is locked */
Jiri Slaby c5b460
+		/*
Jiri Slaby c5b460
+		 * Having a local reference to the zswap entry doesn't exclude
Jiri Slaby c5b460
+		 * swapping from invalidating and recycling the swap slot. Once
Jiri Slaby c5b460
+		 * the swapcache is secured against concurrent swapping to and
Jiri Slaby c5b460
+		 * from the slot, recheck that the entry is still current before
Jiri Slaby c5b460
+		 * writing.
Jiri Slaby c5b460
+		 */
Jiri Slaby c5b460
+		spin_lock(&tree->lock);
Jiri Slaby c5b460
+		if (zswap_rb_search(&tree->rbroot, entry->offset) != entry) {
Jiri Slaby c5b460
+			spin_unlock(&tree->lock);
Jiri Slaby c5b460
+			delete_from_swap_cache(page_folio(page));
Jiri Slaby c5b460
+			ret = -ENOMEM;
Jiri Slaby c5b460
+			goto fail;
Jiri Slaby c5b460
+		}
Jiri Slaby c5b460
+		spin_unlock(&tree->lock);
Jiri Slaby c5b460
+
Jiri Slaby c5b460
 		/* decompress */
Jiri Slaby c5b460
 		acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
Jiri Slaby c5b460
 		dlen = PAGE_SIZE;
Jiri Slaby c5b460
-- 
Jiri Slaby c5b460
2.35.3
Jiri Slaby c5b460