|
Jiri Slaby |
c5b460 |
From: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
|
|
Jiri Slaby |
c5b460 |
Date: Wed, 3 May 2023 17:12:00 +0200
|
|
Jiri Slaby |
c5b460 |
Subject: [PATCH] mm: fix zswap writeback race condition
|
|
Jiri Slaby |
c5b460 |
References: bsc#1012628
|
|
Jiri Slaby |
c5b460 |
Patch-mainline: 6.3.4
|
|
Jiri Slaby |
c5b460 |
Git-commit: 04fc7816089c5a32c29a04ec94b998e219dfb946
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
commit 04fc7816089c5a32c29a04ec94b998e219dfb946 upstream.
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
The zswap writeback mechanism can cause a race condition resulting in
|
|
Jiri Slaby |
c5b460 |
memory corruption, where a swapped out page gets swapped in with data that
|
|
Jiri Slaby |
c5b460 |
was written to a different page.
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
The race unfolds like this:
|
|
Jiri Slaby |
c5b460 |
1. a page with data A and swap offset X is stored in zswap
|
|
Jiri Slaby |
c5b460 |
2. page A is removed off the LRU by zpool driver for writeback in
|
|
Jiri Slaby |
c5b460 |
zswap-shrink work, data for A is mapped by zpool driver
|
|
Jiri Slaby |
c5b460 |
3. user space program faults and invalidates page entry A, offset X is
|
|
Jiri Slaby |
c5b460 |
considered free
|
|
Jiri Slaby |
c5b460 |
4. kswapd stores page B at offset X in zswap (zswap could also be
|
|
Jiri Slaby |
c5b460 |
full, if so, page B would then be IOed to X, then skip step 5.)
|
|
Jiri Slaby |
c5b460 |
5. entry A is replaced by B in tree->rbroot, this doesn't affect the
|
|
Jiri Slaby |
c5b460 |
local reference held by zswap-shrink work
|
|
Jiri Slaby |
c5b460 |
6. zswap-shrink work writes back A at X, and frees zswap entry A
|
|
Jiri Slaby |
c5b460 |
7. swapin of slot X brings A in memory instead of B
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
The fix:
|
|
Jiri Slaby |
c5b460 |
Once the swap page cache has been allocated (case ZSWAP_SWAPCACHE_NEW),
|
|
Jiri Slaby |
c5b460 |
zswap-shrink work just checks that the local zswap_entry reference is
|
|
Jiri Slaby |
c5b460 |
still the same as the one in the tree. If it's not the same it means that
|
|
Jiri Slaby |
c5b460 |
it's either been invalidated or replaced, in both cases the writeback is
|
|
Jiri Slaby |
c5b460 |
aborted because the local entry contains stale data.
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
Reproducer:
|
|
Jiri Slaby |
c5b460 |
I originally found this by running `stress` overnight to validate my work
|
|
Jiri Slaby |
c5b460 |
on the zswap writeback mechanism, it manifested after hours on my test
|
|
Jiri Slaby |
c5b460 |
machine. The key to make it happen is having zswap writebacks, so
|
|
Jiri Slaby |
c5b460 |
whatever setup pumps /sys/kernel/debug/zswap/written_back_pages should do
|
|
Jiri Slaby |
c5b460 |
the trick.
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
In order to reproduce this faster on a vm, I setup a system with ~100M of
|
|
Jiri Slaby |
c5b460 |
available memory and a 500M swap file, then running `stress --vm 1
|
|
Jiri Slaby |
c5b460 |
--vm-bytes 300000000 --vm-stride 4000` makes it happen in matter of tens
|
|
Jiri Slaby |
c5b460 |
of minutes. One can speed things up even more by swinging
|
|
Jiri Slaby |
c5b460 |
/sys/module/zswap/parameters/max_pool_percent up and down between, say, 20
|
|
Jiri Slaby |
c5b460 |
and 1; this makes it reproduce in tens of seconds. It's crucial to set
|
|
Jiri Slaby |
c5b460 |
`--vm-stride` to something other than 4096 otherwise `stress` won't
|
|
Jiri Slaby |
c5b460 |
realize that memory has been corrupted because all pages would have the
|
|
Jiri Slaby |
c5b460 |
same data.
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
Link: https://lkml.kernel.org/r/20230503151200.19707-1-cerasuolodomenico@gmail.com
|
|
Jiri Slaby |
c5b460 |
Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
|
|
Jiri Slaby |
c5b460 |
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
|
|
Jiri Slaby |
c5b460 |
Reviewed-by: Chris Li (Google) <chrisl@kernel.org>
|
|
Jiri Slaby |
c5b460 |
Cc: Dan Streetman <ddstreet@ieee.org>
|
|
Jiri Slaby |
c5b460 |
Cc: Johannes Weiner <hannes@cmpxchg.org>
|
|
Jiri Slaby |
c5b460 |
Cc: Minchan Kim <minchan@kernel.org>
|
|
Jiri Slaby |
c5b460 |
Cc: Nitin Gupta <ngupta@vflare.org>
|
|
Jiri Slaby |
c5b460 |
Cc: Seth Jennings <sjenning@redhat.com>
|
|
Jiri Slaby |
c5b460 |
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
|
|
Jiri Slaby |
c5b460 |
Cc: <stable@vger.kernel.org>
|
|
Jiri Slaby |
c5b460 |
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Jiri Slaby |
c5b460 |
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Jiri Slaby |
c5b460 |
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
|
|
Jiri Slaby |
c5b460 |
---
|
|
Jiri Slaby |
c5b460 |
mm/zswap.c | 16 ++++++++++++++++
|
|
Jiri Slaby |
c5b460 |
1 file changed, 16 insertions(+)
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
diff --git a/mm/zswap.c b/mm/zswap.c
|
|
Jiri Slaby |
c5b460 |
index f6c89049..5d5977c9 100644
|
|
Jiri Slaby |
c5b460 |
--- a/mm/zswap.c
|
|
Jiri Slaby |
c5b460 |
+++ b/mm/zswap.c
|
|
Jiri Slaby |
c5b460 |
@@ -995,6 +995,22 @@ static int zswap_writeback_entry(struct zpool *pool, unsigned long handle)
|
|
Jiri Slaby |
c5b460 |
goto fail;
|
|
Jiri Slaby |
c5b460 |
|
|
Jiri Slaby |
c5b460 |
case ZSWAP_SWAPCACHE_NEW: /* page is locked */
|
|
Jiri Slaby |
c5b460 |
+ /*
|
|
Jiri Slaby |
c5b460 |
+ * Having a local reference to the zswap entry doesn't exclude
|
|
Jiri Slaby |
c5b460 |
+ * swapping from invalidating and recycling the swap slot. Once
|
|
Jiri Slaby |
c5b460 |
+ * the swapcache is secured against concurrent swapping to and
|
|
Jiri Slaby |
c5b460 |
+ * from the slot, recheck that the entry is still current before
|
|
Jiri Slaby |
c5b460 |
+ * writing.
|
|
Jiri Slaby |
c5b460 |
+ */
|
|
Jiri Slaby |
c5b460 |
+ spin_lock(&tree->lock);
|
|
Jiri Slaby |
c5b460 |
+ if (zswap_rb_search(&tree->rbroot, entry->offset) != entry) {
|
|
Jiri Slaby |
c5b460 |
+ spin_unlock(&tree->lock);
|
|
Jiri Slaby |
c5b460 |
+ delete_from_swap_cache(page_folio(page));
|
|
Jiri Slaby |
c5b460 |
+ ret = -ENOMEM;
|
|
Jiri Slaby |
c5b460 |
+ goto fail;
|
|
Jiri Slaby |
c5b460 |
+ }
|
|
Jiri Slaby |
c5b460 |
+ spin_unlock(&tree->lock);
|
|
Jiri Slaby |
c5b460 |
+
|
|
Jiri Slaby |
c5b460 |
/* decompress */
|
|
Jiri Slaby |
c5b460 |
acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
|
|
Jiri Slaby |
c5b460 |
dlen = PAGE_SIZE;
|
|
Jiri Slaby |
c5b460 |
--
|
|
Jiri Slaby |
c5b460 |
2.35.3
|
|
Jiri Slaby |
c5b460 |
|