From 5ef3f04687a5fdca7ccf1a40c426d9db4498d2c7 Mon Sep 17 00:00:00 2001
From: Matt Fleming <matt@codeblueprint.co.uk>
Date: Thu, 7 May 2020 12:15:43 +0100
Subject: [PATCH] x86/asm/64: Align start of __clear_user() loop to 16-bytes
Patch-mainline: v5.8-rc3
Git-commit: bb5570ad3b54e7930997aec76ab68256d5236d94
References: bsc#1168461
x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window. This movement seems to have happened in SLE15-SP2 with the
following commit,
1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across two
16-byte lines.
Aligning the start of the loop to 16-bytes makes this fit neatly
inside a single instruction fetch window again and improves the
performance of __clear_user() which is used heavily when reading from
/dev/zero.
Zen 1 (Naples)
libmicro-file
4.12.14 5.3.18 5.3.18
default-gc5a4c91e56cd default-g91efaf24bedf align16-g91efaf24bedf
Time mean95-pread_z100k 5.8464 ( 0.00%) 9.6102 ( -64.38%) 6.0099 ( -2.80%)
Time mean95-pread_z10k 0.7180 ( 0.00%) 1.0994 ( -53.12%) 0.7476 ( -4.13%)
Time mean95-pread_z1k 0.2089 ( 0.00%) 0.2568 ( -22.96%) 0.2255 ( -7.98%)
Time mean95-pread_zw100k 5.9265 ( 0.00%) 9.6931 ( -63.55%) 6.0915 ( -2.78%)
Time mean95-read_z100k 5.8447 ( 0.00%) 9.6114 ( -64.45%) 6.0107 ( -2.84%)
Time mean95-read_z10k 0.7173 ( 0.00%) 1.1065 ( -54.26%) 0.7493 ( -4.46%)
Time mean95-read_z1k 0.2090 ( 0.00%) 0.2514 ( -20.29%) 0.2275 ( -8.86%)
Time mean95-read_zw100k 5.9275 ( 0.00%) 9.7080 ( -63.78%) 6.0951 ( -2.83%)
Zen 2 (Rome)
libmicro-file
4.12.14 5.3.18 5.3.18
default-gc5a4c91e56cd default-g91efaf24bedf align16-g91efaf24bedf
Time mean95-pread_z100k 5.1452 ( 0.00%) 8.2803 ( -60.93%) 5.1848 ( -0.77%)
Time mean95-pread_z10k 0.6274 ( 0.00%) 0.9439 ( -50.45%) 0.6358 ( -1.34%)
Time mean95-pread_z1k 0.1771 ( 0.00%) 0.2044 ( -15.44%) 0.1905 ( -7.60%)
Time mean95-pread_zw100k 5.2426 ( 0.00%) 8.3323 ( -58.93%) 5.2394 ( 0.06%)
Time mean95-read_z100k 5.2124 ( 0.00%) 8.2830 ( -58.91%) 5.1860 ( 0.51%)
Time mean95-read_z10k 0.6211 ( 0.00%) 0.9455 ( -52.22%) 0.6378 ( -2.68%)
Time mean95-read_z1k 0.1770 ( 0.00%) 0.2069 ( -16.84%) 0.1929 ( -8.97%)
Time mean95-read_zw100k 5.2022 ( 0.00%) 8.3377 ( -60.27%) 5.2489 ( -0.90%)
Note that this doesn't affect Haswell or Broadwell microarchitectures
which can avoid the alignment issue by executing the loop straight out
of the Loop Stream Detector.
Fixes: 1153933703d9 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Matt Fleming <mfleming@suse.de>
---
arch/x86/lib/usercopy_64.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c
index fff28c6f73a2..b0dfac3d3df7 100644
--- a/arch/x86/lib/usercopy_64.c
+++ b/arch/x86/lib/usercopy_64.c
@@ -24,6 +24,7 @@ unsigned long __clear_user(void __user *addr, unsigned long size)
asm volatile(
" testq %[size8],%[size8]\n"
" jz 4f\n"
+ " .align 16\n"
"0: movq $0,(%[dst])\n"
" addq $8,%[dst]\n"
" decl %%ecx ; jnz 0b\n"
--
2.16.4