From: Dexuan Cui <decui@microsoft.com>
Date: Thu, 14 Sep 2023 12:13:31 -0700
Subject: x86/hyperv: Add hv_write_efer() for a TDX VM with the paravisor
Patch-mainline: never, workaround for host bug
References: bsc#1206453
This is a temporary hack. The latest Hyper-V dev build has been
fixed, but the fix won't roll out to Azure until the end of 2023.
It's safe to have the hack on future fixed Hyper-V.
Note: the original workarouond (see commit
78b17e1bd229 ("x86/hyperv: Add hv_write_efer() for a TDX VM with the paravisor"))
adds CPUHP_AP_HYPERV_FORCE_EFER_WRITE, which turns out to be unnecessary:
the first hypercall invoked on a non-boot VP is the IPI hypercalls:
[ 5.238850][ T1] smp: Bringing up secondary CPUs ...
[ 5.241406][ T1] x86: Booting SMP configuration:
[ 5.241818][ T1] .... node #0, CPUs: #1
[ 5.243876][ T18] ------------[ cut here ]------------
[ 5.245804][ T18] cdx: first hypercall
[ 5.245804][ T18] WARNING: CPU: 1 PID: 18 at arch/x86/hyperv/hv_apic.c:243 __send_ipi_one+0x1b4/0x200
...
[ 5.245804][ T18] CPU: 1 PID: 18 Comm: cpuhp/1 Not tainted 5.14.21-150400.22-default-decui-no-w-efer+ #6 SLE15-SP4 fefda0b177da1a9470efefee04adbe432f317fe6
[ 5.245804][ T18] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 07/11/2023
[ 5.245804][ T18] RIP: 0010:__send_ipi_one+0x1b4/0x200
...
[ 5.245804][ T18] Call Trace:
[ 5.245804][ T18] <TASK>
[ 5.245804][ T18] hv_send_ipi+0x12/0x40
[ 5.245804][ T18] ttwu_queue_wakelist+0xef/0x110
[ 5.245804][ T18] try_to_wake_up+0x196/0x590
[ 5.245804][ T18] ? sched_cpu_activate+0xcd/0x180
[ 5.245804][ T18] swake_up_locked.part.0+0x13/0x30
[ 5.245804][ T18] complete+0x2f/0x40
[ 5.245804][ T18] cpuhp_thread_fun+0xb6/0x150
[ 5.245804][ T18] ? sort_range+0x20/0x20
[ 5.245804][ T18] smpboot_thread_fn+0xd8/0x1c0
[ 5.245804][ T18] kthread+0x15a/0x190
[ 5.245804][ T18] ? set_kthread_struct+0x50/0x50
[ 5.245804][ T18] ret_from_fork+0x22/0x30
[ 5.245804][ T18] </TASK>
[ 5.245804][ T18] ---[ end trace 20694464dde8a516 ]---
...
[ 5.245804][ T18] BUG: unable to handle page fault for address: 0000000000011003
[ 5.245804][ T18] #PF: supervisor instruction fetch in kernel mode
[ 5.245804][ T18] #PF: error_code(0x0010) - not-present page
[ 5.245804][ T18] PGD 0 P4D 0
[ 5.245804][ T18] Oops: 0010 [#1] PREEMPT SMP PTI
...
[ 5.245804][ T18] RIP: 0010:0x11003
[ 5.245804][ T18] Code: Unable to access opcode bytes at RIP 0x10fd9.
[ 5.245804][ T18] RSP: 0000:ffffb997035e7db0 EFLAGS: 00010086
[ 5.245804][ T18] RAX: 0000000000000002 RBX: 0000000000000000 RCX: 000000000001000b
[ 5.245804][ T18] RDX: 0000000000000000 RSI: 00000000fffbffff RDI: 0000000000000001
[ 5.245804][ T18] RBP: 00000000000000fb R08: 0000000000000001 R09: ffffb997035e7be8
[ 5.245804][ T18] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
[ 5.245804][ T18] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 5.245804][ T18] FS: 0000000000000000(0000) GS:ffff9252fce40000(0000) knlGS:0000000000000000
[ 5.245804][ T18] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.245804][ T18] CR2: 0000000000011003 CR3: 00000003b4628001 CR4: 0000000000370ee0
[ 5.245804][ T18] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5.245804][ T18] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 5.245804][ T18] Call Trace:
[ 5.245804][ T18] <TASK>
[ 5.245804][ T18] ? __send_ipi_one+0xc7/0x200
[ 5.245804][ T18] ? hv_send_ipi+0x12/0x40
[ 5.245804][ T18] ? ttwu_queue_wakelist+0xef/0x110
[ 5.245804][ T18] ? try_to_wake_up+0x196/0x590
[ 5.245804][ T18] ? sched_cpu_activate+0xcd/0x180
[ 5.245804][ T18] ? swake_up_locked.part.0+0x13/0x30
[ 5.245804][ T18] ? complete+0x2f/0x40
[ 5.245804][ T18] ? cpuhp_thread_fun+0xb6/0x150
[ 5.245804][ T18] ? sort_range+0x20/0x20
[ 5.245804][ T18] ? smpboot_thread_fn+0xd8/0x1c0
[ 5.245804][ T18] ? kthread+0x15a/0x190
[ 5.245804][ T18] ? set_kthread_struct+0x50/0x50
[ 5.245804][ T18] ? ret_from_fork+0x22/0x30
[ 5.245804][ T18] </TASK>
[ 5.245804][ T18] Modules linked in:
[ 5.245804][ T18] Supported: Yes
[ 5.245804][ T18] CR2: 0000000000011003
[ 5.245804][ T18] ---[ end trace 20694464dde8a519 ]---
[ 5.245804][ T18] RIP: 0010:0x11003
[ 5.245804][ T18] Code: Unable to access opcode bytes at RIP 0x10fd9.
According to my testing, as long as the write_efer() happens before
CPUHP_AP_ONLINE, the fatal page fault won't happen.
CPUHP_AP_HYPERV_TIMER_STARTING happens before CPUHP_AP_ONLINE, so let's
move hv_write_efer() into hv_stimer_init() (refer to hv_stimer_alloc())
so that we can avoid adding CPUHP_AP_HYPERV_FORCE_EFER_WRITE.
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Acked-by: Olaf Hering <ohering@suse.de>
---
drivers/clocksource/hyperv_timer.c | 33 ++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
--- a/drivers/clocksource/hyperv_timer.c
+++ b/drivers/clocksource/hyperv_timer.c
@@ -123,6 +123,37 @@ static int hv_ce_set_oneshot(struct clock_event_device *evt)
return 0;
}
+static int hv_write_efer(void)
+{
+ unsigned long long efer;
+
+ if (!hv_isolation_type_tdx() || !ms_hyperv.paravisor_present)
+ return 0;
+
+ /*
+ * Write EFER by force, otherwise the paravisor's hypercall
+ * handler thinks that the VP is in 32-bit mode, and the
+ * returning RIP is truncated to 32-bits, causing a fatal
+ * page fault. This is a TDX-spefic issue because it looks
+ * like the initial default value of EFER on non-boot VPs
+ * already has the EFER.LMA bit, and when the reading of
+ * EFER on a non-boot VP is the same as the value of EER
+ * on VP0, Linux doesn't write the EFER register on a
+ * non-boot VP: see the code in arch/x86/kernel/head_64.S
+ * ("Avoid writing EFER if no change was made (for TDX guest)").
+ * Also see commit 77a512e35db7 ("x86/boot: Avoid #VE during boot for TDX platforms")
+ * Work around the issue for now by force an EFER write.
+ *
+ * This is a temporary hack. The latest Hyper-V dev build has been
+ * fixed, but the fix won't be ported to Azure until the end of 2023.
+ * It's safe to have the hack on future fixed Hyper-V.
+ */
+ rdmsrl(MSR_EFER, efer);
+ wrmsrl(MSR_EFER, efer);
+
+ return 0;
+}
+
/*
* hv_stimer_init - Per-cpu initialization of the clockevent
*/
@@ -130,6 +161,8 @@ static int hv_stimer_init(unsigned int cpu)
{
struct clock_event_device *ce;
+ hv_write_efer();
+
if (!hv_clock_event)
return 0;