In the Linux kernel, the following vulnerability has been resolved:
powerpc/qspinlock: Fix deadlock in MCS queue
If an interrupt occurs in queuedspinlockslowpath() after we increment qnodesp->count and before node->lock is initialized, another CPU might see stale lock values in gettail_qnode(). If the stale lock value happens to match the lock on that CPU, then we write to the "next" pointer of the wrong qnode. This causes a deadlock as the former CPU, once it becomes the head of the MCS queue, will spin indefinitely until it's "next" pointer is set by its successor in the queue.
Running stress-ng on a 16 core (16EC/16VP) shared LPAR, results in occasional lockups similar to the following:
$ stress-ng --all 128 --vm-bytes 80% --aggressive \ --maximize --oomable --verify --syslog \ --metrics --times --timeout 5m
watchdog: CPU 15 Hard LOCKUP ...... NIP [c0000000000b78f4] queuedspinlockslowpath+0x1184/0x1490 LR [c000000001037c5c] _rawspinlock+0x6c/0x90 Call Trace: 0xc000002cfffa3bf0 (unreliable) _rawspinlock+0x6c/0x90 rawspinrqlocknested.part.135+0x4c/0xd0 schedttwupending+0x60/0x1f0 _flushsmpcallfunctionqueue+0x1dc/0x670 smpipidemuxrelaxed+0xa4/0x100 xivemuxedipiaction+0x20/0x40 _handleirqeventpercpu+0x80/0x240 handleirqeventpercpu+0x2c/0x80 handlepercpuirq+0x84/0xd0 generichandleirq+0x54/0x80 _doirq+0xac/0x210 _doIRQ+0x74/0xd0 0x0 doIRQ+0x8c/0x170 hardwareinterruptcommonvirt+0x29c/0x2a0 --- interrupt: 500 at queuedspinlockslowpath+0x4b8/0x1490 ...... NIP [c0000000000b6c28] queuedspinlockslowpath+0x4b8/0x1490 LR [c000000001037c5c] _rawspinlock+0x6c/0x90 --- interrupt: 500 0xc0000029c1a41d00 (unreliable) _rawspinlock+0x6c/0x90 futexwake+0x100/0x260 dofutex+0x21c/0x2a0 sysfutex+0x98/0x270 systemcallexception+0x14c/0x2f0 systemcallvectored_common+0x15c/0x2ec
The following code flow illustrates how the deadlock occurs. For the sake of brevity, assume that both locks (A and B) are contended and we call the queuedspinlock_slowpath() function.
CPU0 CPU1
---- ----
spinlockirqsave(A) | spinunlockirqrestore(A) | spinlock(B) | | | ▼ | id = qnodesp->count++; | (Note that nodes[0].lock == A) | | | ▼ | Interrupt | (happens before "nodes[0].lock = B") | | | ▼ | spinlockirqsave(A) | | | ▼ | id = qnodesp->count++ | nodes[1].lock = A | | | ▼ | Tail of MCS queue | | spinlockirqsave(A) ▼ | Head of MCS queue ▼ | CPU0 is previous tail ▼ | Spin indefinitely ▼ (until "nodes[1].next != NULL") prev = gettail_qnode(A, CPU0) | ▼ prev == &qnodes[CPU0].nodes[0] (as qnodes ---truncated---