Ticket #1030 (closed defect: fixed)

Opened 21 months ago

Last modified 20 months ago

Failure with conditional-store

Reported by: jlawrence Owned by: gb
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: trunk
Keywords: Cc:

Description

(This is assuming ccl::conditional-store should work as a typical compare-and-swap, even though it's not exported.)

The RUN function below causes an assertion failure within 2000 iterations on my Linux-x86 machine. An svref place has the same issue. I did not see a problem on Darwin.

Removing the assertion (but keeping the conditional-store) results in a hang. Replacing that conditional-store (the second one) with setf also produces a hang.

(defstruct stack
  (head nil)
  (lock nil))

(defun call-with-spin-lock (stack fn)
  (loop :until (ccl::conditional-store
                (stack-lock stack) nil t))
  (unwind-protect
       (funcall fn)
    (assert (ccl::conditional-store
             (stack-lock stack) t nil))))

(defmacro with-spin-lock ((stack) &body body)
  `(call-with-spin-lock ,stack (lambda () ,@body)))

(defun push-stack (value stack)
  (with-spin-lock (stack)
    (push value (stack-head stack))))

(defun pop-stack (stack)
  (with-spin-lock (stack)
    (pop (stack-head stack))))

(defun test (thread-count)
  (let ((stack (make-stack)))
    (loop :repeat thread-count
          :do (ccl:process-run-function "test" #'push-stack t stack))
    (loop :repeat thread-count
          :do (loop :until (pop-stack stack)))))

(defun run ()
  (loop
     (test 1)
     (format t ".")))

Change History

comment:1 Changed 21 months ago by jlawrence

  • Component changed from IDE to Runtime (threads, GC)

comment:2 Changed 21 months ago by gb

  • Owner set to gb
  • Status changed from new to assigned

This seems to be x8632-specific; it doesn't seem to be Linux-specific (e.g., also fails on Darwin and I would expect it to do so on other OSes as well.)

I was not able to get this to fail on x8664.

As ticket:994 notes, CCL::CONDITIONAL-STORE isn't ready for prime time. At first glance, this seems to be a problem with an underlying primitive (CCL::STORE-GVECTOR-CONDITIONAL et al) rather than a problem with CCL::CONDITIONAL-STORE not macroexpanding into the correct primitive.

From what I can tell, pc_luser_xp() does not seem to be involved.

This also failed for me in 32-bit CCL 1.8 on Linux.

comment:3 Changed 20 months ago by jlawrence

Yes, 32-bit Darwin fails for me as well -- I forgot to check it since I have dx86cl64 symlinked to ccl.

Is the Windows story the same -- works for 64-bit but not 32-bit? I have a library that benefits significantly from conditional-store, even if the option must carry the experimental label (at least while conditional-store remains unexported).

comment:4 Changed 20 months ago by gb

There are around a dozen things in CCL that do some flavor of conditional-store; they differ in terms of whether they deal with lisp data (in which case there are GC issues) or not and in how that data is accessed. On x86 (32 and 64-bit), they all involve use of a CMPXCHG instruction (which does a sort of compare-and-swap at the hardware level.) In at least 2 cases, there's a problem with one of the arguments to the CMPXCHG instruction; the code in question was introduced on x8664 and the bug was propagated to x8632. The problem is timing-sensitive: it may be harder to reproduce on x8664, but the same mistake is made there as well.

None of this is really OS-dependent (aside from the fact that different OSes have different schedulers which can have slightly different timing behavior.) The fact that I couldn't reproduce this on x8664 means about as much (as little) as the fact that nothing besides this test case has (obviously, visibly) triggered a bug that's been there for several years.

I fixed the offending CMPXCHG instructions in the x8632 kernel and haven't yet been able to reproduce the problem there when running your test case. I want to take a closer look at the 10 or so other uses of CMPXCHG and repeat the process on x8664 before checking these changes into the trunk, and I think that we'd want to leave them in the trunk for a while before propagating them to the release.

comment:5 Changed 20 months ago by gb

  • Status changed from assigned to closed
  • Resolution set to fixed

(In [15500]) Try to clean up code which uses CMPXCHG: imm0 should contain the expected value (which may or may not be the current value, as of a few cycles before the CMPXCHG ...). In general, we don't need or want to repeat the CMPXCHG in order to do a conditional store (failures aren't transient). In cases where we repeat a CMPXCHG in a loop, ensure that the loop contains a PAUSE instruction to work correctly with hyperthreading.

Change the x86 pc_luser_xp() to account for changes in _SPstore_node_conditional and _SPset_hash_key_conditional.

Introduce a WITH-EXCEPTION-LOCK macro; refactor %LOCK-RECURSIVE-LOCK-OBJECT and friends so that we can lock/unlock a kernel lock (with no lisp LOCK object around it) without having to call into the kernel. RECURSIVE-LOCK-WHOSTATE allows its argument to be a string. (WITH-EXCEPTION-LOCK isn't used anywhere yet; it may be a better alternative to things like WITHOUT-GCING, where (a) it's preferable to delay exception handing in other threads than to let the heap grow and (b) the body is short and doesn't try to grab other locks.)

This is all intended to fix ticket:1030 in the trunk.

Note: See TracTickets for help on using tickets.