Opened 11 years ago

Closed 11 years ago

#410 closed defect (fixed)

interrupting/suspending threads on WIndows

Reported by: gb Owned by: gb
Priority: critical Milestone:
Component: Runtime (threads, GC) Version: trunk
Keywords: windows suspend interrupt Cc:

Description

On most Unix platforms, a thread is not interruptible (has all signals masked) when entering and while exiting an exception/asynchronous signal handler). On Darwin, the transitions that occur on entry to and exit from an exception handler are managed by another thread, so there's no point at which a Darwin thread is runnable, interruptable, and transitioning between running lisp code and running an exception handler. Threads can be suspended (for GC or other reasons) or interrupted (for PROCESS-INTERRUPT) by sending asynchronous signals to the target thread, and the handlers for these signals can't run (the signals are masked) when the thread is transitioning into and out of a signal/exception handler.

Windows has no real concept of maskable signals, and many aspects of the transition between "running lisp code" and "running an exception handler" happen in user-mode code that can be suspended. It's not generally safe to interrupt or suspend a thread while it's in the process of calling a handler/returning from a handler unless the machine context that the handler's being called with (or that the handler's returning to) is visible to other threads (e.g., via the target thread's TCR.)

On Windows, we generally try to emulate the effect of asynchronous signals by suspending the target thread and then (possibly) manipulate its context, possibly forcing it to run a handler. Depending on the state of the thread at the point where it's suspended, this can be complicated.

The TCR has an integer field - tcr.valence - that is set to 0 (TCR_STATE_LISP) on synchronous transitions to lisp code (return from ff-call, entry to callback) and set to a non-zero value on synchronous transitions to foreign code (entry to ff-call, return from callback.) On Unix platforms, transitions due to exception/signal processing change tcr.valence in a way that's atomic with respect to asynchronous signals.) On Windows, we try to detect the case where a target thread (that we wish to interrupt or suspend) is in the middle of a transition due to an exception by noting that tcr.valence is = to TCR_STATE_LISP but the PC is not in some range (the lisp heap, subprims, the subprims jump table, in the temp stack - as happens briefly when stack-consed closures are called - ...).

If the PC is in the range of addresses used to return from an exception handler - in the assembler function restore_windows_context() - then we can generally do some form of pc-lusering to either back out of an attempt to partially restore the context or emulate its full restoration; this depends on precise knowledge of the inner workings of restore_windows_context(). For win64, that "precise knowledge" is probably pretty accurate; for win32, it's known not to be (there's a compiler warning indicating that some aspects of "win32 context setup" are incomplete.

If a thread has lisp valence but the PC isn't in a "legal place for lisp code", then it's assumed that the thread is entering an exception handler. A bit is set in the thread's TCR and the thread is resumed in this case; when it safely enters its exception handler (saves the context in the TCR, sets the valence correctly) it checks this bit and raises a sempahore to indicate acknowledgment of the suspend request, then waits on another semaphore (which the suspending thread raises to indicate that the thread should resume.)

On both Windows platforms, the mechanisms are complicated and likely in need of further review and testing. For instance, it -may- be possible for a Windows thread to be suspended when it hasn't yet resumed after a previous suspend request. (I say "may" meaning exactly that: I'm not sure whether this is possible or not.)

It should be possible (e.g., there's no known deep technical issue) to at least bring the win32 code up to the same level of completeness as the win64 code exhibits.

SLIME tries to log/record information about protocol events, and the logged information often involves the printed representations of lisp PROCESS objects (which, by default, includes their WHOSTATE, and it's often necessary to suspend the thread in order to find the value of this thread-local binding.) This exercises the suspend/resume code heavily (heavily enough that I had to avoid printing the WHOSTATE in order to avoid exercising the buggy suspend/resume code.) Being able to run SLIME without disabling WHOSTATE printing would be a good test. (And since SLIME conses excessively and by default spawns threads for expression evaluation, it'll exercise other things as well.)

Change History (2)

comment:1 Changed 11 years ago by gb

  • Status changed from new to assigned

comment:2 Changed 11 years ago by gb

  • Resolution set to fixed
  • Status changed from assigned to closed

There may be other issues, but the specific problem described here should have been fixed a few months ago.

Note: See TracTickets for help on using tickets.