Opened 8 years ago

Closed 7 years ago

#911 closed defect (fixed)

thread crash in callback from Grand Central Dispatch

Reported by: brooks Owned by:
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: 1.7
Keywords: Cc:

Description (last modified by gb)

Using cl-dispatch, I perform the elementary operation of adding a callback function to the asynchronous dispatch queue. See code below. Thread crash results when testing by repeated execution, typically after tens of executions. The error details are slightly different each time, and occasionally I am thrown into the kernel debugger. I have attached the OS X (10.7) crash report for one crash.

(load "lisp:quicklisp;setup")
(ql:quickload "dispatch")

(defcallback foo (:address x) (declare (ignore x)) )

(defun test (v)
  (let ((q (#_dispatch_get_global_queue 0 0))) 
    (dotimes (i v)
        (print i) 
       (#_dispatch_async_f q (%null-ptr) foo))))

(test 100)   ; this never completes without an error.

Attachments (5)

bug report.txt (5.6 KB) - added by brooks 8 years ago.
OS X error report
test-dispatch.c (342 bytes) - added by pfeilgm 8 years ago.
test-dispatch-sync.lisp (253 bytes) - added by pfeilgm 8 years ago.
test-dispatch-async.lisp (260 bytes) - added by pfeilgm 8 years ago.
test-gcd.lisp (1.6 KB) - added by pfeilgm 8 years ago.

Download all attachments as: .zip

Change History (18)

Changed 8 years ago by brooks

OS X error report

comment:1 Changed 8 years ago by gb

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from new to closed
  • Version changed from trunk to 1.7

I couldn't reproduce this in either 32-bit or 64-bit versions of the trunk CCL, but I'm not really surprised if it doesn't work reliably. dispatch_async_f's last argument is a C "block", and we don't offer any way of creating a C block thru the FFI. (I don't think that a 0 arg callback is just a degenerate case of a C block, and if I'm correct in thinking that it might not be surprising if this sort-of-kind-of-works but doesn't work reliably.)

When CCL supports C blocks, this should work.

comment:2 Changed 8 years ago by brooks

Thanks for your reply.

Apple distinguishes between adding a function, versus adding a block, to the queue. I am using dispatch_asych_f, which adds a function to the queue. (One uses dispatch_asynch to add a block.) Therefore, I am confused that the reliability issue might be related to blocks.

Also, I am curious that you could not reproduce the error, since it is 100% for me -- although the iteration at which it happens varies. Is the version on which you tested it the same as my version: Common Lisp Version 1.7-r14925M (DarwinX8664) ?

One possible clue to the problem might be that although the Apple documentation states that when a function (i.e. not a block) is added to the queue, then "the first parameter passed to the function is the value of the context parameter" -- from which I implied that the callback function must take an argument. However, when I add a function that takes no arguments, the behaviour is exactly the same. For example, using (defcallback foo ()), I get the following crash. Does this state of affairs shed any light on the possible source of the problem?

thanks again

  • martin

Thread 1 Crashed: 0 com.clozure.Clozure CL-x8664 0x0000000000026440 find_foreign_rsp + 0 1 com.clozure.Clozure CL-x8664 0x0000000000026a1b setup_signal_frame + 75 2 com.clozure.Clozure CL-x8664 0x00000000000272c5 catch_exception_raise_state + 469 3 libsystem_kernel.dylib 0x00007fff87c6165f _Xexception_raise_state + 324 4 libsystem_kernel.dylib 0x00007fff87c613cb exc_server + 146 5 libsystem_kernel.dylib 0x00007fff87c7428b mach_msg_server + 496 6 com.clozure.Clozure CL-x8664 0x0000000000026b44 exception_handler_proc + 52 7 libsystem_c.dylib 0x00007fff894e38bf _pthread_start + 335 8 libsystem_c.dylib 0x00007fff894e6b75 thread_start + 13

comment:3 Changed 8 years ago by pfeilgm

I ran this on CCL Version 1.8-dev-r15190M-trunk (DarwinX8664), and I had to up the count a bit to make it crash. (test 100000) should be more than enough. Of course, brooks' test is not actually using the cl-dispatch library, so that doesn't need to be loaded.

Brooks is right about there being non-blocks versions of the calls, and he is using those. I ran similar tests – one written in pure C using async queues, and one in lisp, but using sync queues. Both happy ran to 100,000. I'm attaching those to the ticket.

It looks like Ruby ran into the same issue at some point and ostensibly fixed it: http://www.macruby.org/trac/ticket/447 .

Changed 8 years ago by pfeilgm

Changed 8 years ago by pfeilgm

comment:4 Changed 8 years ago by gb

You're (both) right that the test case doesn't involve C blocks. The test case does call dispatch_async_f with a NULL context parameter, and the documentation says that that parameter "cannot be NULL." Comments in /usr/include/dispatch/queue.h phrase that as "the results are undefined if that parameter is NULL", or use words to that effect.

I found that I could run the original case reliably in a command-line CCL (where nothing else is using GCD) but that it failed pretty quickly in the IDE (where Cocoa uses GCD under the hood.) After changing the code to pass a non-NULL context argument, I haven't seen it fail in either environment.

I certainly could be wrong, but it's a little hard to see where CCL would be involved in this. It makes a couple of foreign function calls, one of which results in a little bit of Lisp code (the callback) running on a thread that lisp didn't create. The second part of that - the foreign callback - is indeed very complicated and there have been times when it hasn't worked at all and it's not exercised very much, but if it works at all it tends to work reliably.

Since the test case is documented not to work, it's not too surprising that it fails. If some other case (where the second argument to dispatch_async_f is a non-null pointer) didn't work, that'd be more interesting.

Note that dispatch_sync_f also requires its context arg to be non-null. Saying that "the results are undefined" if that's passed as a null pointer is probably more accurate than saying "this must not be NULL", since the latter almost suggests that either (a) something bothers to check or (b) the failure would be more reliable than it seems to be.

comment:5 Changed 8 years ago by brooks

I mis-parsed the Apple spec. When I read "The application-defined function to invoke on the target queue. The first parameter passed to this function is the value of the context parameter. This parameter cannot be NULL.", I interpreted is as saying that the function pointer could not be null.

What did you use for the context argument? I tried using the function pointer to foo, and I tried using (%int-to-ptr 10); both of then resulted in the same crash. By the way, one of the crashes was snagged by kernel debugger; here is the message that appeared in AltConsole?:

thanks

  • martin

Unhandled exception 11 at 0x7fff8b81b310, context->regs at #x7fff5fbfb080 Exception occurred while executing foreign code

at objc_msgSend_vtable12 + 16

received signal 11; faulting address: 0xffffff8114553d20 ? for help [507] Clozure CL kernel debugger:

comment:6 Changed 8 years ago by gb

  • Resolution fixed deleted
  • Status changed from closed to reopened

I was eventually able to get (test 10000) to crash and looked at it under GDB. In the case that I saw, it appeared that the Mach kernel had sent an exception message to an invalid exception port.

This certainly could have something to do with CCL, though it's hard to know what that would be. I don't have a very high opinion of Mach and that tends to color my thinking in cases like this; the invalid exception port could have been valid at some point in the past, and I suppose that it's possible that CCL could have destroyed and disestablished it while Mach was trying to send a message on it, and that kind of scenario might have something to do with CCL (though at second glance it's not at all clear how that could happen.) It's equally plausible that Mach just fell down under moderate stress; my low opinion of it has some historical basis.

Since I don't know that this isn't a CCL bug, I'll reopen it; I'd note that just trying to determine what's going on and whether CCL's at fault is a significant undertaking and I'm personally not that enthused or motivated to spend lots of time on it. If there were more/clearer reasons to suspect CCL here, that might change.

(Note that the fact that simple C programs don't crash doesn't in and of itself provide such motivation. A C program that does the kinds of things that CCL does might, but that would be a non-trivial C program. Whether those other things (exception handling, GC, ...) are done incorrectly in some way or whether GCD just doesn't scale to non-trivial usage is unknown; all that I can say is that the crashes that I've seen don't (at this point) implicate CCL in any obvious way, but that doesn't mean much.

comment:7 follow-up: Changed 8 years ago by pfeilgm

I already wrote all this up, then saw Gary's comment, so I'm just going to add it anyway …

I think brooks' parsing here is the correct one – the queue and work parameters to dispatch_async_f can't be null, but the context parameter can be. It's confusing with the two different referents for "parameter" in that description, but "This parameter cannot be NULL." is consistently used to refer to the parameter named by the heading.

I modified brooks' original test (attached) to pass a non-null pointer for the context and to not load Cocoa. This still fails for me when run from the command line. Here are some of the errors I see:

  1. no xp here!
  2. bus error ./dx86cl64 --no-init --load
  3. segmentation fault ./dx86cl64 --no-init --load
  4. Fatal error: Mach error Mach error while setting thread FP state : 268435459

When running under gdb:

gdb ./dx86cl64
source lisp-kernel/darwinx8664/.gdbinit
run --no-init --load ~/Documents/Lisp-community/cl-dispatch/test-dispatch-async.lisp

I usually end up with this:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000000000000010
[Switching to process ----- thread 0x1803]
create_thread_context_frame (thread=-----, new_stack_top=0xb0034de0, info_ptr=0xb0034dd8, tcr=0x------, ts=0x-----) at x86-exceptions.c:1531
1531	  if (((BytePtr)rsp < foreign_area->low) ||
1: x/i $pc  0x274c9 <create_thread_context_frame+25>:	cmp    %rbx,0x10(%rax)

I replaced the parts that differed between runs with dashes. Here are the values for each run:

processthreadtcrts
9763553259102880c7038
9767205007d0c7038
9770701021e0c7038
977240504d80c8038
977730504eb0c8038

notice that the thread and ts values are pretty consistent. Here are two other errors I saw:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000b029eb10
[Switching to process 97690 thread 0x1803]
0x00007fff92f062d7 in memmove$VARIANT$sse3x ()
1: x/i $pc  0x7fff92f062d7 <memmove$VARIANT$sse3x+1050>:	movapd %xmm0,(%rdi,%rcx,1)
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xfffffffffffffb30
[Switching to process 97793 thread 0x1803]
0x00007fff92f062d7 in memmove$VARIANT$sse3x ()
1: x/i $pc  0x7fff92f062d7 <memmove$VARIANT$sse3x+1050>:	movapd %xmm0,(%rdi,%rcx,1)

They also seem pretty similar to each other.

Changed 8 years ago by pfeilgm

comment:8 in reply to: ↑ 7 Changed 8 years ago by gb

Replying to pfeilgm: These are all (or at least mostly) symptoms of Mach calling an exception handler with a bogus exception port argument.

In CCL, a thread's exception port shares its address with the thread's TCR ("Thread Context Record"), which is a data structure that (among other things) describes where the thread's stacks are; the thread that receives the exception message tries to push information about the exception on the stack of the thread that caused the exception, and crashes in one of several ways if the TCR/exception port ID isn't valid.

I don't want to say "CCL couldn't possibly do anything that could cause Mach to pass a bogus port argument", but I wouldn't know how to do that if I wanted to ...

Changed 8 years ago by pfeilgm

comment:9 Changed 8 years ago by pfeilgm

I've just attached a new file which defines eight test functions one each for sync and async dispatch on various queues – concurrent, serial, global (concurrent), and main (serial).

Running each of these for 100,000 iterations:

  • all of the synchronous tests work,
  • TEST-ASYNC-MAIN works, and
  • the other asynchronous tests fail.

This makes sense. TEST-*-MAIN executes on your application's main thread and TEST-SYNC-* block and likely execute on the current (blocked) thread (*CURRENT-PROCESS* in the callback bears that out). IE, all the passing tests run on lisp threads. TEST-ASYNC-* don't block, and therefore can't execute on the current thread, and (other than TEST-ASYNC-MAIN) they don't execute on the main thread, but rather some thread that was created outside of Lisp, and that we may not be good at tracking the lifetime or state of.

This lines up with the theory gb mentioned to me earlier, with it being some problem with foreign threads that call into lisp (and rme's comment that perhaps both GCD and CCL muck with POSIX thread attributes in an incompatible way).

comment:10 Changed 7 years ago by gb

CCL (in the trunk for the last several months and in 1.9) doesn't use Mach exception handling anymore, so it's a little easier to see why this crashes.

Short answer: deargodosxsucks.

Longer answer: Any thread that can run Lisp code has to have a thread-private data structure called a "Thread Context Record", which describes many attributes of the thread (where its stacks are, what exceptions are pending, underlying thread identifiers, thread-private dynamic binding info, etc.) A thread created by CCL has its TCR created when it starts up; a thread created by foreign code has a TCR created for it if/when it first calls into lisp. Active TCRs are kept in a doubly-linked list (so given any TCR it's possible to find all other TCRs.)

When the GC runs, it suspends all other threads that have TCRs (and can therefore reference lisp objects) by traversing this list and using pthread_kill() to send a special signal to each other thread. pthread_kill is documented to return 0 on success or one of EINVAL (the signal number is invalid) or ESRCH (invalid thread id). If pthread_kill() returns nonzero, the GC assumes that the thread has died, removes its TCR from the linked list, and tries to do at least some cleanup after it.

The actual implementation of pthread_kill() on OSX can return a third error code which still doesn't seem to be documented: it can return ENOTSUPP (operation not supported) if the thread is a "workqueue thread", since it would make far too much sense and be too useful to allow this. The GC will remove the thread's TCR from the global list in this case but the thread will continue to run during the GC (at least until the GC moves or frees some object that the thread references ...)

There's an undocumented function in the OSX threads library that will allow a calling workqueue thread to be signaled via pthread_kill (just like on real OSes!) Calling that function when a foreign thread establishes its TCR -might- make it possible to reliably use GCD (in a way that involves "workqueue threads") from CCL, but it would have the usual risks associated with using undocumented/internal functions.

We could check for an unknown error return from the GC's use of pthread_kill(), but it's not clear what the next step would be: if we can't suspend the thread, it isn't safe for the GC to run. It -might- be clearer to die at that point than it would be to die mysteriously as we now do.

I lean towards not wanting to fix this, mostly because I don't trust Apple to either provide a sane public interface or not change the current private interface to the functionality that would allow pthread_kill() to be used.

comment:11 Changed 7 years ago by rme

Well. Isn't that special.

This feature of disallowing pthread_kill for workqueue threads has apparently caused trouble elsewhere. For example: http://opensource.apple.com/source/Libc/Libc-763.12/stdlib/FreeBSD/abort.c.patch?txt

comment:12 Changed 7 years ago by rme

Some possibly-useful messages.

https://lists.macosforge.org/pipermail/libdispatch-dev/2010-December/000443.html https://lists.macosforge.org/pipermail/libdispatch-dev/2011-July/000561.html https://lists.macosforge.org/pipermail/libdispatch-dev/2011-July/000576.html

These seem to be saying we could set a thread's signal mask around lisp callbacks.

Finally (before I stop looking at this and go back to what I'm supposed to be doing), I suppose the question is whether we want to open a techincal support incident with Apple about this.

comment:13 Changed 7 years ago by gb

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [15738]) When adding a new foreign TCR on Darwin, call the undocumemted function pthread_workqueue_setkill(1) if we can find that function. If the thread is a "workqueue thread" (created by libdispatch in some cases), this will make it possible to signal the thread via pthread_kill().

It's obviously not good to do this, and obviously not good to have to.

This fixes ticket:911 in the trunk.

Note: See TracTickets for help on using tickets.