Ticket #1123 (new defect)

Opened 11 months ago

Last modified 11 months ago

win64 threads crash

Reported by: vi1 Owned by:
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: trunk
Keywords: win64 threads Cc:

Description

I've been trying to create simple reproducer for a long time for the problem in my win64 opengl program that shows up regularly.

Finally got some code that crashes certainly and almost immediately, at least on my Windows 7 machine.

As it turned out, to reproduce this problem it's better to have no locks (neither data sharing) and create two threads consing data, one conses in lisp function, the other conses in lisp callback function that C function calls asynchronously.

Hope that little cffi use is not a problem.

CCL was built from trunk using fresh cygwin's mingw gcc 4.8.1.

libfoo is compiled as follows:

gcc -shared -o libfoo.dll -fPIC libfoo.c

Attachments

libfoo.c Download (248 bytes) - added by vi1 11 months ago.
test2.lisp Download (833 bytes) - added by vi1 11 months ago.
test2-no-cffi.lisp Download (639 bytes) - added by gb 11 months ago.

Change History

Changed 11 months ago by vi1

Changed 11 months ago by vi1

Changed 11 months ago by gb

comment:1 Changed 11 months ago by gb

What symptom am I supposed to see ?

I've seen some long pauses but the listener loop eventually resumes. Welcome to Windows!

I've run this without CFFI; I have no specific reason to suspect that CFFI causes whatever problem you're having, but it just adds some complexity and makes things harder to debug than they need to be.

If you can reproduce the problem without using CFFI, that'd be interesting; it'd also be interesting if you can't. It'd be interesting in either case to know something about the problem: its symptoms, any error messages, etc.

comment:2 Changed 11 months ago by vi1

It just freezes after the first output:

--- 1
 #<PROCESS lisp-func(3) [Active] #x2101694F0D>
#<PROCESS lisp-callback-func(2) [Active] #x21016909ED>
#<TTY-LISTENER listener(1) [Active] #x21004073DD>
#<PROCESS Initial(0) [Reset] #x21000B3B1D>

That's it

Listener stops responding, Ctrl-C does not work, nothing.

How long is your pause? Several minutes -- no effect.

Your version with ccl's ffi behaves the same way.

If I run if from slime, slime repl stops responding and M-x slime-list-threads shows nothing.

Weird, why yours works. OS -- Windows 7 Professional, SP1.

comment:3 Changed 11 months ago by gb

I was eventualy able to get it to hang after the listener loop had counted to a few hundred.

I was running on a VM that (IIRC) only has one CPU configured; timing could be different on a native machine and/or with more real or virtual cores involved.

comment:4 Changed 11 months ago by vi1

Mine is native machine with Intel Core i7-860 cpu, 4 cores, 8 with HT.

"Stable" freezes within 5 seconds.

Same thing if we eliminate rnd() use in libfoo like

void run_callback(void (*f)(int))
{
        int count=1;
        while (1) {
                int i;
                for(i=0; i<count; i++) {}
                f(count);
                ++count;
        }
}

comment:5 Changed 11 months ago by gb

I'll try to look at this when I have time; that'll likely be at least a few days.

I think that I know very generally where the problem is.

comment:6 Changed 11 months ago by gb

This seems to be a case where C optimization causes things to happen in an unexpected order; the code needs to do whatever it needs to do to defeat that.

The function prepare_to_wait_for_exception_lock does:

  int old_valence = tcr->valence;

  tcr->pending_exception_context = context;
  tcr->valence = TCR_STATE_EXCEPTION_WAIT;

#ifdef WINDOWS
  if (tcr->flags & (1<<TCR_FLAG_BIT_PENDING_SUSPEND)) {
    CLR_TCR_FLAG(tcr, TCR_FLAG_BIT_PENDING_SUSPEND);
    SEM_RAISE(TCR_AUX(tcr)->suspend);
    SEM_WAIT_FOREVER(TCR_AUX(tcr)->resume);
  }
#else
  ALLOW_EXCEPTIONS(context);
#endif
  return old_valence;

and the generated code is:

0000000000000ee0 <prepare_to_wait_for_exception_lock>:
     ee0:	56                   	push   %rsi
     ee1:	53                   	push   %rbx
     ee2:	48 83 ec 28          	sub    $0x28,%rsp
     ee6:	f6 81 21 01 00 00 04 	testb  $0x4,0x121(%rcx)
     eed:	8b b1 b0 00 00 00    	mov    0xb0(%rcx),%esi
     ef3:	48 89 cb             	mov    %rcx,%rbx
     ef6:	48 89 91 08 01 00 00 	mov    %rdx,0x108(%rcx)
     efd:	48 c7 81 b0 00 00 00 	movq   $0x2,0xb0(%rcx)
     f04:	02 00 00 00 
     f08:	74 34                	je     f3e <prepare_to_wait_for_exceptio

e.g., it tests the pending_suspend bit before setting tcr->valence or tcr->pending_exception_context.

Getting things to happen in the intended order is undoubtedly a simple matter of Ring TFCM (perhaps paying special attention to the "volatile" keyword), but a simple workaround is to disable C optimization and rebuild the CCL kernel:

$ cd ccl/lisp-kernel/win64
### edit Makefile, changing
# COPT = -O2
### to
# COPT = #-02
$ make clean
$ make

It's not entirely conclusive since the deadlock often took a long time to happen for me, but I haven't seen it recur since making this change.

comment:7 Changed 11 months ago by vi1

It works! Now running opengl program with fingers crossed.

comment:8 Changed 11 months ago by vi1

Stable as a rock, both synthetic and opengl tests worked fine for 6 hours.

If you decide to make a local fix, I'd be happy to test it.

Note: See TracTickets for help on using tickets.