Ticket #1123 (new defect)

Opened 2 years ago

Last modified 3 months ago

win64 threads crash

Reported by: vi1 Owned by:
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: trunk
Keywords: win64 threads Cc:


I've been trying to create simple reproducer for a long time for the problem in my win64 opengl program that shows up regularly.

Finally got some code that crashes certainly and almost immediately, at least on my Windows 7 machine.

As it turned out, to reproduce this problem it's better to have no locks (neither data sharing) and create two threads consing data, one conses in lisp function, the other conses in lisp callback function that C function calls asynchronously.

Hope that little cffi use is not a problem.

CCL was built from trunk using fresh cygwin's mingw gcc 4.8.1.

libfoo is compiled as follows:

gcc -shared -o libfoo.dll -fPIC libfoo.c


libfoo.c Download (248 bytes) - added by vi1 2 years ago.
test2.lisp Download (833 bytes) - added by vi1 2 years ago.
test2-no-cffi.lisp Download (639 bytes) - added by gb 2 years ago.

Change History

Changed 2 years ago by vi1

Changed 2 years ago by vi1

Changed 2 years ago by gb

comment:1 Changed 2 years ago by gb

What symptom am I supposed to see ?

I've seen some long pauses but the listener loop eventually resumes. Welcome to Windows!

I've run this without CFFI; I have no specific reason to suspect that CFFI causes whatever problem you're having, but it just adds some complexity and makes things harder to debug than they need to be.

If you can reproduce the problem without using CFFI, that'd be interesting; it'd also be interesting if you can't. It'd be interesting in either case to know something about the problem: its symptoms, any error messages, etc.

comment:2 Changed 2 years ago by vi1

It just freezes after the first output:

--- 1
 #<PROCESS lisp-func(3) [Active] #x2101694F0D>
#<PROCESS lisp-callback-func(2) [Active] #x21016909ED>
#<TTY-LISTENER listener(1) [Active] #x21004073DD>
#<PROCESS Initial(0) [Reset] #x21000B3B1D>

That's it

Listener stops responding, Ctrl-C does not work, nothing.

How long is your pause? Several minutes -- no effect.

Your version with ccl's ffi behaves the same way.

If I run if from slime, slime repl stops responding and M-x slime-list-threads shows nothing.

Weird, why yours works. OS -- Windows 7 Professional, SP1.

comment:3 Changed 2 years ago by gb

I was eventualy able to get it to hang after the listener loop had counted to a few hundred.

I was running on a VM that (IIRC) only has one CPU configured; timing could be different on a native machine and/or with more real or virtual cores involved.

comment:4 Changed 2 years ago by vi1

Mine is native machine with Intel Core i7-860 cpu, 4 cores, 8 with HT.

"Stable" freezes within 5 seconds.

Same thing if we eliminate rnd() use in libfoo like

void run_callback(void (*f)(int))
        int count=1;
        while (1) {
                int i;
                for(i=0; i<count; i++) {}

comment:5 Changed 2 years ago by gb

I'll try to look at this when I have time; that'll likely be at least a few days.

I think that I know very generally where the problem is.

comment:6 Changed 2 years ago by gb

This seems to be a case where C optimization causes things to happen in an unexpected order; the code needs to do whatever it needs to do to defeat that.

The function prepare_to_wait_for_exception_lock does:

  int old_valence = tcr->valence;

  tcr->pending_exception_context = context;
  tcr->valence = TCR_STATE_EXCEPTION_WAIT;

#ifdef WINDOWS
  if (tcr->flags & (1<<TCR_FLAG_BIT_PENDING_SUSPEND)) {
  return old_valence;

and the generated code is:

0000000000000ee0 <prepare_to_wait_for_exception_lock>:
     ee0:	56                   	push   %rsi
     ee1:	53                   	push   %rbx
     ee2:	48 83 ec 28          	sub    $0x28,%rsp
     ee6:	f6 81 21 01 00 00 04 	testb  $0x4,0x121(%rcx)
     eed:	8b b1 b0 00 00 00    	mov    0xb0(%rcx),%esi
     ef3:	48 89 cb             	mov    %rcx,%rbx
     ef6:	48 89 91 08 01 00 00 	mov    %rdx,0x108(%rcx)
     efd:	48 c7 81 b0 00 00 00 	movq   $0x2,0xb0(%rcx)
     f04:	02 00 00 00 
     f08:	74 34                	je     f3e <prepare_to_wait_for_exceptio

e.g., it tests the pending_suspend bit before setting tcr->valence or tcr->pending_exception_context.

Getting things to happen in the intended order is undoubtedly a simple matter of Ring TFCM (perhaps paying special attention to the "volatile" keyword), but a simple workaround is to disable C optimization and rebuild the CCL kernel:

$ cd ccl/lisp-kernel/win64
### edit Makefile, changing
# COPT = -O2
### to
# COPT = #-02
$ make clean
$ make

It's not entirely conclusive since the deadlock often took a long time to happen for me, but I haven't seen it recur since making this change.

comment:7 Changed 2 years ago by vi1

It works! Now running opengl program with fingers crossed.

comment:8 Changed 2 years ago by vi1

Stable as a rock, both synthetic and opengl tests worked fine for 6 hours.

If you decide to make a local fix, I'd be happy to test it.

comment:9 Changed 3 months ago by vi1


1) 1.10 release kernel built with -O0 passes this synthetic test, but fails on more complex one that is too big to report here (includes OpenGL etc).

2) 1.11 pre-release kernel from svn still has this problem -- this synthetic test hangs.

3) 1.11 pre-release kernel built with -O0 passes both synthetic and complex tests.

comment:10 Changed 3 months ago by vi1

3) 1.11 pre-release kernel built with -O0 passes both synthetic and complex tests.

Yeah, right, that was premature :( Still fails.

comment:11 Changed 3 months ago by rme

Something even stranger is going on.

Dump of assembler code for function prepare_to_wait_for_exception_lock:
   0x00000000000309c0 <+0>:     push   %rbx
   0x00000000000309c1 <+1>:     sub    $0x20,%rsp
   0x00000000000309c5 <+5>:     mov    $0xfffffffffffffbff,%rdx
   0x00000000000309cc <+12>:    mov    %rcx,%rbx
   0x00000000000309cf <+15>:    lea    0x120(%rcx),%rcx
   0x00000000000309d6 <+22>:    callq  0x39f41 <atomic_and>
   0x00000000000309db <+27>:    mov    0x110(%rbx),%rcx
   0x00000000000309e2 <+34>:    xor    %r8d,%r8d
   0x00000000000309e5 <+37>:    mov    $0x1,%edx
   0x00000000000309ea <+42>:    callq  *0x19d94(%rip)        # 0x4a784 <__imp_ReleaseSemaphore>
   0x00000000000309f0 <+48>:    mov    0x118(%rbx),%rcx
   0x00000000000309f7 <+55>:    callq  0x34f10 <sem_wait_forever>
   0x00000000000309fc <+60>:    add    $0x20,%rsp
   0x0000000000030a00 <+64>:    pop    %rbx
   0x0000000000030a01 <+65>:    retq

I have no idea what happened to the test of tcr->flags (and the branch if the test fails). It looks like compiler just decided to omit it. Maybe the compiler has concluded that the test is always true, but I don't see what would allow it to do that. (This is x86_64-w64-mingw32-gcc (GCC) 4.9.2 from cygwin).

comment:12 Changed 3 months ago by vi1

You are probably looking at wrong function. There are some versions of prepare_to_wait_for_exception_lock created by gcc.

That's what gcc.EXE (x86_64-posix-seh-rev2, Built by MinGW-W64 project) 4.9.1 created with -O2:

objdump -d x86-exceptions.o


0000000000000ec0 <prepare_to_wait_for_exception_lock>:
     ec0:	53                   	push   %rbx
     ec1:	48 83 ec 20          	sub    $0x20,%rsp
     ec5:	f6 81 21 01 00 00 04 	testb  $0x4,0x121(%rcx)
     ecc:	48 8b 99 b0 00 00 00 	mov    0xb0(%rcx),%rbx
     ed3:	48 89 91 08 01 00 00 	mov    %rdx,0x108(%rcx)
     eda:	48 c7 81 b0 00 00 00 	movq   $0x2,0xb0(%rcx)
     ee1:	02 00 00 00 
     ee5:	74 05                	je     eec <prepare_to_wait_for_exception_lock+0x2c>
     ee7:	e8 94 f1 ff ff       	callq  80 <prepare_to_wait_for_exception_lock.part.0>
     eec:	89 d8                	mov    %ebx,%eax
     eee:	48 83 c4 20          	add    $0x20,%rsp
     ef2:	5b                   	pop    %rbx
     ef3:	c3                   	retq   
     ef4:	66 66 66 2e 0f 1f 84 	data32 data32 nopw %cs:0x0(%rax,%rax,1)
     efb:	00 00 00 00 00 

0000000000000080 <prepare_to_wait_for_exception_lock.part.0>:
      80:	53                   	push   %rbx
      81:	48 83 ec 20          	sub    $0x20,%rsp
      85:	48 c7 c2 ff fb ff ff 	mov    $0xfffffffffffffbff,%rdx
      8c:	48 89 cb             	mov    %rcx,%rbx
      8f:	48 8d 89 20 01 00 00 	lea    0x120(%rcx),%rcx
      96:	e8 00 00 00 00       	callq  9b <prepare_to_wait_for_exception_lock.part.0+0x1b>
      9b:	48 8b 8b 10 01 00 00 	mov    0x110(%rbx),%rcx
      a2:	45 31 c0             	xor    %r8d,%r8d
      a5:	ba 01 00 00 00       	mov    $0x1,%edx
      aa:	ff 15 00 00 00 00    	callq  *0x0(%rip)        # b0 <prepare_to_wait_for_exception_lock.part.0+0x30>
      b0:	48 8b 8b 18 01 00 00 	mov    0x118(%rbx),%rcx
      b7:	e8 00 00 00 00       	callq  bc <prepare_to_wait_for_exception_lock.part.0+0x3c>
      bc:	48 83 c4 20          	add    $0x20,%rsp
      c0:	5b                   	pop    %rbx
      c1:	c3                   	retq   
      c2:	66 66 66 66 66 2e 0f 	data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
      c9:	1f 84 00 00 00 00 00 

So the function you mention is some helper functon automagically created to be called from prepare_to_wait_for_exception_lock.

comment:13 Changed 3 months ago by gb

My understanding is that C code can nake any assumptions that it wants to about whether or when memory locations change unless those locations are declared to be "volatile". That behavior is often useless, but may not be incorrect (as far as C standards compliance goes,)

I do not know what harm might ne caused by declaring some fields in a CCL lock or semaphore to be volatile, but I don't believe that a C compiler can casually decide to omit code in cases where volatile is involved. That's a separate issue from whether or not some C compiler would do so in some cases, but that might be incorrect and not merely useless.

"C semantics" may be an oxymoron, but this behavior may not violate them. (whatever they are.)

comment:14 Changed 3 months ago by vi1

Guys, nothing was ommited, the test is there.

Then I actually don't understand what the harm is in moving the test on tcr->flags before setting

tcr->pending_exception_context = context;

These setting can't affect flags, right?

Note: See TracTickets for help on using tickets.