Ticket #862 (closed defect: fixed)

Opened 3 years ago

Last modified 2 years ago

win64 gc/threads lossage

Reported by: rme Owned by: gb
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: trunk
Keywords: windows Cc:

Description

Running the code in attached file, in which 10 threads cons hysterically, will pretty reliably elicit heap corruption.

From Matt Lamari, see  http://clozure.com/pipermail/openmcl-devel/2011-May/012827.html.

Attachments

win64-gc-bug.lisp Download (1.0 KB) - added by rme 3 years ago.

Change History

Changed 3 years ago by rme

comment:1 Changed 2 years ago by gb

  • Status changed from new to assigned

This will actually crash fairly quickly and reliably with just 2 threads consing hysterically.

Bizarrely, making the GC print things while other threads are running by (e.g.) calling:

(gc-verbose t)

before loading the file seems to make the problem go away.

Running under GDB and setting a breakpoint at resume_other_tcrs() and having that breakpoint do "info threads" and "continue" isn't enough to make the problem go away, but seems to indicate that the thread that gets the error was stopped on a uuo_alloc_trap. We interpret that to mean "the thread is about to execute that trap but has not yet done so", and pc_luser_xp() manipulates the thread's context so that handle_alloc_trap() will do the right thing when the thread resumes and the trap is taken. If it actually means "the thread was last in user mode at this point, has executed the uuo, and has not yet reentered user mode to handle the exception" on Windows, that might explain the problem. (The symptoms are consistent with what would happen if the thread ignored our calls to set its context and just continued with exception processing in that case.) If this is true, it's even more of a mystery that printing things would somehow cause Windows to notice that it's behaving inconsistently.

A hack would be to have suspend_tcr() treat a thread that was suspended on a UUO (or other intentional-exception-causing instruction) as if it was in the process of entering the exception handler. We might still have trouble recovering from memory faults or other hardware exceptions, but that'd be better than "having trouble recovering from CONSing."

It might be less of a hack if we could reliably distinguish between the "about to execute a UUO" and "executed a UUO but haven't yet reentered user mode" cases.

I'm by no means certain that this theory is correct, but it's stood up longer than anything else that I've thought of.

comment:2 Changed 2 years ago by gb

There seems to be reason to believe that the theory outlined in the last message is correct.

comment:3 Changed 2 years ago by gb

  • Status changed from assigned to closed
  • Resolution set to fixed

(In [15148]) Try to fix ticket:862 and some related problems.

If GetThreadContext?() claims that we've suspended a thread at a UUO, don't interpret that as "about to execute the UUO"; we may have executed the UUO and not have reentered user mode, and there seems to be reason to believe that setting the thread's context while it's suspended won't have the desired effect. Set the tcr's "pending suspend" bit and resume the thread in this case. Note that there are other situations where we may be in the process of taking an exception but it's much harder to tell (e.g., if the instruction causes a memory fault or arithmetic exception or ...); it'd be nicer if there was a reliable way to detect that. Win32 doesn't seem to have this problem at all, AFAICT.

If we let a thread continue until it reaches the exception handler and checks the pending suspend bit, don't set its tcr.suspend_context slot. When resuming such a thread, just raise its resume semaphore.

Do a little more pc-lusering when suspending Windows threads, e.g., if suspended in the first few instructions of restore_windows_context(), emulate the effect of those instructions.

Note: See TracTickets for help on using tickets.