Ticket #407 (closed defect: wontfix)

Opened 5 years ago

Last modified 4 years ago

lx86cl hangs on Linux Xen domU running on Amazon EC2

Reported by: sctb Owned by: gb
Priority: normal Milestone: IA-32 port
Component: Runtime (threads, GC) Version: trunk
Keywords: ec2 xen linux x86 Cc:

Description

I'm experiencing strange behavior from CCL on Linux x86. When invoking it in the normal way (ccl/scripts/ccl) as well as directly (./lx86cl lx86cl.image) I receive no output from the application. It does not respond to C-c or C-d, nor can I execute any Lisp forms.

Linux domU-12-31-39-02-6A-15 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:39:36 EST 2008 i686 athlon i386 GNU/Linux

Notes from openmcl-devel, R. Matthew Emerson:

"I tried to run the lisp on an EC2 image myself.

What seems to happen is that the lisp kernel starts up, maps in the heap image, and starts running lisp code, which pretty much immediately tries to allocate memory by taking a trap (int $0xc5 on x86 systems).

For some reason, instead of generating a SIGSEGV like we expect, the system just sort of sits there, dumbly staring at the alloc_trap in a puzzled manner.

I don't know if this a symptom of running on Xen or whether there is some other peculiarity of Amazon's EC2 set-up. I have run Clozure CL on a a Linux Xen instance before, though it was a 64-bit version."

Attachments

small-test.c Download (648 bytes) - added by gb 5 years ago.

Change History

Changed 5 years ago by gb

comment:1 Changed 5 years ago by gb

  • Status changed from new to assigned

The enclosed C program does most of the things that CCL does leading up to the point of the hang (at least those things that seem at all relevant to the hang.) If you compile and run it on a Linux x86[64] box:

shell> cc -m32 -D_GNU_SOURCE small-test.c -o small-test
shell> ./small_test

it should exit almost immediately. If the OS can't deliver signals to application-defined signal handlers - which would be a pretty bad OS bug - it will hang indefinitely. In that case, please close this bug report and report the bug to Amazon; you're welcome to use the code in the 'small-test.c' file to demonstrate the problem.

If it doesn't hang, please report that here and I'll try to come up with a slightly larger test case that does. (We use sigaltstack() and the test program doesn't; we have a lot more memory mapped and are linked against lpthread; there may be a few other differences.) The simpler the test case, the more likely it is to be useful in isolating the bug, so it seems to be wise to start small, and this is about as small a test case that I can think of that might demonstrate the bug.

comment:2 Changed 5 years ago by sctb

  • Status changed from assigned to closed
  • Resolution set to invalid

sample-test.c works on a standard x86 Linux install (return immediately), but hangs indefinitely on EC2. A ticket has been opened with Amazon.

comment:3 Changed 4 years ago by james.anderson

  • Status changed from closed to reopened
  • Resolution invalid deleted

has there been any progress on this?

i tried the test program, but the results were not conclusive. if one starts the example program, it does hang, but if one then sends the process a sigsegv from a terminal, the registered signal handler does run. as a variation, if one causes the program to generate to dereference 0, it also hangs. in the hung state, if some other process sends it a sigsegv, it gets the bus error instead. in the hung state, if some other process sends it a sigpwr, it get the sigpwr.

it would be nice to shed some light on this, as it would be good to be able to run ccl in ec2. so far, i have tried most of the other lisps which have linux runtimes. all of them start in an ec2 instance and receive signals. there must be more to this than just an amazon problem.

comment:4 Changed 4 years ago by sctb

Under a more recent kernel from Canonical's Ubuntu AMI, 2.6.31-302-ec2 #7-Ubuntu, the issue is no longer present.

comment:5 Changed 4 years ago by gz

comment:6 Changed 4 years ago by james.anderson

  • Status changed from reopened to closed
  • Resolution set to wontfix

thank you for the ubuntu pointer.

i tried several of the ones readily available to start an instance. the kernel version is not prominent in the selection criteria and none matched the designation "Canonical's Ubuntu AMI, 2.6.31-302-ec2" exactly, but there was one for which ccl does, in fact start: ami-52be5d3b.

is there a clear kernel watermark? is it just kernels >= 2.6.31 or can one go further back?

comment:7 Changed 4 years ago by sctb

The AMI I was using successfully was ami-bb709dd2.

The images from Canonical are the only other ones that I know of that don't use Amazon's FC8 AKI/ARIs, but I think it would be possible to test earlier kernels (other than the one I reported) by using the Hardy and Jaunty images:  http://uec-images.ubuntu.com/

Note: See TracTickets for help on using tickets.