Opened 11 years ago

Closed 11 years ago

#300 closed defect (wontfix)

CCL 1.2 rc1 does not run on FreeBSD-6.3/amd64 under VMWare on AMD hardware

Reported by: hans Owned by: gb
Priority: major Milestone:
Component: Runtime (threads, GC) Version:
Keywords: Cc:

Description

I fetched clozurecl-1.2-rc1-freebsdx8664.tar.gz from ftp.clozure.com. Extracted it. It does not start.

FreeBSD alka-seltzer.headcraft.de 6.3-RELEASE FreeBSD 6.3-RELEASE #0: Wed Jan 16 01:31:10 UTC 2008     root@palmer.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
alka-seltzer 104_> ./fx86cl64
exception in foreign context
Exception occurred while executing foreign code
? for help
[2829] OpenMCL kernel debugger: b

Framepointer [#x7FFFFFFFDCC0] in unknown area.

Change History (7)

comment:1 Changed 11 years ago by gb

  • Status changed from new to assigned
  • Summary changed from CCL 1.2 rc1 does not run on FreeBSD-6.3/amd64 to CCL 1.2 rc1 does not run on FreeBSD-6.3/amd64 for at least one user

Evidence to the contrary:

[src/ccl] gb@jerk> uname -a
FreeBSD jerk.abq.clozure.com 6.3-RELEASE FreeBSD 6.3-RELEASE #0: Wed Jan 16 01:43:02 UTC 2008     root@palmer.cse.buffalo.edu:/usr/obj/usr/src/sys/SMP  amd64
[src/ccl] gb@jerk> ./fx86cl64
Welcome to Clozure Common Lisp Version 1.2-r9226-RC1  (FreebsdX8664)!
?

That's actually running in a virtual machine under VMWare; I've run it under 6.3 (and 7.0rc1) on real hardware. I'm not aware of any substantive differences between 6.3 and previous 6.x releases that would cause it to just not run.

Trying to do a lisp backtrace (b) when an exception occurred while running foreign code doesn't work (not too surprisingly.) It may or may not be useful to see the output of the r command in the kernel debugger; it is not at all useful to not see that output.

I can't tell exactly how many times the FreeBSD 1.2 release has been downloaded, but I'd be very surprised if other people have had problems running it under 6.3 and have simply decided not to bother reporting those problems, and I haven't heard of any such problems. I have no idea what your problem is, and don't really have a whole lot of information to go on. Seeing register values would be at least some information.

comment:2 Changed 11 years ago by hans

I can´t speak for anyone but me, but it does not run for me. This is VMware Workstation 6.0.3 build 80004 running on an AMD Athlon X2, Windows Vista as host. Fresh installation of VMware and FreeBSD as of yesterday.

rax = 0x0000000000000000      r8  = 0x0000000000000000
rcx = 0x0000000000000000      r9  = 0x0000000000000000
rdx = 0x000000000000000A      r10 = 0x00000008005A1FE0
rbx = 0xFFFFFFFFFFFFFFBD      r11 = 0x0000000000000246
rsp = 0x00007FFFFFFFDC38      r12 = 0x0000000000000000
rbp = 0x00007FFFFFFFDC90      r13 = 0x0000300040B5595F
rsi = 0x0000000800CDEF90      r14 = 0x0000000000000000
rdi = 0x00007FFFFFFFE1D0      r15 = 0x0000000000000000
rip = 0x00000008005A1FE0   rflags = 0x0000000000010286

comment:3 Changed 11 years ago by gb

  • Summary changed from CCL 1.2 rc1 does not run on FreeBSD-6.3/amd64 for at least one user to CCL 1.2 rc1 does not run on FreeBSD-6.3/amd64 under VMWare on AMD hardware

Thanks.

Andreas Bogk found a bug in VMWare on AMD hardware that kept CCL from running when Linux was the guest OS: the information passed to a signal handler was bogus. One of the first things that a lisp image does after it's been mapped into memory is to start consing; the first attempt to cons causes an exception, the handler gets bad information about where the exception occurred and crashes. He reported it to VMWare and they acknowledged it as their bug, but I haven't heard any news about it getting fixed.

The same bug affected users at our largest customer and is in their bug database (to which you have access; I don't remember the bug number but spent some time verifying it.) In that case, we found that Intel hardware was not affected, so the problem appears to be something at a low enough level as to be sensitive to differences between Intel and AMD machines.

My best guess is that the same bug confuses a FreeBSD guest in different ways; I suppose that it's possible that 6.3 gets confused when earlier versions did not.

On the 6.3 virtual machine that I have here, the address of the instruction pointer (%rip) in the register output wasn't mapped. I can't really see exactly what's going on here, but the facts that:

  • CCL 1.2-rc1 seems to run fine on FreeBSD 6.3 on native hardware
  • It also seems to run under VMWare Fusion on a C2D MacBook? Pro
  • There are known bugs in VMWare on AMD hardware

makes me think that this is a VMWare bug, similar to or identical to the one that Andreas found.

I don't have a VMWare license for Vista, but do have an AMD64 machine running Vista and might be able to get a clearer idea of the problem id I install an evaluation copy. I can't say with absolute confidence that this is a VMWare bug, but at this point I'd say it with very high confidence.

comment:4 Changed 11 years ago by hans

If you want, I can upload you my virtual machine so that you can test this using VMware Player - The problem should be the same. Please let me know, but I'll only be able to do this on Monday.

comment:5 Changed 11 years ago by gb

  • Resolution set to wontfix
  • Status changed from assigned to closed

I installed an evaluation copy of VMWare Workstation on an AMD64 running Vista, installed a FreeBSD 6.3 virtual machine under it, and installed 1.2 on that VM.

The VMWare bug seems to be exactly the same one that prevents running under a Linux VM on AMD64: the information in a signal context is subtly wrong and nonsensical, and a signal handler dies trying to make sense of it.

I think that Andreas reported this back in January or February; I have no idea if or when a fix will be made available.

comment:6 Changed 11 years ago by stassats

  • Resolution wontfix deleted
  • Status changed from closed to reopened

I have the same symptoms, but I'm running FreeBSD-8.0-CURRENT-200805 on a real hardware (Pentium 4 Prescott).

% ./fx86cl64
exception in foreign context
Exception occurred while executing foreign code
? for help
[41491] OpenMCL kernel debugger: b

Framepointer [#x7FFFFFFFE040] in unknown area.
[41491] OpenMCL kernel debugger: r
rax = 0x0000000000000000      r8  = 0x0000000000000001
rcx = 0x0000000000000000      r9  = 0x0000000000000000
rdx = 0x000000000000000A      r10 = 0x00000008005A1FE0
rbx = 0x0000000000000001      r11 = 0x0000000000000246
rsp = 0x00007FFFFFFFDFE8      r12 = 0x00003000400CE943
rbp = 0x00007FFFFFFFE040      r13 = 0x000030004007C81F
rsi = 0x0000000800CF8EF8      r14 = 0x0000300040CDE76D
rdi = 0x00007FFFFFFFE5B0      r15 = 0x0000000000000000
rip = 0x00000008005A1FE0   rflags = 0x0000000000010286

comment:7 Changed 11 years ago by gb

  • Resolution set to wontfix
  • Status changed from reopened to closed

It actually works for me, on a Core 2 Duo running VMWare:

[src/ccl] gb@bozo> uname -a
FreeBSD bozo.abq.clozure.com 8.0-CURRENT-200805 FreeBSD 8.0-CURRENT-200805 #0: Mon May 12 12:30:36 UTC 2008     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
[src/ccl] gb@bozo> ./fx86cl64
Welcome to Clozure Common Lisp Version 1.2-r9226-RC1  (FreebsdX8664)!
?

I don't know whether the hardware difference is significant.

The VMWare/AMD64 bug had to do with information in the context passed to signal handlers being wrong. If you're seeing the same bug, there isn't much that I can do about it.

If this is still seeing this in a year (or whenever FreeBSD 8.0 is released), it'd be worth looking at again. In the 7.0 release cycle, every other monthly snapshot was broken and the situation didn't really improve until the 7.0 betas.

(Note that it's necessary to install the compat6x package in order to run a CCL built under 6.x on a later OS release.)

In any case, whatever the problem is here it seems very unlikely that it's in the lisp. If there's some reason to believe otherwise, someone can reopen this.

Note: See TracTickets for help on using tickets.