Opened 11 years ago

Closed 11 years ago

#385 closed defect (fixed)

CCL fails to start on FreeBSD-6.3/i386 in VMware

Reported by: hans Owned by: gb
Priority: normal Milestone:
Component: Runtime (threads, GC) Version:
Keywords: Cc:

Description

CCL does not start for me on 32bit FreeBSD, running in VMware 6.5.1-build126130 hosted on Windows Vista/x64.

dentinox 8_> uname -a
FreeBSD dentinox.huebner.org 6.3-RELEASE FreeBSD 6.3-RELEASE #0: Wed Jan 16 04:1
8:52 UTC 2008     root@dessler.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  i38
6
dentinox 9_> svn info .
Path: .
URL: http://svn.clozure.com/publicsvn/openmcl/trunk/freebsdx86/ccl
Repository Root: http://svn.clozure.com/publicsvn/openmcl
Repository UUID: 8bdb8f1d-59ce-db11-9e41-0016172a54ae
Revision: 11462
Node Kind: directory
Schedule: normal
Last Changed Author: gb
Last Changed Rev: 11404
Last Changed Date: 2008-11-18 21:55:55 +0000 (Tue, 18 Nov 2008)

dentinox 10_> ./fx86cl
Unhandled exception 11 at 0x805613d, context->regs at #xbfbfe410
? for help
[2652] OpenMCL kernel debugger: b
current thread: tcr = 0x48098240, native thread ID = 0x806b000, interrupts disab
led

Change History (6)

comment:1 Changed 11 years ago by gb

  • Owner changed from rme to gb
  • Status changed from new to assigned

I actually develop and run the 32-bit FreeBSD version on VMWare Fusion (Version 2.0 (116369).

This isn't the same bug that's affected running under VMWare in the past. My best bet is that the lisp side of this is ticket:331, and the VMWare side of it is "can't map the heap where we want it to go", but that's just a guess. Seeing the value of the registers (R in the kernel debugger) might help to test that hypothesis.

If "VMWare Tools" are installed (which isn't always easy to do under FreeBSD), it'd be interesting to know whether turning them off affects this.

I don't know whether or not this is just random VMWare lossage; if it is, there wouldn't be too much that we could do about it.

If you have the FreeBSD /proc file system mounted (via "sudo mount -t procfs proc /proc"), then at the point where the crash occurs:

  1. Note the PID of the fx86cl process. It appears in the kernel debugger prompt and was 2652 in your report.

2) In another shell, do:

shell> cat /proc/<PID>/map

This will show some information about memory regions in the process. The regions that are most interesting should look like:

0x30000000 0x3043c000 0 3576 0xc320e630 r-x 3 1 0x0 COW NC vnode /usr/local/src/ccl-dev/fx86cl.image
0x3043c000 0x34000000 0 0 0xc2b6fe70 --- 4 0 0x2000 NCOW NNC default -
0x34000000 0x349b3000 0 3576 0xc320e630 rwx 3 1 0x0 COW NC vnode /usr/local/src/ccl-dev/fx86cl.image
0x349b3000 0x359c0000 0 0 0xc2b6fbdc rwx 1 0 0x2000 NCOW NNC default -
0x359c0000 0x6e000000 0 0 0xc2b6fe70 --- 4 0 0x2000 NCOW NNC default -

e.g. a region at 0x30000000 mapped to the image file with read and execute permissions, some allocated but unmapped memory, another writable file-mapped region at 0x34000000, some mapped memory, lots of unmapped memory.

If the file-mapped regions exist but are at different addresses, we might be able to work around that by fixing ticket:331. (If there's a shared library or something occupying 0x30000000, it might be interesting to know what that is.)

If the file-mapped regions exist at the addresses that they're supposed to, then this is more likely to be random VMWare lossage.

comment:2 Changed 11 years ago by hans

Here are the registers:

dentinox 18_> ./fx86cl
Unhandled exception 11 at 0x805613d, context->regs at #xbfbfe410
? for help
[4261] OpenMCL kernel debugger: R
%eax = 0xffffff00
%ecx = 0x4cb59e46
%edx = 0x00000000
%ebx = 0x4c1b013e
%esp = 0x882f2ff4
%ebp = 0x882f2ff8
%esi = 0x4cb59e46
%edi = 0x349a9e46
%eip = 0x0805613d
%eflags = 0x00010246

and here is the complete map:

0x12000 0x14000 2 0 0xc4bcaa50 rwx 1 0 0x2180 COW NNC vnode /usr/home/hans/ccl/fx86cl.image
0x15000 0x16000 1 0 0xc4c5039c rwx 1 0 0x2180 NCOW NNC default -
0x8048000 0x8069000 33 0 0xc4addbdc r-x 1 0 0x0 COW NC vnode /usr/home/hans/ccl/fx86cl
0x8069000 0x806a000 1 0 0xc4c2ee70 rw- 2 0 0x2180 NCOW NNC default -
0x806a000 0x8073000 9 0 0xc4c2ee70 rwx 2 0 0x2180 NCOW NNC default -
0x48069000 0x4808b000 31 0 0xc104e210 r-x 71 34 0x4 COW NC vnode /libexec/ld-elf.so.1
0x4808b000 0x4808d000 2 0 0xc4c3edec rw- 1 0 0x2180 COW NNC vnode /libexec/ld-elf.so.1
0x4808d000 0x48092000 4 0 0xc4ae8948 rw- 2 0 0x2180 NCOW NNC default -
0x48092000 0x4809a000 8 0 0xc4ae8948 rwx 2 0 0x2180 NCOW NNC default -
0x4809a000 0x480ae000 20 0 0xc104f8c4 r-x 22 14 0x0 COW NC vnode /lib/libm.so.4
0x480ae000 0x480af000 1 0 0xc4c3c840 r-x 1 0 0x2180 COW NNC vnode /lib/libm.so.4
0x480af000 0x480b0000 1 0 0xc765d528 rwx 1 0 0x2180 COW NNC vnode /lib/libm.so.4
0x480b0000 0x480bf000 15 0 0xc102ea50 r-x 6 4 0x0 COW NC vnode /usr/lib/libthr.so.2
0x480bf000 0x480c0000 1 0 0xc4b0339c r-x 1 0 0x2180 COW NNC vnode /usr/lib/libthr.so.2
0x480c0000 0x480c1000 1 0 0xc4c3ead4 rwx 1 0 0x2180 COW NNC vnode /usr/lib/libthr.so.2
0x480c1000 0x480c2000 1 0 0xc4c2e738 rwx 1 0 0x2180 NCOW NNC default -
0x480c2000 0x4818c000 153 0 0xc104ebdc r-x 105 68 0x4 COW NC vnode /lib/libc.so.6
0x4818c000 0x4818d000 1 0 0xc4c1b318 r-x 1 0 0x2180 COW NNC vnode /lib/libc.so.6
0x4818d000 0x48193000 6 0 0xc1029318 rwx 1 0 0x2180 COW NNC vnode /lib/libc.so.6
0x48193000 0x481aa000 5 0 0xc4c2ec60 rwx 1 0 0x2180 NCOW NNC default -
0x481b0000 0x485ec000 2 0 0xc4c50dec r-x 3 2 0x0 COW NC vnode /usr/home/hans/ccl/fx86cl.image
0x485ec000 0x4c1b0000 0 0 0xc4c3f000 --- 8 0 0x2180 NCOW NNC default -
0x4c1b0000 0x4cb63000 28 0 0xc4bfa948 rwx 1 0 0x2180 COW NNC vnode /usr/home/hans/ccl/fx86cl.image
0x4cb63000 0x4db70000 0 0 0xc4acddec rwx 1 0 0x2000 NCOW NNC default -
0x4db70000 0x861b0000 0 0 0xc4c3f000 --- 8 0 0x2180 NCOW NNC default -
0x861b0000 0x86218000 0 0 0 rwx 0 0 0x0 NCOW NNC none -
0x86218000 0x871b0000 0 0 0xc4c3f000 --- 8 0 0x2180 NCOW NNC default -
0x871b0000 0x87218000 39 0 0xc4c3cb58 rwx 1 0 0x2180 NCOW NNC default -
0x87218000 0x881b0000 0 0 0xc4c3f000 --- 8 0 0x2180 NCOW NNC default -
0x881b0000 0x881c1000 1 0 0xc4c3f000 r-x 8 0 0x2180 NCOW NNC default -
0x881c1000 0x882f4000 1 0 0xc4c3f000 rwx 8 0 0x2180 NCOW NNC default -
0x882f4000 0x88304000 1 0 0xc4c3f000 r-x 8 0 0x2180 NCOW NNC default -
0x88304000 0x88345000 1 0 0xc4c3f000 rwx 8 0 0x2180 NCOW NNC default -
0xbf9ff000 0xbfa00000 0 0 0 --- 0 0 0x0 NCOW NNC none -
0xbfa9b000 0xbfadf000 1 0 0xc4c1f210 rwx 1 0 0x2180 NCOW NNC default -
0xbfbe0000 0xbfc00000 3 0 0xc4c3ec60 rwx 1 0 0x2180 NCOW NNC default -

To me, it seems as if the files got mapped at the wrong address, so there should be some hope.

comment:3 Changed 11 years ago by gb

The lisp expects there to be 1GB free starting at #x30000000. The map above indicates that the shared libraries get mapped at #x48069000 - a little nearer to the middle of the address space than one would like, and the lisp heap got put at #x481b0000. The code that loads the image is supposed to notice that it's not occupying the preferred address range and relocate it (basically, find every pointer that's pointing into the preferred address range and make it point into the actual address range, but this is broken on x8632 (ticket:331(.

If it tried to map less than 1GB at #x30000000 and the libraries are consistently where they are for you, that might work. You could try calling the lisp with

shell> fx86cl --heap-reserve 384M

which might get a smaller chunk of address space to wind up at the "right" address.

You'd likely wind up with > 300MB in that case (64MB is reserved for "pure" objects, and a few % are used for GC overhead. That's basically an upper bound on heap size; it's what shows up as "reserved for heap expansion" in ROOM's output.

If the library load addresses are an artifact of vmware ... well, it'd be good to get them to fix that. Some platforms try to randomize shared library load addresses in the hope that this will make applications less vulnerable to securiity exploits. That's a noble goal, but unless it's done carefully it limits the application's ability to allocate large contiguous chunks of its own address space, which CCL certainly wants to do.

Even if the relocation code worked (ticket:331), it's undesirable to have to run it: it increases startup time and largely defeats the purpose of file-mapping the image.

comment:4 Changed 11 years ago by hans

shell> fx86cl --heap-reserve 384M

did not work for me, even though the shared libraries have been mapped in the same address range as before.

comment:5 Changed 11 years ago by gb

That suggests that FreeBSD running under the version of VMWare that you're using just refuses to let a process map regions of its address space that it is able to map when running under other versions of VMWare (and on real hardware: I have x86 FreeBSD installed on a real machine here, and the lisp starts up without incident.)

r11468 purports to fix ticket:331, which might work around this: if an image can't be loaded at the address from which it was saved, we can (expensively) walk the image and adjust every pointer that we find.

When you can, please update to r11468 or later, rebuild the 32-bit FreeBSD lisp kernel via:

shell> cd ccl/lisp-kernel/freebdsx8632
shell> make

and see if the lisp starts up. (It's hard to say whether there'd be a visible startup delay while the image relocates itself; it's fairly easy to say that there'd be nothing that we could do about that, if so.)

comment:6 Changed 11 years ago by hans

  • Resolution set to fixed
  • Status changed from assigned to closed

Awesome, Gary! The fix works fine, and the startup delay is not worth mentioning (for me, that is).

dentinox 13_> time ./fx86cl -n -e '(ccl::quit)'
0.000u 0.371s 0:00.37 100.0%    178+211k 0+0io 0pf+0w
Note: See TracTickets for help on using tickets.