Ticket #868 (closed defect: fixed)

Opened 3 years ago

Last modified 3 years ago

Random whacky behaviour (memory corruption?)

Reported by: dfindlay Owned by: gb
Priority: normal Milestone:
Component: Runtime (threads, GC) Version: 1.6
Keywords: Cc:

Description

On moving a Lisp application to a newer Linux box, we get random, whacky behaviour. This is demonstrated by the following file:

=== process-test.lisp ===
(in-package :cl-user)

(defun big-sum (n)
  (let ((total n))
    (dotimes (i n total)
      (incf total i))))

(defun test ()
  (ccl:process-run-function "25" #'big-sum 250000000)
  (ccl:process-run-function "26" #'big-sum 260000000)
  (ccl:process-run-function "27" #'big-sum 270000000)
  (ccl:process-run-function "28" #'big-sum 280000000))

and session transcript:

$ uname -a
Linux startle 2.6.32-30-generic #59-Ubuntu SMP Tue Mar 1 21:30:21 UTC 2011 i686 GNU/Linux
$ ccl
Welcome to Clozure Common Lisp Version 1.6  (LinuxX8632)!
? (load "process-test.lisp")
#P"/home/startle/startle/process-test.lisp"
? (test)
#<PROCESS 28(5) [Reset] #x1828FA76>
? 
> Error: Fault during read of memory address #x634
> While executing: CCL::*-2, in process 26(3).


;;;
;;; #<PROCESS 26(3) [Active] #x182901B6> requires access to Shared Terminal Input
;;; Type (:y 3) to yield control to this thread.
;;;
> Error: value #<BOGUS object @ #x284E4729> is not of the expected type NUMBER.
> While executing: CCL::+-2, in process 27(4).


;;;
;;; #<PROCESS 27(4) [Active] #x1828FE16> requires access to Shared Terminal Input
;;; Type (:y 4) to yield control to this thread.
;;;
> Error: value #<Unprintable CCL::IMMEDIATE : #x34D0C3> is not of the expected type NUMBER.
> While executing: CCL::*-2, in process 28(5).


;;;
;;; #<PROCESS 28(5) [Active] #x1828FA76> requires access to Shared Terminal Input
;;; Type (:y 5) to yield control to this thread.
;;;

The exact error messages are variable. Above transcript from x86 Ubuntu 10.04 box; similar behaviour from x86 Centos 5.6. However, x86 Centos 5.3 (and earlier) does not show this, nor does Darwin/PPC (CCL 1.4).

Change History

comment:1 Changed 3 years ago by gb

  • Owner set to gb
  • Status changed from new to assigned

When you say "x86", I'm not sure whether you mean "32-bit x86", "64-bit x86", or both.

In limited testing, I was able to get the x8632 version of CCL to crash like this on a kernel that identified itself as "2.6.38-8-generic #42-Ubuntu SMP" but haven't been able to get the same lisp to crash on "2.6.34.8-68.fc13.x86_64" and the x8664 CCL seems to run fine on both machines. ("runs fine" means "your test ran to completion the one time that I tried it"; of course that isn't really conclusive.)

Based on this small sample size and one interpretation of your report, I'd say "the bug seems to affect the 32-bit x86 CCL under some Linux kernels and not others, and has not been observed to affect the 64-bit x86 CCL." Does this sound correct ?

If that's correct, then it's certainly possible that this is a Linux kernel bug. If that's true, that'd be especially disturbing since it would seemingly affect lots of different kernel versions. Before worrying too much about the implications of that, I'll try harder to get it to fail on the platform where it seemed to work.

If you could clarify the point that I'm confused about, that'd be helpful.

comment:2 Changed 3 years ago by dfindlay

I was in fact running the 32-bit x86 version. Sorry I didn't make that clear initially.

comment:3 Changed 3 years ago by gb

I've been able to run your test case > 50 times in a row on the machine that it didn't crash on and haven't seen any problem.

On the machine where I could reproduce the problem, I've been able to run a (vaguely) similar test case a few times so far without seeing a problem yet:

(defun test2-fn (c)
  (let* ((n (car c)))
    (unless (zerop n)
      (test2-fn (cons (1- n) nil)))))

(defun test2 ()
  (process-run-function "a" #'test2-fn (cons most-positive-fixnum nil))
  (process-run-function "b" #'test2-fn (cons most-positive-fixnum nil))
  (process-run-function "c" #'test2-fn (cons most-positive-fixnum nil))
  (process-run-function "d" #'test2-fn (cons most-positive-fixnum nil)))

; (test2)

[This may not be interesting, but to think aloud:]

One difference between these cases is that in the first, threads spend a lot of their time allocating small bignums and in the second they spend a lot of their time allocating cons cells. When one thread triggers the GC, that thread stops the other threads and tries to determine if the thread was stopped in the middle of a memory allocation operation; the GC wants to treat those operations as if they happen atomically (so the allocation will either appear to have completed or will appear to not have started yet); the thread entering the GC will manipulate the state of the stopped thread to ensure that this is true.

If any object other than a CONS is being allocated, part of the sequence of instructions involved in finishing that allocation involves storing a header word (describing the object's type and size) in the first word of the newly-allocated object. On x8632, we expect an MMX register to contain that header word, and the GC thread emulates the instruction that would store that MMX register in the object by accessing and manipulating the state of the suspended thread.

The memory corruption that occurs is consistent with what would happen if the newly-allocated object had an invalid header, so one theory is that some Linux kernels don't represent the MMX state of the suspended thread correctly (or don't put that information where CCL expects to find it, or something.)

On machines that aren't as register-starved as x8632, we can keep the header word in a general-purpose register and don't have to worry about where the MMX register is in the suspended thread's context; if the suspended thread is trying to create a CONS, there's no header involved.

That all seems to make this a good theory, but I don't yet know if it's a correct theory or not.

comment:4 Changed 3 years ago by gb

  • Status changed from assigned to closed
  • Resolution set to fixed

(In [14825]) If pc_luser_xp() finds that the target thread was interrupted while consing at the branch around the alloc_trap, the branch would have been taken, and a uvector was being allocated, we need to (as the comment says) "slap the header on the new uvector". On x8632, the header's in xmm0, not imm0.

(This has been broken forever; apparently, it's very hard to reproduce.)

Fixes ticket:868.

comment:5 Changed 3 years ago by gb

(In [14826]) propagate r14825 to 1.6 branch; fixes ticket:868 in 1.6

comment:6 follow-up: ↓ 7 Changed 3 years ago by gb

FWIW, this was a CCL bug, not a Linux kernel bug.

It did indeed have to do with the code that tries to finish interrupted memory allocation in a suspended thread. The case where a thread was interrupted while at a certain point in the allocation sequence was never handled correctly on x8632.

OS version, hardware differences, and other factors can all influence the precise point at which a suspended thread is ... suspended, but these are ultimately just scheduling artifacts. The fact that the test seemed to work reliably on some OS/hardware combinations wasn't really conclusive.

(The case that failed had to do with when the thread was suspended just before taking a conditional branch; the behavior of the branch prediction hardware could conceivably make that case impossible on some platforms and fairly common on others. I saw this fail fairly quickly on an Intel ATOM processor and never saw it fail on a Core2-Duo machine.)

In any case, there's no reason to believe that a Linux kernel bug was involved in this; the bug was ultimately in CCL, and had been there forever.

comment:7 in reply to: ↑ 6 Changed 3 years ago by rme

Replying to gb:

(The case that failed had to do with when the thread was suspended just before taking a conditional branch; the behavior of the branch prediction hardware could conceivably make that case impossible on some platforms and fairly common on others. I saw this fail fairly quickly on an Intel ATOM processor and never saw it fail on a Core2-Duo machine.)

Some x86 hardware uses a thing called macro fusion. This turns certain pairs of compare and branch instructions into a single internal micro instruction. Core 2 Duo does this, so I guess that's why we never saw this bug: as far as that processor is concerned, that compare-and-branch is a single instruction.

Note: See TracTickets for help on using tickets.