Opened 5 years ago

Last modified 11 months ago

#1257 new defect

Bogus-object on multi-core ARM

Reported by: wws Owned by: rme
Priority: major Milestone:
Component: Runtime (threads, GC) Version: trunk
Keywords: Cc: gb, gz, rme

Description

In my application, we're sending serialized lisp objects over TCP. I occasionally see bogus-objects or other unexpected errors when running it on a multi-core ARM processor. It's very stable on my iMac or an x8664 machine running Linux.

The attached file reproduces the error occasionally.

"Version 1.11-dev-r16335M-trunk (LinuxARM32)"

Attachments (3)

read-serialized-string-bug.lisp (3.6 KB) - added by wws 5 years ago.
File to reproduce Ticket #1257
read-serialized-string-bug.2.lisp (4.0 KB) - added by wws 5 years ago.
Adding (process-allow-schedule) to advice on ccl::%read-list-expression greatly increases the frequency of errors
read-serialized-string-bug.3.lisp (3.7 KB) - added by wws 5 years ago.
New attachment identifies the bug as a garbage collection problem

Download all attachments as: .zip

Change History (21)

Changed 5 years ago by wws

File to reproduce Ticket #1257

comment:1 Changed 5 years ago by wws

I'm running this at home on a two-core Wandboard. The bug doesn't happen on a single-core Raspberry Pi.

$ uname -a Linux arm 3.11.0-rc7-armv7-x12.6 #2 SMP Sat Aug 31 13:49:28 UTC 2013 armv7l GNU/Linux $ cat /etc/issue Debian GNU/Linux 7 \n \l

Changed 5 years ago by wws

Adding (process-allow-schedule) to advice on ccl::%read-list-expression greatly increases the frequency of errors

comment:2 Changed 5 years ago by joswig

I see this also on my quadcore Odroid with an ARMv7 architecture. Takes a minute or so to trigger the error.

Changed 5 years ago by wws

New attachment identifies the bug as a garbage collection problem

comment:3 Changed 5 years ago by wws

read-serialized-string-bug.3.lisp identifies the bug as a failure of the garbage collector to relocate a local copy of the string-input-stream.

comment:4 Changed 5 years ago by wws

I tried reproducing the bug with EGC disabled, (egc nil). It takes longer to happen, especially with only 3 threads, but it still happens. The distance between the local STREAM and global *STREAM* is bigger when the bug happens (full GC recovers more garbage). It also seems more likely to get a BOGUS-OBJECT as the argument to READ-LIST-EXPRESSION-ADVICE, and is more likely to crash into the kernel debugger.

comment:5 Changed 5 years ago by wws

I got an ODROID C1. Running their version of Ubuntu 14.04. The bug happens there, very quickly if :thread-count is >= 3. It also happens if I run it with "taskset 1 ccl", meaning that the lisp has access to only one core. It takes quite a bit longer to happen in that environment, but it does.

comment:6 Changed 5 years ago by gb

  • Resolution set to fixed
  • Status changed from new to closed

(In [16341]) pc_luser_xp(): if we interrupt a conditional store, detect its success/failure correctly.

_SPeabi_ff_call_simple: store the vsp in tcr.save_vsp when exiting lisp. This seems to fix ticket:1257 in the trunk.

comment:7 Changed 5 years ago by svspire

I'm still seeing this -- albeit more rarely now -- on a Phytec SOM.

comment:8 Changed 5 years ago by svspire

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:9 Changed 5 years ago by svspire

  • Priority changed from blocker to major

comment:10 Changed 5 years ago by alms

This bug may still exist when running on the Freescale. Client hasn't seen it in a couple of months, but did see it when running WWS's code after the latest fixes were checked in.

comment:11 Changed 5 years ago by gb

Whatever "the freescale" is, I don't have one anymore. When I can. I'll look at this on some machine that I do have.

comment:12 Changed 4 years ago by rme

  • Owner set to rme
  • Status changed from reopened to new

comment:13 Changed 4 years ago by wws

I reproduced this today on my odroid, using the latest trunk CCL version

odroid@odroid:~/lisp$ ccl
Welcome to Clozure Common Lisp Version 1.11-dev-r16477M-trunk  (LinuxARM32)!
? (load "read-serialized-string-bug.3.lisp")
#P"/home/odroid/lisp/read-serialized-string-bug.3.lisp"
? (read-from-serialized-string)
> Error: Too few arguments in call to #<Compiled-function READ-INSTANCE #x144B6326>:
>        0 arguments provided, at least 1 required.
> While executing: READ-INSTANCE, in process Reader 0: 505(2).
;;;
;;; #<PROCESS Reader 0: 505(2) [Active] #x144BBB56> requires access to Shared Terminal Input
;;; Type (:y 2) to yield control to this thread.
;;;
  C-c C-c> Break: interrupt signal
> While executing: CCL::CHECK-SEQUENCE-BOUNDS, in process listener(1).
> Type :GO to continue, :POP to abort, :R for a list of available restarts.
> If continued: Return from BREAK.
> Type :? for other options.
1 > (all-processes)
(#<PROCESS Reader 0: 505(2) [semaphore wait] #x144BBB56> #<TTY-LISTENER listener(1) [Active] #x144AFDAE> #<PROCESS Initial(0) [\
Sleep] #x14161A46>)
1 > (:y 2)

;;;
;;; Shared Terminal Input is now owned by #<PROCESS Reader 0: 505(2) [Active] #x144BBB56>
;;;

> Type :POP to abort, :R for a list of available restarts.
> Type :? for other options.
1 > :b
*(B67E2D50) : 0 (READ-INSTANCE) 8
 (B67E2D50) : 1 (CALL-CHECK-REGS READ-INSTANCE) 112
 (B67E2D80) : 2 (CHEAP-EVAL (#<BOGUS object @ #xB4FE2D3D> #\))) 64
 (B67E2DA0) : 3 (READ-DISPATCH #<STRING-INPUT-STREAM  #x14539F66> #\#) 1008
 (B67E2E08) : 4 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\# NIL) 452
 (B67E2E28) : 5 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E2E48) : 6 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 580
 (B67E2E88) : 7 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E2EA8) : 8 (%READ-FORM #<STRING-INPUT-STREAM  #x14539F66> 0 NIL) 452
 (B67E2EC8) : 9 (FUNCALL #'#<CCL::|'-reader|> #<STRING-INPUT-STREAM  #x14539F66> #\') 64
 (B67E2EE8) : 10 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\' (NIL)) 352
 (B67E2F08) : 11 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> (NIL) #\)) 192
 (B67E2F28) : 12 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 444
 (B67E2F68) : 13 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E2F88) : 14 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
  (B67E2F88) : 14 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E2F98) : 15 (READ-EVAL #<STRING-INPUT-STREAM  #x14539F66> #\. NIL) 96
 (B67E2FB8) : 16 (READ-DISPATCH #<STRING-INPUT-STREAM  #x14539F66> #\#) 1008
 (B67E3020) : 17 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\# NIL) 452
 (B67E3040) : 18 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E3060) : 19 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 580
 (B67E30A0) : 20 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E30C0) : 21 (%READ-FORM #<STRING-INPUT-STREAM  #x14539F66> 0 NIL) 452
 (B67E30E0) : 22 (FUNCALL #'#<CCL::|'-reader|> #<STRING-INPUT-STREAM  #x14539F66> #\') 64
 (B67E3100) : 23 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\' (NIL)) 352
 (B67E3120) : 24 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> (NIL) #\)) 192
 (B67E3140) : 25 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 444
 (B67E3180) : 26 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E31A0) : 27 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E31B0) : 28 (READ-EVAL #<STRING-INPUT-STREAM  #x14539F66> #\. NIL) 96
 (B67E31D0) : 29 (READ-DISPATCH #<STRING-INPUT-STREAM  #x14539F66> #\#) 1008
 (B67E3238) : 30 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\# NIL) 452
 (B67E3258) : 31 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E3278) : 32 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 580
 (B67E32B8) : 33 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E32D8) : 34 (%READ-FORM #<STRING-INPUT-STREAM  #x14539F66> 0 NIL) 452
 (B67E32F8) : 35 (FUNCALL #'#<CCL::|'-reader|> #<STRING-INPUT-STREAM  #x14539F66> #\') 64
 (B67E3318) : 36 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\' (NIL)) 352
 (B67E3338) : 37 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> (NIL) #\)) 192
 (B67E3358) : 38 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 444
 (B67E3398) : 39 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E33B8) : 40 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E33C8) : 41 (READ-EVAL #<STRING-INPUT-STREAM  #x14539F66> #\. NIL) 96
 (B67E33E8) : 42 (READ-DISPATCH #<STRING-INPUT-STREAM  #x14539F66> #\#) 1008
 (B67E3450) : 43 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\# NIL) 452
 (B67E3470) : 44 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E3490) : 45 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 580
 (B67E34D0) : 46 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E34F0) : 47 (%READ-FORM #<STRING-INPUT-STREAM  #x14539F66> 0 NIL) 452
 (B67E3510) : 48 (FUNCALL #'#<CCL::|'-reader|> #<STRING-INPUT-STREAM  #x14539F66> #\') 64
 (B67E3530) : 49 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\' (NIL)) 352
 (B67E3550) : 50 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> (NIL) #\)) 192
 (B67E3570) : 51 (READ-LIST #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 444
 (B67E35B0) : 52 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\( NIL) 352
 (B67E35D0) : 53 (FUNCALL #'#<CCL::%READ-LIST-EXPRESSION> #<STRING-INPUT-STREAM  #x14539F66> NIL #\)) 192
 (B67E35E0) : 54 (READ-EVAL #<STRING-INPUT-STREAM  #x14539F66> #\. NIL) 96
 (B67E3600) : 55 (READ-DISPATCH #<STRING-INPUT-STREAM  #x14539F66> #\#) 1008
 (B67E3668) : 56 (%PARSE-EXPRESSION #<STRING-INPUT-STREAM  #x14539F66> #\# NIL) 452
 (B67E3688) : 57 (%READ-FORM #<STRING-INPUT-STREAM  #x14539F66> 0 NIL) 452
 (B67E36A8) : 58 (READ-INTERNAL #<STRING-INPUT-STREAM  #x14539F66> T NIL NIL) 204
 (B67E36B8) : 59 (READ #<STRING-INPUT-STREAM  #x14539F66> T NIL NIL) 124
 (B67E36C8) : 60 (FUNCALL #'#<(:INTERNAL READ-FROM-SERIALIZED-STRING)> 0) 392
 (B67E3748) : 61 (RUN-PROCESS-INITIAL-FORM #<PROCESS Reader 0: 505(2) [Active] #x144BBB56> (#<COMPILED-LEXICAL-CLOSURE # #x144BB6F6>)) 412
 (B67E3860) : 62 (FUNCALL #'#<(:INTERNAL (CCL::%PROCESS-PRESET-INTERNAL (PROCESS)))> #<PROCESS Reader 0: 505(2) [Active] #x144B\
BB56> (#<COMPILED-LEXICAL-CLOSURE # #x144BB6F6>)) 372
 (B67E3B48) : 63 (FUNCALL #'#<(:INTERNAL CCL::THREAD-MAKE-STARTUP-FUNCTION)>) 216

comment:14 Changed 4 years ago by rme

This is quite hard for me to reproduce on the dual-core Samsung Chromebook that I have. I've so far succeeded only once today, so the bug is there, but coming up with a more easily repeatable test case will probably be key to making progress.

comment:15 Changed 4 years ago by gb

I checked to see if the bug is still present with (EGC NIL) in the current trunk,

 (dotimes (i 1000) (print i) (read-from-serialized-string)) 

got a bogus object after 11 iterations, so yes. it is still present.

comment:16 Changed 4 years ago by gb

One failure that I sometimes see is:

> Error: Too few arguments in call to #<Compiled-function READ-INSTANCE #x144FC73E>:
>        0 arguments provided, at least 1 required. 

pc)_luser_xp() rarely modifies the imm registers in a sigcontext, but something somewhere seems to clobber nargs (imm2). Sometimes.

comment:17 Changed 21 months ago by rme

Not surprisingly, this bug is still present using a current GitHub? ccl (tag v1.12-dev.3).

This is on a Raspberry Pi 3.

Clozure Common Lisp Version 1.12-dev/v1.12-dev.3-2-g73a0450  (LinuxARM32)
? (cl-user::read-from-serialized-string :thread-count 4)
> Error: Too few arguments in call to #<Compiled-function COMMON-LISP-USER::READ-INSTANCE #x144E7A96>:
>        0 arguments provided, at least 1 required. 
> While executing: COMMON-LISP-USER::READ-INSTANCE, in process Reader 0: 415(6).


;;;
;;; #<PROCESS Reader 0: 415(6) [Active] #x14536B2E> requires access to Shared Terminal Input
;;; Type (:y 6) to yield control to this thread.
;;;
Unhandled exception 4 at 0x1984c, context->regs at #x75fe2a60
? for help
[2978] Clozure CL kernel debugger: 

comment:18 Changed 11 months ago by rme

[from openml-devel]

I found that

  • a simple thread test with a bit of garbage generation in threads fails with Unhandled Exception 4 on armcl 1.12-dev (v1.12-dev.4-3-gdd5622e9), but does not fail in release version v1.11.5
  • If I put a lock around the garbage generation (just a make-string), to exercise locking in threads, in v1.11.5 it fails with “Current process #<PROCESS Reader 4(26) [Active] #x150D335E> does not own lock #<RECURSIVE-LOCK "glock" [ptr @ #x76105C60] #x150BB0D6> in CCL::%UNLOCK-RECURSIVE-LOCK-OBJECT"

Is it possible that imperfect locking (race condition? non atomicity?) in the armcl implementation causes bug 1257, mimicking a GC bug?

Regards, -John

;; (THREADTEST ..) launches some threads that generate garbage.
;; It crashes on latest rpi deb build  with:
;;   > Unhandled exception 4 at 0x1512acec, context->regs at #x74d64490

;; specifically, it
;; crashes on raspberri pi armcl dev version:
;;        * "Version 1.12-dev (v1.12-dev.4-3-gdd5622e9) LinuxARM32"
;; but does NOT crash on release version:
;;        * "Version 1.11.5/v1.11.5  (LinuxARM32)"


;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; It also seems to expose a locking problem in Version 1.11.5 when
;; run as (threadtest :loop-count 10000 :exercise-locking) to force
;; allocations inside the threads to take place inside a
;; with-lock-grabbed, the error:
;;
;;  Current process #<PROCESS Reader 4(26) [Active] #x150D335E> does not own
;;  lock #<RECURSIVE-LOCK "glock" [ptr @ #x76105C60] #x150BB0D6>
;;
;; occurs in CCL::%UNLOCK-RECURSIVE-LOCK-OBJECT
;;
;; could this locking problem be the source of the threading bug here?
;; https://trac.clozure.com/ccl/ticket/1257 


(defparameter *the-glock* nil) ;; allow us to look at lock while running

(defun threadtest (&key
		     (thread-count 7)
		     (loop-count 1000)
		     (exercise-locking nil)
		     (garbage-count 1000))
  (let ((done-flags (make-array thread-count :initial-element nil))
	(lock (ccl:make-lock "done-flags-lock"))
	(glock (when exercise-locking
		 (ccl:make-lock "glock"))))
    (setf *the-glock* glock)
    (dotimes (i thread-count)
      (process-run-function
       (format nil "Reader ~d" i)
       (lambda (i)
	 (unwind-protect
	      (dotimes (j loop-count)
		#+nil ;; temporarily remove this additonal complication
		(setf (ccl:process-name ccl:*current-process*)
		      (format nil "Reader ~d: ~d" i j))
		(garbagemaker garbage-count glock)
		))
	 (ccl:with-lock-grabbed (lock) (setf (elt done-flags i) t)))
       i))
    (loop do
         (unless
	     (ccl:with-lock-grabbed (lock)
	       (position nil done-flags))
	   (return)))))


;; make some garbage, optionally grab a lock to exercise
;; locking in threads
(defun garbagemaker (garbage-count lock)
  (loop for i below garbage-count
     collect (if lock
	       (ccl:with-lock-grabbed (lock)
		 (make-string 10))
	       (make-string 10))))
Note: See TracTickets for help on using tickets.