Preemption Failures on Linux

Message boards : Number crunching : Preemption Failures on Linux

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
davidtaille

Send message
Joined: 7 Oct 07
Posts: 2
Credit: 1,470,348
RAC: 0
Message 47887 - Posted: 20 Oct 2007, 22:18:30 UTC - in response to Message 47841.  

Following up on my stalls experience... some more data.

I setup another machine : it's a dell d610 laptop whith Ubuntu 7.04. Boinc messages say:
Starting BOINC client version 5.4.11 for i686-pc-linux-gnu
libcurl/7.16.0 OpenSSL/0.9.8d zlib/1.2.3
...
Processor: 1 GenuineIntel Intel(R) Pentium(R) M processor 2.00GHz
Memory: 1.98 GB physical, 2.50 GB virtual
Disk: 8.86 GB total, 1.43 GB free
The machine is not overclocked.
This machine was "boinc'ed" this morning (default prefs), attached only to Rosetta.
I let it work on 6 rosetta WUs : 4 completed successfully, 2 died in computation errors ; all 6 WUs ended in some way and were reported. That is : no stalls.
Then I attached the machine to Seti and let boinc switch between seti and rosetta every hour. 1 seti completed, 1 rosetta completed, and then I got the second rosetta WU stalled after a 2-hour run.

Just to make sure the 1st machine I mentioned 2 posts above was not having hardware problems, I paused LHC and Seti to let Rosetta run uninterrupted for some time. It successfully completed a Rosetta WU and seems to happily crunch on another. This has only occurred once before since Oct 7...
(by the way, this computer undergone a hardware test on Sept 30)

So, it seems that whether Pentium M or VIA Esther, Ubuntu 7.04 or CentOS 5 :
preempting => stalls.
not preempting => no stalls.
Until you fix the bug, I think I'll turn off project switching and let each WU run to its end w/o being paused.

David
ID: 47887 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Harrison Neal

Send message
Joined: 30 Jul 07
Posts: 8
Credit: 133,501
RAC: 0
Message 48044 - Posted: 26 Oct 2007, 19:18:36 UTC
Last modified: 26 Oct 2007, 19:21:41 UTC

Since I goofed and was apparently supposed to post here instead of starting another thread ( https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3684 ), here's some information.

Since following the advice in this thread and turning on the option to leave applications in memory, I haven't seen any problems. However, what is strange is that, in contrast to DJStarfox's post, Xubuntu 7.04 with BOINC 5.4 (<5.8.16) worked without any problems, but Xubuntu 7.10 with BOINC 5.10 (>5.8.16) seemed to cause the problems.

I've updated these computers on a daily basis thus far (on the days when updates were available), so I'd probably expect that they were fully updated or updated as of the previous day at any one time.

In the thread I started, I showed which computers seemed to run fine before upgrading to Xubuntu 7.10 along with BOINC 5.10, and I showed which tasks had errors besides exit code 193.

Tasks that began with "mcr1__BOINC_RG_FULLWEIGHT_SYMM_FOLD_AND_DOCK_RELAX-mcr1_-mfr__2128" suffered from exit code 139 on two different computers, and both produced a stack trace. The first task mentioned the errors "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor" and "*** glibc detected *** corrupted double-linked list: 0x09821b78 ***". These tasks are:

https://boinc.bakerlab.org/rosetta/result.php?resultid=114287567

https://boinc.bakerlab.org/rosetta/result.php?resultid=113947212

Tasks that contained "LARS_ABRELAX_40_GOODRMS_FITTED_SAVE_ALL_OUT" suffered from exit code 1 on two different computers. Both mentioned the error "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor". These tasks are:

https://boinc.bakerlab.org/rosetta/result.php?resultid=115098768

https://boinc.bakerlab.org/rosetta/result.php?resultid=115032344

Tasks that contained "LARS_ABRELAX_40_NATIVE_FITTED_SAVE_ALL_OUT" suffered from exit code 1 on three different computers, and both produced a stack trace. All of which mentioned the error "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor", and the second task mentioned the error "*** glibc detected *** corrupted double-linked list: 0x092845c0 ***". These tasks are:

https://boinc.bakerlab.org/rosetta/result.php?resultid=115061214

https://boinc.bakerlab.org/rosetta/result.php?resultid=114999850

https://boinc.bakerlab.org/rosetta/result.php?resultid=115068071

The following task was stuck at 100% for several days. It's status was "Waiting to run", but BOINC seemed to refuse to run it. After aborting the task, BOINC briefly reported the task as having worked over 50 CPU Hours (before clicking Abort, it reported ~3 hours).

https://boinc.bakerlab.org/rosetta/result.php?resultid=114100598

And, finally, the following tasks encountered exit code 193, which contained either "LARS_ABRELAX_40_GOODRMS_FITTED_SAVE_ALL_OUT", "LARS_ABRELAX_40_NATIVE_FITTED_SAVE_ALL_OUT", "CNTRL_01ABRELAX_SAVE_ALL_OUT" or "BOINC_SYMM_FOLD_AND_DOCK_RELAX" in the task name, and they all contained a stack trace. The fifth, sixth and ninth task mentioned the error "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor", and the eighth task mentioned the error "*** glibc detected *** corrupted double-linked list: 0x097a1070 ***". These tasks are:

https://boinc.bakerlab.org/rosetta/result.php?resultid=115146613

https://boinc.bakerlab.org/rosetta/result.php?resultid=115144619

https://boinc.bakerlab.org/rosetta/result.php?resultid=115140556

https://boinc.bakerlab.org/rosetta/result.php?resultid=115136485

https://boinc.bakerlab.org/rosetta/result.php?resultid=115123778

https://boinc.bakerlab.org/rosetta/result.php?resultid=115121065

https://boinc.bakerlab.org/rosetta/result.php?resultid=115110213

https://boinc.bakerlab.org/rosetta/result.php?resultid=115099640

https://boinc.bakerlab.org/rosetta/result.php?resultid=115075271

https://boinc.bakerlab.org/rosetta/result.php?resultid=115023948

https://boinc.bakerlab.org/rosetta/result.php?resultid=114208827

https://boinc.bakerlab.org/rosetta/result.php?resultid=114080616

https://boinc.bakerlab.org/rosetta/result.php?resultid=114065860

https://boinc.bakerlab.org/rosetta/result.php?resultid=113973418

https://boinc.bakerlab.org/rosetta/result.php?resultid=113958341

https://boinc.bakerlab.org/rosetta/result.php?resultid=113910476

https://boinc.bakerlab.org/rosetta/result.php?resultid=113901480

In case you're wondering what pathetic excuses for hardware these tasks are running on, here they are:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=625082
Pentium III 450MHz, 512MB RAM

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=625047
Pentium II 350MHz, 512MB RAM

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=617847
AMD K6 350MHz, 192MB RAM

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=622444
Transmeta Crusoe TM5800 1GHz, 768MB RAM

They've all had badblocks run in write mode without any problems, and MemTest86+ comes up clean as well. They are allowed to use all of the physical memory and swap space (1GB), most services that I don't use and would consume excess memory have been disabled (GDM, WPA, etc.), and they were all using less than 150MB RAM and no swap space without BOINC running. They have NOT been overclocked (quite frankly, it's amazing half of these things will even turn on...).

Once again, they worked under BOINC 5.4 with applications not left in memory, and BOINC 5.10 with applications left in memory. All the above failed tasks occured under BOINC 5.10 with applications not left in memory.

-Harrison N.
ID: 48044 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Harrison Neal

Send message
Joined: 30 Jul 07
Posts: 8
Credit: 133,501
RAC: 0
Message 48063 - Posted: 27 Oct 2007, 18:12:25 UTC

I apologize for the previous overkill post, but here's something else -

The AMD K6 350MHz computer with 192MB RAM has stalled on a Rosetta@Home task both with and without the "Leave Applications in Memory" setting enabled. It gets stuck at 100%, says it is waiting to run, but will refuse to run nor send the results. Both tasks start with "mcr1__BOINC_RG_FULLWEIGHT_SYMM_FOLD_AND_DOCK_RELAX-mcr1_-mfr__2128_", if that helps. Once again, I believe this has never happened with Xubuntu 7.04 and BOINC 5.4 on this computer; it has only happened with Xubuntu 7.10 and BOINC 5.10.

Computer: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=617847
Task with Leave Applications in Memory disabled: https://boinc.bakerlab.org/rosetta/result.php?resultid=114100598
Task with Leave Applications in Memory enabled: https://boinc.bakerlab.org/rosetta/result.php?resultid=115401178
ID: 48063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
j2satx

Send message
Joined: 17 Sep 05
Posts: 97
Credit: 3,670,592
RAC: 0
Message 48068 - Posted: 27 Oct 2007, 22:50:34 UTC - in response to Message 48063.  

I apologize for the previous overkill post, but here's something else -

The AMD K6 350MHz computer with 192MB RAM has stalled on a Rosetta@Home task both with and without the "Leave Applications in Memory" setting enabled. It gets stuck at 100%, says it is waiting to run, but will refuse to run nor send the results. Both tasks start with "mcr1__BOINC_RG_FULLWEIGHT_SYMM_FOLD_AND_DOCK_RELAX-mcr1_-mfr__2128_", if that helps. Once again, I believe this has never happened with Xubuntu 7.04 and BOINC 5.4 on this computer; it has only happened with Xubuntu 7.10 and BOINC 5.10.

Computer: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=617847
Task with Leave Applications in Memory disabled: https://boinc.bakerlab.org/rosetta/result.php?resultid=114100598
Task with Leave Applications in Memory enabled: https://boinc.bakerlab.org/rosetta/result.php?resultid=115401178


The minimum RAM required for Rosetta is 256Mb. Even if you've run with less before, you should add RAM.

ID: 48068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 48184 - Posted: 31 Oct 2007, 16:25:22 UTC

Here is a RALPH WU that has been stuck for 8 hours:

<active_task>
<project_master_url>http://ralph.bakerlab.org/</project_master_url>
<result_name>2dlb__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2dlb_-native__2480_49_1</result_name>
<app_version_num>581</app_version_num>
<slot>1</slot>
<checkpoint_cpu_time>14379.437992</checkpoint_cpu_time>
<fraction_done>1.000000</fraction_done>
<current_cpu_time>14421.431608</current_cpu_time>
<swap_size>410079232.000000</swap_size>
<working_set_size>231837696.000000</working_set_size>
<working_set_size_smoothed>231837696.000000</working_set_size_smoothed>
<page_fault_rate>0.000000</page_fault_rate>
</active_task>


Intel P4 3.0 GHz HT
RAM: 512 MB
CentOS 4.5 (Linux 2.6.9-55.0.9.ELsmp)
BOINC 5.10.21
preferences are the defaults
running Rosetta & RALPH
ID: 48184 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
dasy2k1

Send message
Joined: 18 Feb 07
Posts: 2
Credit: 747,661
RAC: 0
Message 48443 - Posted: 7 Nov 2007, 17:05:39 UTC

I have been having similair problems with rosetta on kubuntu 7.4 with boinc 5.4.11

if a rosetta WU is prempted it hangs and wont die gracefully. rather still reports as running, buty uses no CPU and fills the slot on boinc.

i have updated my prefs to swich every 2 hours now with a 1 hour target time hoping that rosetta will finnish all WUs before swiching...
i cant diable swiching though as i run CPDN and those goliath WUs would get in the way of everything elce.
ID: 48443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,677,569
RAC: 10,479
Message 48446 - Posted: 7 Nov 2007, 18:54:15 UTC

i'm a linux newb but have a hunch that my errors (which I believe follow the trend of this thread) are due to rosetta unnecessarily trying to access opengl files. I know the windows version gives a computation error if glu32.dll and opengl32.dll aren't in the system32 folder. Can anyone check what files rosetta is trying to access?
ID: 48446 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Preemption Failures on Linux



©2024 University of Washington
https://www.bakerlab.org