Message boards : Number crunching : Preemption Failures on Linux
Previous · 1 · 2
Author | Message |
---|---|
davidtaille Send message Joined: 7 Oct 07 Posts: 2 Credit: 1,470,348 RAC: 0 |
Following up on my stalls experience... some more data. I setup another machine : it's a dell d610 laptop whith Ubuntu 7.04. Boinc messages say: Starting BOINC client version 5.4.11 for i686-pc-linux-gnu libcurl/7.16.0 OpenSSL/0.9.8d zlib/1.2.3 ... Processor: 1 GenuineIntel Intel(R) Pentium(R) M processor 2.00GHz Memory: 1.98 GB physical, 2.50 GB virtual Disk: 8.86 GB total, 1.43 GB free The machine is not overclocked. This machine was "boinc'ed" this morning (default prefs), attached only to Rosetta. I let it work on 6 rosetta WUs : 4 completed successfully, 2 died in computation errors ; all 6 WUs ended in some way and were reported. That is : no stalls. Then I attached the machine to Seti and let boinc switch between seti and rosetta every hour. 1 seti completed, 1 rosetta completed, and then I got the second rosetta WU stalled after a 2-hour run. Just to make sure the 1st machine I mentioned 2 posts above was not having hardware problems, I paused LHC and Seti to let Rosetta run uninterrupted for some time. It successfully completed a Rosetta WU and seems to happily crunch on another. This has only occurred once before since Oct 7... (by the way, this computer undergone a hardware test on Sept 30) So, it seems that whether Pentium M or VIA Esther, Ubuntu 7.04 or CentOS 5 : preempting => stalls. not preempting => no stalls. Until you fix the bug, I think I'll turn off project switching and let each WU run to its end w/o being paused. David |
Harrison Neal Send message Joined: 30 Jul 07 Posts: 8 Credit: 133,501 RAC: 0 |
Since I goofed and was apparently supposed to post here instead of starting another thread ( https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3684 ), here's some information. Since following the advice in this thread and turning on the option to leave applications in memory, I haven't seen any problems. However, what is strange is that, in contrast to DJStarfox's post, Xubuntu 7.04 with BOINC 5.4 (<5.8.16) worked without any problems, but Xubuntu 7.10 with BOINC 5.10 (>5.8.16) seemed to cause the problems. I've updated these computers on a daily basis thus far (on the days when updates were available), so I'd probably expect that they were fully updated or updated as of the previous day at any one time. In the thread I started, I showed which computers seemed to run fine before upgrading to Xubuntu 7.10 along with BOINC 5.10, and I showed which tasks had errors besides exit code 193. Tasks that began with "mcr1__BOINC_RG_FULLWEIGHT_SYMM_FOLD_AND_DOCK_RELAX-mcr1_-mfr__2128" suffered from exit code 139 on two different computers, and both produced a stack trace. The first task mentioned the errors "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor" and "*** glibc detected *** corrupted double-linked list: 0x09821b78 ***". These tasks are: https://boinc.bakerlab.org/rosetta/result.php?resultid=114287567 https://boinc.bakerlab.org/rosetta/result.php?resultid=113947212 Tasks that contained "LARS_ABRELAX_40_GOODRMS_FITTED_SAVE_ALL_OUT" suffered from exit code 1 on two different computers. Both mentioned the error "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor". These tasks are: https://boinc.bakerlab.org/rosetta/result.php?resultid=115098768 https://boinc.bakerlab.org/rosetta/result.php?resultid=115032344 Tasks that contained "LARS_ABRELAX_40_NATIVE_FITTED_SAVE_ALL_OUT" suffered from exit code 1 on three different computers, and both produced a stack trace. All of which mentioned the error "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor", and the second task mentioned the error "*** glibc detected *** corrupted double-linked list: 0x092845c0 ***". These tasks are: https://boinc.bakerlab.org/rosetta/result.php?resultid=115061214 https://boinc.bakerlab.org/rosetta/result.php?resultid=114999850 https://boinc.bakerlab.org/rosetta/result.php?resultid=115068071 The following task was stuck at 100% for several days. It's status was "Waiting to run", but BOINC seemed to refuse to run it. After aborting the task, BOINC briefly reported the task as having worked over 50 CPU Hours (before clicking Abort, it reported ~3 hours). https://boinc.bakerlab.org/rosetta/result.php?resultid=114100598 And, finally, the following tasks encountered exit code 193, which contained either "LARS_ABRELAX_40_GOODRMS_FITTED_SAVE_ALL_OUT", "LARS_ABRELAX_40_NATIVE_FITTED_SAVE_ALL_OUT", "CNTRL_01ABRELAX_SAVE_ALL_OUT" or "BOINC_SYMM_FOLD_AND_DOCK_RELAX" in the task name, and they all contained a stack trace. The fifth, sixth and ninth task mentioned the error "ERROR:: Exit from: fragments.cc line: 465 FILE_LOCK::unlock(): close failed.: Bad file descriptor", and the eighth task mentioned the error "*** glibc detected *** corrupted double-linked list: 0x097a1070 ***". These tasks are: https://boinc.bakerlab.org/rosetta/result.php?resultid=115146613 https://boinc.bakerlab.org/rosetta/result.php?resultid=115144619 https://boinc.bakerlab.org/rosetta/result.php?resultid=115140556 https://boinc.bakerlab.org/rosetta/result.php?resultid=115136485 https://boinc.bakerlab.org/rosetta/result.php?resultid=115123778 https://boinc.bakerlab.org/rosetta/result.php?resultid=115121065 https://boinc.bakerlab.org/rosetta/result.php?resultid=115110213 https://boinc.bakerlab.org/rosetta/result.php?resultid=115099640 https://boinc.bakerlab.org/rosetta/result.php?resultid=115075271 https://boinc.bakerlab.org/rosetta/result.php?resultid=115023948 https://boinc.bakerlab.org/rosetta/result.php?resultid=114208827 https://boinc.bakerlab.org/rosetta/result.php?resultid=114080616 https://boinc.bakerlab.org/rosetta/result.php?resultid=114065860 https://boinc.bakerlab.org/rosetta/result.php?resultid=113973418 https://boinc.bakerlab.org/rosetta/result.php?resultid=113958341 https://boinc.bakerlab.org/rosetta/result.php?resultid=113910476 https://boinc.bakerlab.org/rosetta/result.php?resultid=113901480 In case you're wondering what pathetic excuses for hardware these tasks are running on, here they are: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=625082 Pentium III 450MHz, 512MB RAM https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=625047 Pentium II 350MHz, 512MB RAM https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=617847 AMD K6 350MHz, 192MB RAM https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=622444 Transmeta Crusoe TM5800 1GHz, 768MB RAM They've all had badblocks run in write mode without any problems, and MemTest86+ comes up clean as well. They are allowed to use all of the physical memory and swap space (1GB), most services that I don't use and would consume excess memory have been disabled (GDM, WPA, etc.), and they were all using less than 150MB RAM and no swap space without BOINC running. They have NOT been overclocked (quite frankly, it's amazing half of these things will even turn on...). Once again, they worked under BOINC 5.4 with applications not left in memory, and BOINC 5.10 with applications left in memory. All the above failed tasks occured under BOINC 5.10 with applications not left in memory. -Harrison N. |
Harrison Neal Send message Joined: 30 Jul 07 Posts: 8 Credit: 133,501 RAC: 0 |
I apologize for the previous overkill post, but here's something else - The AMD K6 350MHz computer with 192MB RAM has stalled on a Rosetta@Home task both with and without the "Leave Applications in Memory" setting enabled. It gets stuck at 100%, says it is waiting to run, but will refuse to run nor send the results. Both tasks start with "mcr1__BOINC_RG_FULLWEIGHT_SYMM_FOLD_AND_DOCK_RELAX-mcr1_-mfr__2128_", if that helps. Once again, I believe this has never happened with Xubuntu 7.04 and BOINC 5.4 on this computer; it has only happened with Xubuntu 7.10 and BOINC 5.10. Computer: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=617847 Task with Leave Applications in Memory disabled: https://boinc.bakerlab.org/rosetta/result.php?resultid=114100598 Task with Leave Applications in Memory enabled: https://boinc.bakerlab.org/rosetta/result.php?resultid=115401178 |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
I apologize for the previous overkill post, but here's something else - The minimum RAM required for Rosetta is 256Mb. Even if you've run with less before, you should add RAM. |
TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0 |
Here is a RALPH WU that has been stuck for 8 hours: <active_task> <project_master_url>http://ralph.bakerlab.org/</project_master_url> <result_name>2dlb__BOINC_SYMM_FOLD_AND_DOCK_RELAX-2dlb_-native__2480_49_1</result_name> <app_version_num>581</app_version_num> <slot>1</slot> <checkpoint_cpu_time>14379.437992</checkpoint_cpu_time> <fraction_done>1.000000</fraction_done> <current_cpu_time>14421.431608</current_cpu_time> <swap_size>410079232.000000</swap_size> <working_set_size>231837696.000000</working_set_size> <working_set_size_smoothed>231837696.000000</working_set_size_smoothed> <page_fault_rate>0.000000</page_fault_rate> </active_task> Intel P4 3.0 GHz HT RAM: 512 MB CentOS 4.5 (Linux 2.6.9-55.0.9.ELsmp) BOINC 5.10.21 preferences are the defaults running Rosetta & RALPH |
dasy2k1 Send message Joined: 18 Feb 07 Posts: 2 Credit: 747,661 RAC: 0 |
I have been having similair problems with rosetta on kubuntu 7.4 with boinc 5.4.11 if a rosetta WU is prempted it hangs and wont die gracefully. rather still reports as running, buty uses no CPU and fills the slot on boinc. i have updated my prefs to swich every 2 hours now with a 1 hour target time hoping that rosetta will finnish all WUs before swiching... i cant diable swiching though as i run CPDN and those goliath WUs would get in the way of everything elce. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,821,902 RAC: 15,180 |
i'm a linux newb but have a hunch that my errors (which I believe follow the trend of this thread) are due to rosetta unnecessarily trying to access opengl files. I know the windows version gives a computation error if glu32.dll and opengl32.dll aren't in the system32 folder. Can anyone check what files rosetta is trying to access? |
Message boards :
Number crunching :
Preemption Failures on Linux
©2024 University of Washington
https://www.bakerlab.org