Questions and Answers : Unix/Linux : Rosetta WU's stall on RedHat Fedora
Author | Message |
---|---|
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
Most of my systems run Windows 2K or XP, and handle Rosetta without a problem. However, I have one box running RedHat Fedora Core 4, on a Dell Dimension 4700. This system occasionally gets a WU that just stalls. Boincmgr shows it as running, but if I use ps to look at it, there's three copies of rosetta_4.78_i686-pc-linux-gnu running, all of which are sleeping: the STAT column shows S for all of them, and the CPU time (both in ps and boincmgr) does not increase. Any ideas? Anything else I can do to help with this? |
Andrew Send message Joined: 19 Sep 05 Posts: 162 Credit: 105,512 RAC: 0 |
What boinc client version are you using on the linux side? |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
What boinc client version are you using on the linux side? 5.2.6 |
tmr0 Send message Joined: 9 Nov 05 Posts: 2 Credit: 3,012,594 RAC: 0 |
What boinc client version are you using on the linux side? I've had this happen twice now with FC4, ps aux shows boinc as S+ here is the output, at 16:42 I started poking around in Boinc Manager: 2005-11-15 19:26:29 [rosetta@home] Requesting 646 seconds of new work, and repor ting 1 results 2005-11-15 19:26:34 [rosetta@home] Scheduler request to https://boinc.bakerlab.or g/rosetta_cgi/cgi succeeded 2005-11-15 19:26:34 [rosetta@home] Message from server: No work sent 2005-11-15 19:26:34 [rosetta@home] Message from server: No work sent 2005-11-15 19:26:34 [rosetta@home] Message from server: (there was work for othe r platforms) 2005-11-15 19:26:34 [rosetta@home] Message from server: (there was work for other platforms) 2005-11-15 19:26:34 [rosetta@home] No work from project 2005-11-15 19:26:34 [rosetta@home] No work from project 2005-11-15 20:59:26 [---] request_reschedule_cpus: process exited 2005-11-15 20:59:26 [rosetta@home] Computation for result 1n0u__abrelaxmode_random_length20_jitter02_omega_32991_0 finished 2005-11-15 20:59:29 [rosetta@home] Started upload of 1n0u__abrelaxmode_random_length20_jitter02_omega_32991_0_0 2005-11-15 20:59:34 [rosetta@home] Finished upload of 1n0u__abrelaxmode_random_length20_jitter02_omega_32991_0_0 2005-11-15 20:59:34 [rosetta@home] Throughput 14967 bytes/sec 2005-11-16 16:42:29 [---] request_reschedule_cpus: project op |
tmr0 Send message Joined: 9 Nov 05 Posts: 2 Credit: 3,012,594 RAC: 0 |
My Boinc version is 5.2.7. In Boinc Manager the status is shown as "Communication Deferred" |
paul.g Send message Joined: 5 Jan 06 Posts: 1 Credit: 174,359 RAC: 0 |
I don't know if this helps, but I'm running boinc ver. 5.2.13 and rosetta ver. 4.80. I've also got two other projects on the go, seti@home and predictor@home, both of which have not given me any problems to date (for about 4 weeks now). I'm running on a somewhat upgraded slackware 7.0 which is now closer to 10.0. I just subscribed to rosetta and my first WU got to about 20% then also stalled. The boinc manager shows rosetta is running, and the the processes are in memory, however they are all sleeping and are not actually taking any CPU time. I didn't try exiting out of the manager altogether and restarting it, I aborted the WU and it got a new one and started processing that one. If it stalls again, I'll try restarting the the manager, but I somehow doubt it will solve the problem. I'm wondering if there is either a problem with unloading the app from memory before the WU is complete when it schedules another project or if there is some kind of race condition from within the rosetta app. |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
I have (occassionally) this problem already for ages, on Red Hat EL 4.1. Now using BCC 5.4.9, attached to 7 projects, Rosetta's share is ~20%. The symptoms are that Rosetta app seems to be running, but the CPU time does not increase. Recently I've noticed that even BCC is not able to run benchmarks, if this happens. IIRC previously if BCC was able to switch to aother app, it got 0 CPU cyces (Rosetta was consuming all) and did not increment time. Usually the only way to overcome this problem was to manually restart BCC. This time I've made few snapshots and suspended Rosetta in memory to be able to test something, if anyone would be interested in. To the snapshots - sorry for inconvenience, they go from bottom up to top, are pretty wide, the [ code ] formatting seems to be ugly and it's even worse without. [edit] I'll read the Rosetta 5.22 problems reporting thread through...[/edit] Peter [size=10] [pepo@orc ~]$ top -U pepo top - 21:21:52 up 65 days, 7:27, 1 user, load average: 0.89, 0.42, 0.20 Tasks: 102 total, 1 running, 101 sleeping, 0 stopped, 0 zombie Cpu(s): 0.3% us, 0.7% sy, 99.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 515744k total, 510592k used, 5152k free, 139564k buffers Swap: 1048568k total, 180k used, 1048388k free, 103004k cached [/size] [size=9] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31420 pepo 17 0 2212 980 764 R 0.7 0.2 0:00.60 top -U pepo 14747 pepo 16 0 5684 4104 1680 S 0.0 0.8 3693:33 ./boinc -allow_remote_gui_rpc 22383 pepo 34 19 78472 51m 4792 S 0.0 10.3 59:40.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o..... 22384 pepo 34 19 78472 51m 4792 S 0.0 10.3 0:00.02 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o..... 22385 pepo 35 19 78472 51m 4792 S 0.0 10.3 0:00.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o 22386 pepo 34 19 78472 51m 4792 S 0.0 10.3 0:00.14 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o 31382 pepo 34 19 30004 6672 2056 S 0.0 1.3 0:00.01 albert_4.58_i686-pc-linux-gnu @conf --IFO=LHO --Freq=287.667431193 --FreqBand=0.00114678899083 --startTime=7951 31390 pepo 16 0 8128 2212 1808 S 0.0 0.4 0:00.10 sshd: pepo@pts/1 31391 pepo 15 0 6308 1456 1176 S 0.0 0.3 0:00.35 -bash [/size] [size=10] XXXX|Rosetta manually suspended (but still in memory), Einstein now increments time. Einstein@Home|22 Jun 2006 21:21:15|Resuming task r1_0287.5__542_S4R2a_3 using albert version 458 rosetta@home|22 Jun 2006 21:21:15|Pausing task t312__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom010__711_1635_0 (removed from memory) ---|22 Jun 2006 21:21:15|Rescheduling CPU: project suspended by user [pepo@orc ~]$ top -U pepo top - 21:20:37 up 65 days, 7:25, 1 user, load average: 0.16, 0.25, 0.13 Tasks: 101 total, 2 running, 99 sleeping, 0 stopped, 0 zombie Cpu(s): 0.5% us, 0.1% sy, 86.8% ni, 12.5% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 515744k total, 512748k used, 2996k free, 140196k buffers Swap: 1048568k total, 180k used, 1048388k free, 101104k cached [/size] [size=9] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31420 pepo 16 0 2208 900 696 R 3.8 0.2 0:00.04 top -U pepo 14747 pepo 16 0 5684 4104 1680 S 1.9 0.8 3693:32 ./boinc -allow_remote_gui_rpc 22383 pepo 34 19 78472 51m 4792 S 0.0 10.3 59:40.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o 22384 pepo 34 19 78472 51m 4792 S 0.0 10.3 0:00.02 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o 22385 pepo 35 19 78472 51m 4792 S 0.0 10.3 0:00.05 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o 22386 pepo 34 19 78472 51m 4792 S 0.0 10.3 0:00.14 rosetta_5.22_i686-pc-linux-gnu xx t312 _ -output_silent_gz -silent -increase_cycles 10 -new_centroid_packing -o 31382 pepo 34 19 30004 6672 2056 S 0.0 1.3 0:00.01 albert_4.58_i686-pc-linux-gnu @conf --IFO=LHO --Freq=287.667431193 --FreqBand=0.00114678899083 --startTime=7951 31390 pepo 15 0 8128 2212 1808 S 0.0 0.4 0:00.03 sshd: pepo@pts/1 31391 pepo 15 0 6308 1456 1176 S 0.0 0.3 0:00.35 -bash [/size] Project|Time|Messages XXXX|Rosetta unsuspended, now active & "running", but not incrementing time, is consuming CPU cycles Einstein@Home|22 Jun 2006 21:16:48|Pausing task r1_0287.5__542_S4R2a_3 (removed from memory) --- |22 Jun 2006 21:16:48|Rescheduling CPU: project resumed by user XXXX|Rosetta manually suspended (but still in memory), Einstein now increments time. Einstein@Home|22 Jun 2006 21:14:25|Starting task r1_0287.5__542_S4R2a_3 using albert version 458 rosetta@home|22 Jun 2006 21:14:25|Pausing task t312__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom010__711_1635_0 (removed from memory) ---|22 Jun 2006 21:14:25|Rescheduling CPU: project suspended by user ---|22 Jun 2006 3:00:00|Resuming network activity ---|22 Jun 2006 2:00:00|Suspending network activity - time of day XXXX |22 Jun 2006 21:12:00|still the same, Rosetta seems to be active, but idle and not incrementing time. |22 Jun 2006 3:00:00|Resuming network activity ---|22 Jun 2006 2:00:00|Suspending network activity - time of day ---|22 Jun 2006 0:53:34|Process 26964 not found ---|22 Jun 2006 0:53:34|Resuming network activity ---|22 Jun 2006 0:53:34|Rescheduling CPU: Resuming computation ---|22 Jun 2006 0:53:34|Resuming computation ---|22 Jun 2006 0:53:33|Failed to stop applications; aborting CPU benchmarks ---|22 Jun 2006 0:53:25|Running CPU benchmarks ---|22 Jun 2006 0:53:23|Suspending network activity - running CPU benchmarks rosetta@home|22 Jun 2006 0:53:23|Pausing task t312__CASP7_JUMPRELAX_SAVE_ALL_OUT_BARCODE_hom010__711_1635_0 (removed from memory) ---|22 Jun 2006 0:53:23|Suspending computation - running CPU benchmarks Einstein@Home|21 Jun 2006 18:35:25|Scheduler request succeeded Einstein@Home|21 Jun 2006 18:35:20|Reporting 1 tasks |
Bob Bowen Send message Joined: 22 Mar 06 Posts: 14 Credit: 6,155,148 RAC: 9 |
The way I keep this to a minimum on my Fedora boxes is to keep my rosetta preference Target CPU run time set to not selected which defaults to 3 hours. That way the ones that hang get cleaned out faster. Join our Great Team at Team-SciFi |
Questions and Answers :
Unix/Linux :
Rosetta WU's stall on RedHat Fedora
©2024 University of Washington
https://www.bakerlab.org