Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 14 · Next

AuthorMessage
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56195 - Posted: 3 Oct 2008, 14:11:40 UTC - in response to Message 56194.  

You think you have it bad what a waste of time and energy this one was

194952967 26 Sep 2008 18:51:19 UTC Over Client error Compute error 61,849.00 562.03 ---


it's all these hombench tasks. It's like I said in this threed where they announced it (https://boinc.bakerlab.org/forum_thread.php?id=4388). This sort of thing should really be part of RALPH not Rosetta.

As I understand it Rosetta was set up so all us grunts can do the monkey work with the tested and proven applications. While RALPH was for RnD for Rosetta so they could test new ideas and get them working right before we all grind away at processing it all. If you go to the RALPH home page the first thing on the page says

"RALPH@home is the official alpha test project for Rosetta@home. New application versions, work units, and updates in general will be tested here before being used for production. The goal for RALPH@home is to improve Rosetta@home."
ID: 56195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sswilson

Send message
Joined: 9 May 08
Posts: 6
Credit: 1,519,259
RAC: 0
Message 56787 - Posted: 9 Nov 2008, 16:38:20 UTC

Rosetta mini 1.40
Boinc 5.10.45
Win XP (fully updated)

1hzh_1s69_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_113_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=206158706

1hzh_2jc7_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_73_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=206146511

1hzh_1o4r_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_74_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=206098744

1hzh_1r6n_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_21_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=206025047

All of these took much longer than normal, and returned very poor granted credit vrs claimed credit.





ID: 56787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sswilson

Send message
Joined: 9 May 08
Posts: 6
Credit: 1,519,259
RAC: 0
Message 56789 - Posted: 9 Nov 2008, 16:58:58 UTC

BTW....

If this is an ongoing issue, this thread should be sticked so that folks know it exists. I only came across it accidentally through a link from another thread.
ID: 56789 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,722,686
RAC: 3,377
Message 56790 - Posted: 9 Nov 2008, 18:00:11 UTC

205632563 ran long. I suspect it was because of a number of these:

recovering checkpoint of tag S_U9X3X_00000001 with id abrelax_rg_state
recovering checkpoint of tag S_U9X3X_00000001 with id stage_1
recovering checkpoint of tag S_U9X3X_00000001 with id stage_2
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_1
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_2
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_3
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_4
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_5
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_6
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_7
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_8
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_9
recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_10
recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_1
recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_2
recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_3
recovering checkpoint of tag S_U9X3X_00000001 with id abrelax_relax

Eight of them, actually, which equals the number of decoys it ran.
ID: 56790 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Inikurmoma

Send message
Joined: 12 Oct 08
Posts: 1
Credit: 606,772
RAC: 0
Message 56796 - Posted: 9 Nov 2008, 22:20:44 UTC - in response to Message 56790.  

I have to report a long task

09/11/2008 12:01:27 PM|rosetta@home|Starting 1hzh_1wrm_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_172_0

It's running for 5 hours now and stuck at 96,779%
ID: 56796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hubington

Send message
Joined: 3 Feb 06
Posts: 24
Credit: 127,236
RAC: 0
Message 56823 - Posted: 11 Nov 2008, 11:26:27 UTC

11/11/2008 06:11:44|rosetta@home|Computation for task 1hzh_2fi9_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_20_0 finished

run time is supposed to be 3 hours or there abouts, took 8 hours 11 mins

given the similarity in name to the above threed I assume it to went to a snails pace in the last few percent.
ID: 56823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 56827 - Posted: 11 Nov 2008, 13:07:59 UTC

On slow Duron CPU, Mini 1.39 task 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_4653_6486_0 was interrupted after nearly 10 hours because of "going too long". The preference was increased from 1 to 2 hours during the run, which was an attempt, how would the model cope with such slow machine. (Probably not able at all to finish a decoy on such slow host.) It was checkpointing.

During the run, the progress went very fast to some 80-90% and then was progressing 0.1%-wise over hours...

Peter
ID: 56827 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 56835 - Posted: 11 Nov 2008, 15:00:29 UTC

Hello all,

Not sure were to post, the Minirosetta v1.40 bug thread or this thread:
With a runtime preference of 6 hours, this WU:
1hzh_1mve_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_147_0
is already running for over 13 hours. From the graphics I can see it is at Model: 1 Step: 52500 and running.
I'm glad it is running on a (2-core) machine with ram: 2813.69 MB and swap space: 5849.91 MB since the WU uses 400 MB of ram (peak 437 MB) and 393 MB (peak: 429 MB) of virtual memory.

Have a nice day,
Path7.
ID: 56835 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 1,150
Message 56844 - Posted: 11 Nov 2008, 18:04:37 UTC

I already reported this one a few times in the Minirosetta v1.40 bug thread, but that thread may be getting overloaded since it lost my last message.

1hzh_1o9g_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_155_1

It finally completed in about 19.25 CPU hours, due to that being over three times the preference of 6 CPU hours.

Also, it was very memory hungry compared to the other workunits I run for other projects - a peak of perhaps 296 MB, and I haven't found a way to check how much virtual memory.

A poor ratio of requested to granted credit - 200/80; but that seems common among workunits with 4704 in their names.

I suspect a problem in its debt calculations also, which could explain why it won't yield the CPU core to a workunit from another project at the end of a 2 hour timeslot, even with leave in memory set.
ID: 56844 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 28
Message 56846 - Posted: 11 Nov 2008, 19:12:48 UTC

I put mine in the Mini Rosetta 1.30 thread.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56846 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 28
Message 56847 - Posted: 11 Nov 2008, 19:13:21 UTC

I put mine in the Mini Rosetta 1.30 thread. Was called...

1hzh_2he4_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_262
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 56847 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sarha1

Send message
Joined: 23 Sep 05
Posts: 5
Credit: 6,339,735
RAC: 0
Message 56853 - Posted: 11 Nov 2008, 21:03:56 UTC

robertmiles: Granted credit 80 means you were awarded a flat-rate credit after watchdog shut down the task due to time consumed.
ID: 56853 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57073 - Posted: 19 Nov 2008, 19:11:16 UTC

default cpu time 21600 this ran 3146.078
https://boinc.bakerlab.org/rosetta/result.php?resultid=207892330
h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-8--h001b-_4769_556_0
Client state Compute error
Exit status 1 (0x1)
Computer ID 871217
Report deadline 26 Nov 2008 22:35:22 UTC
CPU time 3146.078
stderr out

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
recovering checkpoint of tag S_U11X8X_00000001 with id abrelax_rg_state
recovering checkpoint of tag S_U11X8X_00000001 with id stage_1
recovering checkpoint of tag S_U11X8X_00000001 with id stage_2
# cpu_run_time_pref: 21600
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_1
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_2
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_3
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_4
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_5
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_6
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_7
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_8
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_9
recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_10

and this repeats

then this stderr:
ERROR: NANs occured in hbonding!
ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763
called boinc_finish

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 21.0970375448934
ID: 57073 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 57092 - Posted: 20 Nov 2008, 8:17:27 UTC

I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened?

Peter
ID: 57092 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57095 - Posted: 20 Nov 2008, 11:02:20 UTC - in response to Message 57092.  

I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened?

Peter


mod hid some double posts by me and some rhetorical fighting as well.
you didn't miss anything. the last current post is the one below.
ID: 57095 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 57097 - Posted: 20 Nov 2008, 12:39:37 UTC - in response to Message 57095.  

...and some rhetorical fighting...

:-)

Peter
ID: 57097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57177 - Posted: 23 Nov 2008, 8:49:38 UTC
Last modified: 23 Nov 2008, 8:51:53 UTC

sorry wrong area - no actual over run
ID: 57177 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile paulcsteiner

Send message
Joined: 15 Oct 05
Posts: 19
Credit: 3,144,322
RAC: 0
Message 57239 - Posted: 26 Nov 2008, 2:58:48 UTC

This one went 263,622.40 which is a bit longer that the 24 hour RT I have set. Also no credit,..


https://boinc.bakerlab.org/rosetta/workunit.php?wuid=188279776
ID: 57239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dennis

Send message
Joined: 30 May 06
Posts: 2
Credit: 3,619,161
RAC: 0
Message 57535 - Posted: 3 Dec 2008, 5:56:17 UTC

I believe my long running started when i loaded v6.2.19 which i am running on 5 machines with xp and vista. I have single cpu and 2 cpu machines.
I have seen the over run on all and have aborted the run. after 50 plus hours i gave up. I am running 24 hr models[?]. So now i will wait only to the 30 hr mark of cpu time.

there were 5 or 6 occasions of over run.
on two occasions after aborting, the value of cpu time shown, changed from the 30 plus hrs and 40 plus hrs to the mid 20 hrs which would be normal.

don't know if anyone has seen this.
ID: 57535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57536 - Posted: 3 Dec 2008, 8:44:32 UTC - in response to Message 57535.  

I believe my long running started when i loaded v6.2.19 which i am running on 5 machines with xp and vista. I have single cpu and 2 cpu machines.
I have seen the over run on all and have aborted the run. after 50 plus hours i gave up. I am running 24 hr models[?]. So now i will wait only to the 30 hr mark of cpu time.

there were 5 or 6 occasions of over run.
on two occasions after aborting, the value of cpu time shown, changed from the 30 plus hrs and 40 plus hrs to the mid 20 hrs which would be normal.

don't know if anyone has seen this.



I had a look at your computers and their tasks. I see only 2 instances where the tasks went over their time. 120,000+ seconds when your preference is for 87,000+ secs.

There were some tasks that looked like memory access errors (similar to what I get when my OC speed is set to high) and then you had some tasks that had to many restarts for whatever reason.
There were some tasks that the message says keep the program in memory. have you set your preferences in boinc manager on the memory tab to 'leave tasks in memory while suspended'? That will help with the error message about keep tasks in memory.

Be sure to use the boinc manager activity, suspend function if you are going to have multiple reboots and boinc manager is set to start automatically on boot up.

This should help you get more steady run times.
ID: 57536 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2024 University of Washington
https://www.bakerlab.org