Report long-running models here

Message boards : Number crunching : Report long-running models here

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next

AuthorMessage
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 58309 - Posted: 31 Dec 2008, 20:24:15 UTC

As you can see my chosen runtime is 12 hours but this task ran for 20 hours.
Stealing the sentiment of greg_be "This sucks".

1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_97328_0
Claimed credit 153
Granted credit 76 (benevolent of you)

# cpu_run_time_pref: 43200
======================================================
DONE :: 1 starting structures 71773.1 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================
ID: 58309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 58310 - Posted: 31 Dec 2008, 21:24:46 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=218071526
This WU is 5 hours and 18 minute in out of 12 hour runtime and is still on the first decoy.
Running BOINC 6.4.5
Rosetta Mini 1.47
Running Vista with AMD quad 9500 Phenom, 3 gigs RAM
I have been getting lots of WUs where I have been getting maybe 1 to 3 decoys out of 12 hours CPU time and very little credit for the invested time. The last 3 or 4 days have been so bad credit wise my RAC is dropping.
ID: 58310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 58329 - Posted: 1 Jan 2009, 18:20:01 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451
Workunit 198419747

Name is 1nkuA_BOINC_MPZN_vanilla_abrelax_5901_156441_0 for Rosetta Mini 1.47.

Now stays already for a very long time at about 96%, the last estimated 15 minutes usually take at least another 2 hours (or more).

All my last results showed the info, that the watchdog ended the runs, since the used time is more then 3 times the preferred time.

BTW: this unit is still running in model 1.

The second task, running right now with the same symptoms is https://boinc.bakerlab.org/rosetta/result.php?resultid=218092186, Workunit 198571904. Its name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_110537_1 (this unit already has one client error as result).

So either the estimations are not good, especially because I have the impression, that the last 10% of the estimated time are using 90% of the time in real. I don't think that this is a problem of my computer, since the benchmark results are known to the schedulers...
ID: 58329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 58330 - Posted: 1 Jan 2009, 19:01:22 UTC - in response to Message 58329.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451
Workunit 198419747


Interesting effect: after shutting down and restarting my computer (due to a software installation), the CPU time used by the two processes went down from about 4:30 hours now to about 2:45, the completion percentage went down from about 96% now to about 94%.

I forgot to mention, that I use BOINC Manager 6.4.5

Anyone else who has seen this effect?
ID: 58330 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 58332 - Posted: 1 Jan 2009, 20:45:48 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=218071526
Same task mentioned in my last post is now 28 hours in on 1st decoy and stuck at 99 percent with 10 minutes to go. I now have the choice of aborting-----for 0 credit or letting watchdog abort it in another 10 hours for very little credit.
ID: 58332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 58333 - Posted: 1 Jan 2009, 21:37:06 UTC - in response to Message 58332.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=218071526
Same task mentioned in my last post is now 28 hours in on 1st decoy and stuck at 99 percent with 10 minutes to go. I now have the choice of aborting-----for 0 credit or letting watchdog abort it in another 10 hours for very little credit.

Task ID 218071526
Name 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_113811_1
Workunit 198586483
Created 31 Dec 2008 15:42:51 UTC
Sent 31 Dec 2008 16:03:51 UTC
Received 1 Jan 2009 21:18:15 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 948562
Report deadline 10 Jan 2009 16:03:51 UTC
CPU time 101439.9
stderr out <core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 43200
======================================================
DONE :: 1 starting structures 101440 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 392.681268578232
Granted credit 37.8765655196635
application version 1.47
Thanks for the credit.
ID: 58333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 58335 - Posted: 1 Jan 2009, 21:50:08 UTC - in response to Message 58330.  

Workunit 198419747

Interesting effect: after shutting down and restarting my computer (due to a software installation), the CPU time used by the two processes went down from about 4:30 hours now to about 2:45, the completion percentage went down from about 96% now to about 94%.

I forgot to mention, that I use BOINC Manager 6.4.5

Anyone else who has seen this effect?

This is quite normal. On restarting it's gone back to your last checkpoint, which appears to have been 2h 45m and begins again from there. That's about 94% of the default 3 hour run time.

Once the time to completion gets to 10 minutes it stops reducing, then just shows (runtime/(runtime+10mins)) as a percentage of work done until completion.

Claimed credit 392.681268578232
Granted credit 37.8765655196635
application version 1.47
Thanks for the credit.

I just saw that. Ouch! :(
ID: 58335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 58345 - Posted: 2 Jan 2009, 3:18:15 UTC

ID: 58345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58354 - Posted: 2 Jan 2009, 10:32:29 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=218058455
cc2_1_8_mammoth_mix_cen_cst_hb_t342__IGNORE_THE_REST_2G0QA_2_5889_234_0
CPU time 15509.59
cpu_run_time_pref: 14400


credit was great even with the over run
ID: 58354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 58357 - Posted: 2 Jan 2009, 11:27:43 UTC - in response to Message 58333.  

Claimed credit 392.681268578232
Granted credit 37.8765655196635
Thanks for the credit.

People are experiencing similar results. So how can the credit system be working right if there is such a large disparity between claim and grant? I thought the grant was a running average of claims? Apparently too few decoys are being produced. Something smells; and if the smell gets strong enough I will, er... "have to leave the room".
ID: 58357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58362 - Posted: 2 Jan 2009, 12:18:08 UTC - in response to Message 58357.  

Claimed credit 392.681268578232
Granted credit 37.8765655196635
Thanks for the credit.

People are experiencing similar results. So how can the credit system be working right if there is such a large disparity between claim and grant? I thought the grant was a running average of claims? Apparently too few decoys are being produced. Something smells; and if the smell gets strong enough I will, er... "have to leave the room".



can you post what task that was? with large over runs the credit system gets messed up from what i can tell. i've had stuff that goes 2 hours over limit and returns a lousy credit.
ID: 58362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Wissi

Send message
Joined: 19 Nov 08
Posts: 14
Credit: 485,807
RAC: 0
Message 58364 - Posted: 2 Jan 2009, 13:34:54 UTC - in response to Message 58329.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=217748451
Workunit 198419747

https://boinc.bakerlab.org/rosetta/result.php?resultid=218092186, Workunit 198571904. Its name is 1nkuA_BOINC_MPZN_with_zinc_abrelax_6130_110537_1


Okay, here comes again, what I saw: yesterday, when shutting down my computer, workunit 198419747 had a processor time of 6 hrs 17 min and 40 seconds. Boinc told me, that 97.418% of the work was done. 9 mins 40 seconds was the time estimated to the end of the job.

Workunit 198571904 had 6 hrs, 11 min and 17 seconds processor time, and 97.376% of the work was done. 9 mins 30 seconds was the estimated time till the end of the job.

Now, a few minutes ago, I restarted my computer, and so the boinc manager restarted the computation. Now the values are as follows:

Workunit 198419747: 4 hrs 8 mins 36 seconds of processor time, 96.131% work done, and 15 minutes and 1 seconds as an estimation until the end.

Workunit 198571904: 3 hrs 52 mins 33 seconds of processor time, 95.877% work done, 15 minutes 33 seconds as the estimation.

Now if this continues in THIS way (in fact, now for the 2nd time, there are several hours of processor time "stolen" due to a restart), I also will have to leave the room and set rosetta to "don't get any more tasks". The credits are only a symbolic value, but as another one has written here already: something smells, and not very good. I will wait until the watchdog stops both jobs, because I don't want to loose the symbolic credits (which I will, when I stop the job myself).

There is a bigger problem somewhere in the system and this has to be investigated. I think, there is enough evidence, when I read all the posts in this thread here.

There is another task waiting for execution:
https://boinc.bakerlab.org/rosetta/result.php?resultid=218251155, workunit 198873439. It has an expected time to run of 14 hours, 27 minutes and 27 seconds. I thought, the preferred runtime would be something between 3 and 6 hours. So maybe, the results of the benchmark of this computer seem to be totally misinterpreted by the scheduler, who sends me those tasks.

By the way: happy new year.
ID: 58364 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile jay

Send message
Joined: 12 Jan 08
Posts: 20
Credit: 195,801
RAC: 0
Message 58366 - Posted: 2 Jan 2009, 14:32:32 UTC
Last modified: 2 Jan 2009, 15:01:59 UTC

Hi,

Another long running task...


Full WU name (you can copy the BOINC message from when the task completes).
wuid=197237636
cc_nonideal_1_3_nocst4_hb_t303__IGNORE_THE_REST_2AH5A_6_5991_15

Type of operating system (version of Windows, Linux distribution, or Mac info.)
Windows XP SP3 & updates

BOINC version (see BOINC Manager "About" page).
6.4.5 wxWidgets version 2.8.7

Rosetta version (see BOINC Manager "tasks" page).
Rosetta mini 1.47

A link to the task's results page.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=197237636

If a specific model took longer then the rest of them, then what model # was shown in the graphic?

model: 2 step: 1879201,...1880095,...,1881207,...,1882051,
The graphic RMSD to Accepted energy plot is very scattered - no dense
concentration . a blue strand and a green strand are waving at each other.
They extend from a clump of the rest of the material.


My notes.

Task gets stuck at 97%.
This is running on a laptop.
I had to power off last night to drive home. But before I suspended all tasks, I noticed that this task had run over 3 hours and was stuck at 97%.

I also noted over 3,000,000 page faults on the Rosetta task.
I have 2 Gig of memory. Task Manager says 1.2 gig of physical mem are available.

When I got home last, I restarted Boinc, Running Rosetta, WCG, and spinhenge.
I noticed that the task lost its work and started over.
This morning - after 9 hours real time the task was stuck again.
I just (around 9:00AM EST) suspended all other tasks - and set 'no new tasks' so there will be no swapping out of the rosetta task and possible loss after checkpoint restart.

After 1/2 hour of a single BOINC task running on an Intel duo, it still shows:
Progress 97.520 %
To complete: 15.04 minutes
CPU time: 06:34:44

The 'to complete time' changes every 6 to 10 seconds and either increases or decreases a second. Ooops, it just stayed over minute on 15.01 seconds.

I'll try to get a 10 minute cpu time interval...


Progress 97.592 %
To complete: 14.55 minutes
CPU time: 06:45:21

I'll post this and see what is suggested.

Thanks in advance!!
Jay

PS
More data:
Wall Clock: 10:00AM
Progress 97.750%
To complete: 00:14:37 minutes
CPU time: 07:14:17
ID: 58366 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile jay

Send message
Joined: 12 Jan 08
Posts: 20
Credit: 195,801
RAC: 0
Message 58368 - Posted: 2 Jan 2009, 16:39:15 UTC - in response to Message 58366.  

Hi,

Another long running task...

------ snip snip snip ------




The task did complete at wall clock time 11:04AM
Jay
ID: 58368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 58370 - Posted: 2 Jan 2009, 17:54:37 UTC - in response to Message 58362.  

Claimed credit 392.681268578232
Granted credit 37.8765655196635
Thanks for the credit.

People are experiencing similar results. So how can the credit system be working right if there is such a large disparity between claim and grant? I thought the grant was a running average of claims? Apparently too few decoys are being produced. Something smells; and if the smell gets strong enough I will, er... "have to leave the room".



can you post what task that was? with large over runs the credit system gets messed up from what i can tell. i've had stuff that goes 2 hours over limit and returns a lousy credit.

See Rifleman's posts 58332 and 58333 in this thread. I also posted one earlier.
ID: 58370 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
quadro

Send message
Joined: 22 Oct 08
Posts: 3
Credit: 10,085,084
RAC: 0
Message 58374 - Posted: 2 Jan 2009, 19:08:31 UTC

Task ID:
218231328
218171409
218110882
218110207
and so on

page with all my tasks

1 of every 10 or more tasks, is with more granted than claimed credits
if some1 can explain it ill be greatfull
or it is normal?

best
ID: 58374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58383 - Posted: 2 Jan 2009, 20:19:41 UTC

...for those concerned about credit and fairness and what smells and what doesn't... please read the original post of this thread. Models that are taking significantly longer then average will receive significantly less credit then claimed. That's how an average works. The large claim goes in to the average, but only for one model, as compared to 1000s of others. So, it ups the average, but how much depends on how common the models run long.

That is one of many good reasons to work to eliminate these long-running models. Another is that if long-running models can be eliminated, then the estimated runtimes will be more reliable and work fetch more predictable.

The new approaches being used to study the proteins seem to have a higher variability between models then we are all used to. The team has reviewed the information in this thread and is working on some approaches to addressing the long-running models, and to studying them further.
Rosetta Moderator: Mod.Sense
ID: 58383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 58384 - Posted: 2 Jan 2009, 20:33:48 UTC - in response to Message 58383.  

...for those concerned about credit and fairness and what smells and what doesn't... please read the original post of this thread. Models that are taking significantly longer then average will receive significantly less credit then claimed. That's how an average works. The large claim goes in to the average, but only for one model, as compared to 1000s of others. So, it ups the average, but how much depends on how common the models run long.

That is one of many good reasons to work to eliminate these long-running models. Another is that if long-running models can be eliminated, then the estimated runtimes will be more reliable and work fetch more predictable.

The new approaches being used to study the proteins seem to have a higher variability between models then we are all used to. The team has reviewed the information in this thread and is working on some approaches to addressing the long-running models, and to studying them further.


If a fast CPU runs flat out for 28 hours and generates one decoy------there must have been a hell of a lot of work done to figure the decoy out? I have had over a week of these difficult units and credit for them is abysmal compared to what earlier WUs were awarding. It's almost like folks with long runtime preferences are being penalized for it.
I am new to the project and distributed computing in general but increasing my hydro bill by significant amounts there should be closer attention to the way these credits are awarded.
ID: 58384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58391 - Posted: 2 Jan 2009, 22:18:34 UTC

mod,

how do you explain the weirdness of this users tasks that were running at 6 hrs and still a long ways to go to completion. then upon reboot of the computer nearly the same amount of work is shown completed for less time used.

kind of some odd stuff going on with his tasks.
ID: 58391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58393 - Posted: 2 Jan 2009, 23:09:49 UTC - in response to Message 58391.  

mod,

how do you explain the weirdness of this users tasks that were running at 6 hrs and still a long ways to go to completion. then upon reboot of the computer nearly the same amount of work is shown completed for less time used.

kind of some odd stuff going on with his tasks.


It all relates to the time to completion estimate. As was stated earlier, when the machine rebooted, the task reverted to it's last checkpoint. And at that point, the % complete is going to based on the runtime preference. In short, the task should be about to proceed down the same path. Running for too long, showing about 10 minutes to go the whole time.

There's no exceptional weirdness described there. It is simply how the symptoms appear when you have a long-running model.
Rosetta Moderator: Mod.Sense
ID: 58393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next

Message boards : Number crunching : Report long-running models here



©2024 University of Washington
https://www.bakerlab.org