Why does this still happen.

Message boards : Number crunching : Why does this still happen.

To post messages, you must log in.

AuthorMessage
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 56450 - Posted: 23 Oct 2008, 21:58:52 UTC

I guess i'm not the only one this happens to, why

can't the tasks be canceled by the project if they

haven't been started. Other projects do this it saves

wasting time.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=181582962

=====================================================
DONE :: 1 starting structures 21135.6 cpu seconds
This process generated 42 decoys from 42 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped
Claimed credit 148.221875622837
Granted credit 0
application version 1.34

pete

ID: 56450 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 56454 - Posted: 24 Oct 2008, 3:16:39 UTC

Two words Peter:

BOINC Bug
http://boinc.berkeley.edu/trac/ticket/276
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 56454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 56469 - Posted: 25 Oct 2008, 5:21:12 UTC

My thanks to whoever fixed this one up.

pete.

ID: 56469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FoldingSolutions
Avatar

Send message
Joined: 2 Apr 06
Posts: 129
Credit: 3,506,690
RAC: 0
Message 56515 - Posted: 29 Oct 2008, 19:53:52 UTC

Task ID - 202592794
Work unit ID - 185058479
Sent - 27 Oct 2008 20:17:33 UTC
Time reported or deadline - 29 Oct 2008 19:33:36 UTC
Server state - Over
Outcome - Client error
Client state - Compute error
CPU time (sec) - 70,590.59
Claimed credit - 329.12
Granted credit - ---

Shouldn't there be some kind of credit compensation for 20 hours of wasted CPU time??
ID: 56515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 56674 - Posted: 3 Nov 2008, 18:20:24 UTC - in response to Message 56515.  

Task ID - 202592794
Work unit ID - 185058479
Sent - 27 Oct 2008 20:17:33 UTC
Time reported or deadline - 29 Oct 2008 19:33:36 UTC
Server state - Over
Outcome - Client error
Client state - Compute error
CPU time (sec) - 70,590.59
Claimed credit - 329.12
Granted credit - ---

Shouldn't there be some kind of credit compensation for 20 hours of wasted CPU time??



you should post this info in the 1.34 thread in case the team didn't see it.

be sure to tell them you had a exit code 255 as that will help them narrow down the issue.
ID: 56674 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 60301 - Posted: 24 Mar 2009, 20:52:02 UTC

Hi.

Looks like this is not fixed yet, wasted 6hrs on it. Why are tasks getting sent

out when others are still not past their deadlines.

Could have been doing something else.

Workunit error - check skipped

Over_Success_Done_21,377.59_154.48_0.00

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=214522002

pete.



ID: 60301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60302 - Posted: 24 Mar 2009, 20:55:26 UTC
Last modified: 24 Mar 2009, 20:56:35 UTC

The timestamps are a bit misleading. The deadline is always 10 days. If you look at it again, the 10 day deadline was indeed crossed and this caused the task to be reissued. Then, after that, a result came in.
Rosetta Moderator: Mod.Sense
ID: 60302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 60306 - Posted: 25 Mar 2009, 4:25:47 UTC

Hi there Mod Sense.

Well that dosen't make me feel all warm & fuzzy.

If they can't be returning the work on time then i see that as a waste of time.

I'm just going to have to abort all that are just sent out because there overdue then, i don't like wasting the time.

More work for me but so be it.

I'm guessing my result for that one won't be used at all, it might be a better

answer!.

pete.


ID: 60306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,178,827
RAC: 3,166
Message 60309 - Posted: 25 Mar 2009, 9:16:45 UTC - in response to Message 60306.  

Hi there Mod Sense.

Well that dosen't make me feel all warm & fuzzy.

If they can't be returning the work on time then i see that as a waste of time.

I'm just going to have to abort all that are just sent out because there overdue then, i don't like wasting the time.

More work for me but so be it.

I'm guessing my result for that one won't be used at all, it might be a better

answer!. pete.


It just means you need to lower the cache for this Project. If you have special reasons why you can't do that then as you suggested this may not be the Project for you. A 10 day cache is pretty long and unless you have a very slow pc would result in a ton of workunits. My computer is taking about 2 to 2 1/2 hours per workunit, roughly. That is say 9 units per day times 10 days is 90 workunits! Just for this Project alone. I just looked at your 2 pc's and both seem to have a very short cache already. One pc has one workunit and the other has 2 workunits that haven't been returned yet. I wonder if Boinc is having problems? It should be able to tell that a unit is near its deadline and switch to high priority crunching for that unit, so that it gets returned on time. Do you crunch 24/7?
ID: 60309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60316 - Posted: 25 Mar 2009, 13:30:09 UTC

mikey, it wasn't peter that was late with the results, so a smaller cache doesn't help what he's talking about. You could also increase your Rosetta preference for runtime and have a cleaner task list if you are crunching all the time anyway. The default runtime is 3 hours, but you can set it as high as 24hrs. If you make changes to target runtime, make them gradually. BOINC will still request enough work units for the time based on the old preference before it sees they begin running longer. So, best to make changes when you are requesting only a small cache, and to make changes of just a notch or two per day.

peter, I hear ya. I would just point out that it is not every time a task is late that results in a credit problem. It only seems to be if one fails, a second is late and then a third is issued and then the second is reported back.

So what I'm saying is, don't just go by the last digit on the WU name to judge. Also keep in mind that chances are that the late result will not come back in time to conflict with you. Although with a larger cache, you would have time to go look and see if it came in.

Perhaps you could add to the trac item, and post about this issue on other project boards as well. It's gone unfixed for a long time.
Rosetta Moderator: Mod.Sense
ID: 60316 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,178,827
RAC: 3,166
Message 60360 - Posted: 28 Mar 2009, 13:13:24 UTC - in response to Message 60316.  

mikey, it wasn't peter that was late with the results, so a smaller cache doesn't help what he's talking about.

Whoops, sorry

peter, I hear ya. I would just point out that it is not every time a task is late that results in a credit problem. It only seems to be if one fails, a second is late and then a third is issued and then the second is reported back.

So what I'm saying is, don't just go by the last digit on the WU name to judge. Also keep in mind that chances are that the late result will not come back in time to conflict with you. Although with a larger cache, you would have time to go look and see if it came in.

Perhaps you could add to the trac item, and post about this issue on other project boards as well. It's gone unfixed for a long time.


This is a long time Boinc thing, if I understand this time...if person A gets a unit but doesn't return it before the deadline the project reissues the unit, sending it to person B. But then if Person A returns the unit before person B, then person A does get credit and person B gets the "too many results" error message. Dr. A, and others, knew about this long, long ago and decided it was not a big deal since it only happened rarely. The way I see solving the problem is to not allow person A to return the unit once it has been reissued, giving them an error message if they do try to return it. If person B returns the unit before person A, then person A does get an error message. That is why at Seti they toyed with the idea of only sending units out as reissues to computers that could return the unit within 24 hours or less. This would also clear the database of 'hanging' units quicker.
ID: 60360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Why does this still happen.



©2024 University of Washington
https://www.bakerlab.org