Compute and Client Error on a whole lot of work units with exit status -185 (0xffffff47)

Message boards : Number crunching : Compute and Client Error on a whole lot of work units with exit status -185 (0xffffff47)

To post messages, you must log in.

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 49006 - Posted: 24 Nov 2007, 18:42:13 UTC

Server state Over
Outcome Client error
Client state Compute error
Exit status -185 (0xffffff47)

<core_client_version>5.10.28</core_client_version>
<![CDATA[
<message>
Can't link input file
</message>
]]>


This happened with stuff of 5.81,5.82 and 5.85
I lost about 19 tasks to this error and they all contain the same message in the log. This is just one sample.

11/23/2007 4:35:55 PM|rosetta@home|Computation for task w007_1_MolecularRep_1_w007_1_ffas03-1-2b0v_StructuralGenomics_a_2325_7744_0 finished
11/23/2007 4:35:55 PM|rosetta@home|Output file w007_1_MolecularRep_1_w007_1_ffas03-1-2b0v_StructuralGenomics_a_2325_7744_0_0 for task w007_1_MolecularRep_1_w007_1_ffas03-1-2b0v_StructuralGenomics_a_2325_7744_0 absent
11/23/2007 4:39:55 PM|rosetta@home|Starting 1uis__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1uis_-crystal_foldanddock__2318_61717_0
11/23/2007 4:43:56 PM|rosetta@home|[error] Can't link projects/boinc.bakerlab.org_rosetta/rosetta_beta_5.85_windows_intelx86.exe to slots/1/rosetta_beta_5.85_windows_intelx86.exe

Later one task shows this:
11/23/2007 4:43:56 PM|rosetta@home|Computation for task 1uis__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1uis_-crystal_foldanddock__2318_61717_0 finished
11/23/2007 4:43:56 PM|rosetta@home|Output file 1uis__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1uis_-crystal_foldanddock__2318_61717_0_0 for task 1uis__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1uis_-crystal_foldanddock__2318_61717_0 absent
11/23/2007 4:43:58 PM|rosetta@home|Finished upload of 4ubpA_FRAGPRED_ABRELAX_SAVE_ALL_OUT-4ubpA-__2309_17581_0_0

Later it shows this after the firewal found something it did not like and I had to allow the Boinc through the firewall. But this seems odd as it was ok when I restarted Boinc after the install and put the firewall on auto learn.

11/23/2007 4:43:56 PM|rosetta@home|Computation for task 1uis__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1uis_-crystal_foldanddock__2318_61717_0 finished
11/23/2007 4:43:56 PM|rosetta@home|Output file 1uis__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1uis_-crystal_foldanddock__2318_61717_0_0 for task 1uis__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1uis_-crystal_foldanddock__2318_61717_0 absent
11/23/2007 4:43:58 PM|rosetta@home|Finished upload of 4ubpA_FRAGPRED_ABRELAX_SAVE_ALL_OUT-4ubpA-__2309_17581_0_0
11/24/2007 3:26:21 PM|rosetta@home|Fetching scheduler list
11/24/2007 3:26:26 PM|rosetta@home|Master file download succeeded
11/24/2007 3:26:32 PM|rosetta@home|Sending scheduler request: Requested by user. Requesting 669600 seconds of work, reporting 45 completed tasks
11/24/2007 3:26:52 PM|rosetta@home|Scheduler request succeeded: got 36 new tasks

Now it is running ok and shows this message:
11/24/2007 4:59:59 PM|rosetta@home|Sending scheduler request: To fetch work. Requesting 19 seconds of work, reporting 0 completed tasks
11/24/2007 5:00:04 PM|rosetta@home|Scheduler request succeeded: got 1 new tasks
11/24/2007 6:29:09 PM|rosetta@home|Sending scheduler request: To fetch work. Requesting 138 seconds of work, reporting 0 completed tasks
11/24/2007 6:29:14 PM|rosetta@home|Scheduler request succeeded: got 1 new tasks

The firewall now knows all the functions of Boinc.
Is the error related to the firewall or is it something to do with Boinc and all the error messages I saw earlier in other threads about to much information or wrong filenames on the server?

I took a hit of over 20 credits due to these errors.
That makes me very mad! It will take a long time to build those back and get back to climbing to where I was before I had some other problems.
ID: 49006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 49011 - Posted: 24 Nov 2007, 19:43:26 UTC

Greg, it sounds as though your firewall saw Rosetta trying to use the internet (not the usual BOINC). This occurs when failures are captured and it is trying to report them back. It needs the program symbol tables and etc. and so it can't be done via BOINC the way all normal communication is done.
Rosetta Moderator: Mod.Sense
ID: 49011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 49013 - Posted: 24 Nov 2007, 20:41:00 UTC - in response to Message 49011.  

ok, that fits, because the firewall also 'learned' rossetaa 5.85, but not the 5.82 stuff I have queued. So if I am reading this right, if rosie gets a error then the version tried to access the internet and not the boinc manager? So if 5.82 gets any errors then the firewall will block it because it has not learned that version yet?

But what I don't get is this, did I lose credit the normal way because all 19 or so work units errored out or did I lose credit beacuse rosie was not able to connect to the firewall?

Some of these were the crystal fold tasks in 5.81 and a whole lot of others in 5.82 and 5.85

take for instance

Task ID 122381870
Name 1i8f__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1i8f_-crystal_foldanddock__2318_54614_0
Workunit 111250295

it got client and compute error and all the rest of the stuff i posted from the stderr text. in addition to the error message from the boinc manager.

also:
Task ID 122177508
Name 2a43__BOINC_RHO_OMEGA1_OMEGA2_HALFBACKBONEHB_RNA_ABINITIO-2a43_-_2322_27_0
Workunit 111064109

same problem.

But to lose nearly half of the work units I crunched really irrates me.
So someone should review my load of failed tasks, some of which i got 2nd hand due to the same error. The majority were 5.81 Crystal Fold with a mix of others to go with it.

Final question if the firewall learns all the versions of rosie and it already knows the boinc manager, then what will happen if there is another error with a task? Am I going to get the same client and compute errors or will it figure out what it needs to do to correct the issue and complete it computing and report as a success?


Greg, it sounds as though your firewall saw Rosetta trying to use the internet (not the usual BOINC). This occurs when failures are captured and it is trying to report them back. It needs the program symbol tables and etc. and so it can't be done via BOINC the way all normal communication is done.


ID: 49013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 49020 - Posted: 24 Nov 2007, 21:31:12 UTC - in response to Message 49013.  

ok, that fits, because the firewall also 'learned' rossetaa 5.85, but not the 5.82 stuff I have queued. So if I am reading this right, if rosie gets a error then the version tried to access the internet and not the boinc manager? So if 5.82 gets any errors then the firewall will block it because it has not learned that version yet?


Exactly.

But what I don't get is this, did I lose credit the normal way because all 19 or so work units errored out or did I lose credit beacuse rosie was not able to connect to the firewall?


The tasks errored out, which is what caused the request for Rosetta to access the internet from your firewall. Any credit lost, issued, granted by the nightly script or whatever is due to the task error, not the inability to send in all the diagnostics.

Rosetta Moderator: Mod.Sense
ID: 49020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 49023 - Posted: 24 Nov 2007, 22:49:41 UTC - in response to Message 49020.  

ok, that fits, because the firewall also 'learned' rossetaa 5.85, but not the 5.82 stuff I have queued. So if I am reading this right, if rosie gets a error then the version tried to access the internet and not the boinc manager? So if 5.82 gets any errors then the firewall will block it because it has not learned that version yet?


Exactly.

But what I don't get is this, did I lose credit the normal way because all 19 or so work units errored out or did I lose credit beacuse rosie was not able to connect to the firewall?


The tasks errored out, which is what caused the request for Rosetta to access the internet from your firewall. Any credit lost, issued, granted by the nightly script or whatever is due to the task error, not the inability to send in all the diagnostics.



thanks for clearing that up, as for all the errors, is it worth posting or just leaving them be? its around 19 or so different tasks with 3 different versions mostly that troublesome crystal_fold task
ID: 49023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 49029 - Posted: 25 Nov 2007, 3:28:28 UTC

There have been a few different issues around lately. Some .out files getting overly full, some missing, and some large swap space and/or memory consumed during the run. If you issues are inline with those, I would say it's already been posted. Otherwise, the Problems with... thread for the release of the task would be the place to post.
Rosetta Moderator: Mod.Sense
ID: 49029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Compute and Client Error on a whole lot of work units with exit status -185 (0xffffff47)



©2024 University of Washington
https://www.bakerlab.org