Can a 'malformed' Workunit crash BOINC?

Message boards : Number crunching : Can a 'malformed' Workunit crash BOINC?

To post messages, you must log in.

AuthorMessage
Gerhard Klünger

Send message
Joined: 24 Dec 05
Posts: 21
Credit: 936,103
RAC: 0
Message 33820 - Posted: 31 Dec 2006, 11:14:02 UTC
Last modified: 31 Dec 2006, 11:19:31 UTC

On a 24/7 running PC suddenly BOINC mgr lost contact to the project.

Addiditional info:
Running BOINC/Rosette since 2006-12-22 on an
additional workstation 380897 with WinXP SP2 I had a closer look in the last days how things perform. This PC 380897 is running 24/7. The first 6 days all worked smooth with an average of 16 WU a day. On 2006-12-29 near midnight I noticed in BOINCstats a dramatical decrease of RAC and found, that this machine didn't upload a single crunched WU this day.

Opening the BOINC-mgr in the taskbar with right mouse I found that there was no connection any more to the project nor tasks running. I shut down this BOINC-mgr via file-menu and started BOINC mgr again throught the autostart-menu. Immediately crunching on the dual core processor continued for both CPUs, as far as I remember in the midth of work. I put this problem already to the BOINC Manager board (see http://boinc.berkeley.edu/dev/forum_thread.php?id=1415).

A closer look to the results in BOINCstats showed a strange situation (I copy the table here - hopefully, it can be seen. Otherwise go to https://boinc.bakerlab.org/rosetta/results.php?hostid=380897&offset=20 at the moment).

As you can see, usually a WU is returned within 6 hours after download. You can see that whole 2006-12-29 nothing was down- or uploaded and then in the first hour of 2006-12-30 WU 48254005 about 1 hour after my intervention was uploaded with an average claimed credit 32.68 and a granted credit of 2.61!!

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=48254005

54543304 48457268 30 Dec 2006 2:23:58 UTC 30 Dec 2006 10:05:38 UTC Over Success Done 11,168.83 35.20 33.81
54534284 48449120 30 Dec 2006 0:56:07 UTC 30 Dec 2006 6:58:02 UTC Over Success Done 11,013.47 34.71 32.58
54357676 48289270 28 Dec 2006 21:58:52 UTC 30 Dec 2006 6:58:02 UTC Over Success Done 11,002.20 34.67 36.01
54347534 48245293 28 Dec 2006 20:32:06 UTC 30 Dec 2006 3:48:46 UTC Over Success Done 10,552.27 33.26 33.92
54337359 48271052 28 Dec 2006 19:03:57 UTC 30 Dec 2006 3:48:46 UTC Over Success Done 10,453.41 32.94 33.56
54326952 48261842 28 Dec 2006 17:31:20 UTC 30 Dec 2006 0:56:07 UTC Over Success Done 10,452.58 32.94 32.35
54318035 48254005 28 Dec 2006 16:11:48 UTC 30 Dec 2006 0:56:07 UTC Over Success Done 10,368.66 32.68 2.61
54306891 48243978 28 Dec 2006 14:35:06 UTC 28 Dec 2006 20:32:06 UTC Over Success Done 10,504.67 33.11 41.02
54297829 48235776 28 Dec 2006 13:16:10 UTC 28 Dec 2006 20:32:06 UTC Over Success Done 10,317.33 32.52 30.44

I suspect this WU might have a destructive effect on BOINC.
Gruß, Gerhard
ID: 33820 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 33848 - Posted: 31 Dec 2006, 21:51:18 UTC

Gerhard, it sounds like you are seeing the mysterious BOINC dropped communications error. See if you agree your symptoms are the same as described here.

In a nutshell, no a given work unit shouldn't be able to bring down BOINC. The discussion in that thread is along the lines that BOINC have some kind of problem when accessing the internet to get more work, and it loses contact with the project threads that it is running.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 33848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gerhard Klünger

Send message
Joined: 24 Dec 05
Posts: 21
Credit: 936,103
RAC: 0
Message 33851 - Posted: 31 Dec 2006, 22:37:50 UTC - in response to Message 33848.  
Last modified: 31 Dec 2006, 22:43:48 UTC

Gerhard, it sounds like you are seeing the mysterious BOINC dropped communications error. See if you agree your symptoms are the same as described here.

In a nutshell, no a given work unit shouldn't be able to bring down BOINC. The discussion in that thread is along the lines that BOINC have some kind of problem when accessing the internet to get more work, and it loses contact with the project threads that it is running.


Hi Feet1st,

I will have a look at this thread - but for the moment:
From the data given at BOINC stats and in my original posting it doesn't look BOINC cannot get new work: After the last WU uploaded BOINC loaded two more WUs down for processing ... but then nothing happend any more.

And see: The first WU uploaded the next day the difference between claimed and granted credits: It is huge and might be a hint that the problem is in relation to this WU.

What I didn't yet mention: BOINC mgr not only lost communication with the client. When opening the running mgr from the taskbar and "Extras > Kommunikation wiederholen" nothing happens. Same with "Extras > Computer auswählen > localhost" ... no effect at all. In the taskmanager no clientjob is visibile, the CPU is 99% idle. In this situation I have to terminate this BOINC mgr and restart it from the autostart-menu, and immediately prozessing is going on / continuin: With other words: There IS enough WU available for crunching.

Before I didn't care this so much, because I expected BOINC running with booting my system and every day I shut my system down after I am done. So it might have happened a many times in the past that BOINC was not running because I didn't control it. But since I installed BOINC now the first time on a 24/7 PC the first couple of days I had a closer look how things develop. On another machine I had the impression that a similar situation is not so seldom, but thought this might be a problem of booting the pc when a lot of drivers and programs and services are loaded and in this situation BOINC might fail to start properly. But what puzzeled me was the fact that such 'lost communication' happens on a 24/7 system after running for nearly a week without any problems.

A Happy New Year also to you ... here in 22 minutes :)
Gruß, Gerhard
ID: 33851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gerhard Klünger

Send message
Joined: 24 Dec 05
Posts: 21
Credit: 936,103
RAC: 0
Message 54118 - Posted: 1 Jul 2008, 18:47:34 UTC

Again i get the impression that some kind of workunits crash BOINC or part of the programs:
On my monthly inspection on how my PCs are working I found for CPU# 380897 that it was not working any more since 2008-06-16. A closer look showed that 2 workunits with 100% were waiting for delivery, but nothing happening.
Killing BOINC and 2 Rosetta-Task with Taskmanager (WinXP) and restarting boinc the two tasks were uploaded and a new downloaded; processing went on.

Investigating the statistics for CPU 380897 showed that the problem startet with
171973562 Task-ID Created 16 Jun 2008 4:30:35 UTC and WU 154457940. This WU terminated with an error. The same workunit terminated before on another CPU also with an error and until now the next host trying to crunch it is "not responding" (any more).

The next 2 task on CPU 380897 terminated also with an error:
172077892 Task-ID Created 17 Jun 2008 20:50:46 UTC
157067871 Task-ID Created 18 Jun 2008 9:54:13 UTC
and since than (Jun 18, 2008) nothing went on on this machine any more (in respect to BOINC/Rosetta), while other tasks (other kind oft software) performed without problem on this PC, running 24/7.

Gruß, Gerhard
Gruß, Gerhard
ID: 54118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 54122 - Posted: 1 Jul 2008, 21:33:28 UTC

Hello Gerhard,

When I look into the Tasks for your computer 380897, I see 2 of the failing WU's are of the “t405” type.
You may have overlooked this; this batch has been mentioned on the news / homepage as of June 24, as being a big problem.
Good to see you have done 2 valid WU's already.

Gruß,
Path7.
ID: 54122 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gerhard Klünger

Send message
Joined: 24 Dec 05
Posts: 21
Credit: 936,103
RAC: 0
Message 54126 - Posted: 1 Jul 2008, 22:43:59 UTC

Thanks for your reply. Indeed, I never look news, since the systems are running since month and years and, well, have other things to do than reading news on a regular basis if something happened. My two other machines where BOINC is running had no problems, so I didn't think it is a general problem.

By the way: If there is a problem, why all those participants (or those who got such t405-type were not informed by email? Our email-addresses are well known to BOINC and the database easly answers the question, who got such WU: "Please check your BOINC at least once a day, there might be troubles with some WUs"





Gruß, Gerhard
ID: 54126 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 54128 - Posted: 1 Jul 2008, 23:39:27 UTC - in response to Message 54126.  

Thanks for your reply. Indeed, I never look news, since the systems are running since month and years and, well, have other things to do than reading news on a regular basis if something happened. My two other machines where BOINC is running had no problems, so I didn't think it is a general problem.

By the way: If there is a problem, why all those participants (or those who got such t405-type were not informed by email? Our email-addresses are well known to BOINC and the database easly answers the question, who got such WU: "Please check your BOINC at least once a day, there might be troubles with some WUs"







That's a good point. I don't like doing mass emails but it may be worth doing one in this case.
ID: 54128 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,675,695
RAC: 11,002
Message 54140 - Posted: 2 Jul 2008, 11:32:19 UTC - in response to Message 54128.  

That's a good point. I don't like doing mass emails but it may be worth doing one in this case.

You could filter out any computers which are still returning results - it should reduce the number of emails massively.
ID: 54140 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 54144 - Posted: 2 Jul 2008, 13:38:43 UTC - in response to Message 54128.  

That's a good point. I don't like doing mass emails but it may be worth doing one in this case.

Many emails will bounce, get lost in spam-folder and so on, and not to forget some users has configured to not get email from project. Also, even if user does read email, it's not always he's got ready access to all computers so can abort the corrupt wu's.

But, BOINC does have the option to abort wu's on client, in case wu's is cancelled server-side. For this to work, you'll need:
1; Enable <send_result_abort> on server.
Note, this will increase database-load.
2; Users must run BOINC-client v5.8.17 or later for auto-aborting to work.
The unconditional task-abort will possibly work with v5.5.1 and later, but to be on the safe side, use v5.8.17 or later.

There's one additional problem, for clients to get the abort-message, they'll need to connect the Scheduling-server, something they don't neccessarily need to do if stuck on a particular corrupt wu. To ensure they're connect Scheduling-server:
3; Use example <next_rpc_delay>86400</next_rpc_delay> server-side.
(users needs to run v5.5.1 or later)

This means, except for computers that is manually connected, they'll connect atleast once per day. In case there's another batch of corrupt wu's, this means majority of computers will cancel the wu's within 24 hours after they're cancelled server-side.

Now, #3 won't help in the current situation there the corrupt wu's was released 14 days ago, but will help next time Rosetta@home releases a batch of "bad" wu's, or another bad application...

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 54144 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Can a 'malformed' Workunit crash BOINC?



©2024 University of Washington
https://www.bakerlab.org