No "finished" file

Message boards : Number crunching : No "finished" file

To post messages, you must log in.

AuthorMessage
Ed Machak

Send message
Joined: 10 Nov 16
Posts: 7
Credit: 17,339,411
RAC: 0
Message 91675 - Posted: 12 Feb 2020, 1:33:20 UTC

For some time now the following types of message have shown up in the event log.

2/11/2020 7:47:32 PM | Rosetta@home | Task ennist_dLHa_0001_0001_0008_loop_0001_0001_fold_SAVE_ALL_OUT_891273_110_0 exited with zero status but no 'finished' file
2/11/2020 7:47:32 PM | Rosetta@home | If this happens repeatedly you may need to reset the project.


I have reset the project but these messages persist.

Can this error report be ignored or is it indicating I'll get no credit for the work?

Ed Machak
ID: 91675 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 13,353
Message 91679 - Posted: 12 Feb 2020, 9:09:18 UTC
Last modified: 12 Feb 2020, 9:28:24 UTC

This error have been here for years now.
Happens from time to time. No clear ways to fix it.

No need to do full reset of the project.
Simple BOINC restart (not just manager aka GUI, but full restart) or computer reboot fix it too. But it will return again after some time.
ID: 91679 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 130
Credit: 1,766,254
RAC: 0
Message 91721 - Posted: 17 Feb 2020, 3:39:13 UTC - in response to Message 91679.  

This error have been here for years now.
Happens from time to time. No clear ways to fix it.

Agree this error is a pain and means to me the task will take longer to process. Frequency of error seems to depend on the type of file being crunched. I only see it when using the Rosetta Mini v3.78 application.
ID: 91721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 91723 - Posted: 17 Feb 2020, 9:38:58 UTC - in response to Message 91675.  

2/11/2020 7:47:32 PM | Rosetta@home | Task ennist_dLHa_0001_0001_0008_loop_0001_0001_fold_SAVE_ALL_OUT_891273_110_0 exited with zero status but no 'finished' file

I seem to recall that is the one where the disk drive can not write the results fast enough to be available when needed (or that may be a different error).
A write cache or an SSD could help.
ID: 91723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,992,337
RAC: 13,353
Message 91744 - Posted: 19 Feb 2020, 12:38:48 UTC
Last modified: 19 Feb 2020, 12:41:14 UTC

Yes it somehow related to disk speed and occurs on SSDs much less frequently, but it still occurs sometimes even on SSDs.
On HDD + lot of concurrent R@H WUs running it happens much often.

Looks like root of the problem is a really old bug somewhere in Rosetta software which cause app to crash if it can not write to disk immediately, instead of just waiting a few seconds while disk is busy by handling other requests.
But devs do not bother to track it and fix so it keeps crashing the app and wasting generated result for years now.

Moving data to SSDs, enable disk write cache, reducing max_concurrent tasks running, etc - all is just partial workarounds(it helps mitigate problems, but not 100%), it does not fix the problem itself.
ID: 91744 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 291,214,977
RAC: 1
Message 91746 - Posted: 19 Feb 2020, 13:20:02 UTC - in response to Message 91744.  

Yes it somehow related to disk speed and occurs on SSDs much less frequently, but it still occurs sometimes even on SSDs.
On HDD + lot of concurrent R@H WUs running it happens much often.

Looks like root of the problem is a really old bug somewhere in Rosetta software which cause app to crash if it can not write to disk immediately, instead of just waiting a few seconds while disk is busy by handling other requests.
But devs do not bother to track it and fix so it keeps crashing the app and wasting generated result for years now.

Moving data to SSDs, enable disk write cache, reducing max_concurrent tasks running, etc - all is just partial workarounds(it helps mitigate problems, but not 100%), it does not fix the problem itself.


I had in mind that the above recommendations were related to other classical BOINC error the "finish file present too long" more prone to occur in host with many core/threads.
ID: 91746 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 91751 - Posted: 19 Feb 2020, 14:23:42 UTC - in response to Message 91746.  
Last modified: 19 Feb 2020, 14:29:22 UTC

I had in mind that the above recommendations were related to other classical BOINC error the "finish file present too long" more prone to occur in host with many core/threads.

That could be it, or it may be both.

SSDs are not always fast. They usually write quickly, but sometimes a write occurs while they are trying to do garbage collection or consolidate blocks or whatever they do. In those cases, the writes can be delayed for a half-second or so (you sometimes can see them pause a desktop app). But I use a write-cache on all of my machines. That was originally for protecting the SSDs from the high write-rates of some projects, but it happens to solve a variety of other problems too.

In Windows, I use the Samsung Magician utility that includes a small cache (about 1 GB), or else I use PrimoCache for larger ones with longer write-delays.
Linux includes its own cache, I just use the commands to increase it in size and duration, usually to at least 4 GB and 30 minutes.

I don't recall seeing either of those errors for a while.
ID: 91751 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : No "finished" file



©2024 University of Washington
https://www.bakerlab.org