All tasks failed : finish file present too long

Message boards : Number crunching : All tasks failed : finish file present too long

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Huvecraft

Send message
Joined: 17 Mar 20
Posts: 1
Credit: 1,079,557
RAC: 27
Message 95300 - Posted: 24 Apr 2020, 12:39:26 UTC

Hello,

I have Rosetta running on my VPS. Since some days, I have almost all my tasks failed, with error "finish file present too long"...
I do not understand what is the cause, and how can I hope to fix it...

Exit code : 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT

Here some stderr output :

<core_client_version>7.6.33</core_client_version>
<![CDATA[
<message>
finish file present too long
</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.15_x86_64-pc-linux-gnu -score:weights hugh2020_HHH_rd4_0636_E18Y__HH_run17_weights_403.wts @hugh2020_HHH_rd4_0636_E18Y__HH_run17_flags_403 -frag3 00001.200.3mers -frag9 00001.200.9mers -abinitio::increase_cycles 10 -mute all -abinitio::fastrelax -relax::default_repeats 5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -abinitio::use_filters false -ex1 -ex2aro -in:file:boinc_wu_zip hugh2020_HHH_rd4_0636_E18Y__HH_run17_step403_fragments_fold_data.zip -out:file:silent default.out -silent_gz -mute all -in:file:fasta 00001.fasta -out:file:silent_struct_type binary -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 10000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2610974
Starting watchdog...
Watchdog active.
======================================================
DONE :: 1 starting structures 28753.3 cpu seconds
This process generated 131 decoys from 131 attempts
======================================================
BOINC :: WS_max 3.53542e+08

BOINC :: Watchdog shutting down...
10:22:19 (2462): called boinc_finish(0)

</stderr_txt>
]]>

Thanks !
ID: 95300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95312 - Posted: 24 Apr 2020, 15:58:18 UTC - in response to Message 95300.  

Your hosts are hidden. What version of BOINC are you running?
Rosetta Moderator: Mod.Sense
ID: 95312 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JohnDK
Avatar

Send message
Joined: 6 Apr 20
Posts: 33
Credit: 2,390,240
RAC: 0
Message 95315 - Posted: 24 Apr 2020, 17:53:05 UTC

Stderr output says 7.6.33, so he should upgrade to latest version.
ID: 95315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1684
Credit: 17,950,321
RAC: 23,118
Message 95321 - Posted: 24 Apr 2020, 21:05:01 UTC - in response to Message 95300.  
Last modified: 24 Apr 2020, 21:05:32 UTC

I do not understand what is the cause,
Lots of disk I/O contention. When a Task finishes, the data produced needs to be written to the result file file to be uploaded to Rosetta & then all of it's files need to cleaned up & removed so the next Task can start. If this doesn't happen fast enough for the BOIMC Manager's liking, it will just clobber things.
And even if the result produced was a Valid one, it will still end up being an Error.



and how can I hope to fix it...
Faster storage makes it less likely to occur, but with more & more cores & threads in CPUs these days it can still happen.
As others have posted, upgrading to the latest version BOINC Manager should sort it out (it's meant to have a fix for this particular issue).
Grant
Darwin NT
ID: 95321 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 95335 - Posted: 25 Apr 2020, 4:31:38 UTC

Changes are also underway to share the large database directory structure across all R@h tasks on the same machine. One benefit of this will be that there is less in the directory structure of the WU that is completing that needs to be cleaned up.

But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.
Rosetta Moderator: Mod.Sense
ID: 95335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95351 - Posted: 25 Apr 2020, 15:04:20 UTC - in response to Message 95335.  
Last modified: 25 Apr 2020, 15:04:50 UTC

But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.


It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic. I have to manually start/stop some WU's to spread them out a bit so at most 5-7 finish at the same time.
ID: 95351 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 95354 - Posted: 25 Apr 2020, 15:51:27 UTC - in response to Message 95351.  
Last modified: 25 Apr 2020, 15:52:33 UTC

It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet.

You can use a write-cache (don't bother wasting memory on a read-cache).

Windows: If you have a Samsung SSD, their Magician utility includes the "Rapid Mode cache" if you enable it.
The Crucial drives have "Storage Executive" with a cache.
For a larger cache, you can buy PrimoCache from Romex Software

Linux has its own built-in cache, you just need to set the size. 1 GB of cache and 1/2 hour write-delay should work wonders;
probably half that amount or even less would fix this problem; 5 minutes should be more than enough.
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
ID: 95354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95357 - Posted: 25 Apr 2020, 16:48:45 UTC - in response to Message 95354.  
Last modified: 25 Apr 2020, 16:50:52 UTC


You can use a write-cache (don't bother wasting memory on a read-cache).


My dedicated 24 core machines are OS X. I do have one Win10 box that's a Xeon 24 core also but I have it set to 33% CPU as it's used for other tasks and is in a room with minimal ventilation, so I can't have it cranking out heat at 100% load all day every day.

/edit. I guess technically all of them are 12c/24t.
ID: 95357 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95362 - Posted: 25 Apr 2020, 19:47:40 UTC
Last modified: 25 Apr 2020, 19:56:27 UTC

Did anyone attempt to use RAM-drive for BOINC with Rosetta? Of course there should be quite a lot RAM allocated for it, but cause this project has quite a lot I/O during processing, replacing HDD to RAM-drive could speed things up noticeable.

EDIT: for example

Время выполнения 1 дней 7 часов 29 мин. 57 сек. (elapsed)
Время ЦП 1 дней 0 часов 0 мин. 43 сек. (CPU)

It's dedicated cruncher 24/7 under Linux w/o any other tasks.
Computation time set to ~1day to minimize startup overhead.
And still there is 7 hours (!!!) of difference between elapsed and CPU.
Currently this host has a quite slow 5400RPM HDD as storage. So, it's good example how I/O influences here on performance.
ID: 95362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
manalog

Send message
Joined: 8 Apr 15
Posts: 24
Credit: 233,155
RAC: 0
Message 95367 - Posted: 25 Apr 2020, 21:52:30 UTC - in response to Message 95362.  
Last modified: 25 Apr 2020, 22:01:35 UTC

I am facing the same problem of Raistmer,: a "warming up" lasting several minutes (30-40) before tasks start.
I dedicated a Xeon L5420 24/24 on Rosetta but I do not have an hard disk to use and so its only storage unity is a 16GB thumb drive (USB 2.0). I was thinking too about the Ram disk: which are the files Rosetta needs to access more frequently? If they are, let's say, 400MB big, then we could move them on a ram disk without problems.
totale 727M
-rwxr-xr-x 1 boinc boinc 485M apr 23 19:23 database_357d5d93529_n_methyl.zip
-rw-r--r-- 1 boinc boinc  701 apr 25 21:17 flags_il6r3
-rw-r--r-- 1 boinc boinc  343 apr 25 22:11 jgHE_b02_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_2qo6vn6b.flags
-rw-r--r-- 1 boinc boinc 151K apr 25 22:11 jgHE_b02_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_2qo6vn6b.zip
-rw-r--r-- 1 boinc boinc   12 apr 25 22:11 jgHE_b02.flags
-rwxr-xr-x 1 boinc boinc 345K apr 23 19:13 LiberationSans-Regular.ttf
-rw-r--r-- 1 boinc boinc 3,9K apr 25 22:11 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_0_SAVE_ALL_OUT_IGNORE_THE_REST_9iu9zq9t.flags
-rw-r--r-- 1 boinc boinc 1,9M apr 25 22:11 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_0_SAVE_ALL_OUT_IGNORE_THE_REST_9iu9zq9t.zip
-rw-r--r-- 1 boinc boinc 3,0K apr 25 21:21 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_4_SAVE_ALL_OUT_IGNORE_THE_REST_0pa6ea9e.flags
-rw-r--r-- 1 boinc boinc 1,8M apr 25 21:20 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_4_SAVE_ALL_OUT_IGNORE_THE_REST_0pa6ea9e.zip
-rw-r--r-- 1 boinc boinc 170K apr 25 21:21 r3x_3934_data.zip
-rw-r--r-- 1 boinc boinc 181K apr 25 22:10 r4k_11675_data.zip
-rw-r--r-- 1 boinc boinc 177K apr 25 22:11 r4k_13116_data.zip
-rwxr-xr-x 1 boinc boinc 120M apr 23 19:15 rosetta_4.15_x86_64-pc-linux-gnu
-rwxr-xr-x 1 boinc boinc 118M apr 25 21:24 rosetta_graphics_4.15_x86_64-pc-linux-gnu


Is the database the one that causes trouble? How often is the database file updated?
ID: 95367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 95368 - Posted: 25 Apr 2020, 22:11:31 UTC

I've been working the last couple days on a couple improvements.

1. Improving storage/access of the database, moving it into the projects directory. This is already finished and ready to test.
2. It's taking a bit longer because I am taking this opportunity to also add more frequent checkpoints to one of our protocols which has been causing long run time issues for some cases.

I'll have a version out soon to test on Ralph@h once I confirm the checkpointing is working as it should.

Thanks for everyone's feedback and patience, and sorry these types of improvements haven't come sooner.
ID: 95368 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1684
Credit: 17,950,321
RAC: 23,118
Message 95373 - Posted: 25 Apr 2020, 23:10:17 UTC - in response to Message 95367.  
Last modified: 25 Apr 2020, 23:21:30 UTC

If they are, let's say, 400MB big, then we could move them on a ram disk without problems.
They can be over 1GB in size.
With the present present Tasks, 6c/12t running, i've got 13GB of storage space in use.

The work the project is doing at the moment should bring this down to around 1GB (maybe a bit more). Certainly a lot less than 13GB.
Grant
Darwin NT
ID: 95373 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1684
Credit: 17,950,321
RAC: 23,118
Message 95374 - Posted: 25 Apr 2020, 23:12:21 UTC - in response to Message 95354.  
Last modified: 25 Apr 2020, 23:20:47 UTC

It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet.
You can use a write-cache (don't bother wasting memory on a read-cache).
If you've got enough RAM for the extra cache.
Quite a few systems often have barely enough RAM for the number of Tasks they are running.
Grant
Darwin NT
ID: 95374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1684
Credit: 17,950,321
RAC: 23,118
Message 95375 - Posted: 25 Apr 2020, 23:20:32 UTC - in response to Message 95351.  

But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.
It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic.
If you could post that with a copy of the Stderr output of a couple of those Tasks over at the BOINC forums, it would let them know there is still an issue & to check that the fix was actually included in the latest released version. And if so, that it needs further investigation.
Grant
Darwin NT
ID: 95375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 95377 - Posted: 25 Apr 2020, 23:26:11 UTC - in response to Message 95374.  
Last modified: 25 Apr 2020, 23:33:43 UTC

Moving DB to project dir, besides of all other performance improvements, will easier OS task to cache drive accesses. Currently OS can't tell that same data in different slots is the really same so re-cache it.
Being in physically one place on drive files will be really cached so reading DB from one task will speedup reading DB from another.

EDIT: but how this will address heavy I/O issue we should see still.
As I said, 7 hours overhead was on day-long task, not 8h or 2h ones. So, startup time (that definitely will be improved) plays diminishing role here, it's I/o through task processing mostly.
And here if DB accessed often, speedup will be too (much less cache evictions!). But if heavy I/o lies somewhere else effect will be not so strong. Will see. Maybe worth to use System Internals tools to study file access pattern of Rosetta in more details.
ID: 95377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95383 - Posted: 26 Apr 2020, 1:34:12 UTC - in response to Message 95375.  

But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.
It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic.
If you could post that with a copy of the Stderr output of a couple of those Tasks over at the BOINC forums, it would let them know there is still an issue & to check that the fix was actually included in the latest released version. And if so, that it needs further investigation.


I'll re-create the problem Monday or Tuesday coming up when I'm back in my office as it's easy to replicate. I run next to no cache, and typically when I get in first thing I'll set my primary work machine to "no new tasks", so over time as it finishes up WU's, more and more CPU becomes available for other things I was doing outside BOINC. At the end of my work day it would be mostly cleared, and as I leave I resume fetching tasks. This brings in 20-24 new ones simultaneously to crunch overnight, which in turn will typically all finish about the same time, which is when the error will happen. Over the weekend they tend to naturally space out enough so they don't all finish at once and the problem is far less frequent. It's a rare issue that only happens really when you initiate a lot of tasks simultaneously, such as when initially starting onto Rosetta the first time with lots of cores, and if all the work you fetch is basically the same style of job, that all finishes very close to each other.

When you get jobs of different task types they tend to have less consistent endings, so you don't see it as much. I do not see the issue on my 8 core machines, the new version of BOINC seems to have fixed it.

I got a 96GB RAM upgrade in the mail today for my primary machine, so that might change things though. Currently it's only running 32GB.
ID: 95383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1684
Credit: 17,950,321
RAC: 23,118
Message 95384 - Posted: 26 Apr 2020, 2:34:15 UTC - in response to Message 95383.  

I got a 96GB RAM upgrade in the mail today for my primary machine, so that might change things though. Currently it's only running 32GB.
It wouldn't surprise me if it does. Disk caches are often a percentage of available RAM, the more RAM in use by applications & other system functions then the less there is for caching. So even if the cache has a maximum size, that extra RAM should allow for a larger write cache.
Grant
Darwin NT
ID: 95384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95405 - Posted: 26 Apr 2020, 17:32:43 UTC - in response to Message 95384.  

It wouldn't surprise me if it does.


Actually I suspect it would make the problem worse. Presently with only 32GB I get frequent "waiting for memory" messages on tasks, which stops progress on them and actually helps with the spacing out of finish times. Having more memory will eliminate this and make more tasks run straight through to the end without stopping, causing them to finish simultaneously. Having 12+ tasks finish within 2 minutes of each other is where the problem arises.
ID: 95405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1684
Credit: 17,950,321
RAC: 23,118
Message 95413 - Posted: 27 Apr 2020, 5:16:50 UTC - in response to Message 95405.  

It wouldn't surprise me if it does.
Actually I suspect it would make the problem worse. Presently with only 32GB I get frequent "waiting for memory" messages on tasks, which stops progress on them and actually helps with the spacing out of finish times. Having more memory will eliminate this and make more tasks run straight through to the end without stopping, causing them to finish simultaneously. Having 12+ tasks finish within 2 minutes of each other is where the problem arises.
Or the extra RAM will add to the write cache.
Will be interesting to see which way it falls.
Grant
Darwin NT
ID: 95413 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
CIA

Send message
Joined: 3 May 07
Posts: 100
Credit: 21,059,812
RAC: 0
Message 95421 - Posted: 27 Apr 2020, 13:15:30 UTC - in response to Message 95413.  

Separate unrelated issue, anyone know what would cause this?: https://boinc.bakerlab.org/rosetta/result.php?resultid=1162348400

Seems to be a one-off. The machine in question is running OS X, with 0 cache, dedicated, 24/7 operation. It's the first time I've seen this error, and looking around on the web for the error code provided a few answers that didn't seem to apply to me.
ID: 95421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : All tasks failed : finish file present too long



©2024 University of Washington
https://www.bakerlab.org