Message boards : Number crunching : All tasks failed : finish file present too long
Author | Message |
---|---|
Huvecraft Send message Joined: 17 Mar 20 Posts: 1 Credit: 1,079,557 RAC: 27 |
Hello, I have Rosetta running on my VPS. Since some days, I have almost all my tasks failed, with error "finish file present too long"... I do not understand what is the cause, and how can I hope to fix it... Exit code : 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT Here some stderr output : <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> finish file present too long </message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.15_x86_64-pc-linux-gnu -score:weights hugh2020_HHH_rd4_0636_E18Y__HH_run17_weights_403.wts @hugh2020_HHH_rd4_0636_E18Y__HH_run17_flags_403 -frag3 00001.200.3mers -frag9 00001.200.9mers -abinitio::increase_cycles 10 -mute all -abinitio::fastrelax -relax::default_repeats 5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -abinitio::use_filters false -ex1 -ex2aro -in:file:boinc_wu_zip hugh2020_HHH_rd4_0636_E18Y__HH_run17_step403_fragments_fold_data.zip -out:file:silent default.out -silent_gz -mute all -in:file:fasta 00001.fasta -out:file:silent_struct_type binary -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 10000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2610974 Starting watchdog... Watchdog active. ====================================================== DONE :: 1 starting structures 28753.3 cpu seconds This process generated 131 decoys from 131 attempts ====================================================== BOINC :: WS_max 3.53542e+08 BOINC :: Watchdog shutting down... 10:22:19 (2462): called boinc_finish(0) </stderr_txt> ]]> Thanks ! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Your hosts are hidden. What version of BOINC are you running? Rosetta Moderator: Mod.Sense |
JohnDK Send message Joined: 6 Apr 20 Posts: 33 Credit: 2,390,240 RAC: 0 |
Stderr output says 7.6.33, so he should upgrade to latest version. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,950,321 RAC: 23,118 |
I do not understand what is the cause,Lots of disk I/O contention. When a Task finishes, the data produced needs to be written to the result file file to be uploaded to Rosetta & then all of it's files need to cleaned up & removed so the next Task can start. If this doesn't happen fast enough for the BOIMC Manager's liking, it will just clobber things. And even if the result produced was a Valid one, it will still end up being an Error. and how can I hope to fix it...Faster storage makes it less likely to occur, but with more & more cores & threads in CPUs these days it can still happen. As others have posted, upgrading to the latest version BOINC Manager should sort it out (it's meant to have a fix for this particular issue). Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Changes are also underway to share the large database directory structure across all R@h tasks on the same machine. One benefit of this will be that there is less in the directory structure of the WU that is completing that needs to be cleaned up. But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end. Rosetta Moderator: Mod.Sense |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end. It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic. I have to manually start/stop some WU's to spread them out a bit so at most 5-7 finish at the same time. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. You can use a write-cache (don't bother wasting memory on a read-cache). Windows: If you have a Samsung SSD, their Magician utility includes the "Rapid Mode cache" if you enable it. The Crucial drives have "Storage Executive" with a cache. For a larger cache, you can buy PrimoCache from Romex Software Linux has its own built-in cache, you just need to set the size. 1 GB of cache and 1/2 hour write-delay should work wonders; probably half that amount or even less would fix this problem; 5 minutes should be more than enough. https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
My dedicated 24 core machines are OS X. I do have one Win10 box that's a Xeon 24 core also but I have it set to 33% CPU as it's used for other tasks and is in a room with minimal ventilation, so I can't have it cranking out heat at 100% load all day every day. /edit. I guess technically all of them are 12c/24t. |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
Did anyone attempt to use RAM-drive for BOINC with Rosetta? Of course there should be quite a lot RAM allocated for it, but cause this project has quite a lot I/O during processing, replacing HDD to RAM-drive could speed things up noticeable. EDIT: for example Время выполнения 1 дней 7 часов 29 мин. 57 сек. (elapsed) Время ЦП 1 дней 0 часов 0 мин. 43 сек. (CPU) It's dedicated cruncher 24/7 under Linux w/o any other tasks. Computation time set to ~1day to minimize startup overhead. And still there is 7 hours (!!!) of difference between elapsed and CPU. Currently this host has a quite slow 5400RPM HDD as storage. So, it's good example how I/O influences here on performance. |
manalog Send message Joined: 8 Apr 15 Posts: 24 Credit: 233,155 RAC: 0 |
I am facing the same problem of Raistmer,: a "warming up" lasting several minutes (30-40) before tasks start. I dedicated a Xeon L5420 24/24 on Rosetta but I do not have an hard disk to use and so its only storage unity is a 16GB thumb drive (USB 2.0). I was thinking too about the Ram disk: which are the files Rosetta needs to access more frequently? If they are, let's say, 400MB big, then we could move them on a ram disk without problems. totale 727M -rwxr-xr-x 1 boinc boinc 485M apr 23 19:23 database_357d5d93529_n_methyl.zip -rw-r--r-- 1 boinc boinc 701 apr 25 21:17 flags_il6r3 -rw-r--r-- 1 boinc boinc 343 apr 25 22:11 jgHE_b02_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_2qo6vn6b.flags -rw-r--r-- 1 boinc boinc 151K apr 25 22:11 jgHE_b02_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_2qo6vn6b.zip -rw-r--r-- 1 boinc boinc 12 apr 25 22:11 jgHE_b02.flags -rwxr-xr-x 1 boinc boinc 345K apr 23 19:13 LiberationSans-Regular.ttf -rw-r--r-- 1 boinc boinc 3,9K apr 25 22:11 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_0_SAVE_ALL_OUT_IGNORE_THE_REST_9iu9zq9t.flags -rw-r--r-- 1 boinc boinc 1,9M apr 25 22:11 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_0_SAVE_ALL_OUT_IGNORE_THE_REST_9iu9zq9t.zip -rw-r--r-- 1 boinc boinc 3,0K apr 25 21:21 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_4_SAVE_ALL_OUT_IGNORE_THE_REST_0pa6ea9e.flags -rw-r--r-- 1 boinc boinc 1,8M apr 25 21:20 Mini_Protein_binds_IL6R_COVID-19_1p9m_v2_4_SAVE_ALL_OUT_IGNORE_THE_REST_0pa6ea9e.zip -rw-r--r-- 1 boinc boinc 170K apr 25 21:21 r3x_3934_data.zip -rw-r--r-- 1 boinc boinc 181K apr 25 22:10 r4k_11675_data.zip -rw-r--r-- 1 boinc boinc 177K apr 25 22:11 r4k_13116_data.zip -rwxr-xr-x 1 boinc boinc 120M apr 23 19:15 rosetta_4.15_x86_64-pc-linux-gnu -rwxr-xr-x 1 boinc boinc 118M apr 25 21:24 rosetta_graphics_4.15_x86_64-pc-linux-gnu Is the database the one that causes trouble? How often is the database file updated? |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
I've been working the last couple days on a couple improvements. 1. Improving storage/access of the database, moving it into the projects directory. This is already finished and ready to test. 2. It's taking a bit longer because I am taking this opportunity to also add more frequent checkpoints to one of our protocols which has been causing long run time issues for some cases. I'll have a version out soon to test on Ralph@h once I confirm the checkpointing is working as it should. Thanks for everyone's feedback and patience, and sorry these types of improvements haven't come sooner. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,950,321 RAC: 23,118 |
If they are, let's say, 400MB big, then we could move them on a ram disk without problems.They can be over 1GB in size. With the present present Tasks, 6c/12t running, i've got 13GB of storage space in use. The work the project is doing at the moment should bring this down to around 1GB (maybe a bit more). Certainly a lot less than 13GB. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,950,321 RAC: 23,118 |
If you've got enough RAM for the extra cache.It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet.You can use a write-cache (don't bother wasting memory on a read-cache). Quite a few systems often have barely enough RAM for the number of Tasks they are running. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,950,321 RAC: 23,118 |
If you could post that with a copy of the Stderr output of a couple of those Tasks over at the BOINC forums, it would let them know there is still an issue & to check that the fix was actually included in the latest released version. And if so, that it needs further investigation.But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic. Grant Darwin NT |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
Moving DB to project dir, besides of all other performance improvements, will easier OS task to cache drive accesses. Currently OS can't tell that same data in different slots is the really same so re-cache it. Being in physically one place on drive files will be really cached so reading DB from one task will speedup reading DB from another. EDIT: but how this will address heavy I/O issue we should see still. As I said, 7 hours overhead was on day-long task, not 8h or 2h ones. So, startup time (that definitely will be improved) plays diminishing role here, it's I/o through task processing mostly. And here if DB accessed often, speedup will be too (much less cache evictions!). But if heavy I/o lies somewhere else effect will be not so strong. Will see. Maybe worth to use System Internals tools to study file access pattern of Rosetta in more details. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
If you could post that with a copy of the Stderr output of a couple of those Tasks over at the BOINC forums, it would let them know there is still an issue & to check that the fix was actually included in the latest released version. And if so, that it needs further investigation.But the newer versions of BOINC have corrections to wait longer for clean up to complete as tasks end.It's still a problem for large core-count machines. My 24 core Xeon machines have SSD storage and gigabit internet. If all 24 threads finish WU's within a very short time of each other I still get the errors mentioned above even with the latest version of bonic. I'll re-create the problem Monday or Tuesday coming up when I'm back in my office as it's easy to replicate. I run next to no cache, and typically when I get in first thing I'll set my primary work machine to "no new tasks", so over time as it finishes up WU's, more and more CPU becomes available for other things I was doing outside BOINC. At the end of my work day it would be mostly cleared, and as I leave I resume fetching tasks. This brings in 20-24 new ones simultaneously to crunch overnight, which in turn will typically all finish about the same time, which is when the error will happen. Over the weekend they tend to naturally space out enough so they don't all finish at once and the problem is far less frequent. It's a rare issue that only happens really when you initiate a lot of tasks simultaneously, such as when initially starting onto Rosetta the first time with lots of cores, and if all the work you fetch is basically the same style of job, that all finishes very close to each other. When you get jobs of different task types they tend to have less consistent endings, so you don't see it as much. I do not see the issue on my 8 core machines, the new version of BOINC seems to have fixed it. I got a 96GB RAM upgrade in the mail today for my primary machine, so that might change things though. Currently it's only running 32GB. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,950,321 RAC: 23,118 |
I got a 96GB RAM upgrade in the mail today for my primary machine, so that might change things though. Currently it's only running 32GB.It wouldn't surprise me if it does. Disk caches are often a percentage of available RAM, the more RAM in use by applications & other system functions then the less there is for caching. So even if the cache has a maximum size, that extra RAM should allow for a larger write cache. Grant Darwin NT |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
It wouldn't surprise me if it does. Actually I suspect it would make the problem worse. Presently with only 32GB I get frequent "waiting for memory" messages on tasks, which stops progress on them and actually helps with the spacing out of finish times. Having more memory will eliminate this and make more tasks run straight through to the end without stopping, causing them to finish simultaneously. Having 12+ tasks finish within 2 minutes of each other is where the problem arises. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1684 Credit: 17,950,321 RAC: 23,118 |
Or the extra RAM will add to the write cache.It wouldn't surprise me if it does.Actually I suspect it would make the problem worse. Presently with only 32GB I get frequent "waiting for memory" messages on tasks, which stops progress on them and actually helps with the spacing out of finish times. Having more memory will eliminate this and make more tasks run straight through to the end without stopping, causing them to finish simultaneously. Having 12+ tasks finish within 2 minutes of each other is where the problem arises. Will be interesting to see which way it falls. Grant Darwin NT |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Separate unrelated issue, anyone know what would cause this?: https://boinc.bakerlab.org/rosetta/result.php?resultid=1162348400 Seems to be a one-off. The machine in question is running OS X, with 0 cache, dedicated, 24/7 operation. It's the first time I've seen this error, and looking around on the web for the error code provided a few answers that didn't seem to apply to me. |
Message boards :
Number crunching :
All tasks failed : finish file present too long
©2024 University of Washington
https://www.bakerlab.org