Message boards : Number crunching : too many instances of rosetta
Author | Message |
---|---|
Mugurel Send message Joined: 17 Jan 07 Posts: 3 Credit: 27,424,126 RAC: 0 |
Hi, Anyone else has difficulties with rosetta if too many instances runs at the same time? I have problems on 48 cores linux 64 bit platforms. The same runs fine on 4 and 8 cores systems. For more than 8 I get into troubles... The nodes become less responsive, and many instances of rosetta are started then stopped, and other one gets started, and so on. In the logs is nothing usefull. 29-Mar-2011 18:15:20 [rosetta@home] Restarting task mem_widd_run03_centroid_A_2kdc_SAVE_ALL_OUT_IGNORE_THE_REST_22158_915157_0 using minirosetta version 217 29-Mar-2011 18:15:20 [rosetta@home] Task mem_widd_run03_centroid_A_1zll_SAVE_ALL_OUT_IGNORE_THE_REST_22158_914948_0 exited with zero status but no 'finished' file 29-Mar-2011 18:15:20 [rosetta@home] If this happens repeatedly you may need to reset the project. 29-Mar-2011 18:15:23 [rosetta@home] Restarting task mem_widd_run03_centroid_A_1zll_SAVE_ALL_OUT_IGNORE_THE_REST_22158_914948_0 using minirosetta version 217 29-Mar-2011 18:15:23 [rosetta@home] Task mem_widd_run03_centroid_A_2hac_SAVE_ALL_OUT_IGNORE_THE_REST_22158_603074_0 exited with zero status but no 'finished' file 29-Mar-2011 18:15:23 [rosetta@home] If this happens repeatedly you may need to reset the project. A reset would not help. The boinc manager is not really able to get in touch with the boinc client, it get stuck with the small window: Please wait, communicating with the client. It is only fine when rosetta runs only few instances and the rest of cpu's are occupied by other processes. I have no similar issues with any other projects (I attached to almost 40 projects). Anyone has any clue on this? Thank you. Ionel |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Rosetta uses more memory per active WU then most other BOINC projects. But it looks like you've got about 2.5GB per core on this machine, which should be plenty. Is that the machine you are talking about? Oh, now I see you've got SEVERAL machines with 48 cores, in a variety of configurations... I'm guessing the machine you are seeing delayed responsiveness on has less memory. Basically, when you get all 48 cores fired up, the machine is probably doing considerable page swapping slowing it down. My prior comments and suggestions on helping a machine run well with less memory should pertain to your situation as well. Rosetta Moderator: Mod.Sense |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
I could be wrong but I think that another consideration with memory on systems which are crunching for multiple projects is that while you may size your memory configuration for the maximum number of tasks expected (for example - 1 gig per core) if you are crunching for more than one project ** AND ** you have the "keep tasks in memory while suspended" option set - your actual memory usage will climb through the roof as you toggle between tasks. For example - I don't think that memory used for tasks executing for "Project A" is released when the system switches over to execute tasks for "Project B" as defined in your BOINC Processor Usage preferences. Mod.Sense - you have any insight into this? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The BOINC setting to keep tasks "in memory" is rather poorly worded. The way the operating systems manage memory, any task that is not getting CPU will gradually have it's memory pages "stolen" and swapped out, as memory is needed to run active work (assuming there is some active work running to demand memory). So, the BOINC setting actually refers to VIRTUAL memory in your page file. It is basically just saying "should I throw away the work since the last checkpoint? Or hang on to it to complete when the task gets scheduled again?". So I would expect the usage of your swap file may increase based on that setting, but not physical memory usage. When the task is resumed, if it has been kept "in memory", then it just picks up running where it left off (perhaps hours before). It will page fault the memory it was using in from the page file as it hits code needing various pieces of data and pieces of the program. Rosetta Moderator: Mod.Sense |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Thanks Mod - I had often wondered about that |
Mugurel Send message Joined: 17 Jan 07 Posts: 3 Credit: 27,424,126 RAC: 0 |
Rosetta uses more memory per active WU then most other BOINC projects. But it looks like you've got about 2.5GB per core on this machine, which should be plenty. Is that the machine you are talking about? Yes, there are few machines with 48 cores. None of them are using the entire amount of RAM, less than half is used. No swapping at all. All applications are kept in memory. The problems happens when boinc decided to switch from whatever it runs to rosetta, and if more than 8 instances of rosetta starts at the same time, then they got killed/hang after running for about 3 minutes. If some other instances were active, they also will get killed and the wu will no longer be usable. Again this doesnt happend on machines with 8 or less cores, only on those with 16 or more (like 48). Ionel |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Well, it looks like your failing tasks show a process got signal 11 message. I had assumed these were tasks you may have killed as you were studying the situation... but your latest comments indicate this is happening on its own. The BOINC FAQ indicates this could be faulty memory or page file... but ALSO can be an issue when running 32 bit applications on a 64 bit machine. Rosetta currently just wraps the 32bit app. for the executable sent to 64bit machines, so that description would seem to fit. The FAQ indicates that installing ia32 will help the situation. I am not familiar with ia32. Do you know if you have it already? Would you be ok with installing it on a machine to see if it helps? Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I should also point out, just in case you are not aware, that BOINC has settings where you can designate how much memory BOINC is allowed to use during processing. And so even if the machine has plenty of memory, and there are no page faults... it is possible that BOINC is setup to try and live in some smaller memory footprint. Rosetta Moderator: Mod.Sense |
Mugurel Send message Joined: 17 Jan 07 Posts: 3 Credit: 27,424,126 RAC: 0 |
Well, it looks like your failing tasks show a process got signal 11 message. I had assumed these were tasks you may have killed as you were studying the situation... but your latest comments indicate this is happening on its own. I have ia32-libs installed. So that is OK. I think. I try to run ldd on the excutable but it is a static executable: beo-39:~/.boinc/boinc.beo-39/projects/boinc.bakerlab.org_rosetta$ ldd minirosetta_2.17_x86_64-pc-linux-gnu not a dynamic executable I did killed few of the running processes to get boinc going, but that did not help. For the rest of times the processes got killed by themself. If I keep a "top" open, I see that they start, run for 3 minutes or less then they are killed (and shows as zombie), then wu get corrupted and boinc starts other instances of rosetta and all starts to hapen again, till boinc decided to start only fewer sentinces of rosetta and some other projects. That is usually after about 300-500 wu are wasted. Then it is fine. For the memory, the settings are to use max 90% of the system when idle and max 75% of the system when in use. As I said earlier none of the 48cores systems come close to use 50% of their RAM. No swap disk at all is used. At the moment I made sure rosetta doesn't use 100% of cpu's by running in paralell many other projects. This did not solve the problem, it just avoids it... :-( Ionel |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,178,442 RAC: 3,202 |
Try upping the 75% to 85% and see if the units don't start back up and finish again. I had this happen to me on another project and the problem was that it would run fine while using the 90%, ie when I was away, but when dropping to the 75% it did what you are seeing. You could even up it to 90% but the machine will be extremely slow to do anything else. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
So, I guess my next questions would be "have you run intensive memory tests on the machine?" and since you appear to have more then one machine with more then 8 cores... "are you seeing the same behavior on more then one machine?" where tasks start crashing once more then 8 R@h tasks are actively running at the same time? Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
too many instances of rosetta
©2024 University of Washington
https://www.bakerlab.org