Discussion:
[OMPI users] Performance Issues on SMP Workstation
Andy Witzig
2017-02-01 20:52:36 UTC
Permalink
Hi all,

I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. Both applications have been compiled using OpenMPI 1.6.4. I have tried running:

mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE

and others, but cannot achieve the same performance on the workstation as is seen on the cluster. The workstation outperforms on other non-MPI but multi-threaded applications, so I don’t think it’s a hardware issue.

Any help you can provide would be appreciated.

Thanks,
cap79
Andrew Witzig
2017-02-01 21:36:57 UTC
Permalink
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun -np 20 is possible (and should be equivalent) on both platforms.

Thanks,
cap79
Post by Andy Witzig
Hi all,
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is seen on the cluster. The workstation outperforms on other non-MPI but multi-threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Elken, Tom
2017-02-01 22:10:43 UTC
Permalink
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "

are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?

-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Post by Andy Witzig
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Andy Witzig
2017-02-01 22:25:08 UTC
Permalink
Hi Tom,

The cluster uses an Infiniband interconnect. On the cluster I’m requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation I’m just using the command: mpirun -np 20 …. I haven’t finished setting Torque/PBS up yet.

Best regards,
Andy

On Feb 1, 2017, at 4:10 PM, Elken, Tom <***@intel.com> wrote:

For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "

are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?

-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Post by Andy Witzig
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Douglas L Reeder
2017-02-01 22:32:18 UTC
Permalink
Andy,

What allocation scheme are you using on the cluster. For some codes we see noticeable differences using fillup vs round robin, not 4x though. Fillup is more shared memory use while round robin uses more infinniband.

Doug
Post by Andy Witzig
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I’m requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation I’m just using the command: mpirun -np 20 …. I haven’t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Post by Andy Witzig
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Andy Witzig
2017-02-01 22:46:48 UTC
Permalink
Honestly, I’m not exactly sure what scheme is being used. I am using the default template from Penguin Computing for job submission. It looks like:

#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n

mpirun $EXECUTABLE $INPUT_FILE

I’m not configuring OpenMPI anywhere else. It is possible the Penguin Computing folks have pre-configured my MPI environment. I’ll see what I can find.

Best regards,
Andy

On Feb 1, 2017, at 4:32 PM, Douglas L Reeder <***@centurylink.net> wrote:

Andy,

What allocation scheme are you using on the cluster. For some codes we see noticeable differences using fillup vs round robin, not 4x though. Fillup is more shared memory use while round robin uses more infinniband.

Doug
Post by Andy Witzig
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I’m requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation I’m just using the command: mpirun -np 20 
. I haven’t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Post by Andy Witzig
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
r***@open-mpi.org
2017-02-01 23:04:06 UTC
Permalink
Simple test: replace your executable with “hostname”. If you see multiple hosts come out on your cluster, then you know why the performance is different.
Post by Andy Witzig
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin Computing folks have pre-configured my MPI environment. I’ll see what I can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see noticeable differences using fillup vs round robin, not 4x though. Fillup is more shared memory use while round robin uses more infinniband.
Doug
Post by Andy Witzig
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I’m requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation I’m just using the command: mpirun -np 20 
. I haven’t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Post by Andy Witzig
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Bennet Fauber
2017-02-01 23:15:01 UTC
Permalink
You may want to run this by Penguin support, too.

I believe that Penguin on Demand use Torque, in which case the

nodes=1:ppn=20

is requesting 20 cores on a single node.

If this is Torque, then you should get a host list, with counts by inserting

uniq -c $PBS_NODEFILE

after the last #PBS directive. That should print the host name and the
number 20. MPI should resort to whatever it uses when it is on the same
node.

-- bennet
Simple test: replace your executable with “hostname”. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I’m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I’ll see what I can
find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup is
more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
#PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the
cluster should be SMP on the node, since there are 20 cores/node. On the
workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores /
node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Andy Witzig
2017-02-01 23:36:13 UTC
Permalink
Thanks, Bennet. I made the modification to the Torque submission file and got “20 n388”, which confirms (like you said) that for my cluster runs I am requesting 20 cores on a single node.

Best regards,
Andy

On Feb 1, 2017, at 5:15 PM, Bennet Fauber <***@umich.edu> wrote:

You may want to run this by Penguin support, too.

I believe that Penguin on Demand use Torque, in which case the

nodes=1:ppn=20

is requesting 20 cores on a single node.

If this is Torque, then you should get a host list, with counts by inserting

uniq -c $PBS_NODEFILE

after the last #PBS directive. That should print the host name and the
number 20. MPI should resort to whatever it uses when it is on the same
node.

-- bennet
Simple test: replace your executable with “hostname”. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I’m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I’ll see what I can
find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup is
more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
#PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the
cluster should be SMP on the node, since there are 20 cores/node. On the
workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores /
node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Andy Witzig
2017-02-01 23:36:34 UTC
Permalink
Thank for the idea. I did the test and only get a single host.

Thanks,
Andy

On Feb 1, 2017, at 5:04 PM, ***@open-mpi.org wrote:

Simple test: replace your executable with “hostname”. If you see multiple hosts come out on your cluster, then you know why the performance is different.
Post by Andy Witzig
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin Computing folks have pre-configured my MPI environment. I’ll see what I can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see noticeable differences using fillup vs round robin, not 4x though. Fillup is more shared memory use while round robin uses more infinniband.
Doug
Post by Andy Witzig
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I’m requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation I’m just using the command: mpirun -np 20 
. I haven’t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Post by Andy Witzig
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Bennet Fauber
2017-02-02 00:15:18 UTC
Permalink
How do they compare if you run a much smaller number of ranks, say -np 2 or 4?

Is the workstation shared and doing any other work?

You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.

You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.

Just flailing in the dark...

-- bennet
Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with “hostname”. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I’m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I’ll see what I can
find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup is
more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
#PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the
cluster should be SMP on the node, since there are 20 cores/node. On the
workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores /
node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Andy Witzig
2017-02-02 03:09:28 UTC
Permalink
Thank you, Bennet. From my testing, I’ve seen that the application usually performs better at much smaller ranks on the workstation. I’ve tested on the cluster and do not see the same response (i.e. see better performance with ranks of -np 15 or 20). The workstation is not shared and is not doing any other work. I ran the application on the Workstation with top and confirmed that 20 procs were fully loaded.

I’ll look into the diagnostics you mentioned and get back with you.

Best regards,
Andy

On Feb 1, 2017, at 6:15 PM, Bennet Fauber <***@umich.edu> wrote:

How do they compare if you run a much smaller number of ranks, say -np 2 or 4?

Is the workstation shared and doing any other work?

You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.

You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.

Just flailing in the dark...

-- bennet
Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with “hostname”. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I’m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I’ll see what I can
find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup is
more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
#PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the
cluster should be SMP on the node, since there are 20 cores/node. On the
workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores /
node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
n***@hlrs.de
2017-02-02 11:08:43 UTC
Permalink
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.

Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph
Thank you, Bennet. From my testing, I?ve seen that the application usually
performs better at much smaller ranks on the workstation. I?ve tested on
the cluster and do not see the same response (i.e. see better performance
with ranks of -np 15 or 20). The workstation is not shared and is not
doing any other work. I ran the application on the Workstation with top
and confirmed that 20 procs were fully loaded.
I?ll look into the diagnostics you mentioned and get back with you.
Best regards,
Andy
How do they compare if you run a much smaller number of ranks, say -np 2 or 4?
Is the workstation shared and doing any other work?
You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.
You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.
Just flailing in the dark...
-- bennet
Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what I
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np 20
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Gilles Gouaillardet
2017-02-02 11:28:43 UTC
Permalink
i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how
things gi

Cheers,

Gilles
Post by n***@hlrs.de
Hello Andy,
You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.
Are you using the same backend compiler on both systems?
Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.
Best
Christoph
Thank you, Bennet. From my testing, I?ve seen that the application
usually
performs better at much smaller ranks on the workstation. I?ve tested on
the cluster and do not see the same response (i.e. see better performance
with ranks of -np 15 or 20). The workstation is not shared and is not
doing any other work. I ran the application on the Workstation with top
and confirmed that 20 procs were fully loaded.
I?ll look into the diagnostics you mentioned and get back with you.
Best regards,
Andy
How do they compare if you run a much smaller number of ranks, say -np 2
or
4?
Is the workstation shared and doing any other work?
You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.
You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.
Just flailing in the dark...
-- bennet
Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see
multiple
Post by Andy Witzig
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using
the
Post by Andy Witzig
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what
I
Post by Andy Witzig
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we
see
Post by Andy Witzig
noticeable differences using fillup vs round robin, not 4x though.
Fillup
Post by Andy Witzig
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np
20
Post by Andy Witzig
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20
cores
Post by Andy Witzig
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
On Behalf Of Andrew
Post by Andy Witzig
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so
using
Post by Andy Witzig
mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon
E5-2697
Post by Andy Witzig
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications
have
Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation
as
Post by Andy Witzig
is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Andy Witzig
2017-02-06 16:25:01 UTC
Permalink
Hi all,

My apologies for not replying sooner on this issue - I’ve been swamped with other tasking. Here’s my latest:

1.) I have looked deep into bindings on both systems (used --report-bindings option) and nothing came to light. I’ve tried multiple variations on bindings settings and only minor improvements were made on the workstation.

2.) I used the mpirun --tag-output grep Cpus_allowed_list /proc/self/status command and everything was in order on both systems.

3.) I used ompi_info -c (per recommendation of Penguin Computing support staff) and looked at the differences in configuration. I’m pasting the output below for reference. The only settings in the cluster configuration that were not present in the workstation configuration were: --enable-__cxa_atexit, --disable-libunwind-exceptions, and --disable-dssi. There were several settings present in the workstation configuration that were not set in the cluster configuration. Any reason why the same version of OpenMPI would have such different settings?

3.) I used hwloc and lstopo to compare system hardware and confirmed that the workstation has either equivalent or superior specs to the cluster node setup.

3.) Primary differences I can see right now are:
a.) OpenMPI 1.6.4 was compiled using gcc 4.4.7 on the cluster and I am compiling with gcc 5.4.0 on the workstation;
b.) OpenMPI compile configurations are different;
b.) the cluster uses Torque/PBS to submit the jobs and;
c.) the workstation is hyper threaded and cluster is not
d.) Workstation runs on Ubuntu while cluster runs on CentOS

My next steps will be to compile/install gcc 4.4.7 on the Workstation and recompile OpenMPI 1.6.4 to ensure the software configuration is equivalent, and do my best to replicate the cluster configuration settings. I will also look into the profiling tools that Christoph mentioned and see if any details come to light.

Thanks much,
Andy

---------------------------WORKSTATION OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr
--program-suffix=-5
--enable-shared
--enable-linker-build-id
--libexecdir=/usr/lib
--without-included-gettext
--enable-threads=posix
--libdir=/usr/lib
--enable-nls
--with-sysroot=/
--enable-clocale=gnu
--enable-libstdcxx-debug
--enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new
--enable-gnu-unique-object
--disable-vtable-verify
--enable-libmpx
--enable-plugin
--with-system-zlib
--disable-browser-plugin
--enable-java-awt=gtk
--enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc
--enable-multiarch
--disable-werror
--with-arch-32=i686
--with-abi=m64
--with-multilib-list=m32,m64,mx32
--enable-multilib
--with-tune=generic
--enable-checking=release
--build=x86_64-linux-gnu
--host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

---------------------------CLUSTER OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ./configure

--prefix=/public/apps/gcc/4.4.7
--enable-shared
--enable-threads=posix
--enable-checking=release
--with-system-zlib
--enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-gnu-unique-object
--disable-dssi
--with-arch_32=i686
--build=x86_64-redhat-linux build_alias=x86_64-redhat-linux
--enable-languages=c,c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.7 (GCC)


On Feb 2, 2017, at 5:28 AM, Gilles Gouaillardet <***@gmail.com> wrote:

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how things gi

Cheers,

Gilles

On Thursday, February 2, 2017, <***@hlrs.de <mailto:***@hlrs.de>> wrote:
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.


Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph
Thank you, Bennet. From my testing, I?ve seen that the application usually
performs better at much smaller ranks on the workstation. I?ve tested on
the cluster and do not see the same response (i.e. see better performance
with ranks of -np 15 or 20). The workstation is not shared and is not
doing any other work. I ran the application on the Workstation with top
and confirmed that 20 procs were fully loaded.
I?ll look into the diagnostics you mentioned and get back with you.
Best regards,
Andy
How do they compare if you run a much smaller number of ranks, say -np 2 or 4?
Is the workstation shared and doing any other work?
You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.
You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.
Just flailing in the dark...
-- bennet
Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what I
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np 20
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
Elken, Tom
2017-02-06 16:44:27 UTC
Permalink
“c.) the workstation is hyper threaded and cluster is not”

You might turn off hyperthreading (HT) on the workstation, and re-run.
I’ve seen some OS’s on some systems get confused and assign multiple OS “cpus” to the same HW core/thread.

In any case, if you turn HT off, and top shows you that tasks are running on different ‘cpus’, you can be sure they are running on different cores, and less likely to interfere with each other.

-Tom

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Andy Witzig
Sent: Monday, February 06, 2017 8:25 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Performance Issues on SMP Workstation

Hi all,

My apologies for not replying sooner on this issue - I’ve been swamped with other tasking. Here’s my latest:

1.) I have looked deep into bindings on both systems (used --report-bindings option) and nothing came to light. I’ve tried multiple variations on bindings settings and only minor improvements were made on the workstation.

2.) I used the mpirun --tag-output grep Cpus_allowed_list /proc/self/status command and everything was in order on both systems.

3.) I used ompi_info -c (per recommendation of Penguin Computing support staff) and looked at the differences in configuration. I’m pasting the output below for reference. The only settings in the cluster configuration that were not present in the workstation configuration were: --enable-__cxa_atexit, --disable-libunwind-exceptions, and --disable-dssi. There were several settings present in the workstation configuration that were not set in the cluster configuration. Any reason why the same version of OpenMPI would have such different settings?

3.) I used hwloc and lstopo to compare system hardware and confirmed that the workstation has either equivalent or superior specs to the cluster node setup.

3.) Primary differences I can see right now are:
a.) OpenMPI 1.6.4 was compiled using gcc 4.4.7 on the cluster and I am compiling with gcc 5.4.0 on the workstation;
b.) OpenMPI compile configurations are different;
b.) the cluster uses Torque/PBS to submit the jobs and;
c.) the workstation is hyper threaded and cluster is not
d.) Workstation runs on Ubuntu while cluster runs on CentOS

My next steps will be to compile/install gcc 4.4.7 on the Workstation and recompile OpenMPI 1.6.4 to ensure the software configuration is equivalent, and do my best to replicate the cluster configuration settings. I will also look into the profiling tools that Christoph mentioned and see if any details come to light.

Thanks much,
Andy

---------------------------WORKSTATION OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs<file:///\\usr\share\doc\gcc-5\README.Bugs>
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr
--program-suffix=-5
--enable-shared
--enable-linker-build-id
--libexecdir=/usr/lib
--without-included-gettext
--enable-threads=posix
--libdir=/usr/lib
--enable-nls
--with-sysroot=/
--enable-clocale=gnu
--enable-libstdcxx-debug
--enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new
--enable-gnu-unique-object
--disable-vtable-verify
--enable-libmpx
--enable-plugin
--with-system-zlib
--disable-browser-plugin
--enable-java-awt=gtk
--enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc
--enable-multiarch
--disable-werror
--with-arch-32=i686
--with-abi=m64
--with-multilib-list=m32,m64,mx32
--enable-multilib
--with-tune=generic
--enable-checking=release
--build=x86_64-linux-gnu
--host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

---------------------------CLUSTER OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ./configure

--prefix=/public/apps/gcc/4.4.7
--enable-shared
--enable-threads=posix
--enable-checking=release
--with-system-zlib
--enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-gnu-unique-object
--disable-dssi
--with-arch_32=i686
--build=x86_64-redhat-linux build_alias=x86_64-redhat-linux
--enable-languages=c,c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.7 (GCC)

On Feb 2, 2017, at 5:28 AM, Gilles Gouaillardet <***@gmail.com<mailto:***@gmail.com>> wrote:

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how things gi

Cheers,

Gilles

On Thursday, February 2, 2017, <***@hlrs.de<mailto:***@hlrs.de>> wrote:
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.


Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph
Thank you, Bennet. From my testing, I?ve seen that the application usually
performs better at much smaller ranks on the workstation. I?ve tested on
the cluster and do not see the same response (i.e. see better performance
with ranks of -np 15 or 20). The workstation is not shared and is not
doing any other work. I ran the application on the Workstation with top
and confirmed that 20 procs were fully loaded.
I?ll look into the diagnostics you mentioned and get back with you.
Best regards,
Andy
How do they compare if you run a much smaller number of ranks, say -np 2 or 4?
Is the workstation shared and doing any other work?
You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.
You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.
Just flailing in the dark...
-- bennet
Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what I
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np 20
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org<javascript:;>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Andy Witzig
2017-02-06 16:56:59 UTC
Permalink
Thanks, Tom. I did try using the mpirun —bind-to-core option and confirmed that individual MPI processes were placed on unique cores (also without other interfering MPI runs); however, it did not make a significant difference. That said, I do agree that turning off hyper-threading is an important test to rule out any fundamental differences that may be at play. I’ll turn off hyper-threading and let you know what I find.

Best regards,
Andy

On Feb 6, 2017, at 10:44 AM, Elken, Tom <***@intel.com> wrote:

“c.) the workstation is hyper threaded and cluster is not”

You might turn off hyperthreading (HT) on the workstation, and re-run.
I’ve seen some OS’s on some systems get confused and assign multiple OS “cpus” to the same HW core/thread.

In any case, if you turn HT off, and top shows you that tasks are running on different ‘cpus’, you can be sure they are running on different cores, and less likely to interfere with each other.

-Tom

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Andy Witzig
Sent: Monday, February 06, 2017 8:25 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Performance Issues on SMP Workstation

Hi all,

My apologies for not replying sooner on this issue - I’ve been swamped with other tasking. Here’s my latest:

1.) I have looked deep into bindings on both systems (used --report-bindings option) and nothing came to light. I’ve tried multiple variations on bindings settings and only minor improvements were made on the workstation.

2.) I used the mpirun --tag-output grep Cpus_allowed_list /proc/self/status command and everything was in order on both systems.

3.) I used ompi_info -c (per recommendation of Penguin Computing support staff) and looked at the differences in configuration. I’m pasting the output below for reference. The only settings in the cluster configuration that were not present in the workstation configuration were: --enable-__cxa_atexit, --disable-libunwind-exceptions, and --disable-dssi. There were several settings present in the workstation configuration that were not set in the cluster configuration. Any reason why the same version of OpenMPI would have such different settings?

3.) I used hwloc and lstopo to compare system hardware and confirmed that the workstation has either equivalent or superior specs to the cluster node setup.

3.) Primary differences I can see right now are:
a.) OpenMPI 1.6.4 was compiled using gcc 4.4.7 on the cluster and I am compiling with gcc 5.4.0 on the workstation;
b.) OpenMPI compile configurations are different;
b.) the cluster uses Torque/PBS to submit the jobs and;
c.) the workstation is hyper threaded and cluster is not
d.) Workstation runs on Ubuntu while cluster runs on CentOS

My next steps will be to compile/install gcc 4.4.7 on the Workstation and recompile OpenMPI 1.6.4 to ensure the software configuration is equivalent, and do my best to replicate the cluster configuration settings. I will also look into the profiling tools that Christoph mentioned and see if any details come to light.

Thanks much,
Andy

---------------------------WORKSTATION OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs <file://///usr/share/doc/gcc-5/README.Bugs>
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr
--program-suffix=-5
--enable-shared
--enable-linker-build-id
--libexecdir=/usr/lib
--without-included-gettext
--enable-threads=posix
--libdir=/usr/lib
--enable-nls
--with-sysroot=/
--enable-clocale=gnu
--enable-libstdcxx-debug
--enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new
--enable-gnu-unique-object
--disable-vtable-verify
--enable-libmpx
--enable-plugin
--with-system-zlib
--disable-browser-plugin
--enable-java-awt=gtk
--enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc
--enable-multiarch
--disable-werror
--with-arch-32=i686
--with-abi=m64
--with-multilib-list=m32,m64,mx32
--enable-multilib
--with-tune=generic
--enable-checking=release
--build=x86_64-linux-gnu
--host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

---------------------------CLUSTER OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ./configure

--prefix=/public/apps/gcc/4.4.7
--enable-shared
--enable-threads=posix
--enable-checking=release
--with-system-zlib
--enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-gnu-unique-object
--disable-dssi
--with-arch_32=i686
--build=x86_64-redhat-linux build_alias=x86_64-redhat-linux
--enable-languages=c,c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.7 (GCC)


On Feb 2, 2017, at 5:28 AM, Gilles Gouaillardet <***@gmail.com <mailto:***@gmail.com>> wrote:

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how things gi

Cheers,

Gilles

On Thursday, February 2, 2017, <***@hlrs.de <mailto:***@hlrs.de>> wrote:
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.


Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph
Thank you, Bennet. From my testing, I?ve seen that the application usually
performs better at much smaller ranks on the workstation. I?ve tested on
the cluster and do not see the same response (i.e. see better performance
with ranks of -np 15 or 20). The workstation is not shared and is not
doing any other work. I ran the application on the Workstation with top
and confirmed that 20 procs were fully loaded.
I?ll look into the diagnostics you mentioned and get back with you.
Best regards,
Andy
How do they compare if you run a much smaller number of ranks, say -np 2 or 4?
Is the workstation shared and doing any other work?
You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.
You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.
Just flailing in the dark...
-- bennet
Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what I
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np 20
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
***@lists.open-mpi.org <javascript:;>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
Loading...