[OMPI users] Performance Issues on SMP Workstation

Discussion:

Andy Witzig

2017-02-01 20:52:36 UTC

Hi all,

I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4 2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. Both applications have been compiled using OpenMPI 1.6.4. I have tried running:

mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE

and others, but cannot achieve the same performance on the workstation as is seen on the cluster. The workstation outperforms on other non-MPI but multi-threaded applications, so I don’t think it’s a hardware issue.

Any help you can provide would be appreciated.

Thanks,
cap79

Andrew Witzig

2017-02-01 21:36:57 UTC

Permalink

By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun -np 20 is possible (and should be equivalent) on both platforms.

Thanks,
cap79

Post by Andy Witzig
Hi all,
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is seen on the cluster. The workstation outperforms on other non-MPI but multi-threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Elken, Tom

2017-02-01 22:10:43 UTC

Permalink

For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "

are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?

-Tom

-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79

Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4

2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have

Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is

seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.

Post by Andy Witzig
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Andy Witzig

2017-02-01 22:25:08 UTC

Permalink

Hi Tom,

The cluster uses an Infiniband interconnect. On the cluster I’m requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation I’m just using the command: mpirun -np 20 …. I haven’t finished setting Torque/PBS up yet.

Best regards,
Andy

On Feb 1, 2017, at 4:10 PM, Elken, Tom <***@intel.com> wrote:

For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "

are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?

-Tom

Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4

Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is

seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Douglas L Reeder

2017-02-01 22:32:18 UTC

Permalink

Andy,

What allocation scheme are you using on the cluster. For some codes we see noticeable differences using fillup vs round robin, not 4x though. Fillup is more shared memory use while round robin uses more infinniband.

Doug

Post by Andy Witzig
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I’m requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation I’m just using the command: mpirun -np 20 …. I haven’t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom

Post by Andy Witzig
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4

Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is

seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Andy Witzig

2017-02-01 22:46:48 UTC

Permalink

Honestly, Iâm not exactly sure what scheme is being used. I am using the default template from Penguin Computing for job submission. It looks like:

#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n

mpirun $EXECUTABLE $INPUT_FILE

Iâm not configuring OpenMPI anywhere else. It is possible the Penguin Computing folks have pre-configured my MPI environment. Iâll see what I can find.

Best regards,
Andy

On Feb 1, 2017, at 4:32 PM, Douglas L Reeder <***@centurylink.net> wrote:

Andy,

What allocation scheme are you using on the cluster. For some codes we see noticeable differences using fillup vs round robin, not 4x though. Fillup is more shared memory use while round robin uses more infinniband.

Doug

Post by Andy Witzig
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster Iâm requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the cluster should be SMP on the node, since there are 20 cores/node. On the workstation Iâm just using the command: mpirun -np 20 âŠ. I havenât finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores / node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom

Post by Andy Witzig
Hi all,
Iâm testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4

Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is

seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I donât think itâs a hardware issue.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

r***@open-mpi.org

2017-02-01 23:04:06 UTC

Permalink

Simple test: replace your executable with âhostnameâ. If you see multiple hosts come out on your cluster, then you know why the performance is different.

Post by Andy Witzig
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
Iâm not configuring OpenMPI anywhere else. It is possible the Penguin Computing folks have pre-configured my MPI environment. Iâll see what I can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see noticeable differences using fillup vs round robin, not 4x though. Fillup is more shared memory use while round robin uses more infinniband.
Doug

Post by Andy Witzig
Hi all,
Iâm testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4

Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is

seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I donât think itâs a hardware issue.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Bennet Fauber

2017-02-01 23:15:01 UTC

Permalink

You may want to run this by Penguin support, too.

I believe that Penguin on Demand use Torque, in which case the

nodes=1:ppn=20

is requesting 20 cores on a single node.

If this is Torque, then you should get a host list, with counts by inserting

uniq -c $PBS_NODEFILE

after the last #PBS directive. That should print the host name and the
number 20. MPI should resort to whatever it uses when it is on the same
node.

-- bennet

Simple test: replace your executable with “hostname”. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I’m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I’ll see what I can
find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup is
more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
#PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the
cluster should be SMP on the node, since there are 20 cores/node. On the
workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores /
node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Andy Witzig

2017-02-01 23:36:13 UTC

Permalink

Thanks, Bennet. I made the modification to the Torque submission file and got “20 n388”, which confirms (like you said) that for my cluster runs I am requesting 20 cores on a single node.

Best regards,
Andy

On Feb 1, 2017, at 5:15 PM, Bennet Fauber <***@umich.edu> wrote:

You may want to run this by Penguin support, too.

I believe that Penguin on Demand use Torque, in which case the

nodes=1:ppn=20

is requesting 20 cores on a single node.

If this is Torque, then you should get a host list, with counts by inserting

uniq -c $PBS_NODEFILE

after the last #PBS directive. That should print the host name and the
number 20. MPI should resort to whatever it uses when it is on the same
node.

-- bennet

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Andy Witzig

2017-02-01 23:36:34 UTC

Permalink

Thank for the idea. I did the test and only get a single host.

Thanks,
Andy

On Feb 1, 2017, at 5:04 PM, ***@open-mpi.org wrote:

Simple test: replace your executable with âhostnameâ. If you see multiple hosts come out on your cluster, then you know why the performance is different.

Post by Andy Witzig
Hi all,
Iâm testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4

Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is

seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I donât think itâs a hardware issue.

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Bennet Fauber

2017-02-02 00:15:18 UTC

Permalink

How do they compare if you run a much smaller number of ranks, say -np 2 or 4?

Is the workstation shared and doing any other work?

You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.

You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.

Just flailing in the dark...

-- bennet

Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with “hostname”. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I’m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I’m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I’ll see what I can
find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup is
more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
#PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically, the run on the
cluster should be SMP on the node, since there are 20 cores/node. On the
workstation I’m just using the command: mpirun -np 20 …. I haven’t finished
setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores /
node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I’m testing my application on a SMP workstation (dual Intel Xeon E5-2697 V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don’t think it’s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Andy Witzig

2017-02-02 03:09:28 UTC

Permalink

Thank you, Bennet. From my testing, I’ve seen that the application usually performs better at much smaller ranks on the workstation. I’ve tested on the cluster and do not see the same response (i.e. see better performance with ranks of -np 15 or 20). The workstation is not shared and is not doing any other work. I ran the application on the Workstation with top and confirmed that 20 procs were fully loaded.

I’ll look into the diagnostics you mentioned and get back with you.

Best regards,
Andy

On Feb 1, 2017, at 6:15 PM, Bennet Fauber <***@umich.edu> wrote:

How do they compare if you run a much smaller number of ranks, say -np 2 or 4?

Is the workstation shared and doing any other work?

You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.

You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.

Just flailing in the dark...

-- bennet

_______________________________________________
users mailing list
***@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

n***@hlrs.de

2017-02-02 11:08:43 UTC

Permalink

Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.

Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph

Thank you, Bennet. From my testing, I?ve seen that the application usually
performs better at much smaller ranks on the workstation. I?ve tested on
the cluster and do not see the same response (i.e. see better performance
with ranks of -np 15 or 20). The workstation is not shared and is not
doing any other work. I ran the application on the Workstation with top
and confirmed that 20 procs were fully loaded.
I?ll look into the diagnostics you mentioned and get back with you.
Best regards,
Andy
How do they compare if you run a much smaller number of ranks, say -np 2 or 4?
Is the workstation shared and doing any other work?
You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.
You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.
Just flailing in the dark...
-- bennet

Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what I
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np 20
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Gilles Gouaillardet

2017-02-02 11:28:43 UTC

Permalink

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how
things gi

Cheers,

Gilles

Post by n***@hlrs.de
Hello Andy,
You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.
Are you using the same backend compiler on both systems?
Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.
Best
Christoph

Thank you, Bennet. From my testing, I?ve seen that the application

usually

performs better at much smaller ranks on the workstation. I?ve tested on
the cluster and do not see the same response (i.e. see better performance
with ranks of -np 15 or 20). The workstation is not shared and is not
doing any other work. I ran the application on the Workstation with top
and confirmed that 20 procs were fully loaded.
I?ll look into the diagnostics you mentioned and get back with you.
Best regards,
Andy
How do they compare if you run a much smaller number of ranks, say -np 2

4?
Is the workstation shared and doing any other work?
You could insert some diagnostics into your script, for example,
uptime and free, both before and after running your MPI program and
compare.
You could also run top in batch mode in the background for your own
username, then run your MPI program, and compare the results from top.
We've seen instances where the MPI ranks only get distributed to a
small number of processors, which you see if they all have small
percentages of CPU.
Just flailing in the dark...
-- bennet

Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see

multiple

Post by Andy Witzig
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using

the

Post by Andy Witzig
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what

Post by Andy Witzig
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we

see

Post by Andy Witzig
noticeable differences using fillup vs round robin, not 4x though.

Fillup

Post by Andy Witzig
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np

Post by Andy Witzig
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20

cores

Post by Andy Witzig
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----

On Behalf Of Andrew

Post by Andy Witzig
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so

using

Post by Andy Witzig
mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon

E5-2697

Post by Andy Witzig
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications

have

Post by Andy Witzig
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation

Post by Andy Witzig
is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Andy Witzig

2017-02-06 16:25:01 UTC

Permalink

Hi all,

My apologies for not replying sooner on this issue - Iâve been swamped with other tasking. Hereâs my latest:

1.) I have looked deep into bindings on both systems (used --report-bindings option) and nothing came to light. Iâve tried multiple variations on bindings settings and only minor improvements were made on the workstation.

2.) I used the mpirun --tag-output grep Cpus_allowed_list /proc/self/status command and everything was in order on both systems.

3.) I used ompi_info -c (per recommendation of Penguin Computing support staff) and looked at the differences in configuration. Iâm pasting the output below for reference. The only settings in the cluster configuration that were not present in the workstation configuration were: --enable-__cxa_atexit, --disable-libunwind-exceptions, and --disable-dssi. There were several settings present in the workstation configuration that were not set in the cluster configuration. Any reason why the same version of OpenMPI would have such different settings?

3.) I used hwloc and lstopo to compare system hardware and confirmed that the workstation has either equivalent or superior specs to the cluster node setup.

3.) Primary differences I can see right now are:
a.) OpenMPI 1.6.4 was compiled using gcc 4.4.7 on the cluster and I am compiling with gcc 5.4.0 on the workstation;
b.) OpenMPI compile configurations are different;
b.) the cluster uses Torque/PBS to submit the jobs and;
c.) the workstation is hyper threaded and cluster is not
d.) Workstation runs on Ubuntu while cluster runs on CentOS

My next steps will be to compile/install gcc 4.4.7 on the Workstation and recompile OpenMPI 1.6.4 to ensure the software configuration is equivalent, and do my best to replicate the cluster configuration settings. I will also look into the profiling tools that Christoph mentioned and see if any details come to light.

Thanks much,
Andy

---------------------------WORKSTATION OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr
--program-suffix=-5
--enable-shared
--enable-linker-build-id
--libexecdir=/usr/lib
--without-included-gettext
--enable-threads=posix
--libdir=/usr/lib
--enable-nls
--with-sysroot=/
--enable-clocale=gnu
--enable-libstdcxx-debug
--enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new
--enable-gnu-unique-object
--disable-vtable-verify
--enable-libmpx
--enable-plugin
--with-system-zlib
--disable-browser-plugin
--enable-java-awt=gtk
--enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc
--enable-multiarch
--disable-werror
--with-arch-32=i686
--with-abi=m64
--with-multilib-list=m32,m64,mx32
--enable-multilib
--with-tune=generic
--enable-checking=release
--build=x86_64-linux-gnu
--host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

---------------------------CLUSTER OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ./configure

--prefix=/public/apps/gcc/4.4.7
--enable-shared
--enable-threads=posix
--enable-checking=release
--with-system-zlib
--enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-gnu-unique-object
--disable-dssi
--with-arch_32=i686
--build=x86_64-redhat-linux build_alias=x86_64-redhat-linux
--enable-languages=c,c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.7 (GCC)

On Feb 2, 2017, at 5:28 AM, Gilles Gouaillardet <***@gmail.com> wrote:

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how things gi

Cheers,

Gilles

On Thursday, February 2, 2017, <***@hlrs.de <mailto:***@hlrs.de>> wrote:
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.

Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph

Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what I
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np 20
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>

Elken, Tom

2017-02-06 16:44:27 UTC

Permalink

âc.) the workstation is hyper threaded and cluster is notâ

You might turn off hyperthreading (HT) on the workstation, and re-run.
Iâve seen some OSâs on some systems get confused and assign multiple OS âcpusâ to the same HW core/thread.

In any case, if you turn HT off, and top shows you that tasks are running on different âcpusâ, you can be sure they are running on different cores, and less likely to interfere with each other.

-Tom

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Andy Witzig
Sent: Monday, February 06, 2017 8:25 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Performance Issues on SMP Workstation

Hi all,

My apologies for not replying sooner on this issue - Iâve been swamped with other tasking. Hereâs my latest:

1.) I have looked deep into bindings on both systems (used --report-bindings option) and nothing came to light. Iâve tried multiple variations on bindings settings and only minor improvements were made on the workstation.

2.) I used the mpirun --tag-output grep Cpus_allowed_list /proc/self/status command and everything was in order on both systems.

3.) I used ompi_info -c (per recommendation of Penguin Computing support staff) and looked at the differences in configuration. Iâm pasting the output below for reference. The only settings in the cluster configuration that were not present in the workstation configuration were: --enable-__cxa_atexit, --disable-libunwind-exceptions, and --disable-dssi. There were several settings present in the workstation configuration that were not set in the cluster configuration. Any reason why the same version of OpenMPI would have such different settings?

3.) I used hwloc and lstopo to compare system hardware and confirmed that the workstation has either equivalent or superior specs to the cluster node setup.

3.) Primary differences I can see right now are:
a.) OpenMPI 1.6.4 was compiled using gcc 4.4.7 on the cluster and I am compiling with gcc 5.4.0 on the workstation;
b.) OpenMPI compile configurations are different;
b.) the cluster uses Torque/PBS to submit the jobs and;
c.) the workstation is hyper threaded and cluster is not
d.) Workstation runs on Ubuntu while cluster runs on CentOS

My next steps will be to compile/install gcc 4.4.7 on the Workstation and recompile OpenMPI 1.6.4 to ensure the software configuration is equivalent, and do my best to replicate the cluster configuration settings. I will also look into the profiling tools that Christoph mentioned and see if any details come to light.

Thanks much,
Andy

---------------------------WORKSTATION OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs<file:///\\usr\share\doc\gcc-5\README.Bugs>
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr
--program-suffix=-5
--enable-shared
--enable-linker-build-id
--libexecdir=/usr/lib
--without-included-gettext
--enable-threads=posix
--libdir=/usr/lib
--enable-nls
--with-sysroot=/
--enable-clocale=gnu
--enable-libstdcxx-debug
--enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new
--enable-gnu-unique-object
--disable-vtable-verify
--enable-libmpx
--enable-plugin
--with-system-zlib
--disable-browser-plugin
--enable-java-awt=gtk
--enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc
--enable-multiarch
--disable-werror
--with-arch-32=i686
--with-abi=m64
--with-multilib-list=m32,m64,mx32
--enable-multilib
--with-tune=generic
--enable-checking=release
--build=x86_64-linux-gnu
--host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

---------------------------CLUSTER OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ./configure

--prefix=/public/apps/gcc/4.4.7
--enable-shared
--enable-threads=posix
--enable-checking=release
--with-system-zlib
--enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-gnu-unique-object
--disable-dssi
--with-arch_32=i686
--build=x86_64-redhat-linux build_alias=x86_64-redhat-linux
--enable-languages=c,c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.7 (GCC)

On Feb 2, 2017, at 5:28 AM, Gilles Gouaillardet <***@gmail.com<mailto:***@gmail.com>> wrote:

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how things gi

Cheers,

Gilles

On Thursday, February 2, 2017, <***@hlrs.de<mailto:***@hlrs.de>> wrote:
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.

Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph

_______________________________________________
users mailing list
***@lists.open-mpi.org<javascript:;>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
***@lists.open-mpi.org<mailto:***@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Andy Witzig

2017-02-06 16:56:59 UTC

Permalink

Thanks, Tom. I did try using the mpirun âbind-to-core option and confirmed that individual MPI processes were placed on unique cores (also without other interfering MPI runs); however, it did not make a significant difference. That said, I do agree that turning off hyper-threading is an important test to rule out any fundamental differences that may be at play. Iâll turn off hyper-threading and let you know what I find.

Best regards,
Andy

On Feb 6, 2017, at 10:44 AM, Elken, Tom <***@intel.com> wrote:

âc.) the workstation is hyper threaded and cluster is notâ

You might turn off hyperthreading (HT) on the workstation, and re-run.
Iâve seen some OSâs on some systems get confused and assign multiple OS âcpusâ to the same HW core/thread.

In any case, if you turn HT off, and top shows you that tasks are running on different âcpusâ, you can be sure they are running on different cores, and less likely to interfere with each other.

-Tom

From: users [mailto:users-***@lists.open-mpi.org] On Behalf Of Andy Witzig
Sent: Monday, February 06, 2017 8:25 AM
To: Open MPI Users <***@lists.open-mpi.org>
Subject: Re: [OMPI users] Performance Issues on SMP Workstation

Hi all,

My apologies for not replying sooner on this issue - Iâve been swamped with other tasking. Hereâs my latest:

1.) I have looked deep into bindings on both systems (used --report-bindings option) and nothing came to light. Iâve tried multiple variations on bindings settings and only minor improvements were made on the workstation.

2.) I used the mpirun --tag-output grep Cpus_allowed_list /proc/self/status command and everything was in order on both systems.

3.) I used ompi_info -c (per recommendation of Penguin Computing support staff) and looked at the differences in configuration. Iâm pasting the output below for reference. The only settings in the cluster configuration that were not present in the workstation configuration were: --enable-__cxa_atexit, --disable-libunwind-exceptions, and --disable-dssi. There were several settings present in the workstation configuration that were not set in the cluster configuration. Any reason why the same version of OpenMPI would have such different settings?

3.) I used hwloc and lstopo to compare system hardware and confirmed that the workstation has either equivalent or superior specs to the cluster node setup.

3.) Primary differences I can see right now are:
a.) OpenMPI 1.6.4 was compiled using gcc 4.4.7 on the cluster and I am compiling with gcc 5.4.0 on the workstation;
b.) OpenMPI compile configurations are different;
b.) the cluster uses Torque/PBS to submit the jobs and;
c.) the workstation is hyper threaded and cluster is not
d.) Workstation runs on Ubuntu while cluster runs on CentOS

My next steps will be to compile/install gcc 4.4.7 on the Workstation and recompile OpenMPI 1.6.4 to ensure the software configuration is equivalent, and do my best to replicate the cluster configuration settings. I will also look into the profiling tools that Christoph mentioned and see if any details come to light.

Thanks much,
Andy

---------------------------WORKSTATION OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v
--with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs <file://///usr/share/doc/gcc-5/README.Bugs>
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr
--program-suffix=-5
--enable-shared
--enable-linker-build-id
--libexecdir=/usr/lib
--without-included-gettext
--enable-threads=posix
--libdir=/usr/lib
--enable-nls
--with-sysroot=/
--enable-clocale=gnu
--enable-libstdcxx-debug
--enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new
--enable-gnu-unique-object
--disable-vtable-verify
--enable-libmpx
--enable-plugin
--with-system-zlib
--disable-browser-plugin
--enable-java-awt=gtk
--enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64
--with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar
--enable-objc-gc
--enable-multiarch
--disable-werror
--with-arch-32=i686
--with-abi=m64
--with-multilib-list=m32,m64,mx32
--enable-multilib
--with-tune=generic
--enable-checking=release
--build=x86_64-linux-gnu
--host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)

---------------------------CLUSTER OMPI_INFO -C OUTPUT---------------------------
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ./configure

--prefix=/public/apps/gcc/4.4.7
--enable-shared
--enable-threads=posix
--enable-checking=release
--with-system-zlib
--enable-__cxa_atexit
--disable-libunwind-exceptions
--enable-gnu-unique-object
--disable-dssi
--with-arch_32=i686
--build=x86_64-redhat-linux build_alias=x86_64-redhat-linux
--enable-languages=c,c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.7 (GCC)

On Feb 2, 2017, at 5:28 AM, Gilles Gouaillardet <***@gmail.com <mailto:***@gmail.com>> wrote:

i cannot remember what is the default binding (if any) on Open MPI 1.6
nor whether the default is the same with or without PBS

you can simply
mpirun --tag-output grep Cpus_allowed_list /proc/self/status
and see if you note any discrepancy between your systems

you might also consider upgrading to the latest Open MPI 2.0.2, and see how things gi

Cheers,

Gilles

On Thursday, February 2, 2017, <***@hlrs.de <mailto:***@hlrs.de>> wrote:
Hello Andy,

You can also use the --report-bindings option of mpirun to check which cores
your program will use and to which cores the processes are bound to.

Are you using the same backend compiler on both systems?

Do you have performance tools available on the systems where you can see in
which part of the Program the time is lost? Common tools would be Score-P/
Vampir/CUBE, TAU, extrae/Paraver.

Best
Christoph

Post by Andy Witzig
Thank for the idea. I did the test and only get a single host.
Thanks,
Andy
Simple test: replace your executable with ?hostname?. If you see multiple
hosts come out on your cluster, then you know why the performance is
different.
Honestly, I?m not exactly sure what scheme is being used. I am using the
#PBS -S /bin/bash
#PBS -q T30
#PBS -l walltime=24:00:00,nodes=1:ppn=20
#PBS -j oe
#PBS -N test
#PBS -r n
mpirun $EXECUTABLE $INPUT_FILE
I?m not configuring OpenMPI anywhere else. It is possible the Penguin
Computing folks have pre-configured my MPI environment. I?ll see what I
can find.
Best regards,
Andy
Andy,
What allocation scheme are you using on the cluster. For some codes we see
noticeable differences using fillup vs round robin, not 4x though. Fillup
is more shared memory use while round robin uses more infinniband.
Doug
Hi Tom,
The cluster uses an Infiniband interconnect. On the cluster I?m
requesting: #PBS -l walltime=24:00:00,nodes=1:ppn=20. So technically,
the run on the cluster should be SMP on the node, since there are 20
cores/node. On the workstation I?m just using the command: mpirun -np 20
?. I haven?t finished setting Torque/PBS up yet.
Best regards,
Andy
For this case: " a cluster system with 2.6GHz Intel Haswell with 20 cores
/ node and 128GB RAM/node. "
are you running 5 ranks per node on 4 nodes?
What interconnect are you using for the cluster?
-Tom
-----Original Message-----
Witzig
Sent: Wednesday, February 01, 2017 1:37 PM
To: Open MPI Users
Subject: Re: [OMPI users] Performance Issues on SMP Workstation
By the way, the workstation has a total of 36 cores / 72 threads, so using mpirun
-np 20 is possible (and should be equivalent) on both platforms.
Thanks,
cap79
Hi all,
I?m testing my application on a SMP workstation (dual Intel Xeon E5-2697
V4
2.3 GHz Intel Broadwell (boost 2.8-3.1GHz) processors 128GB RAM) and am
seeing a 4x performance drop compared to a cluster system with 2.6GHz Intel
Haswell with 20 cores / node and 128GB RAM/node. Both applications have
mpirun -np 20 $EXECUTABLE $INPUT_FILE
mpirun -np 20 --mca btl self,sm $EXECUTABLE $INPUT_FILE
and others, but cannot achieve the same performance on the workstation as is
seen on the cluster. The workstation outperforms on other non-MPI but multi-
threaded applications, so I don?t think it?s a hardware issue.
Any help you can provide would be appreciated.
Thanks,
cap79
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
_______________________________________________
users mailing list
https://rfd.newmexicoconsortium.org/mailman/listinfo/users <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>