Discussion:
[OMPI users] Missing data with MPI I/O and NFS
Stephen Guzik
2017-10-13 01:26:16 UTC
Permalink
Hi,

I'm having trouble with parallel I/O to a file system mounted with NFS
over an infiniband network.  In my test code, I'm simply writing 1 byte
per process to the same file.  When using two nodes, some bytes are not
written (zero bits in the unwritten bytes).  Usually at least some data
from each node is written---it appears to be all data from one node and
partial from the other.

This used to work fine but broke when the cluster was upgraded from
Debian 8 to Debian 9.  I suspect an issue with NFS and not with
OpenMPI.  However, if anyone can suggest a work-around or ways to get
more information, I would appreciate it.  In the sole case where the
file system is exported with 'sync' and mounted with 'hard,intr', I get
the error:
[node1:14823] mca_sharedfp_individual_file_open: Error during datafile
file open
MPI_ERR_FILE: invalid file
[node2:14593] (same)

----------

Some additional info:
- tested versions 1.8.8, 2.1.1, and 3.0.0 self-compiled and packaged and
vendor-supplied versions.  All have same behavior.
- all write methods (individual or collective) fail similarly.
- exporting the file system to two workstations across ethernet and
running the job across the two workstations seems to work fine.
- on a single node, everything works as expected in all cases.  In the
case described above where I get an error, the error is only observed
with processes on two nodes.
- code follows.

Thanks,
Stephen Guzik

----------

#include <iostream>

#include <mpi.h>

int main(int argc, const char* argv[])
{
  MPI_File fh;
  MPI_Status status;

  int mpierr;
  char mpistr[MPI_MAX_ERROR_STRING];
  int mpilen;
  int numProc;
  int procID;
  MPI_Init(&argc, const_cast<char***>(&argv));
  MPI_Comm_size(MPI_COMM_WORLD, &numProc);
  MPI_Comm_rank(MPI_COMM_WORLD, &procID);

  const int filesize = numProc;
  const int bufsize = filesize/numProc;
  char *buf = new char[bufsize];
  buf[0] = (char)(48 + procID);
  int numChars = bufsize/sizeof(char);

  mpierr = MPI_File_open(MPI_COMM_WORLD, "dataio",
                         MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
  if (mpierr != MPI_SUCCESS)
    {
      MPI_Error_string(mpierr, mpistr, &mpilen);
      std::cout << "Error: " << mpistr << std::endl;
    }
  mpierr = MPI_File_write_at_all(fh, (MPI_Offset)(procID*bufsize), buf,
                                 numChars, MPI_CHAR, &status);
  if (mpierr != MPI_SUCCESS)
    {
      MPI_Error_string(mpierr, mpistr, &mpilen);
      std::cout << "Error: " << mpistr << std::endl;
    }
  mpierr = MPI_File_close(&fh);
  if (mpierr != MPI_SUCCESS)
    {
      MPI_Error_string(mpierr, mpistr, &mpilen);
      std::cout << "Error: " << mpistr << std::endl;
    }

  delete[] buf;
  MPI_Finalize();
  return 0;
}
Edgar Gabriel
2017-10-13 01:36:05 UTC
Permalink
try for now to switch to the romio314 component with OpenMPI. There is
an issue with NFS and OMPIO that I am aware of and working on, that
might trigger this behavior (although it should actually work for
collective I/O even in that case).

try to set something like

mpirun --mca io romio314 ...

Thanks

Edgar
Post by Stephen Guzik
Hi,
I'm having trouble with parallel I/O to a file system mounted with NFS
over an infiniband network. In my test code, I'm simply writing 1 byte
per process to the same file. When using two nodes, some bytes are not
written (zero bits in the unwritten bytes). Usually at least some data
from each node is written---it appears to be all data from one node and
partial from the other.
This used to work fine but broke when the cluster was upgraded from
Debian 8 to Debian 9. I suspect an issue with NFS and not with
OpenMPI. However, if anyone can suggest a work-around or ways to get
more information, I would appreciate it. In the sole case where the
file system is exported with 'sync' and mounted with 'hard,intr', I get
[node1:14823] mca_sharedfp_individual_file_open: Error during datafile
file open
MPI_ERR_FILE: invalid file
[node2:14593] (same)
----------
- tested versions 1.8.8, 2.1.1, and 3.0.0 self-compiled and packaged and
vendor-supplied versions. All have same behavior.
- all write methods (individual or collective) fail similarly.
- exporting the file system to two workstations across ethernet and
running the job across the two workstations seems to work fine.
- on a single node, everything works as expected in all cases. In the
case described above where I get an error, the error is only observed
with processes on two nodes.
- code follows.
Thanks,
Stephen Guzik
----------
#include <iostream>
#include <mpi.h>
int main(int argc, const char* argv[])
{
MPI_File fh;
MPI_Status status;
int mpierr;
char mpistr[MPI_MAX_ERROR_STRING];
int mpilen;
int numProc;
int procID;
MPI_Init(&argc, const_cast<char***>(&argv));
MPI_Comm_size(MPI_COMM_WORLD, &numProc);
MPI_Comm_rank(MPI_COMM_WORLD, &procID);
const int filesize = numProc;
const int bufsize = filesize/numProc;
char *buf = new char[bufsize];
buf[0] = (char)(48 + procID);
int numChars = bufsize/sizeof(char);
mpierr = MPI_File_open(MPI_COMM_WORLD, "dataio",
MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
if (mpierr != MPI_SUCCESS)
{
MPI_Error_string(mpierr, mpistr, &mpilen);
std::cout << "Error: " << mpistr << std::endl;
}
mpierr = MPI_File_write_at_all(fh, (MPI_Offset)(procID*bufsize), buf,
numChars, MPI_CHAR, &status);
if (mpierr != MPI_SUCCESS)
{
MPI_Error_string(mpierr, mpistr, &mpilen);
std::cout << "Error: " << mpistr << std::endl;
}
mpierr = MPI_File_close(&fh);
if (mpierr != MPI_SUCCESS)
{
MPI_Error_string(mpierr, mpistr, &mpilen);
std::cout << "Error: " << mpistr << std::endl;
}
delete[] buf;
MPI_Finalize();
return 0;
}
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Edgar Gabriel
2017-10-13 01:56:22 UTC
Permalink
I opened an issue on this, hope to have the fix available next week.

https://github.com/open-mpi/ompi/issues/4334

Thanks
Edgar
Post by Edgar Gabriel
try for now to switch to the romio314 component with OpenMPI. There is
an issue with NFS and OMPIO that I am aware of and working on, that
might trigger this behavior (although it should actually work for
collective I/O even in that case).
try to set something like
mpirun --mca io romio314 ...
Thanks
Edgar
Post by Stephen Guzik
Hi,
I'm having trouble with parallel I/O to a file system mounted with NFS
over an infiniband network. In my test code, I'm simply writing 1 byte
per process to the same file. When using two nodes, some bytes are not
written (zero bits in the unwritten bytes). Usually at least some data
from each node is written---it appears to be all data from one node and
partial from the other.
This used to work fine but broke when the cluster was upgraded from
Debian 8 to Debian 9. I suspect an issue with NFS and not with
OpenMPI. However, if anyone can suggest a work-around or ways to get
more information, I would appreciate it. In the sole case where the
file system is exported with 'sync' and mounted with 'hard,intr', I get
[node1:14823] mca_sharedfp_individual_file_open: Error during datafile
file open
MPI_ERR_FILE: invalid file
[node2:14593] (same)
----------
- tested versions 1.8.8, 2.1.1, and 3.0.0 self-compiled and packaged and
vendor-supplied versions. All have same behavior.
- all write methods (individual or collective) fail similarly.
- exporting the file system to two workstations across ethernet and
running the job across the two workstations seems to work fine.
- on a single node, everything works as expected in all cases. In the
case described above where I get an error, the error is only observed
with processes on two nodes.
- code follows.
Thanks,
Stephen Guzik
----------
#include <iostream>
#include <mpi.h>
int main(int argc, const char* argv[])
{
MPI_File fh;
MPI_Status status;
int mpierr;
char mpistr[MPI_MAX_ERROR_STRING];
int mpilen;
int numProc;
int procID;
MPI_Init(&argc, const_cast<char***>(&argv));
MPI_Comm_size(MPI_COMM_WORLD, &numProc);
MPI_Comm_rank(MPI_COMM_WORLD, &procID);
const int filesize = numProc;
const int bufsize = filesize/numProc;
char *buf = new char[bufsize];
buf[0] = (char)(48 + procID);
int numChars = bufsize/sizeof(char);
mpierr = MPI_File_open(MPI_COMM_WORLD, "dataio",
MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
if (mpierr != MPI_SUCCESS)
{
MPI_Error_string(mpierr, mpistr, &mpilen);
std::cout << "Error: " << mpistr << std::endl;
}
mpierr = MPI_File_write_at_all(fh, (MPI_Offset)(procID*bufsize), buf,
numChars, MPI_CHAR, &status);
if (mpierr != MPI_SUCCESS)
{
MPI_Error_string(mpierr, mpistr, &mpilen);
std::cout << "Error: " << mpistr << std::endl;
}
mpierr = MPI_File_close(&fh);
if (mpierr != MPI_SUCCESS)
{
MPI_Error_string(mpierr, mpistr, &mpilen);
std::cout << "Error: " << mpistr << std::endl;
}
delete[] buf;
MPI_Finalize();
return 0;
}
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Stephen Guzik
2017-10-13 17:42:48 UTC
Permalink
Thanks for the advice Edgar.  This appears to help but does not
eliminate the problem.  This is what I observe (out of maybe 10 trials)
when using '-mca io romio314':

- no failures using 40 processes across 2 nodes (each node has 20 cores)
- no failures if using 'MPI_File_write_at'
- same type of failure if using 2 processes across 2 nodes (i.e. writing
2 bytes) *and* using 'MPI_File_write_at_all'
- writing to the 'sync/hard,intr' file system (see original email) does
not report an error anymore.  I see same results from trials as for an
async mount.

So it mostly works except in an unusual case.  I'd be happy to help test
a nightly snapshot---let me know.

Stephen
Post by Edgar Gabriel
try for now to switch to the romio314 component with OpenMPI. There is
an issue with NFS and OMPIO that I am aware of and working on, that
might trigger this behavior (although it should actually work for
collective I/O even in that case).
try to set something like
mpirun --mca io romio314 ...
Thanks
Edgar
Post by Stephen Guzik
Hi,
I'm having trouble with parallel I/O to a file system mounted with NFS
over an infiniband network.  In my test code, I'm simply writing 1 byte
per process to the same file.  When using two nodes, some bytes are not
written (zero bits in the unwritten bytes).  Usually at least some data
from each node is written---it appears to be all data from one node and
partial from the other.
This used to work fine but broke when the cluster was upgraded from
Debian 8 to Debian 9.  I suspect an issue with NFS and not with
OpenMPI.  However, if anyone can suggest a work-around or ways to get
more information, I would appreciate it.  In the sole case where the
file system is exported with 'sync' and mounted with 'hard,intr', I get
[node1:14823] mca_sharedfp_individual_file_open: Error during datafile
file open
MPI_ERR_FILE: invalid file
[node2:14593] (same)
----------
- tested versions 1.8.8, 2.1.1, and 3.0.0 self-compiled and packaged and
vendor-supplied versions.  All have same behavior.
- all write methods (individual or collective) fail similarly.
- exporting the file system to two workstations across ethernet and
running the job across the two workstations seems to work fine.
- on a single node, everything works as expected in all cases.  In the
case described above where I get an error, the error is only observed
with processes on two nodes.
- code follows.
Thanks,
Stephen Guzik
----------
#include <iostream>
#include <mpi.h>
int main(int argc, const char* argv[])
{
   MPI_File fh;
   MPI_Status status;
   int mpierr;
   char mpistr[MPI_MAX_ERROR_STRING];
   int mpilen;
   int numProc;
   int procID;
   MPI_Init(&argc, const_cast<char***>(&argv));
   MPI_Comm_size(MPI_COMM_WORLD, &numProc);
   MPI_Comm_rank(MPI_COMM_WORLD, &procID);
   const int filesize = numProc;
   const int bufsize = filesize/numProc;
   char *buf = new char[bufsize];
   buf[0] = (char)(48 + procID);
   int numChars = bufsize/sizeof(char);
   mpierr = MPI_File_open(MPI_COMM_WORLD, "dataio",
                          MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
   if (mpierr != MPI_SUCCESS)
     {
       MPI_Error_string(mpierr, mpistr, &mpilen);
       std::cout << "Error: " << mpistr << std::endl;
     }
   mpierr = MPI_File_write_at_all(fh, (MPI_Offset)(procID*bufsize), buf,
                                  numChars, MPI_CHAR, &status);
   if (mpierr != MPI_SUCCESS)
     {
       MPI_Error_string(mpierr, mpistr, &mpilen);
       std::cout << "Error: " << mpistr << std::endl;
     }
   mpierr = MPI_File_close(&fh);
   if (mpierr != MPI_SUCCESS)
     {
       MPI_Error_string(mpierr, mpistr, &mpilen);
       std::cout << "Error: " << mpistr << std::endl;
     }
   delete[] buf;
   MPI_Finalize();
   return 0;
}
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
https://lists.open-mpi.org/mailman/listinfo/users
Loading...