Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for hypre mixedint #1583

Merged
merged 27 commits into from Apr 29, 2021
Merged

Support for hypre mixedint #1583

merged 27 commits into from Apr 29, 2021

Conversation

jandrej
Copy link
Member

@jandrej jandrej commented Jun 26, 2020

This PR should add support for hypre built with the option --enable-mixedint, e.g. allow local sizes to use 32bit integers and global sizes to use 64bit integers.

I added an initial guess of changes and I hope that we get a group of contributors that go through a group of files and check for errors or add changes.

Right now the code branch does not compile without changes. What I did locally was commenting out the complex operator files. With that ex1p runs fine in serial and parallel.

If you participate and want to compile individual files, I recommend using the makefile build system and specifying the file you are working on with

make fem/pfespace.o

this way you only get the errors from the file you are looking at.

I'd especially like to see someone help with the complex stuff, since I'm not comfortable just churning through the files. @dylan-copeland @mlstowell @psocratis

TODO:

  • In INSTALL, document that mixed-int support requires hypre >= 2.20.0
PR Author Editor Reviewers Assignment Approval Merge
#1583 @jandrej @tzanio @v-dobrev + @dylan-copeland + @psocratis 04/20/21 04/24/21 04/28/21

@jandrej jandrej added the WIP Work in Progress label Jun 26, 2020
@tzanio tzanio mentioned this pull request Jun 28, 2020
5 tasks
@mlstowell
Copy link
Member

mlstowell commented Jul 14, 2020

@jandrej , I think I found all of the BigInt changes in complex_operator and complex_fem. If you notice any other issues I'd be glad to help.

Thanks for tackling this!

@jandrej
Copy link
Member Author

jandrej commented Jul 14, 2020

Thanks! Compiles fine, make test shows problems with HypreParMatrixBlocks and

    NURBS miniapp [ mpirun -np 4 nurbs_ex11p ... ]: FAILED  (0.13s 17768kB)
Options used:
   --mesh ../../data/star.mesh
   --refine-serial 2
   --refine-parallel 1
   --order '0'
   --num-eigs 5
   --seed 75
   --no-visualization
Mesh::GeneratePartitioning(...): edgecut = 39
Mesh does not have FEs --> Assume order 1.
Number of unknowns: 1361

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 18902 RUNNING AT dyro.llnl.gov
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault: 11 (signal 11)

This is not testing problems larger than MAX_INT yet.

@mlstowell
Copy link
Member

That looks like something @dylan-copeland will want to take a look at. I'm not familiar with HypreParMatrixBlocks but I see a couple places where row and column offsets are treated as integer arrays. The number of non-zeros might also cause a problem.

@mlstowell
Copy link
Member

This branch doesn't seem to compile with HYPRE version 2.12.0 because HYPRE_BigInt was not defined back in 2017. Is there a plan to support these older versions of HYPRE or do we require a new minimum HYPRE version (the current minimum is 2.10.0b)? Asking for a friend...

@v-dobrev
Copy link
Member

This branch doesn't seem to compile with HYPRE version 2.12.0 because HYPRE_BigInt was not defined back in 2017. Is there a plan to support these older versions of HYPRE or do we require a new minimum HYPRE version (the current minimum is 2.10.0b)? Asking for a friend...

I think this will be easy to add: for older versions of hypre (checked using MFEM_HYPRE_VERSION), we can typedef HYPRE_BigInt to be the same as HYPRE_Int and define HYPRE_MPI_BIG_INT as HYPRE_MPI_INT.

@v-dobrev
Copy link
Member

The issue with HypreParMatrixFromBlocks may be resolved by merging master into this branch -- there was an update to that function recently in #1521.

@dylan-copeland
Copy link
Member

I pulled from master and checked that the unit tests pass, so @v-dobrev is right.

@jandrej
Copy link
Member Author

jandrej commented Jul 14, 2020

Great, lets wait for the merge and then I try some large problems and report back what breaks.

@v-dobrev
Copy link
Member

It turns out --enable-mixedint disables some features in hypre:

  --enable-mixedint       Use long long int for HYPRE_BigInt and int for
                          HYPRE_Int (default is int for both). Note: This
                          option disables Euclid, ParaSails, pilut and CGC
                          coarsening.

Because of this and how the makefiles work, building on Mac with --enable-mixedint (using configure and make) fails with an error like this:

ar -rcu libHYPRE.a 
ar: no archive members specified
usage:  ar -d [-TLsv] archive file ...
	ar -m [-TLsv] archive file ...
	ar -m [-abiTLsv] position archive file ...
	ar -p [-TLsv] archive [file ...]
	ar -q [-cTLsv] archive file ...
	ar -r [-cuTLsv] archive file ...
	ar -r [-abciuTLsv] position archive file ...
	ar -t [-TLsv] archive [file ...]
	ar -x [-ouTLsv] archive [file ...]

One workaround for this is to combine the following lines: https://github.com/hypre-space/hypre/blob/b075d642550c9947b758201f41f5d38dabfe7a51/src/lib/Makefile#L82-L84 with the previous line, so that the list of files is not empty.

Fix a bug affecting HYPRE_MIXEDINT builds -- 'make test' still
shows a lot of failures in this case.
All tests in 'make test' seem to pass now.
number of elements is less than the number of processors.
@v-dobrev
Copy link
Member

For proper mixedint support, we need to use the latest HYPRE master (any commit after hypre-space/hypre@13e2cad). We should document this in INSTALL.

@tzanio
Copy link
Member

tzanio commented Oct 13, 2020

Take 2: How far are we from marking this as ready-for-review? It will be great to include it in the xSDK release at the end of the month...

@jandrej
Copy link
Member Author

jandrej commented Oct 13, 2020

I'd like to do 2 things before we go into review

  1. Run clang-tidy with https://clang.llvm.org/extra/clang-tidy/checks/bugprone-too-small-loop-variable.html https://reviews.llvm.org/D53974. This detects too small loop variables.

  2. Run at least a handful of examples (not just ex1p) so we get more code coverage. This requires us to discuss and pick examples where we know the examples have robust solvers that handle >2B unknowns.

@tzanio
Copy link
Member

tzanio commented Oct 14, 2020

Running clang-tidy and using it in general is an excellent idea.

My personal feeling is that testing ex1p, ex2p, ex3p and ex4p should be sufficient. We can add maybe ex8p, ex11p and ex13p to that, but I don't think we need to run every single parallel example.

@tzanio tzanio added this to the mfem-4.2 milestone Oct 19, 2020
/** This is a compact text representation of the local data of the
HypreParMatrix that can be used to compare matrices from different runs
without the need to save the whole matrix. */
void PrintHash(std::ostream &out) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -212,6 +213,84 @@ class HashTable : public BlockArray<T>
};


/// Hash function for data sequences.
/** Depends on GnuTLS for SHA-256 hashing. */
class HashFunction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be #ifdef MFEM_USE_GNUTLS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need -- the implementation has the #ifdefs. If you don't have GnuTLS, you'll get a hash string that says you need GnuTLS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Do you want to mention any of the hash changes in CHANGELOG?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not crucial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is GnuTLS already installed on LC, or does someone know how to install it? Is it the one at https://github.com/gnutls/gnutls so we just need to clone and follow their build instructions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @v-dobrev, I tried that but apparently need something else:
In file included from general/hash.cpp(15):
/usr/include/gnutls/crypto.h(35): error: identifier "gnutls_cipher_algorithm_t" is undefined
gnutls_cipher_algorithm_t cipher,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Mac I got it working, using a brew installation of gnutls. Maybe on LC some include directory needs to be specified? Anyway, the hash feature is a really nice addition.

Copy link
Member

@psocratis psocratis Apr 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the mac it works for me too. On LC I get the same error with @dylan-copeland. All paths look ok... so It's not clear to me what's wrong here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this. I pushed a fix for the issue -- that version of GnuTLS needed <gnutls/gnutls.h> to be included before <gnutls/crypto.h>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I confirmed it works now on LC.

@tzanio
Copy link
Member

tzanio commented Apr 21, 2021

LGTM

@tzanio
Copy link
Member

tzanio commented Apr 21, 2021

LGTM

@tzanio
Copy link
Member

tzanio commented Apr 22, 2021

@dylan-copeland and @psocratis, can you please take a quick look?

@psocratis
Copy link
Member

@dylan-copeland and @psocratis, can you please take a quick look?

I will need some to test it first, by tomorrow maybe?

Copy link
Member

@tzanio tzanio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jandrej !

HYPRE_Int *j_offd_hi = j_offd;
#else
HYPRE_Int *j_offd_hi = Memory<HYPRE_Int>(offd_cols);
Memory<HYPRE_BigInt>(j_offd, offd_cols, true).Delete();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is j_offd created and then deleted? Can this be simplified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a trick to delete a host pointer that was allocated via the memory manager (using the currently set host allocator). The respective allocation is done with something like HYPRE_BigInt *j_offd = Memory<HYPRE_BigInt>(size); -- you can check above that j_offd was allocated that way.

Copy link
Member

@dylan-copeland dylan-copeland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check that the right type is used everywhere in MFEM, but the changes that were made in this PR look correct, and I suppose if something is wrong it will show up when a hypre solve fails. It seems other examples also work (e.g. ex3p), so the CHANGELOG could state that (it would require some more testing to be sure).

Copy link
Member

@psocratis psocratis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it's really hard to check all the places for the appropriate changes, this looks very good. Some examples that I ran with more than 2 billion dofs seem to work just fine.

@tzanio tzanio added the in-next label Apr 25, 2021
@tzanio
Copy link
Member

tzanio commented Apr 25, 2021

Merged in next for testing...

@tzanio
Copy link
Member

tzanio commented Apr 25, 2021

There are number of errors from tonight's autotest (due to older version of hypre?):

linalg/hypre.cpp: In member function ‘void mfem::HypreParMatrix::PrintHash(std::ostream&) const’:
linalg/hypre.cpp:1642:16: error: ‘struct hypre_CSRMatrix’ has no member named ‘big_j’
       if (csr->big_j == nullptr)
                ^
linalg/hypre.cpp:1648:29: error: ‘struct hypre_CSRMatrix’ has no member named ‘big_j’
          hf.AppendInts(csr->big_j, csr_nnz);
                             ^

See tux426/next/baseline and toss3/next for more details.

@v-dobrev
Copy link
Member

There are number of errors from tonight's autotest (due to older version of hypre?):

linalg/hypre.cpp: In member function ‘void mfem::HypreParMatrix::PrintHash(std::ostream&) const’:
linalg/hypre.cpp:1642:16: error: ‘struct hypre_CSRMatrix’ has no member named ‘big_j’
       if (csr->big_j == nullptr)
                ^
linalg/hypre.cpp:1648:29: error: ‘struct hypre_CSRMatrix’ has no member named ‘big_j’
          hf.AppendInts(csr->big_j, csr_nnz);
                             ^

See tux426/next/baseline and toss3/next for more details.

Ah, yes, forgot about older versions. I'll fix it.

@tzanio
Copy link
Member

tzanio commented Apr 26, 2021

Re-merged in next for testing ...

@dylan-copeland
Copy link
Member

@tzanio I started doing some large tests, to verify this branch for examples other than ex1p, but it is going to take a while to get that done due to the large number of nodes required to exceed the max int without also running out of memory. Let's proceed with the CHANGELOG just claiming to support ex1p.

@tzanio
Copy link
Member

tzanio commented Apr 28, 2021

Thanks @dylan-copeland. I will merge as is.

@tzanio tzanio merged commit 5b894c6 into master Apr 29, 2021
Pull Requests automation moved this from Review Now to Merged Apr 29, 2021
@tzanio tzanio deleted the hypre-mixedint branch April 29, 2021 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Pull Requests
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

8 participants