Handling Errors

    Table of contents
    No headers

    To make Gromacs behave like a proper library, we need to change the way errors etc. are handled. Basically, the library should not print out anything to stdio/stderr unless it is part of the API specification, and even then, there should be a way for the user to suppress the output. Also, the library should normally not terminate the program without the user having control over this. There are different types of errors, which also affects the handling. Different cases are discussed separately below, split by the way they are handled. These guidelines are starting to take their final form, although details may still change.

    UPDATE: The information below is largely outdated by the decision to use exceptions for all error handling. Needs to be rewritten.

    1) For programming errors, i.e., errors that should never occur if the program is correctly written, it's acceptable to assert and terminate the program. This applies to both errors in the library and errors in user code that calls the library.

    • If it's feasible to recover graciously and return an error code, this is also possible (see the next point), but the API should in such cases be designed such that the users do not separately need to check whether they made a programming error.
    • For code that is not performance-sensitive, consider using GMX_RELEASE_ASSERT() which remains in the code even in release builds, in particular if the assert could fire as a result of incorrect user code.

    2) In cases when a library routine meets an error after which it does not make sense to continue processing, it should return an error code and let the caller decide what to do.

    • There is a global list of possible error codes, and the library should return one of these. In addition, it should call an error handler with a more detailed description of the reason for the error. The default error handler could print the error to stderr, but the user can replace the error handler if there is need for it. The global error codes include:
      • Out of memory
      • Other OS I/O error
      • File not found
      • Invalid user input (could not be understood)
      • Inconsistent user input (parsed correctly, but has internal conflicts)
      • Simulation instability
      • Invalid API call/value/internal error (we can also have a policy that the program should assert in such cases)
    • Internally, you can use exceptions for error handling, but avoid propagating them to caller code.
      • Exceptions should only be used for unexpected errors, e.g., out of memory or file system IO errors. As a general guideline, incorrect user input should not result in an exception.
      • Avoid exceptions in threaded code, but if you throw one, make sure that it will always gets caught in the same thread/OpenMP section.
      • The same error handler as for return codes should be called before throwing an exception, and it should be arranged such that the exception is translated to the same return code as is passed to the error handler.
    • For common errors, there is a mechanism that can be used for appending a standard explanation for troubleshooting, e.g., errors related to simulation instabilities. Use these when appropriate, and add new explanations if a need arises.

    3) There are also cases where a library routine wants to report a warning or a non-fatal error, but is still able to continue processing. For example, what grompp does now with notes, warnings, and errors.

    • There is a common reporting interface for such cases. All library functions that need this functionality, should take as an extra parameter an object that implements this interface, and can then call functions in the interface to report warnings. A default implementation is provided for callers who simply want to write everything to stderr.

    Points for discussion:

    • How to handle functions that may fail as part of normal operation? E.g., a function that accesses data, and by design should also be callable when no data is available. These should not call the error handler, but should they return 0 and use another variable for reporting whether the call was successful? Or should we have a designated error code(s), e.g. all negative values, for such cases?
      RS110410: Why would we need that? I think functions should be designed in two ways (as they are e.g. in the STL):
      • if the special condition can be correctly handled than no error has to be reportet. E.g. a contailer.clear() method would detect that the container is empty and do nothing
      • if the special condition cannot be correctly handled than the normal error handler. To avoid the (possible) slow error handler the code should call a check before calling the function (e.g. is_empty())
    • We may not want to riddle performance-sensitive code with a lot of error-checking, but for debugging, it is useful to be able to pinpoint where things start to go wrong instead of observing a crash that may occur much later. Having the error checks in performance-sensitive code as asserts is one way, but there may be others that would be more suitable.
    • How to handle cases when the reason for the error is detected within a relatively deep call graph, but there is not enough information in that context to print an error message that's useful to the user? Five options:
      • Don't do anything, live with cryptic error messages.
      • Signal errors from the inner scopes with return values only and call the error handler only from an outer scope where enough context is known. Can make debugging harder, because the original reason for the error is no longer accessible if one breaks in the error handler. If this becomes a problem, could have a separate macro that is used in the inner scope to call the error handler in development versions, but expands to nothing in release versions. There is support for this in the current implementation.
      • Pass enough information through all the function calls. If this information would not be otherwise needed for anything, it will make the code less modular.
      • Call the error handler from both scopes (with different values so that calls from different scopes can be recognized). Will make the error handler itself more complex.
      • Use the facilities from 2) above in such cases. Can easily result in overly complex code for handling simple errors.
    • Should the error handler be global, or thread-local? Similarly, for 2), should the error reporter object be thread-local, or be passed as an parameter?
    Page last modified 09:12, 11 Jan 2016 by hess