Q: Does HPVM require special hardware to run?
A: HPVM implementations run on PC's and one of two types of networking -- Myrinet and Winsock 2 (sockets). The latter interface can be used atop a variety of hardware. However, best performance will be delivered on the Myrinet.
Q: What sort of performance should I expect to get?
A: See the web pages under for each of the interfaces. They quote representative performance numbers.
Q: Is the source code for HPVM (or subsystems) available?
A: The source code for FM 1.1 is available, but other HPVM elements are distributed in binary only versions for research and educational usage. We do license source, but you will need to execute a source code license agreement. Contact Professor Andrew A. Chien for more information.
Q. Does FM require special hardware to run?
A: FM runs on PC's and either Myrinet or a sockets interface. The latter interface can operate over any type of hardware.
Q. Does FM support multiple communicating processes per machine?
A: FM 2.1 supports multiple processes per machine and sharing of network interfaces.
Q. What sort of performance should I expect to get?
A: See the FM and API web pages.
Q. How do I run the programs in the `benchmks' directory?
A: The `bwx' program measures one-way bandwidth, and the `latx' program measures round-trip latency. Run the `---s' version on the sender and the `---r' version on the receiver. The arguments are the node number of the other node (as calculated from the network configuration file, the message size in bytes, and the number of times to repeat the experiment.
Q: For a set of programs to communicate, must they use the same binary?
A: No (although such was the case for previous versions of FM). In FM 2.0, programs can communicate as long as they agree on the indexes into the handler tables.
Q: Is there a program that automatically generates network configuration files?
No, sorry.
Q: Are there any data alignment restrictions for FM_send_piece() or FM_receive()?
A: No. There are some performance considerations, though, due to internals of the FM implementation. In most cases, doubleword-aligned buffers will deliver better performance than word-aligned buffers, which will deliver better performance than halfword- or byte-aligned buffers.
Q: What happens if two or more threads call FM_begin_message() simultaneously?
A: One call will succeed, and the rest will return NULL.
Q: What happens if two or more threads call FM_extract() simultaneously?
A: One call will extract data (if available), and the rest will return as if there were no data to extract.
Q: If two processes send to each other and neither calls FM_extract(), will deadlock occur?
A: No. FM_send_piece() will automatically call FM_extract() if it's unable to send. This breaks the deadlock.
Q: Why am I getting compiler error messages for code that doesn't appear to be mine?
A: Your handlers are improperly defined. Remember, streamify is not only picky about how handlers are coded, but it's also too stupid to output error messages if it can't completely parse your program. Instead, given unacceptable input, it produces incorrect output. (Think of streamify as a glorified sed script.) Scrutinize your code, and ensure you're following the requirements layed out in section Handlers.
Q: Why, when I link, do I get an `Undefined symbol XXX' message?
A: Make sure you're linking with `libFM.a' or `libFM32.a' (as appropriate), `libLanaiDevice.a', and Myricom's version of `libbfd.a' and `libiberty.a'. We developed FM 2.0 using version 3.0x of Myricom's software. If you're using a significantly older or newer version, it's likely that some symbols we rely on won't exist.
Q: Why does my program hang at the `FM: Synchronizing with other nodes...' message?
A: This is the most difficult question in the list. Synchronization should take at most a few seconds after all the processes are running. Any of a number of failings could cause FM to hang during synchronization. A checklist of some things to look for would include:
•Are the network cables connected?
•Are the LEDs blinking? (They should at least be non-red; this shows that the machines are at least trying to synchronize.)
•Does your network configuration file correctly describe the physical configuration of the network subset your application is using?
•Are you using the right version of the LCP for your network and Myrinet board type?
•Is Myricom's IP driver disabled?
•If you have absolute-addressed switches (unlikely these days), does your code specify FM_set_parameter(FM_ABSOLUTE_SWITCHES) before FM_initialize()?
You should be able to answer "Yes" to all of the above questions. If you still have problems, make sure you can run some of Myricom's test programs, to determine whether the fault lies with FM or your hardware and driver configuration.
Q: What could have caused my program to hang right after synchronizing?
A: It's possible that the machines didn't really synchronize. Rather, stale data in the network made one machine think it had synchronized when it really hadn't. Sometimes, running lload 0 on all the machines on the network will clean things up. (Don't run lload while an FM program is running, though; that'll usually kill it.) Also, try starting up your FM programs in a different order and see if that causes them to die before synchronizing. If not, check for bugs in your code.
Q: What does the message `FM: Received unexpected packet tag XXXX' mean?
A: It means that FM received a packet from the network that doesn't look like an FM packet. This probably means one of three things occurred. Either:
Q: Why does FM tell me that `FM: open_lanai_copy_block() failed'?
A: FM was unable to access the Myrinet device. Make sure the driver is loaded. (In Solaris, /usr/sbin/modinfo | grep -i myri should list a `myri' driver and an `mlanai' driver.)
Q: In one of my applications, one process exits properly, but another hangs. Why?
A: Fortunately, this situation is rare in "real" applications, although it's common in networking benchmarks. The problem occurs only when all of the following conditions are met:
What's happening is that, due to the way that message streams are pipelined across the network, the receiver is exiting before the sender has finished sending the last, large message. As a result, the sender eventually fills the network, and then waits for the receiver to drain the network. Because the receiver is no longer running, the sender waits indefinitely. Probably the easiest solution is for the sender to send a zero-byte "cleanup" message at the end of the benchmark, which should set a completion flag in its handler. The receiver shouldn't exit until that message arrives and the completion flag is set.
Q: Is MPI-FM based on another implementation of MPI?
A: Yes, it is based on the Argonne/MSU MPICH code base.
Q: How does the Shmem Put/Get interface differ from the Cray T3D Shmem library?
A: The major difference is that the interfaces are designed for a 32-bit machine in HPVM. Other differences include a significantly simplified interface and operation set.
Q: Does Global Arrays also implement the tcgmsg library?
A: Yes, this is an embedded interface in the HPVM implementation of GA. If you're an expert user, you can also make use of this.
Q: What platforms can I access the Java front-end from?
A: If your platform has a JVM implementation and can make TCP connections to a daemon on the LSF cluster manager system, you can use the Java front-end. This allows access from Unix, NT, and even MacOS machines through the same interface.
Q: Where do I get LSF?
A: The Load Sharing Facility (LSF) is a commercial product from the Platform Computing Company.
Back to HPVM Clusters home page
Last updated August 1997