************************************************************************ * Myricom GM networking software and documentation * * Copyright (c) 2001, 2002 by Myricom, Inc. * * All rights reserved. See the file `COPYING' for copyright notice. * ************************************************************************ README-linux for gm-1.5.2.1 README for linux distribution Supported platforms: Linux 2.2 and 2.4 for ia32, ultrasparc, powerpc, alpha. Linux 2.4 for ia64 (Itanium). - For Alphas, if you have 2 GB or more of memory, we recommend kernel version 2.4.18 to install GM. You must use kernel version 2.4.14 or later (2.4.9 also works). - GM will only compile and run under sparc64 linux (Ultrasparc). Supported interfaces: LANai4 with 1MB or 512K (PCI32), LANai7 (PCI64, PCI64A), and LANai9 (PCI64B, PCI64C) (If you have LANai4 with 256K, you will need to upgrade your interface, or use a previous version of GM (gm-1.2.3). For installation instructions of an earlier GM version please refer to the respective README and README- files. Please also note that Linux 2.4 is not supported on earlier GM versions). WARNING: When building/linking GM applications, you must do so on a linux box that matches the OS version of the machine on which you will be running. You cannot compile on a 2.2.x machine and run the executable on a 2.4.x machine. Table of Contents: ----------------- I. GM Installation a. Configuring, compiling, and loading the GM driver b. Running the GM Mapper c. Testing the GM installation II. Verifying the GM performance III. Running IP over GM IV. Improving IP Performance V. Fork() Support VI. Sample Scripts to automatically load GM and start the Mapper VII. Operating-system-specific Caveats a. Using Compaq Compilers for Alpha Linux (ccc cxx) b. Linux 2.2 - 2.4 for Sparc64 c. APIC IRQ conflict on Supermicro P4DC6 (dual P4-Xeon) d. AGP (nVidia and ATI) conflicts e. Motherboards with i840 and i860 chipsets ************************************************************************ If difficulties are encountered, please consult the FAQ http://www.myri.com/scs/GM_FAQ.html and all technical support questions should be directed to help@myri.com. ************************************************************************ =================== I. GM Installation =================== GM installation is performed in the following three steps. 1. Configure, compile, and load the GM driver: --------------------------------------------- gunzip -c gm-1.5.2.1_Linux.tar.gz | tar xvf - cd {GM_HOME} ./configure make cd binary su root ./GM_INSTALL By default, we assume that the header file for your Linux installation is located in /usr/src/linux. If your Linux installation is not located in /usr/src/linux, you must configure with the following option: ./configure --with-linux= where specifies the directory for the linux kernel source. The kernel header files MUST match the running kernel exactly: not only should they both be from the same version, but they should also contain the same kernel configuration options. By default, we also assume that you have LANai9 or LANai7 interfaces. If you have LANai4 (with 1MB of memory), you will need to configure with: ./configure --disable-new-features If you have LANai4 with 512K, you will need to configure with: ./configure --disable-new-features --with-min-supported-sram=256 and you will only have 4 GM ports available instead of 8. (As previously noted, if you have LANai4 with 256K, you cannot install gm-1.5.1 or later. You will need to upgrade your interface or use a previous version of GM (gm-1.2.3). Note: If you have a mixture of hosts with LANai4 and LANai7 (or LANai9) interfaces that need to talk to each other, you must configure with --disable-new-features on all of the hosts. For a complete listing of all options to configure, type: ./configure --help Note: Do not use the configure flag --enable-directcopy. This flag is not a valid option to GM 1.5.2.1. It will be re-enabled in a future release. The GM_INSTALL script will unload any existing GM device driver, load the current device driver and create /dev/gm device i-nodes. It does not configure the IP device, nor does it set up any scripts to load the GM driver at boot time. During the GM_INSTALL phase, GM prints messages to the kernel log (dmesg). If the running kernel and the kernel header used for compilation are mismatched, GM will print a warning message to the kernel log. Please be sure to read the {GM_HOME}/README and the {GM_HOME}/README-linux for further details of operating-system-specific caveats. Note: If the host is rebooted, you must reload the GM driver (and rerun the GM mapper). There are sample scripts, contributed by a customer, in {GM_HOME}/drivers/linux/scripts for loading GM and running the mapper at reboot. 2. Running the GM Mapper ------------------------ Myrinet is a source-routed network. I.e., each host must know the route to all other hosts through the switching fabric. The GM mapper automatically discovers all of the hosts connected to the Myrinet network, computes a set of deadlock free minimum length routes between the hosts, and distributes appropriate routes to each host on the connected network. Loopback and point-to-point network topologies require that gm_simpleroute must be run instead of the GM Mapper. (Refer to the GM README and the FAQ for details.) For a switch network topology, the GM Mapper must be run before any communication over Myrinet can be initiated. Further technical details about the GM mapper can be found in mt/README. Depending upon the user's needs, there are three different ways in which the GM mapper may be used. MAP_ONCE mapping: ---------------- The first way is by far the most common, and we shall refer to it as "map_once". In this method, the mapper is run on one host in the network (any of the hosts). It is rerun if a host (re)boots or a hostname is changed or after a change of Myrinet topology (swapping of ports on a switch). (If the Mapper must be rerun for any of these reasons, it is best to run it on the same host.) The command for this method of running the GM mapper is: cd {GM_HOME}/binary/sbin/ su root ./mapper map_once.args STATIC mapping: -------------- The second way in which the GM mapper may be used is called "static mapping" or "file mapping". In this method, an active mapper is run once when ALL of the hosts are up and running the GM driver. This initial active mapper will generate a map file and a host file. These files are then copied to all of the hosts in the network, or shared by NFS. An entry in the boot scripts will allow each host to read the map file and the host file and update the routing table on its local Myrinet interface(s). This method is particularly appealing as no human intervention is needed and no traffic is generated at boot time. The commands for this method of running the GM mapper are: cd {GM_HOME}/binary/sbin/ su root ./mapper static.args Copy the 3 files created by this command (static.map, static.routes, and static.hosts) to each {GM_HOME}/binary/sbin/ directory on each host if the gm tree is not mounted by NFS. Add the following command to the boot scripts of the host (scripts in /etc/init.d or /etc/rc.d/init.d). cd {GM_HOME}/binary/sbin/ su root ./file_mapper file.args HA mapping: ----------- The third way in which the GM mapper may be used is for the users who have a need for High Availability (HA) in an aggressive computing environment. The command for this method of running the GM Mapper is: cd {GM_HOME}/binary/sbin/ su root ./mapper active.args & It will continuously run the GM mapper in the background to detect and add any new hosts or remove any non-responding hosts, to detect any change of topology (change of slots in the switch, change of innerswitch topology), and periodically update the routing tables of the Myrinet cards (by default, every 30 seconds). You should note that this mapping method is quite intrusive. The user is strongly advised to avoid this method of running the GM mapper if his applications produce heavy network traffic (e.g., MPI applications) since the GM Mapper uses non-reliable messages that may be dropped in case of heavy contention, leading to hosts that may be marked as "non-responding" and removed because they are unreachable. A few expert customers use this mapping method to satisfy their high availability constraints for GM applications designed to handle a dynamic change of configuration (by design, MPI is NOT a fault-tolerant application). For the majority of users, the "map_once" GM mapping method is sufficient. For the users with more production-level constraints, the "static mapping" is the most adequate method. For fault-tolerant GM applications, the third method provides the best alternative. 3. Testing the GM Installation ------------------------------ A variety of test scripts are available in {GM_HOME}/binary/bin to test your GM installation. A README describing each of these tests can be found in {GM_HOME}/tests/README. We recommend the following five tests to validate your installation. cd {GM_HOME}/binary/bin 1. Test that the Mapper has correctly detected all of the hosts in your Myrinet network by typing the following command on several of the hosts: ./gm_board_info Note: In the output of this command, all hosts should be listed in the routing table of each node. If not all of the hosts are listed, then it is possible that a cable is not connected, or GM is not properly loaded on all hosts in the Myrinet network. A green LED should be lit up on the switch for each connection that is active. If you see *** No routes found *** in the output, this is an indication that the GM Mapper has not been run. (See README- for details.) When ./gm_board_info successfully reports a list of hosts, you can then run ./gm_allsize and ./gm_stress to test the network. 2. Test the basic connectivity of GM, by typing: ./gm_allsize --verbose --geometric on one of the hosts in the Myrinet network. Note: This loopback test will NOT work in a point-to-point (no switch) configuration. 3. Test GM bandwidth between two hosts, type (on the first host) ./gm_allsize --slave --size=15 and then type the following command (on the second host) ./gm_allsize --unidirectional --bandwidth --remote-host= \ --size=15 --geometric where is the name of the first host. These one-way tests are performed by running in slave mode on one machine and master on the node to be tested. This is done by adding '--slave' on the command line of the slave machine and '-h ' on the command line of the master where is the name of the machine running in slave mode. The name of each host is as specified in the output of ./gm_board_info. The --size parameter indicates the maximum length of message that will be sent, where 2^{size} is the value of that length. In this example, the maximum length of message sent is 2^{15}=32K. The --geometric parameter reduces the number of message lengths that will be tested. The default for gm_allsize is to test every length from 1 to 2^max_size incrementing one byte at a time. These tests take a long time to run, and generate data files suitable for input to gnuplot. 4. Test GM latency between two hosts, type (on the first host) ./gm_allsize --slave --size=15 and then type the following command (on the second host) ./gm_allsize --bidirectional --latency --remote-host= \ --size=15 --geometric where is the name of the first host. These one-way tests are performed by running in slave mode on one machine and master on the node to be tested. This is done by adding '--slave' on the command line of the slave machine and '-h ' on the command line of the master where is the name of the machine running in slave mode. The name of each host is as specified in the output of ./gm_board_info. The --size parameter indicates the maximum length of message that will be sent, where 2^{size} is the value of that length. In this example, the maximum length of message sent is 2^{15}=32K. The --geometric parameter reduces the number of message lengths that will be tested. The default for gm_allsize is to test every length from 1 to 2^max_size incrementing one byte at a time. These tests take a long time to run, and generate data files suitable for input to gnuplot. 5. Run gm_stress on every host in the cluster to validate GM. Complete details on running gm_stress can be found on the FAQ. http://www.myri.com/scs/GM_FAQ.html#debug-stress This gm_stress command must be run simultaneously on each host, using the same list of host names in each case. It can be run on any subset of hosts on the network. For a list of all possible runtime options for these commands, you can issue the command with --help as the runtime option, e.g. ./gm_debug --help. ================================ II. Verifying the GM Performance ================================ We recommend the following test to verify the GM performance. View the results of the hardware benchmark test of the PCI bus with the DMA engine of the Myrinet adapter. cd {GM_HOME}/binary/bin ./gm_debug --no-counters Note: The output of this command gives the maximum sustained bandwidth that can be obtained from the PCI bus. Refer to the section entitled "GM Performance" in the {GM_HOME}/README for complete details on expected GM performance. ======================= III. Running IP over GM ======================= The Linux command to enable IP over GM is as follows: /sbin/ifconfig myri0 up where you must replace 'myri0' with the appropriate name (myri1, myr2, etc.) if you have more than one Myrinet interface per host. For more information, please refer to the FAQ (http://www.myri.com/scs/GM_FAQ.html). ============================ IV. Improving IP performance ============================ To get good IP performance over Myrinet: * use Linux-2.4 (Linux-2.4.18 is now available) * configure GM with --enable-new-features to get a larger 9000byte MTU for IP-over-Myrinet You definitely want to use Linux 2.4 instead of Linux 2.2, and NFS-v3 over TCP. Linux 2.4 has vastly better TCP/IP and UDP/IP numbers than Linux-2.2. Also, there have been some recent patches to Linux-2.4 that help udp performance. If you are running Linux 2.2 or earlier, you should use the following tuning options to get good NFS bandwidth. Otherwise, you are latency dominated and Myrinet IP and Ethernet IP performance will be about the same. - For linux you want to increase the tcp windows: echo "262144" > /proc/sys/net/core/rmem_max echo "262144" > /proc/sys/net/core/wmem_max echo "262144" > /proc/sys/net/core/wmem_default echo "262144" > /proc/sys/net/core/rmem_default - In linux/include/net/tcp.h, replace the value of #define MAX_WINDOW 32767 with the value of your choice (200k~500k might be good) - check that /proc/sys/net/ipv4/tcp_window_scaling is enabled with the value 1 (as it should be by default). - Play with the buffer sizes of netperf or your favorite net tester. Note: These tunings options are not required for Linux 2.4. ================== V. Fork() Support ================== As of gm-1.5.2 and later, GM has full support for fork() under Linux. It works for all processor families. There are no restrictions; GM can fork() with or without a GM port open. However, if the customer has a choice between using vfork() or fork(), there will be better performance with vfork() since the time to fork a process with vfork() is much shorter. ================================================================ VI. Sample Scripts to automatically load GM and start the Mapper ================================================================ The directory {GM_HOME}/share contains some sample initialization scripts, contributed by customers, that can be customized to suit your system to automatically load the gm driver and start the GM Mapper. ======================================= VII. Operating-system-specific Caveats ======================================= --------------------------------------------------- a. Using Compaq Compilers for Alpha Linux (ccc cxx) --------------------------------------------------- Under the C shell: setenv CC ccc setenv CXX cxx setenv CXXFLAGS \ "-g -O2 -inline speed -x cxx -noexceptions -nocxxstd -using_std -w2" setenv CFLAGS -gcc_messages setenv KCC gcc rm -f config.cache ./configure or under a Bourne shell or Bash: CC=ccc ; export CC CXX=cxx ; export CXX CXXFLAGS="-g -O2 -inline speed -x cxx -noexceptions -nocxxstd" CXXFLAGS="$(CXXFLAGS) -using_std -w2" ; export CXXFLAGS CFLAGS=-gcc_messages ; export CFLAGS KCC=gcc ; export KCC rm -f config.cache ./configure ------------------------------ b. Linux 2.2 - 2.4 for Sparc64 ------------------------------ Running GM on the sparc64-linux arch does require us to patch the kernel to get ioctls to work from 32-bit user-space. There are the two patches below (it would actually be cleaner to have the first even for other archs), the other is to make the kernel know about gm ioctls from sparc32bit userland. The init_mm patch is not needed after 2.2.10. And you need to generate a file `linux/include/gm_ioctl_switch.h' with: perl drivers/linux/sparc32.pl < include/gm_io.h \ > /usr/src/linux/include/gm_ioctl_switch.h First patch: --- linux/kernel/ksyms.c.std Fri Jun 4 18:14:15 1999 +++ linux/kernel/ksyms.c Fri Jun 4 18:14:17 1999 @ -107,6 +107,7 @ EXPORT_SYMBOL(update_vm_cache); EXPORT_SYMBOL(vmtruncate); EXPORT_SYMBOL(find_vma); +EXPORT_SYMBOL(init_mm); EXPORT_SYMBOL(get_unmapped_area); /* filesystem internal functions */ Second patch: --- linux/arch/sparc64/kernel/ioctl32.c.std Fri Mar 17 20:02:23 2000 +++ linux/arch/sparc64/kernel/ioctl32.c Fri Mar 17 20:06:42 2000 @ -2390,6 +2390,8 @ case AUTOFS_IOC_CATATONIC: case AUTOFS_IOC_PROTOVER: case AUTOFS_IOC_EXPIRE: + +#include "gm_ioctl_switch.h" /* Raw devices */ case _IO(0xac, 0): /* RAW_SETBIND */ -------------------------------------------------------- d. APIC IRQ conflict on Supermicro P4DC6 (dual P4-Xeon) -------------------------------------------------------- Here is a description of the behavior that was witnessed. # We have seen that on the Supermicro P4DC6 (dual P4-Xeon) # with RH linux 7.1 (2.4 kernel) that the SCSI controller seems to # get confused on OS boot (hardware probing) when the Myrinet NIC is # installed. # This same behavior was experienced on the following machine: # IBM Intellistation M Pro 6850 # 1.4 GHz P4 SMP (only one processor installed) # RH7.1 # Kernel 2.4.3 (redhat update) # All 7.1 updates (updates.redhat.com) # # GM message that IRQ 11 is used, Link lights up, then solid hang # (no keyboard lights, nothing). After receiving one of these Supermicro machines on which to test our theories, one of our developers concluded that: Linux support for the APIC on this motherboard is broken. So if you boot linux using the APIC code, it will map a strange IRQ to the Myrinet board. The solution is to boot Linux with "noapic". When booting Linux with "noapic", the compatibility code is used and the IRQs are not re-mapped. Everything is straight from the BIOS. In this case, the onboard SCSI and the Myrinet NIC get the same Interrupt (it seems that onboard SCSI and all PCI 64 bits slots get the same IRQ). People not using onboard SCSI are happy because everything is fine for them. But if they use SCSI, the Linux driver will hang at boot time. I have checked the code in the Adaptec Linux driver, and it supports shared IRQ, as does the Myrinet driver. What we found when we tested the machine here is that the problem is not dependent on linux - we saw the problem with FreeBSD and Linux, and even without Myrinet. So, that means that the problem is with the BIOS. It appears that the Supermicro BIOS does not report the APIC mapping correctly. For now we will continue to tell customers to disable APIC, i.e., to boot Linux with "noapic". By booting with this kernel flag, the APIC mapping is not used, SCSI and 64-bit slot will share the IRQ and everything works fine. At the lilo prompt, type the name of the version and "noapic". Or in lilo.conf, add append="noapic" --------------------------------- e. AGP (nVidia and ATI) conflicts --------------------------------- Two types of problems were reported. 1. If I load the GM module first, and then load the nVidia or ATI module, it works. But if load the nVidia or ATI module first, GM won't load. The GM_INSTALL error message looks like: n03 135# ./GM_INSTALL Making device files in /dev. ifconfig myri0 down - in case it was up myri0: unknown interface: No such device Adding new GM driver. sbin/gm: init_module: No such device Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters **** Error installing GM driver module. **** #Our systems consist of ASUS P3V4X motherboards with 733Mhz PIII's, 1GB #RAM, 40GB of disk, nVidia GeForce II boards, Intel Ethernet Pro 100 #NICS, and ~40GB of disk. They are all running clean installs of #RedHat 7.1 with the latest RedHat patches and a 2.4.7 kernel built with 4GB high memory and module support (no smp support). # #We finally found a work around to get the myrinet to work with the #nVidia cards in our cluster. # #We found that if we load the kernel module before loading nVidia's #kernel module, then it works fine. After the gm module is loaded, we #can then load the nVidia module. We did not need to change any of our #BIOS settings. Once the gm module has been loaded, it can be unloaded #and reload as needed until a reboot occurs. # #We are using nVidia's latest driver from their web site (www.nvidia.com). # # n03 kernel: GM: pci_rev2: Could NOT map board into kernel (span = 0x1000000) # n03 kernel: GM: WARNING: drivers/gm_instance.c:4689:gm_instance_init():kernel: # n03 kernel: GM: Can't map IO memory to system memory #ATI has recently released official Linux drivers for its graphics cards. I #have discovered that when using the ATI FireGL 8800 drivers with an ATI Radeon #8500 card in a system with Myrinet installed, the GM module conflicts with the #ATI module, causing hardware acceleration and DRI to fail. I saw in the Linux #README packaged with the GM sources that a similar problem occurs with nVidia #cards. I wanted to inform you that the suggested temporary fix for that #problem (booting the kernel with mem=768m) works with the ATI drivers as well. #I haven't tried the kernel patch recommended yet. This one is a case of shortage of virtual memory (used for IO-mapping PCI memory) in the Linux kernel. On configurations with a lot of physical memory, there will only be 128Mb of the address space that Linux will always reserve for virtual memory dynamically allocated. Unfortunately the nVidia card seems to eat as much virtual memory as it can (it occupies at least 128Mb in PCI memory space), so if you load it before the gm module on such a configuration, you will have the error reported. The fix is to recommend for people with more than 768Mb of memory and an nVidia or ATI card to apply the following patch to their kernel: --- arch/i386/kernel/setup.c Thu Aug 2 17:00:46 2001 +++ arch/i386/kernel/setup.c.2 Thu Oct 11 09:00:59 2001 @@-815,7 +815,7 @@ /* * 128MB for vmalloc and initrd */ -#define VMALLOC_RESERVE (unsigned long)(128 << 20) +#define VMALLOC_RESERVE (unsigned long)(256 << 20) #define MAXMEM (unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE) #define MAXMEM_PFN PFN_DOWN(MAXMEM) #define MAX_NONPAE_PFN (1 << 20) And to be sure the HIGHMEM option is enabled while configuring the kernel they use. If they do not mind losing memory or just to do a test, they can try to boot their current kernel with mem=768m to see if the problem disappears. 2. Overlapping of prefetch memory for the AGP and PCI bridges. SGI Visual Workstation 550 machine. AGP cards (nVidia Quadro, ATI Mach64 PCI graphics card, ATI Rage AGP). What we see with them is that the prefetchable memory assigned by the BIOS for the AGP and PCI bridges is overlapping. This looks like a BIOS problem and we have asked the customer to look into upgrading the BIOS, or to play with the BIOS settings to attempt to get the BIOS to do the right thing (things to try - toggling the plug-n-play OS setting, change the size of the AGP graphics aperture, reinitialize or re-detect the PCI space in the configuration space, etc.) Specifically, it was seen that: The memory for the Myrinet card is mapped at exactly the same spot with the ATI Mach64 PCI graphics card as it is with the ATI Rage AGP graphics card: 03:01.0 Non-VGA unclassified device: MYRICOM Inc.: Unknown device 8043 (rev 03) Region 0: Memory at 82000000 (64-bit, prefetchable) [size=16M] However, now look at the bridges leading to bus 3 (PCI where Myrinet card is) and bus 1 (AGP) in the ATI Rage AGP config: 00:01.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset AGP Bridge (rev 01) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 Prefetchable memory behind bridge: 82300000-850fffff 00:02.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset PCI Bridge (Hub B) (rev 01) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=02, subordinate=03, sec-latency=0 Prefetchable memory behind bridge: 81600000-831fffff See how those the prefetchable memory regions overlap? And, more importantly, see how the bridge to the AGP bus's prefetchable memory region overlaps that of the Myrinet card? Note that the only prefetchable memory on the AGP bus is for the rage card and that this memory is a small subset of the region the bridge is claiming: 01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC AGP (rev 7a) (prog-if 00 [VGA]) Region 0: Memory at 84000000 (32-bit, prefetchable) [size=16M] This issue is now resolved. You need to download BIOS version A9 from the SGI website. ------------------------------------------- f. Motherboards with i840 or i860 chipsets ------------------------------------------- Several customers have reported IRQ issues (see entry for APIC) and disappointing DMA performance. A customer received a beta bios release from Supermicro that they have found resolves some IRQ issues for them and increases the performance. Using this new BIOS, along with the small change to the gm code (described below), they are now seeing around 300 MBytes/sec. Here is the change to increase the performance on the 860 (or 840) chipset. In the file: {GM_HOME}/drivers/linux/gm/gm_arch.c If you have an i840 chipset, modify the flag to be #define GM_INTEL_840 1 If you have an i860 chipset, modify the flag to be #define GM_INTEL_860 1 then rebuild and reload the driver, and run gm_debug -L to see if your peak pci performance is higher.