RDMA network profile

This page provides an overview of the RDMA network profile in Google Cloud.

About the RDMA network profile

The RDMA network profile lets you create a Virtual Private Cloud (VPC) network in which you can run AI workloads on VM instances that have NVIDIA ConnectX NICs. These NICs support remote direct memory access (RDMA) connectivity and have the NIC type MRDMA in Google Cloud.

A VPC network with the RDMA network profile supports low-latency, high-bandwidth RDMA communication between the GPUs of VMs that are created in the network using RDMA over converged ethernet v2 (RoCE v2).

For more information about running AI workloads in Google Cloud, see the AI Hypercomputer documentation.

Specifications

VPC networks created with the RDMA network profile have the following specifications:

  • The network only accepts attachments from MRDMA NICs. A3 Ultra VMs, A4 VMs, and A4X VMs are the only VM types that support MRDMA NICs. Other NIC types, for example the GVNICs of an A3 Ultra VM, must be attached to a regular VPC network.
  • The set of features that are supported in the network is pre-configured by Google Cloud to support running AI workloads that require RDMA. VPC networks with the RDMA network profile have more constraints than regular VPC networks. For more information, see Supported and unsupported features.
  • The network is constrained to the zone of the network profile that you specify when you create the network. For example, any instances that you create in the network must be created in the zone of the network profile. Additionally, any subnets that you create in the network must be in the region that corresponds to the zone of the network profile.

    The RDMA network profile is not available in all zones. To view the zones in which the network profile is available, see Supported zones. You can also view the zone-specific instances of the network profile that are available by listing network profiles.

  • The resource name of the RDMA network profile that you specify when you create the network has the following format ZONE-vpc-roce, for example europe-west1-b-vpc-roce.

  • The default MTU in a VPC network created with the RDMA network profile is 8896. This default gives the RDMA driver in the VM's guest OS the flexibility to use an appropriate MTU. The default MTU in regular VPC networks might be too small for some RDMA workloads. For best performance, Google recommends that you don't change the default MTU.

Supported zones

The RDMA network profile is available in the following zones:

  • europe-west1-b
  • us-central1-a
  • us-central1-b
  • us-east4-b
  • us-west1-c

Supported and unsupported features

This section describes the supported and unsupported features in VPC networks created with the RDMA network profile.

Features of regular VPC networks are supported unless they are configured to be disabled by the network profile, are dependent on a feature that is disabled by the network profile, or don't apply to traffic from RDMA NICs as described this section.

Features configured by the network profile

This table lists the specific features that are configured by the network profile resource and describes whether they are supported or not supported in VPC networks created with the RDMA network profile. It includes the network profile property values set by Google Cloud.

FeatureSupportedProperty nameProperty valueDetails
MRDMA NICsinterfaceTypesMRDMA

The network supports only MRDMA NICs.

The network doesn't support other NIC types, such as GVNIC or VIRTIO_NET.

Multi-NIC in the same networkallowMultiNicInSameNetworkMULTI_NIC_IN_SAME_NETWORK_ALLOWEDThe network supports multi-NIC VMs where different NICs of the same VM can attach to the same VPC network. The NICs must attach to different subnets in the network, however.

See Performance considerations for multi-NIC in the same VPC network.

IPv4-only subnetssubnetworkStackTypesSUBNET_STACK_TYPE_IPV4_ONLY

The network supports IPv4-only subnets, including the same Valid IPv4 ranges as regular VPC networks.

The network doesn't support dual-stack or IPv6-only subnets. For more information, see Types of subnets.

PRIVATE subnet purposesubnetworkPurposesSUBNET_PURPOSE_PRIVATE

The network supports regular subnets, which have a purpose of PRIVATE.

The network doesn't support Private Service Connect subnets, proxy-only subnets, or Private NAT subnets. For more information, see Purposes of subnets.

GCE_ENDPOINT address purposeaddressPurposesGCE_ENDPOINT

The network supports IP addresses with a purpose of GCE_ENDPOINT, which is used for internal IP addresses assigned to VM instances.

The network doesn't support special purpose IP addresses, such as the SHARED_LOADBALANCER_VIP purpose used in Cloud Load Balancing. For more information, see the address resource reference.

Attachments from nic0allowDefaultNicAttachmentDEFAULT_NIC_ATTACHMENT_BLOCKEDThe network doesn't support attachments from the nic0 interface of an VM, also referred to as the default NIC.
External IP addresses for VMsallowExternalIpAccessEXTERNAL_IP_ACCESS_BLOCKEDThe network doesn't support assigning external IP addresses to VMs. NICs connected to the network can't reach the public internet.
Dynamic Network InterfacesallowSubInterfacesSUBINTERFACES_BLOCKEDThe network doesn't support Dynamic NICs.
Alias IP rangesallowAliasIpRangesALIAS_IP_RANGE_BLOCKEDThe network doesn't support using alias IP ranges, including secondary IPv4 address ranges, which can only be used by alias IP ranges.
IP forwardingallowIpForwardingIP_FORWARDING_BLOCKEDThe network doesn't support IP forwarding.
VM network migrationallowNetworkMigrationNETWORK_MIGRATION_BLOCKEDThe network doesn't support migrating VMs between networks.
Auto modeallowAutoModeSubnetAUTO_MODE_SUBNET_BLOCKEDThe subnet creation mode of the VPC network can't be set to auto mode.
VPC Network PeeringallowVpcPeeringVPC_PEERING_BLOCKEDThe network doesn't support VPC Network Peering. Additionally, the network doesn't support private services access, which relies on VPC Network Peering.
Static routesallowStaticRoutesSTATIC_ROUTES_BLOCKEDThe network doesn't support static routes.
Packet MirroringallowPacketMirroringPACKET_MIRRORING_BLOCKEDThe network doesn't support Packet Mirroring.
Cloud NATallowCloudNatCLOUD_NAT_BLOCKEDThe network doesn't support Cloud NAT.
Cloud RouterallowCloudRouterCLOUD_ROUTER_BLOCKEDThe network doesn't support creating Cloud Routers.
Cloud InterconnectallowInterconnectINTERCONNECT_BLOCKEDThe network doesn't support Cloud Interconnect.
Cloud VPNallowVpnVPN_BLOCKEDThe network doesn't support Cloud VPN.
Network Connectivity CenterallowNccNCC_BLOCKEDThe network doesn't support Network Connectivity Center. You can't add the network as a spoke to a Network Connectivity Center hub.
Cloud Load BalancingallowLoadBalancingLOAD_BALANCING_BLOCKEDThe network doesn't support Cloud Load Balancing. You can't create load balancers in the network. Additionally, you can't use Google Cloud Armor in the network, because Google Cloud Armor security policies apply only to load balancers and VMs with external IP addresses.
Private Google AccessallowPrivateGoogleAccessPRIVATE_GOOGLE_ACCESS_BLOCKEDThe network doesn't support Private Google Access.
Private Service ConnectallowPscPSC_BLOCKEDThe network doesn't support any Private Service Connect configurations.

Additional features that don't apply to traffic from RDMA NICs

Some features of regular VPC networks that are available for traffic of other protocols don't apply to traffic in a network with the RDMA network profile, such as the following:

While Google Cloud doesn't prevent you from configuring these features, they aren't effective in VPC networks with the RDMA network profile.

Performance considerations for multi-NIC in the same VPC network

To support workloads that benefit from cross-rail GPU-to-GPU communication, the RDMA network profile lets you create VMs that have multiple MRDMA NICs attached to the same network. However, cross-rail connectivity might affect network performance, such as through increased latency. VMs that have MRDMA NICs use NCCL, which attempts to rail-align all network transfers even for cross-rail communication, for example by using PXN to copy data through NVlink to a rail-aligned GPU prior to transferring over the network.

What's next