[SMART]Rack AI

Inference & Training at Scale

“Our requirement for computational power, density and network speed are very different from what conventional servers provide. AMAX was able to deliver the [SMART]Rack AI, a fully customized rackscale Deep Learning solution with comprehensive power management [SMART]DC, in-rack cooling system, and ultra-fast network speed, that solved our problems above and beyond what we thought we were looking for."

AMAX [SMART]Rack AI is a turnkey Machine Learning cluster designed for optimal manageability and performance, featuring 96x NVIDIA® Tesla P100s, P40s or V100s for up to 1.34 PFLOPs per rack. Delivered plug-and-play and fully-loaded, the solution features All-Flash storage for an ultra-fast in-rack data repository, 25G high speed networking, [SMART]DC Data Center Manager and an in-rack battery for graceful shutdown in power loss scenarios. [SMART]Rack AI is the perfect platform for on-premise AI clouds and DL-as-a-service, or to drop into any data center environment for the highest-performance in training and inference at scale.

·

    

·

    

·

[SMART] DC Data Center Manager

[SMART]DC is an HPC-optimized, fully integrated DCIM to remotely monitor, manage and orchestrate power-dense GPU-based ML deployments, where real time temperature, power and system health monitoring are critical to ensure uninterrupted operation. Features include policy-based emergency power and resource management, remote KVM, alert and event notification, and advanced analytics.

25GbE Fabrics Featuring RoCE

Benefit from the latest 25G network technology for increased in-rack bandwidth and productivity. 48x 25G downlink and 6x 100G uplink remove existing bottlenecks between compute and SSD/NVMe storage solutions, and accelerate application workloads. AMAX’s 25G fabrics supports RDMA over Converged Ethernet (RoCE) and link aggregation for connectivity previously only available to HPC architectures.

10kW In-Rack Backup Battery Solution

Designed to bridge short power outages and to safely shut down servers without external UPS, the in-rack battery provides 2.5 min of backup power at 10kW load per battery. In addition, [SMART]DC smart power policies will reduce power consumption of GPU servers and stretch the battery hold up time to 5 min, comparable with state-of-the-art centralized UPS solutions.

[SMART]DC Data Center Manger

[SMART]DC is an HPC-optimized, fully integrated DCIM to remotely monitor, manage and orchestrate power-dense GPU-based ML deployments, where real time temperature, power and system health monitoring are critical to ensure uninterrupted operation. Features include policy-based emergency power and resource management, remote KVM, alert and event notification, and advanced analytics.

48 Port 25G TOR Switch

Benefit from the latest 25G network technology for increased in-rack bandwidth and productivity. 48x 25G downlink and 6x 100G uplink remove existing bottlenecks between compute and SSD/NVMe storage solutions, and accelerate application workloads. AMAX’s 25G fabrics supports RDMA over Converged Ethernet (RoCE) and link aggregation for connectivity previously only available to HPC architectures.

10kW In-Rack Backup Battery Solution

Designed to bridge short power outages and to safely shut down servers without external UPS, the in-rack battery provides 2.5 min of backup power at 10kW load per battery. In addition, [SMART]DC smart power policies will reduce power consumption of GPU servers and stretch the battery hold up time to 5 min, comparable with state-of-the-art centralized UPS solutions.

Compute Module

One [SMART]Rack AI Compute Module consists of four MATRIX 280 8GPU Servers, featuring P40, P100, or V100 GPUs. Each rack encloses up to three Compute Modules to provide over 1 PetaFLOP of compute power.

4x MATRIX 280 2U 8 GPU Servers

All-Flash Storage Appliance

Scalable Multi-Framework Deep Learning IDE

MATRIX Powered by Bitfusion Flex

The MATRIX is an end-to-end Deep Learning platform geared towards fast tracking AI development and simplifying GPU resource & workload management. Leverage pre-built Docker containers featuring the latest DL frameworks and data science libraries to kickstart and shorten training iterations. GPU over Fabrics technology enables the sharing & scaling of large numbers of GPUs across systems for multi-tenancy and highly-customizable self-service features.

  • Fully integrates Jupyter Notebook application, for a shareable GUI development environment that supports 40 different languages, including Python, Julia, R, Scala, etc.
  • Unmatched flexibility in GPU resources management via auto-allocation, and interactive workspace sharing across multiple users and multiple workloads.
  • Track development workloads remotely via both GUI and CLI.
  • High Power Density Rack Cooling System

    Live Chat