[SMART]Rack AI
Inference & Training at Scale
“Our requirement for computational power, density and network speed are very different from what conventional servers provide. AMAX was able to deliver the [SMART]Rack AI, a fully customized rackscale Deep Learning solution with comprehensive power management [SMART]DC, in-rack cooling system, and ultra-fast network speed, that solved our problems above and beyond what we thought we were looking for."
Rokid
AMAX [SMART]Rack AI is a turnkey Machine Learning cluster designed for optimal manageability and performance, featuring 96x NVIDIA® Tesla® P100s, P40s or V100s for up to 1.34 PFLOPs per rack. Delivered plug-and-play and fully-loaded, the solution features All-Flash storage for an ultra-fast in-rack data repository, 25G high speed networking, [SMART]DC Data Center Manager and an in-rack battery for graceful shutdown in power loss scenarios. [SMART]Rack AI is the perfect platform for on-premise AI clouds and DL-as-a-service, or to drop into any data center environment for the highest-performance in training and inference at scale.
·
·
·
[SMART] DC Data Center Manager
[SMART]DC is an HPC-optimized, fully integrated DCIM to remotely monitor, manage and orchestrate power-dense GPU-based ML deployments, where real time temperature, power and system health monitoring are critical to ensure uninterrupted operation. Features include policy-based emergency power and resource management, remote KVM, alert and event notification, and advanced analytics.
25GbE Fabrics Featuring RoCE
Benefit from the latest 25G network technology for increased in-rack bandwidth and productivity. 48x 25G downlink and 6x 100G uplink remove existing bottlenecks between compute and SSD/NVMe storage solutions, and accelerate application workloads. AMAX’s 25G fabrics supports RDMA over Converged Ethernet (RoCE) and link aggregation for connectivity previously only available to HPC architectures.
10kW In-Rack Backup Battery Solution
Designed to bridge short power outages and to safely shut down servers without external UPS, the in-rack battery provides 2.5 min of backup power at 10kW load per battery. In addition, [SMART]DC smart power policies will reduce power consumption of GPU servers and stretch the battery hold up time to 5 min, comparable with state-of-the-art centralized UPS solutions.
[SMART]DC Data Center Manger[SMART]DC is an HPC-optimized, fully integrated DCIM to remotely monitor, manage and orchestrate power-dense GPU-based ML deployments, where real time temperature, power and system health monitoring are critical to ensure uninterrupted operation. Features include policy-based emergency power and resource management, remote KVM, alert and event notification, and advanced analytics. 48 Port 25G TOR SwitchBenefit from the latest 25G network technology for increased in-rack bandwidth and productivity. 48x 25G downlink and 6x 100G uplink remove existing bottlenecks between compute and SSD/NVMe storage solutions, and accelerate application workloads. AMAX’s 25G fabrics supports RDMA over Converged Ethernet (RoCE) and link aggregation for connectivity previously only available to HPC architectures. 10kW In-Rack Backup Battery SolutionDesigned to bridge short power outages and to safely shut down servers without external UPS, the in-rack battery provides 2.5 min of backup power at 10kW load per battery. In addition, [SMART]DC smart power policies will reduce power consumption of GPU servers and stretch the battery hold up time to 5 min, comparable with state-of-the-art centralized UPS solutions. Compute ModuleOne [SMART]Rack AI Compute Module consists of four MATRIX 280 8GPU Servers, featuring P40, P100, or V100 GPUs. Each rack encloses up to three Compute Modules to provide over 1 PetaFLOP of compute power. ↑
|
|
4x MATRIX 280 2U 8 GPU Servers All-Flash Storage Appliance |
Scalable Multi-Framework Deep Learning IDE
MATRIX Powered by Bitfusion Flex
The MATRIX is an end-to-end Deep Learning platform geared towards fast tracking AI development and simplifying GPU resource & workload management. Leverage pre-built Docker containers featuring the latest DL frameworks and data science libraries to kickstart and shorten training iterations. GPU over Fabrics technology enables the sharing & scaling of large numbers of GPUs across systems for multi-tenancy and highly-customizable self-service features.
High Power Density Rack Cooling System
Rack Cooling System (Add On)
Cooling via an active rear door heat exchanger efficiently removes heat from GPU racks if existing data center cooling is not sufficient, and is substantially more cost effective than employing CRAC units. Fans in the rear door enable shared cooling between adjacent racks (only every 2nd or 3rd rack requires a rear door heat exchanger) and provide auxiliary cooling to existing cooling solutions.

