By Meiwen Wang, Sr. Solution Consultant – Shanghai Fastone Information Technology
By Wei Dai, Partner Solutions Architect – AWS
By Xueyao Bai, Solutions Architect – AWS

Fastone
Fastone-APN-Blog-CTA-2023

An increasing number of simulations and calculations necessitate the real-time processing of substantial volumes of information, particularly in the computer aided engineering (CAE) and computational fluid dynamics (CFD) industries.

Remarkable progress has facilitated the migration of CAE/CFD simulation tasks to the cloud, thereby enabling multi-site collaboration, enhanced efficacy, and economic advantages in terms of cost reduction.

However, users still encounter a range of obstacles including effectively managing clusters, devising optimal scheduling strategies, fine-tuning performance, managing resources, and closely monitoring task progress.

In this post, we will share how to use the Fastone Compute Cloud-Enterprise Edition (Fastone FCC-E) to more efficiently handle large-scale CAE/CFD computations and simulations, while reliably leveraging cloud resources.

Shanghai Fastone Information Technology is an AWS Specialization Partner and AWS Marketplace Seller with the Manufacturing and Industrial Services Competency. Fastone delivers a readily available research and development (R&D) environment to help clients expedite application development across many industries, including biotech, electronic design automation, and financial technology.

Automating the Complexities of Cloud Simulations

Enterprises require considerable computational resources and storage capacity to successfully accomplish simulation tasks within a limited timeframe.

The advancement of cloud technology presents expandable computational and storage resources, an extensive array of instance alternatives, worldwide infrastructure, and virtually boundless utility. Fastone is a visual platform that can quickly turn all of this possibility into progress, requiring only a matter of hours to implement.

The Fastone platform leverages Amazon Web Services (AWS) to aid customers in conducting CAE/CFD simulations in the cloud in the following ways:

  • Enhancing efficiency: This is achieved by employing techniques like parallel computing and task decomposition.
  • Managing large-scale clusters: This is done with dynamic adjustment of resources within the cluster, ensuring the coherence of hardware and software environments.
  • Selecting suitable scheduling strategies: For instance, evenly distributing jobs across nodes or dispatching them to a single node until its resources are fully utilized before scaling out to the next node.
  • Optimizing the computational process: Done so by choosing the most appropriate instance type and performance for desired outcomes.
  • Incorporating auto-scaling: Continuous integration and continuous delivery (CI/CD) mechanisms are leveraged for efficient utilization and allocation of cloud resources.
  • Monitoring tasks in real time: Real-time monitoring allows for timely detection of potential issues and swift action to adjust resource allocation, optimize workloads, and ensure high availability and stability across the entire cluster. Monitoring can also be leveraged to generate historical data, providing valuable insights for capacity planning and future system optimizations.

Consider one customer’s structural analysis and simulation application as an illustrative example.

Fastone-HPC-Platform-1

Figure 1 – Reference architecture diagram.

The application is primarily employed for complex finite element analysis and calculations, encompassing a wide range of fields such as mechanical, structural, fluid, and geological domains. It requires substantial computational and storage resources and parallel computing capabilities, as existing local resources are unable to adequately meet these demands.

To ensure architectural security and data flow security, with the rich instance resource backed by AWS, the Fastone FCC-E platform tackles this using the architecture depicted in Figure 1.

After logging in to the Fastone FCC-E web portal, select New Job in the FCC-E and adjust the desired number of CPU cores, specify the input file, and choose the appropriate instance type and number of data nodes.

Fastone-HPC-Platform-2

Figure 2 – Reference job submission.

Subsequently, the Fastone FCC-E platform will seamlessly create the high-performance computing (HPC) cluster on AWS for the job through the API, and promptly initiate the job once the cluster is prepared.

What’s Behind the Automation?

Now, let’s examine more closely the automated process of building the cluster and submitting the CAE job for execution.

First, the Fastone FCC-E platform utilizes an infrastructure as code (IaC) tool to generate a comprehensive stack on AWS. This process employs a predefined YAML template that encompasses essential components such as key pairs, virtual private clouds (VPCs), subnets, security groups, and route tables.

Example YAML template indicates the automation process:

#Example template, it shows how FCC-E does automation by utilizing AWS resources to build a HPC cluster for the submitted job, including vpc, subnets, security groups, images and instances, etc.
vendor: aws            
region: cn-northwest-1 
credential:            
  ACCESS_KEY_ID: '***'
  ACCESS_KEY_SECRET: '***'
bucket:
  region: "cn-northwest-1"                 
  bucket_name: "bucketname-s3"           
  access_key_id: "***"           # access key
  access_key_secret: "***"       # secret key
stack:
  prefix: "standard_public"                  
  master_image: "fs-master-version-***"     
  node_image: "fs-linux-version-***"    
  master_inst_type: t3a.xlarge                
  node_inst_type: t3a.large                  
  # ------------------------------ subnet name --
  subnet_vdi: "{{ stack.prefix }}-subnet-vdi"
  subnet_master: "{{ stack.prefix }}-subnet-master"
  subnet_storage: "{{ stack.prefix }}-subnet-storage"
  subnet_approval: "{{ stack.prefix }}-subnet-approval-cluster"
  subnet_login: "{{ stack.prefix }}-subnet-login"
  subnet_default: "{{ stack.prefix }}-subnet-default-cluster"
  # ----------------------------- firewall name --
  firewall_master: "{{ stack.prefix }}-firewall-master"
  firewall_common: "{{ stack.prefix }}-firewall-common"
  firewall_storage: "{{ stack.prefix }}-firewall-storage"
  firewall_monitor: "{{ stack.prefix }}-firewall-monitor"
  # ------------------------------------- image --
  image_master: "{{ stack.master_image }}"
  image_common: "{{ stack.master_image }}"
  image_monitor: "{{ stack.master_image }}"
configure:
  dest: "/opt/deploy"               
  store_dir: ".buri"     #directory for IaC tool            
ssh:
  keypair: "{{ stack.prefix }}-keypair" 
  public_key_path: ""         
  private_key_path: ""        
vpc:
  name: "{{ stack.prefix }}-vpc"       #define VPC CIDR
  cidr: "10.0.0.0/16"                
zone_id: "cn-northwest-1a"             #AWS region id
subnets:
  - name: "{{ stack.subnet_vdi }}"         
    cidr: "10.0.1.0/24"
    internet: true
  - name: "{{ stack.subnet_master }}"       
    cidr: "10.0.2.0/24"
    internet: true
    # ---------------------- Cluster Subnet ----------------------
  - name: "{{ stack.subnet_default }}"      
    cidr: "10.0.16.0/20"            #  10.0.16.1 ~ 10.0.31.254   4k hosts
    internet: false
  - name: "{{ stack.subnet_login }}"        
    cidr: "10.0.5.0/24"
    internet: true
    # -- firewall ------------------------------------------------
    firewalls: ...
    # -- stack service ------------------------------------------------
    nodes: ...

Next, the Fastone FCC-E platform retrieves configuration details from the job and determines the quantity and specifications of each node within the cluster. It then employs the AWS API to launch Amazon Elastic Compute Cloud (Amazon EC2) instances, utilizing the Fastone image that incorporates the necessary HPC cluster libraries, dependencies, and the installed CAE application.

Furthermore, the platform automatically updates the host file to facilitate seamless intercommunication between the nodes.

Node updates in hosts file:

$ sudo cat /etc/hosts
172.31.32.13   public-1
172.31.16.27   head-1
172.31.16.96   login-1
172.31.16.60   partition9344-1
172.31.16.141  partition9344-a1
172.31.16.98   partition9344-a2

Then, Fastone FCC-E constructs an HPC cluster on AWS by utilizing the Slurm.conf and Partitions.conf files.

Slurm.conf acts as the principal configuration file for a Slurm-based HPC cluster, containing a range of global configuration options that define and configure the behavior of the entire cluster. It specifies various aspects such as nodes, queues, resource limits, account management, and task scheduling policies.

Administrators have the flexibility to customize the cluster’s scale, performance parameters, permission controls, and other settings by editing the Slurm.conf file to align with specific application requirements.

Codes in slurm.conf file define the SPECs of a SLURM-based cluster:

# This code sample shows how FCC-E utilizes AWS EC2 instances to build a SLURM based HPC cluster.
$ sudo cat /etc/slurm/slurm.conf
ClusterName=fastone-44
SlurmctldHost=head-1
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
CacheGroups=0
DefMemPerNode=1
EnforcePartLimits=ALL
MaxArraySize=4000000
MaxJobCount=8000000
FirstJobId=1
MaxJobId=8000000
MinJobAge=180
ReturnToService=2 # we need the node come up when it registered
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurm.spool
StateSaveLocation=/var/spool/slurm.state
SlurmctldParameters=cloud_dns,nohold_on_prolog_fail
SwitchType=switch/none
UsePAM=0
CommunicationParameters=NoAddrCache
GresTypes=gpu
MpiDefault=pmix
ProctrackType=proctrack/pgid
PrologFlags=X11
TaskPlugin=task/none
CpuFreqGovernors=Performance
PowerPlugin=none
PreemptType=preempt/none
TaskProlog=/etc/slurm/task_prolog.sh
TaskEpilog=/etc/slurm/task_epilog.sh
#
# SlurmdParameters
SlurmdParameters=config_overrides
#
# TIMERS
SlurmctldTimeout=300 #when the idle time reaches 300 seconds,it automatically release the Instances.
SlurmdTimeout=300 #when the idle time reaches 300 seconds,it automatically release the instances.
InactiveLimit=3600
KillWait=30
Waittime=0
ResumeTimeout=300
TCPTimeout=30
TreeWidth=65535
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory,CR_LLN
FastSchedule=1
# COMPUTE NODES
NodeName=DEFAULT
PartitionName=DEFAULT MaxTime=INFINITE State=UP
# Include Partitions
include partitions.conf

Within the partitions.conf file, users can observe the definition of compute nodes and partitions, along with their detailed specifications. This file enables users to submit tasks to the relevant partitions, depending on their specific requirements.

It also outlines the characteristics of each partition, including name, available nodes, allocation policies, priorities, and other relevant attributes.

Codes in partitions.conf define the names and nodes of the partitions:

$ sudo cat /etc/slurm/partitions.conf
#
# PARTITION partition9344
PartitionName=partition9344 Nodes=partition9344-1,partition9344-a[1-999] Default=YES 
#  DUMMY
NodeName=partition9344-a[1-999] CPUs=4 RealMemory=6963 State=FUTURE
#  NODES
NodeName=partition9344-1 CPUs=4 RealMemory=6963 Weight=1 State=CLOUD
NodeName=partition9344-a1 CPUs=4 RealMemory=6963 Weight=1001 State=CLOUD
NodeName=partition9344-a2 CPUs=4 RealMemory=6963 Weight=1002 State=CLOUD

The Fastone FCC-E platform automatically submits the job to the cluster queue, and the execution of the job commences promptly once the compute nodes are prepared.

SLURM command to show the submitted job status in the queue:

# A reference job status checking by using standard SLURM command: squeue
$ squeue
JOBID PARTITION   NAME   USER ST    TIME NODES NODELIST(REASON)
210 partition appname username PD    0:00   3 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

Upon completion of the job, Fastone FCC-E automatically releases the allocated resources to prevent unnecessary costs. The output of the job becomes accessible on the platform’s file system. Downloading the output files requires approval from the system administrator.

Additional Benefits

Fastone FCC-E offers an intuitive billing management module that facilitates financial analysis, order details, and budget management, so users can effectively monitor and manage their expenses associated with the platform’s services.

Fastone-HPC-Platform-3

Figure 3 – Intuitive billing status.

In addition, Fastone FCC-E provides users with integrated Secure Shell (SSH) and web virtual network computing(VNC) functionalities, so users are able to directly access the command line or desktop from the web portal.

Furthermore, for customers seeking an elevated user experience, Fastone FCC-E extends support for commercial virtual desktop infrastructure (VDI) solutions, including NICE DCV and Amazon WorkSpaces to keep the user experience consistent with their habits.

Fastone-HPC-Platform-4

Figure 4 – Integrated SSH and VNC.

During task execution, users can monitor the job status and resource status in real-time through the monitoring and alerting modules provided by the platform. Alerts will be sent once there’s a job failure or a threshold is reached; for example, if memory is reaching 90% and instance type change is suggested.

Fastone-HPC-Platform-5

Figure 5 – Reference dashboard.

In conjunction with the built-in monitoring and alerting system, the platform also offers a highly customizable monitoring and alerting platform that utilizes Prometheus and Grafana. This empowers users to conduct extensive and detailed data analysis, catering to their specific needs and requirements.

Fastone-HPC-Platform-6

Figure 6 – Advanced dashboard.

Conclusion

This post demonstrated how the Fastone FCC-E platform automatically creates clusters, executes tasks, and releases resources within an AWS environment to support enterprise simulations on the cloud.

The Fastone platform can assist users in overcoming challenges related to enhancing efficiency, managing clusters, task scheduling, resource management, and monitoring.

Whether you’re a small startup or large enterprise, Fastone’s solution is designed to help automate workflows and streamline your simulations, all while providing top-notch performance and security. Take your simulations to the next level with Fastone FCC-E on AWS.

Learn more about Fastone FCC-E in AWS Marketplace.

.
Fastone-APN-Blog-Connect-2023
.


Fastone – AWS Partner Spotlight

Shanghai Fastone Information Technology is an AWS Partner that delivers a readily available R&D environment to help clients expedite application development across many industries, including biotech, electronic design automation, and financial technology.

Contact Fastone | Partner Overview | AWS Marketplace