Cluster HPC

Accessing The System

DNS entry: eniac.iqsc.usp.br
SSH ports: 22 (internal) or 2122 (outside USP networks)
Extern.IP: 143.107.225.64

After logging in to the head node for the first time, users must perform a ssh
connection to one of the active computing nodes, e.g., "$ssh n01".
Such procedure will ensure the propagation of ssh generated keys.

System Layout

Hardware characteristics:
- Intel OmniPath low latency and high throughput network
- 21 computing nodes (n01-n20 and the GPU-enabled gn01)
- Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
- 40 cores per node (80 cores considering hyperthreading)
- RAM available:
- n01-n04: 188G
- n05-08,n13-n20: 377G
- n09-n12,gn01: 755G
- GPU-enabled node gn01 with 4xNVIDIA Tesla V100 16Gb RAM

Software information:
- All available software pieces are listed under /dados/softwares folder. Some
binaries are public accessible, others are restricted to certain groups of
users, due to licensing limitations.
Users are welcome to compile their own applications and use it from the
personal home folder.

- Intel compilers and tools are available under /opt/intel

Scheduler Information

We use the PBSPro batch system, users must refer to its documentation to
understand how to submit and manage their jobs.
Specific queues are designed to prioritize the load accordingly to the
resources requirements and availability, so users are encouraged to chose
and specify the correct queue before submitting a job:

- workq: standard queue for average jobs (default when no queue is specified)
- max walltime = 48:00:00
- max simultaneous running processes = 160
- priority = 50
- nodes: all nodes, except gn01
- max allocated cores per job = 20

- longq: jobs with long runtimes
- max walltime = 720:00:00
- max simultaneous running processes = 8
- priority = 10
- all nodes, except gn01
- max allocated cores per job = 40

- shortq: quick jobs with short expected runtime
- max walltime = 1:00:00
- max simultaneous running processes = 4
- priority = 100
- nodes: all nodes
- max allocated cores per job = 20

- mediumq: for jobs with regular memory requirements
- max walltime = 48:00:00
- max simultaneous running processes = 64
- priority = 75
- nodes = n05-n20 (excludes nodes with low RAM)
- max allocated cores per job = 40

- largeq: jobs with high memory requirements
- max walltime = 48:00:00
- max simultaneous running processes = 16
- priority = 100
- nodes = n09-n12 and gn01 (only nodes with 755Gb RAM)
- max allocated cores per job = 80

- gpuq: GPU specific jobs
- max walltime = 48:00:00
- max simultaneous running processes = 4
- priority = 100
- nodes = only gn01
- max allocated cores per job = 80 cpus, 4 gpus

- serialq: queue for single thread jobs
- max wallime = 120:00:00
- max simultaneous running processes = 20
- prioriy = 50
- nodes = n20 only
- max allocated cores per job = 1