AMiCO

HELP: amico-troubles@fisica.unimi.it
MANAGEMENT: amico-resp@fisica.unimi.it
HTCONDOR: condor-troubles@mi.infn.it
ACCOUNT
JOBS SUBMISSION
CLUSTERS OVERVIEW
HOWTO
USE CASE
CLUSTERS DETAILS

AMiCO is “Apparato MIlanese per il Calcolo Opportunistico”
Implemented by Dipartimento di Fisica and INFN, AMiCO is a project aiming to federate heterogeneous computing clusters at Physics Department of Università degli Studi and INFN Milano.
Clusters can

  • spill out their excess jobs to unused resources when they are overloaded
  • preserving the possibility for cluster owners to preempt alien jobs.
We use the starting and flocking features of HTCondor. We added support for:
  • dynamic slots
  • parallel scheduling
  • Docker jobs
For jobs needing data access while running outside their home cluster we provide:
  • a CEPH readable/writeable storage accessible via S3 on RADOS gateway
  • CVMFS, mounted on worker nodes and inside Docker containers.

ACCOUNT: requirement and howto request access

You need to have an “idefix” (INFN – Sezione di Milano) or an UNIMI account (@ unimi.it or @ studenti.unimi.it). An “idefix” account request must be forwarded to the INFN Administration: (on this web site) follow menu ACCOUNTING E POSTA, menu item RICHIESTA DI ACCOUNT.
To request access to submit jobs with HTCondor you can
  • contact one of the following Cluster managers:
    Leonardo Carminati
    Laura Perini
    Physics of Matter –> Nicola Manini / Davide Galli
    Cosmology-CMB (Cosmic Microwave Back-ground radiation) –> Davide Maino
    Cosmology-LSS (Large-Scale Structure) –> Luigi Guzzo
    Theoretical Physics –> Alessandro Vicini
    Department’s cluster (“MAGI”) –> Giovanni Onida
  • Undergraduates and graduate students and guests without a research group manager can ask to amico-troubles@mi.infn.it (specifying email, role and research activities) an access authorization to the clusters “MAGI” (network point of entry: “gaspare”) with a user-quota of 50 GB and a storage quota on CEPH.

SUBMISSION of the JOBS:

TUTORIAL:
AMiCO: practical introduction di Francesco Prelz (INFN – MI)
AMiCO: special cases di Francesco Prelz (INFN – MI)
Shortly…
The computing resources are organised in a number of privately owned and operated computing clusters.
Inside each cluster, one “head node” is usually charged with co-ordinating the cluster, and sometimes also acts as a single network point of entry.
Other nodes in the cluster execute jobs (“execution nodes”). In the AMiCO’s infrastructure, all executing nodes in any cluster can communicate directly over the local area network.
In the TABLE below we’ll shortly take an overall look at the list of the available clusters. Just an idea about the computer power.
Nodes where jobs are submitted and queued are called “submit nodes”.
Typically, users who need to submit jobs share some interests with cluster owners, so they have priority access to some cluster.
Interactive execution and (possibly) various batch systems are used to organise the workload in each cluster.
Typically with less than 100% resource occupancy.
AMiCO wants to be friendly to local cluster owners, and will suspend, then migrate jobs, when local workload appears.
Current default policy:
Suspend after 2 minutes of local activity.
Vacate and migrate if the job cannot be restarted within 10 minutes.
An upper-tier service (or “Central Manager“, codename: superpool-cm) matching available computing resources with pending job requests can compensate load peaks across clusters and increase goodput.
The semantics of this resource sharing service is opportunistic: HTCondor is a specialised solution for this scenario.
If HTCondor is also used as a local cluster ‘batch system’, then local and AMiCO’s jobs can be handled in a uniform way.
This scenario cannot be serviced with any number of FIFO (first-in-first-out) queues.

Main characteristics of the AMICO’s federate CLUSTER

Pool nameNodesTotal memoryTotal cpusTotal diskMax memoryMax cpusMax disk
etsfmi-pool10402G168501G62G24109G
doraemon6330G844030G63G242025G
erebor-pool7658G1682867G94G24411G
proof-pool101221G2641232G755G40356G
magi-pool2250G128328G125G64168G
teor-pool341154G98426261G63G401771G
lagrange-pool14210G11213G15G81G
gamma-pool1126G324G126G324G

Pool nameResearch groupHead node name
(central manager)
etsfmi-poolCondensed Matteretsfmi
doraemonCosmology-LSS
erebor-poolCosmology-CMB
proof-poolHEP - ATLAS
magi-poolDepartment of Physicsgaspare
teor-poolTheoretical Physics
lagrange-poolCondensed Matterhalley
gamma-poolNuclear Physics

HOWTO:

  • 1) HOWTO AMICO
    SLIDE di Francesco Prelz (INFN – MI)
    1a) Conceptual introduction: Distributed computing and storage, Opportunistic computing Practical introduction: available tools, AMICO distributed storage (CEPH), AMICO distributed computing (HTCondor).
    Job examples: “Hello world”, File transfer via sandbox, Multiple/parametric job submission and control, File access via Object Storage, Script submission, Object Storage file staging, Interactive Jobs
    AMiCO: practical introduction
    1b) More complex cases: Common dependences and how to require them, Docker and HTCondor, Parallel jobs (MPI)
    AMiCO: special cases

  • 2) HOWTO HTCondor
    HTCondor – web site:
    http://research.cs.wisc.edu/htcondor/
    HTCondor – readthedocs:
    https//htcondor.readthedocs.io/en/latest/
    Howto submit, monitor and manage a job by Miguel Villaplana (Dip. di Fisica e INFN – MI): howto_condor.pdf (Sept. 2017 – download PDF)

  • 3) An overview: POSTER di David Rebatto (INFN – MI)

Use Case:

Clusters details:

AMiCO
PROJECT MANAGEMENT: Leonardo Carminati, Laura Perini
TEAM: Franco Leveraro, Francesco Prelz, David Rebatto, Paolo Salvestrini
Collaborators: Miguel Villaplana, Francesca Milanini