Demand for high performance compute (HPC) on the rise
Once the stalwart of particle physicists, Formula 1 designers, and climate forecasters, the demand for HPC is rapidly going mainstream as corporates increasingly introduce deep learning models, simulations and complex business decisioning into their daily operations.
HPC can play a pivotal role in accelerating product design, tackling complex problems and enabling businesses to generate insights faster and with more depth and accuracy, and has applicability across an ever-widening range of industries including financial services, media, gaming and retail.
To-date, the only option for corporates seeking to access this level of compute was to build, maintain and operate dedicated HPC facilities in-house, but this brings a number of challenges. The first is cost – HPC systems are expensive, and only around 7% of the budget actually goes toward the hardware, the rest being consumed by buildings, staffing, power, cooling, networking etc. Moreover, because many of these systems are designed to support peak demand, utilisation can be as little as 60% for the majority of the time.
What’s more, significant additional capital is needed on a three year upgrade cycle to keep pace with demand as the business grows and the volume and complexity of workloads increases, and/or to reap the benefits of the latest computing technology. But this computing resource is intrinsically finite and hence projects within an organisation need to be prioritised leading to many missing out or having to step aside if more urgent tasks come along.
HPC on-premise deployments are traditionally designed and optimised around particular use cases (such as climate forecasting) whereas corporates today need HPC resources that can support a much broader set of applications and be able to adapt as workload characteristics evolve in reaction to the fast-moving competitive landscape.
Many companies are therefore turning to cloud-based HPC.
[Image Source: www.seekpng.com]
Benefits of cloud HPC
First and foremost, cloud HPC provides more flexibility for an organisation to gain access to HPC resources as and when needed and scale to match individual workload demands.
It also opens up more choices for the corporate; for instance, employing 10x HPC resources to accelerate product design and gain competitive advantage by being first to market. Or increasing productivity by removing compute barriers so that the corporate can use more detailed simulations or eliminate the effort in simplifying deep learning models to fit inside legacy hardware.
A cloud-based approach mitigates the risk in cost & complexity of operating HPC on-prem by providing flexibility to manage the cost/performance trade-off, allowing HPC environments to be created on the fly and then torn down as soon as the workloads have completed to avoid the corporate paying for resources and software licenses that are no longer needed.
To accommodate this variability in customer demand, CSPs dimension their cloud infrastructure with excess capacity which is powered-up and ready to use but otherwise sitting idle. To offset the monetary and environmental impact of this idle infrastructure, CSPs offer this excess capacity in the form of preemptible instances at massive discounts (up to 90% in some cases) but with the caveat that the resource can be reclaimed by the CSP at a moment’s notice if required by a full-paying customer – a corporate choosing to use these preemptible instances is essentially trading availability guarantees for a variable but much reduced ‘Spot’ price.
HPC workloads such as running a simulation, training a deep learning model, analysing a big data set or encoding video are periodic and batch in nature and not dependent on continuous availability hence a good fit for preemptible resources. If some of the resource instances within the HPC cluster are reclaimed during processing, the workload slows but does not completely stop. Ideally though, it should be possible to quickly and seamlessly re-distribute the part of the workload that was interrupted to alternate resources at the same or a different CSP thereby ensuring that the workload still completes on time – this is possible but not easy, and an area of speciality for some 3rd party tool providers.
With the increase in availability of HPC resources twinned with the ability to closely manage cost, corporates get the opportunity to open up HPC resources to the wider organisation, enabling a wider range of teams, departments and geographically dispersed business units to access the processing power they need whilst being able to track cost and performance and focus on outcomes rather than managing operational complexity.
In a world that is speeding up, becoming more competitive, and being driven by continuous integration and continuous delivery (CI/CD), easy access to cost-effective HPC resources on-the-fly is likely to become a key requirement for any corporate wishing to stay ahead.
Considerations when leveraging cloud HPC
Running complex technical workloads in the cloud is not as simple as swiping a credit card and getting a cloud account.
Many of the companies coming to cloud HPC will be specialists in their area, and may also have cloud expertise, but will need support in composing their workloads to take advantage of the parallelism within the cloud HPC stack, and tools to help them optimise their use of cloud resources.
Such tools will need to work across both workload management and resource provisioning, balancing them to meet the corporate’s target SLAs whether that be dynamically adding more resource to complete a workload on time, or prioritising and scheduling workloads to make most effective use of resources to meet budgetary constraints.
More specifically, tools will be needed that can:
Conduct realtime analysis of workload snapshots to determine their compute requirements.
Sift through the bewildering array of 30,000 different compute resources offered by the CSPs to ensure the best fit for each individual workload whilst also abiding by any corporate policy or individual budgetary targets. Factors that may need to be taken into account when selecting appropriate resources include:
- Workloads that are limited by the number of cores they can use
- Workloads that require particular processor hardware to match the OS used within a virtual image
- Workloads that may have scaling limitations due to the nature of the application licensing model provided by the software provider(s) that may not allow for bursting above a small number of processors
- Workloads that require processor hyper-threading to be disabled and/or are dependent on bare metal servers as opposed to virtual machines to maximise performance
- Tightly-coupled workloads that have specific latency and bandwidth requirements for communication between the cluster nodes
- Workloads involving sensitive data or regulatory restrictions that require all processing to be conducted within a particular locale
- Preference for cloud resources powered via renewables to meet corporate ESG targets
Create clusters of mixed instance types, and do this x-CSP to avoid vendor lock-in and/or to circumvent constraints imposed by any single CSP when dealing with large clusters.
Ensure the workload data is available in the relevant cloud by replicating data between CSPs and locations to ensure availability should a workload need to be executed there.
Monitor when workloads start and complete to ensure that resources are not left running when no workloads are executing.
Intelligently monitor spot/preemptible instances (where used) to ensure that workload cost stays within budget as the spot pricing fluctuates with demand, and reallocate workloads seamlessly if instances are reclaimed by the CSP to ensure that the composite cluster is able to deliver against the workload targets.
Integrate into a corporate’s DevOps and CI/CD processes to enable accessibility of HPC resources more broadly across the organisation.
Provide a single view of workload status and enable users to dynamically make changes to their workloads to deliver results on time and within their project budgetary constraints.
Coordinate with any 3rd party schedulers already used by the corporate (e.g., Slurm, IBM LSF, TIBCO DataSynapse GridServer etc.) to provide a single meta system for workload submission and management across on-prem, fixed cloud and public cloud HPC resources.
Client types and associated requirements
The relative importance of these different tools and requirements will very much be determined by the type of company seeking to utilise cloud HPC, the level of resources they may already have in place and the type of workloads they need to support.
Three example client types are outlined:
Multi-national organisations and specialist corporates in sectors such as academia, engineering, life sciences, oil & gas, aircraft and automotive that already have an HPC data centre on-prem but aim to supplement it with cloud HPC resources to avoid the cost of building out and maintaining additional HPC resources themselves to increase capacity.
Such clients may use cloud resources as an extension of their existing HPC for use with all workloads, or segment and only use cloud for adhoc non-critical (and loosely coupled workloads), or perhaps just for ‘bursting’ into the cloud to deal with peaks in demand either because the planned workload exceeded expectations and bursting was needed to complete it on time (e.g., CGI rendering), or bursting was employed to speed-up execution and produce simulation results more quickly. By using the cloud as an adjunct enables these companies to extend the usefulness of their existing on-prem systems, and any new systems they deploy can be designed with less peak performance capacity by being able to burst into the cloud whenever needed.
Given that the corporate will already have on-prem and/or private cloud infrastructure, cloud HPC tools will be needed that can interface with the existing 3rd party workload schedulers. Equally, any cloud HPC resources that are employed may need to be matched to the on-prem resource types already in-use, hence intelligent tooling will be needed that can analyse individual workload requirements and provision the most appropriate cloud resources across the myriad of available instance options from the CSPs, and map the workload accordingly across the on-prem and cloud infrastructure.
Depending on the workload, the corporate may also decide to use spot/preemptible instances to complete batch processing tasks without loading other cloud resources and/or as a way of managing cost.
Corporates in sectors such as financial services, retail, media, gaming, manufacturing and logistics that are dependent on high-performance compute to drive their deep learning models, simulations and business decisioning to maintain a competitive edge but with insufficient funds and/or interest in deploying and managing dedicated HPC resources on-prem hence reliant on such resources being provided via the cloud.
Given the mission-critical nature of their workloads, such corporates are likely to follow a multi-cloud strategy to provide resiliency and de-risk dependency on a single provider. Selection of resources may also be driven by corporate sustainability goals, with a preference for CSPs and/or specific CSP data centres that maximise use of renewables.
Intelligent tooling will also be needed for use by the corporate in parallelising their workloads and integrating into their existing DevOps processes, and a dashboard providing oversight of HPC resources employed and workload status.
Similar to the cloud-native corporates, many startups/scale-ups utilising deep learning for NLP, computer vision etc. are keen on gaining access to HPC resources to accelerate their product development and time to market, and/or would like to develop products and services that can scale up and down in the cloud, but may not have the budget or expertise to achieve this.
Such companies are therefore wholly dependent on automated tools that enable them to programmatically control their usage via DevOps interfaces and dynamically switch between different CSPs and instances to minimise their costs. Primary usage will be via preemptible resources, and startups may also choose to use older generation instances to meet budgetary constraints.
YellowDog is a pioneer in the cloud HPC space, providing solutions that enable intelligent orchestration, scheduling and provisioning at scale across on-prem, hybrid and multi-cloud environments and delivering on all the requirements outlined above.
In addition to providing benefits to companies already employing HPC, they’re unique in being able to generate clusters delivering HPC levels of compute using spot/preemptible instances hence are well placed to support the new breed of companies needing access to HPC performance levels at an affordable price and to provide startups with a base platform that enables them to easily develop a new autoscaling product or service hence reducing their time to market and simplifying development.
A particular speciality of YellowDog is the ability to rapidly spin-up massive scale HPC clusters that aggregate resources from multiple CSPs and/or across multiple regions to circumvent the scaling limits in any particular CSP; in 2021, YellowDog successfully demonstrated creation of a cluster utilising 3.2million vCPUs on AWS to run an HPC workload with 95% utilisation, and achieved this feat in under an hour.
Figure 4 Scale-up to 3.2 million vCPUs and rapid scale-down on job completion (YellowDog; AWS)
The YellowDog platform provides a straightforward GUI enabling engineers and scientists to use the platform without needing to be HPC specialists, and also provides a sophisticated dashboard and API access for managing workloads and provisioning preferences, including an ML-based prediction of completion time thereby enabling customers to easily flex the resources being employed to meet a particular deadline or budgetary constraint.
Unique in the market, YellowDog also compiles a realtime insight on the myriad of different instances offered by the main CSPs with regard to their machine performance, pricing, and use of renewables, and utilises this intelligence within the YellowDog platform to deliver optimal provisioning for its clients.
Whilst there are other companies offering solutions to help clients with their cloud orchestration and management, only YellowDog provide orchestration twinned with intelligent scheduling and provisioning at sufficient scale to deliver compute capabilities at HPC performance levels, and at a price point using spot/preemptible resources that meets the growing industry demand, and via a platform and set of tools that enable all to enjoy the benefits of cloud HPC.
The world is speeding up.
Easy access to HPC levels of compute via the cloud is changing the economics of product development, increasing the pace of innovation and enabling corporates to increase agility, accuracy, and critical insights in today’s data-driven economy. By harnessing preemptible instances and spot pricing, even the smallest of companies and startups can now afford to run HPC workloads.
Preemptible instances ensure that cloud resources do not lie idle, and bring environmental benefits as well as incremental revenues for the CSPs and lower costs for companies utilising the cloud – a veritable win:win for all, and demonstrates that HPC systems in the cloud can be cost-comparable to on-prem alternatives whilst bringing many advantages.
Harnessing the potential of cloud HPC whilst meeting all other business objectives though is no mean feat and will be dependent on intelligent tooling. YellowDog is a pioneer in this space and a perfect partner for any business looking to leverage cloud HPC resources to gain a competitive edge.
 Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, Oracle Cloud Infrastructure and Alibaba
The science that connects our phones to cell towers remains one of the greatest technological achievements of the past century. Radio Access Networks (RAN) convert electromagnetic waves to data streams of electrons and back again at fibre-like speeds.
This is made possible through the deepest technology which takes theoretical physics out of the lab and turns it into a commercial reality. Driving the generations of technology that have become familiar household terms (the G’s) is a rich ecosystem of academia, vendors and network operators co-ordinated through standards bodies and initiatives such as ITU-R, 3GPP, GSA and GSMA.
Open RAN (ORAN) is one such initiative gaining market momentum with engagement across a range of players (mobile operators, network equipment providers, chip component suppliers, system integrators, and test specialists). ORAN has moved beyond the peak of the hype-cycle and will become a major force in RAN equipment provision. Some predict it will grow from less than a tenth of total RAN spend in 2021 to over a half by 2030.
The theory of ORAN, and the driving force behind the initiative, is supply chain disruption. For Mobile Network Operators (MNOs) it provides greater flexibility, increased innovation, a broader number of suppliers whilst reducing cost through competition. Ultimately ORAN promises to break decades of vendor lock-in.
In some respects, there’s a sense of déjà vu with MNOs pushing a strategy of open interfaces between infrastructure elements to diversify supply chains.
In the early 2000s, at the peak of 3G hype, there was a broad set of infrastructure suppliers with Nokia, Ericsson, Nortel, Alcatel, Lucent, Motorola, Siemens, and Huawei all vying for business in 3G rollouts.
Even back then 3GPP standards specified intra-network element interfaces that enabled mobile networks to be built using multi-vendor products. Nokia and Ericsson developed complete end-to-end (E2E) solutions, whereas others like Motorola and Lucent concentrated on specific network elements.
But in practice, the single vendor E2E solution providers won out, the MNOs preferring a fully integrated solution as it simplified supply chains, reduced system integration overheads and streamlined Network Management. Consequently, only a small number of dominant suppliers have survived the industry consolidation that followed with the likes of Nortel, Siemens, Motorola, Alcatel and Lucent disappearing.
What’s different this time?
Valuable lessons have been learnt, this time around there is a focus on tackling proprietary product architectures and mitigating against over complicated vendor-specific Operations and Management (O&M) systems.
The O-RAN Alliance founded in 2018 by AT&T, China Mobile, Deutsche Telekom, NTT Docomo and Orange is a global community of MNOs, vendors, and research institutions working together to ensure interoperability. The Alliance has published over 74 specifications that address gaps and ambiguities within the 3GPP specifications defining the necessary O&M processes and systems.
In addition to this, security concerns about critical infrastructure being sourced predominantly from Chinese companies such as Huawei and ZTE (with implied state control) has led to Governments forcibly opening up the telecoms market by banning Chinese manufacturers from providing critical elements of 5G infrastructure. In doing so, creating a technology vacuum stimulating innovation and creating opportunity for new entrant startups – all made possible by ORAN.
A number of startups have successfully entered the market providing RF front-end solutions; examples in Europe: AccelerComm, Lime Micro, Pharrowtech, Software Radio Systems, and outside Europe: DeepSig, EdgeQ, Metanoia Communications and Picocom.
Which companies are innovating in ORAN?
Southampton based AccelerComm is a good example of how startups can bring fresh innovation into the ORAN space – in their case, developing deep tech that delivers a 10x improvement in information throughput speeds and latency reduction.
The 5G ORAN architecture also introduces the RAN intelligent controller (RIC) which allows 3rd parties to generate xApps (near real-time) or rApps (non-real time) for optimising ORAN performance based on the environment. The higher-level Management & Orchestration functions of 5G also provides opportunities for new entrants such as: Accelleran, IS-Wireless, Zeetta Networks in Europe, and from further afield Aarna Networks, Cellwize, and Opanga.
Zeetta has developed multi-domain orchestration technology based on 5G network slicing principles, and innovative splicing technology to provide QoS management and improve resource utilisation across access networks and cell sites. A capability that is especially relevant to Industry 4.0 and is demonstrated via the DCMS-backed 5G-ENCODE project.
ORAN is driving demand for higher performance compute, especially to meet the higher levels of complexity in 5G compared with 4G. Massive MIMO, in particular, can prove challenging when significant antenna arrays are used in combination with high bandwidths – Xilinx estimates a x40-x300 uplift in compute for 100MHz 64T64R 5G compared with 20MHz 8T8R 4G [source: “Telecom TV – OpenRAN Summit – October 2021”].
In response, chip suppliers are working to enrich existing CPU products with hardware accelerators to meet the demands of high-performance ORAN software whilst seeking to optimise power efficiency to enable a wider range of deployment topologies. Enter ARM and watch this space Intel.
How big is the prize?
The ORAN market will take time to become an established alternative to existing single vendor solutions especially for the high demand of dense urban high-capacity deployments. Indications are that the ORAN marketplace will mature in 2024/2025, providing an opportunity for companies to establish themselves in the short term and be well placed to capitalise on the maturity and growth phase of ORAN. ABI Research predict ORAN revenue will grow to over a half of RAN revenue by 2030.
Having said that, many MNOs will have deployed their 5G RAN equipment by this time, and ORAN may end up being more significant during a 5G equipment refresh towards the end of the decade. This is being accelerated by state intervention:
- $750M of ORAN Wireless Network Funding in the USA
- €150Bn of funding from the EU to help MNOs roll out 5G
- UK orders removal of Huawei equipment from 5G network by 2027
These may lead to more ORAN investments over the next 2-3 years, especially for rural areas. In the meantime, the deployment of private cellular networks (PCNs) may drive the near-term commercial opportunity for ORAN. J’son and Partners Consulting estimate that annual spending on private 4G/5G reached $1Bn in 2020, with an estimated 10% YoY growth.
Whilst this represents a sizeable market for the ORAN ecosystem, it only equates to about 2% of the total expenditure in cellular infrastructure by the MNOs, small compared to the wide-area public network opportunities in today’s market. However, strong longer-term growth in enterprise and industrial PCNs is predicted by ABI Research with revenues growing to $65Bn by 2030.
The combination of ORAN technology readiness and political stimulus are clear indicators that there’s a real opportunity for startups. Provided that is, the MNOs don’t repeat history and opt for established vendor single supplier solutions as Vodafone UK has decided to do with its selection of Samsung as its single vRAN and ORAN solution provider. BT has announced a Nokia ORAN trial in Hull and notably has been quite public that no one should assume that a single vendor strategy is going to change anytime soon.
More positively Vodafone Group has recently announced its opening an R&D centre in Spain that will work with Intel and other silicon vendors to develop its own ORAN chip architecture with half the 5 year investment of €250m coming from EU funding. Whether or not this will allow new innovators into the inner circle remains to be seen.
How does this impact early-stage deep tech?
One of the biggest challenges for early-stage companies in telecommunications remains as much a balance sheet one as it is a technology one. How do you convince the supply chain manager of an MNO that a loss-making startup is a safe bet for its critical infrastructure?
The answer is two-fold: first deliver significant performance improvements that have economic impact. This will likely be in specialist areas that the generalist prime contractors are not agile enough, or don’t have the deep technical expertise, to address.
Such technology is likely to be very deep in the technology stack in areas such as L1 channel coding/equalisation, power efficient accelerator hardware and RF semiconductors, and at the higher layers in orchestration/resource management and QoS management using AI and Machine Learning in the RIC (xApps and rApps).
Second is to partner with and sell to the OEMs rather than MNOs. OEMs are the most obvious partners as they are also potential investors in deep tech companies.
Has the ship sailed for early-stage investment in these areas?
It possibly has for Seed stage startups with a focus on 5G ORAN. But the next developments of 5G-advanced and 6G have already started, just as those have in the parallel universe of IEEE (Wi-Fi 6 and 7). So, an opportunity for early-stage investment does exist and lays in those deep dark pools of tech that will deliver on the vision to produce more efficient and cognitive networks.
Whether ORAN alone can break incumbent vendor lock-in remains to be seen.
Cybersecurity innovation critical in combatting the inexorable rise in cyber threats and ransomware attacks.
Bloc invests in technology areas that underpin the future growth and prosperity of the digital age. Cybersecurity, and in particular the challenges companies face as they move operations online and into the cloud, is a growing area of importance and innovation.
The landscape for security teams is rapidly changing. Digital transformation, accelerated by Covid and remote working, is driving a rapid uptake in cloud utilisation.
Hybrid multi-cloud & remote working practises are dramatically expanding the attack surface as workforces access company IT systems from unsecured devices (home PCs, tablets) and over unsecured WLANs (home, coffee shops) thereby tearing down the single security perimeter that security teams have previously come to rely upon.
Competitive pressures driven by DevOps & CI/CD working practises are leading to mistakes in cloud configuration and deployment of unauthorised shadow IT, both of which are creating additional vulnerabilities within company networks – Verizon estimates that 82% of enterprise breaches should have been stopped by existing security controls but weren’t, and 79% of observed exposures were in the cloud compared with 21% for on-premise assets.
Worst still, zero-day vulnerabilities introduced or exploited within the systems and software of companies’ suppliers is on the rise – a Trojan horse in effect that a business has very little control over, although startups such as Darkbeam are seeking to help companies manage the risk.
Cyber-attacks and the resultant data breaches are expensive, erode customer trust, damage brand reputation and can ultimately stop a company in its tracks.
And yet despite their efforts, many companies are being overwhelmed by the magnitude of threats they face, and are ill-equipped to differentiate between real threats and false alerts coming from their networks.
Survival will be dependent on the development of intelligent tools leveraging advanced AI/ML that can augment and support security teams in their ever-lasting battle with the cybercriminals.
Key areas for innovation identified by Bloc
We have identified a number of cybersecurity areas for innovation:
- Use of few-shot learning AI techniques for detecting zero-day exploits with unknown signatures such as those introduced through supply chain attacks
- Methods for obfuscating existing networks to inhibit attackers without the company needing to re-architect
- Enclave Networks is one such company helping its clients to ‘darken’ their networks through the introduction of invisible network access gates
- Implementing zero-trust principles to prevent attackers from moving laterally through the network after gaining access via infected systems
- Zero-trust assumes that everyone in the network could be a bad actor, hence all activity is continuously monitored for behavioural anomalies and access to individual systems managed via granular privileges and more robust authentication methods
- Introduction of cyber deception platforms and honeytraps that lure attackers into revealing themselves thereby enabling security teams to shut them down before they cause any serious damage
- CounterCraft, for example, provide a cyber deception and counterintelligence platform designed to detect intrusion and insider threats before attacks are perpetrated
- Supporting anomaly detection at scale, especially for Industrial IoT networks comprising huge numbers of devices
- Realtime anomaly detection becomes especially challenging in the IoT space as the number of devices scale into the millions. One way to tackle this (pioneered by Shield-IoT based on work conducted within MIT) is to compress the network and resulting data into a smaller coreset enabling context-free highly accurate anomaly detection in minutes instead of hours or days
The market opportunity is clear
Cybersecurity software & tools in 2020 was worth $12 billion in the UK, $26.5 billion in Europe and $78 billion globally and is projected to grow to $118 billion globally by 2024. The cybersecurity market for hardware & software combined is expected to exceed $200 billion by 2024 and reach $372 billion globally by 2028.
Managing cloud vulnerabilities is a race between attacker and defender and therefore ripe for new entrants bringing fresh ideas and utilising the latest technology to deliver anomaly detection, behavioural profiling and automated tools for supporting security teams and those companies wanting to take their business operations into the cloud.
Investment research firm, Edison Group, has written an in-depth report on Bloc Ventures and conducted a video interview with our CEO, Bruce Beckloff. Read the report here and watch the interview here.
Bloc’s CTO and co-founder, David Leftley, comments in Tech Monitor.