The focus of AI and ML innovation to-date has understandably been in those areas characterised by an abundance of labelled data with the goal of deriving insights, making recommendations and automating processes.
But not every potential application of AI produces enough labelled data to utilise such techniques – use cases such as spotting manufacturing defects on a production line is a good example where images of defects (for training purposes) are scarce and hence a different approach is needed.
Interest is now turning within academia and AI labs to the harder class of problems in which data is limited or more variable in nature, requiring a different approach. Techniques include: leveraging datasets in a similar domain (few-shot learning), auto-generating labels (semi-supervised learning), leveraging the underlying structure of data (self-supervised learning), or even synthesising data to simulate missing data (data augmentation).
Characterising limited-data problems
Deep learning using neural networks has become increasingly adept at performing tasks such as image classification and natural language processing (NLP), and seen widespread adoption across many industries and diverse sectors.
Machine Learning is a data driven approach, with deep learning models requiring thousands of labelled images to build predictive models that are more accurate and robust. And whilst it’s generally true that more data is better, it can take much more data to deliver relatively marginal improvements in performance.
Figure 1: Diminishing returns of two example AI algorithms [Source: https://medium.com/@charlesbrun]
Manually gathering and labelling data to train ML models is expensive and time consuming. To address this, the commercial world has built large sets of labelled data, often through crowd-sourcing and through specialists like iMerit offering data labelling and annotation services.
But such data libraries and collection techniques are best suited to generalist image classification. For manufacturing, and in particular spotting defects on a production line, the 10,000+ images required per defect to achieve sufficient performance is unlikely to exist, the typical manufacturing defect rate being less than 1%. This is a good example of a ‘limited-data’ problem, and in such circumstances ML models tend to overfit (over optimise) to the sparse training data, hence struggle to generalise to new (unknown) images and end up delivering poor overall performance as a result.
So what can be done for limited-data use cases?
A number of different techniques can be used for addressing these limited-data problems depending on the circumstances, type of data and the amount of training examples available.
- Few-shot learning
Few-shot learning is a set of techniques that can be used in situations where there are only a few example images (shots) in the training data for each class of image (e.g. dogs, cats). The fewer the examples, the greater the risk of the model overfitting (leading to poor performance) or adversely introducing bias into the model’s predictions. To address this issue, few-shot learning leverages a separate but related larger dataset to (pre)train the target model.
Three of the more popular approaches are meta-learning (training a meta-learner to extract generalisable knowledge), transfer learning (utilising shared knowledge between source and target domains) and metric learning (classifying an unseen sample based on its similarity to labelled samples).
Once a human has seen one or two pictures of a new animal species, they’re pretty good at recognising that animal species in other images – this is a good example of meta-learning. When meta-learning is applied in the context of ML, the model consecutively learns how to solve lots of different tasks, and in doing so becomes better at learning how to handle new tasks; in essence, ‘learning how to learn’ similar to a human – illustrated below:
Figure 2: Meta-learning [Source: www.borealisai.com]
Transfer learning takes a different approach. When training ML models, part of the training effort involves learning how to extract features from the data; this feature extraction part of the neural network will be very similar for problems in similar domains, such as recognising different animal species, and hence can be used in instances where there is limited data.
Metric learning (or distance metric learning) determines similarity between images based on a distance metric and decides whether two images are sufficiently similar to be considered the same. Deep metric learning takes the approach one step further by using neural networks to automatically learn discriminative features from the images and compute the distance metric based on these features – very similar in fact to how a human learns to differentiate animal species.
- Self-supervised & semi-supervised learning
Techniques such as few-shot learning can work well in situations where there is a larger labelled dataset (or pre-trained model) in a similar domain, but this won’t always be the case.
Semi-supervised learning can address this lack of sufficient data by leveraging the data that is labelled to predict labels for the rest hence creating a larger labelled dataset for use in training. But what if there isn’t any labelled data? In such circumstances, self-supervised learning is an emerging technique that sidesteps the lack of labelled data by obtaining supervisory signals from the data itself, such as the underlying structure in the data.
Figure 3 Predicting hidden parts of the input (in grey) from visible parts (in green) using self-supervised learning [source: metaAI]
- Data augmentation
An alternate approach is simply to fill the gap through data augmentation by simulating real-world events and synthesising data samples to create a sufficiently large dataset for training. Such an approach has been used by Tesla to complement the billions of real-world images captured via its fleet of autonomous vehicles for training their AI algorithms, and by Amazon within their Amazon’s Go stores for determining which products each customer is taking from the shelves.
Figure 4: An Amazon Go store [Source: https://www.aboutamazon.com/what-we-do]
Whilst synthetic data might seem like a panacea for any limited-data problem, it’s too costly to simulate for every eventuality, and it’s impractical to predict anomalies or defects a system may face when put into operation.
Data augmentation has the potential to reinforce any biases that may be present in the limited amount of original labelled data, and/or causing overfitting of the model by creating too much similarity within the training samples such that the model struggles to generalise to the real-world.
Applying these techniques to computer vision
Mindtrace is utilising the unsupervised and few-shot learning techniques described previously to deliver a computer vision system that is especially adept in environments characterised by limited input data and where models need to adapt to changing real-life conditions.
Pre-trained models bringing knowledge from different domains create a base AI solution that is fine-tuned from limited (few-shot) or unlabelled data to deliver state-of-the-art performance for asset inspection and defect detection.
Figure 6: Mindtrace [Source: https://www.mindtrace.ai]
This approach enables efficient learning from limited data, drastically reducing the need for labelled data (by up to 90%) and the time / cost of model development (by a factor of 6x) whilst delivering high accuracy.
Furthermore, the approach is auto-adaptive, the models continuously learn and adapt after deployment without needing to be retrained, and are better able to react to changing circumstances in asset inspection or new cameras on a production line for detecting defects, for example.
The solution is also specifically designed for deployment at the edge by reducing the size of the model through pruning (optimal feature selection) and reducing the processing and memory overhead via quantisation (reducing the precision using lower bitwidths).
Furthermore, through a process of swarm learning, insights and learnings can be shared between edge devices without having to share the data itself or process the data centrally, hence enabling all devices to feed off one-another to improve performance and quickly learn to perform new tasks (Bloc invested in Mindtrace in 2021).
The focus of AI and ML innovation to-date has understandably been in areas characterised by an abundance of labelled data to derive insights, make recommendations or automate processes.
Increasingly though, interest is turning to the harder class of problems with data that is limited and dynamic in nature such as the asset inspection examples discussed. Within Industry 4.0, limited-data ML techniques can be used by autonomous robots to learn a new movement or manipulation action in a similar way to a human with minimal training, or to auto-navigate around a new or changing environment without needing to be re-programmed.
Limited-data ML is now being trialled across cyber threat intelligence, visual security (people and things), scene processing within military applications, medical imaging (e.g., to detect rare pathologies) and smart retail applications.
Mindtrace has developed a framework that can deliver across a multitude of corporate needs.
Figure 7: Example Autonomous Mobile Robots from Panasonic [Source: Panasonic]
Industry 4.0 driving the need for 5G
Automation in Industry 4.0 sectors such as smart manufacturing, warehousing, mining and ports is driving increased demand for high performance connectivity. Wi-Fi is widely deployed today but is limited in terms of reliability and support for critical mobility use cases – 5G is much better placed to meet these needs.
In particular, 5G can meet requirements around high bandwidth and low latency, whilst also delivering resiliency through dedicated radio spectrum and has the flexibility to support full mobility ranging from indoor use to wide area outdoor coverage.
A common misconception is that many of these benefits are available within Wi-Fi 6, but whilst Wi-Fi 6 can offer high capacity, it can’t manage radio resources as efficiently as 5G and is intrinsically hampered by sharing unlicensed spectrum, whilst 5G using dedicated spectrum is inherently more reliable.
It would also be missing the point to say that 5G is simply a ‘faster 4G’ – 5G adopts a service-based architecture (SBA) which enables provisioning of customised network slices and zero-touch network operations that provides much finer granularity in how a 5G network can be set up and run.
5G is therefore growing in favour, 75% of manufacturers indicating that 5G is a key enabler within their digital transformation strategies [Capgemini’s global enterprise 5G survey].
Nevertheless, it’s not a clear homerun for 5G and to succeed it must provide the best of both worlds – the functionality, performance and reliability of 5G, twinned with the flexibility, control and ease of use of Wi-Fi deployments.
Delivering 5G to meet enterprise needs
- Option 1: a public 5G network slice
Network slicing is a new capability introduced in 5G that enables mobile network operators (MNOs) to leverage their public 5G infrastructure to provide virtualised private networks to enterprises.
A number of slice types have been defined within the 3GPP standards (3GPP TS23.501):
- eMBB (enhanced mobile broadband) – for applications requiring stable connections with very high peak data rates
- URLLC (ultra-reliable low latency communications) – for applications that have strict reliability and latency requirements such as industrial automation and autonomous vehicles (i.e., devices requiring mission critical connectivity)
- mMTC (massive machine-type communications) – to support a massive number of IoT devices within a defined area which are only sporadically active in sending small data payloads (e.g., sensors)
In a manufacturing example, a computer vision system used for monitoring a production line may require consistent throughput with an ultra-reliable connection and be best served by a URLLC slice, whilst sensors for monitoring humidity levels may only need to connect intermittently to send signals to a control centre and be adequately served using an mMTC slice.
But this approach may be too constraining for some enterprises – the slices being statically defined, whereas what many enterprises really want is the ability to control their connectivity on a more dynamic basis to map resources to an application as circumstances change (adaptive slicing).
As 5G public networks evolve towards fully cloud-native architectures, it will become possible to provision highly customised network slices tailored to specific services. But for now, MNO public 5G offerings are limited by the current approach of predefined eMBB, URLLC, and mMTC slices.
Given these constraints, enterprises are increasingly exploring the option of procuring their own 5G mobile private network (MPN) that can be tailored specifically to their needs.
- Option 2: a 5G mobile private network (MPN)
A 5G MPN is a 5G network (RAN and 5G core) that has been designed, configured and deployed specifically for a given enterprise customer.
Mobile networks are designed to utilise specific licensed spectrum, so the logical choice would be to procure an MPN from an MNO. But with the introduction of shared spectrum in many countries (including the UK) and open flexible architectures (via OpenRAN) there are now many new entrants entering the space offering solutions to enterprises either direct or through partnership.
This gives enterprises the flexibility to decide whether to go with a Managed Service Provider (MSP) that can fully design, deploy, configure and optionally operate the MPN for them (e.g., a school campus), or work with a selection of vendors and partners to assemble their own MPN infrastructure tailored to their requirements (e.g., smart manufacturing, ports, mining etc.).
Currently, all options and potential partnerships are being explored in the marketplace.
MNOs and incumbents such as Ericsson and Nokia are partnering to bring MPN propositions to their enterprise client base (e.g., Ericsson Industry Connect). But equally MNOs are also partnering up with challengers (Affirmed Networks, Parallel Wireless, Metaswitch, Mavenir, Celona et al) and leveraging cloud resources (e.g., Azure, AWS Wavelength) and enterprise IT partners (Cisco, IBM, Oracle) to increase their flexibility and agility in bringing solutions to market that encompass not only connectivity but also provide the cloud, edge and AI capabilities needed by enterprises for their end-end application delivery.
Whilst the necessity of acquiring licensed spectrum for 5G MPN deployments drives many of these players into partnering with the MNOs, in those markets where shared spectrum has been allocated, these players are also able to step up, adopt the role of a Managed Service Provider, and offer complete MPN solutions directly to enterprise clients. Nokia, Ericsson, Mavenir, Celona, Federated Wireless, Expeto and many more all have direct-to-market propositions, and the hyperscalers are also eying up the opportunity with both Amazon and more recently Google announcing MPN offerings, either developed in-house or through partnership (Google working with Betacom, Boingo, Celona and Kajeet in the US).
Enterprises are faced with many options, but this also gives them huge flexibility in finding the best match for their functional and operational needs and also affords them with higher levels of privacy by operating the infrastructure themselves rather than sharing infrastructure within a public network – for those in manufacturing, high security is a key driver in choosing an MPN over utilising a public 5G network slice.
Given the opportunity, it’s hardly surprising that deploying private 5G is a top priority now for IT decision makers in enterprises [Technalysis Research] and 76% of those in manufacturing plan to deploy 5G MPNs by 2024 [Accedian].
Optimising connectivity to match use cases
A key attraction for enterprises in deploying their own 5G MPN is the flexibility it gives them in optimising connectivity to match application requirements. This can be achieved through the definition of an ‘intent’ that states expectations on service delivery and network operation through the expression of a set of goals, functional requirements, and constraints.
The table below describes the requirements for example use cases within a factory automation context:
At a practical level, intents can be managed in a number of ways depending on the skillsets of the enterprise. For those enterprises with limited expertise, a set of low/no-code tools can be provided for defining intents, app/device group administration, and monitoring network and application performance as well as end-end security.
Conversely, for those wanting more fine-grained control, orchestration could be provided to DevOps teams through RESTful APIs with dynamic control over throughput, latency, packet error rate metrics, network segments / IP domains etc., and/or bootstrapped via Infrastructure as Code (IaC) templates – in short, the aim is to enable enterprises to configure and manage their 5G MPNs using DevOps-friendly interfaces as easily as Kubernetes enables them to do with cloud resources for their application and services.
Zeetta delivers on this vision by hiding the details of vendors and technology domains under a layer of abstraction and then enabling the enterprise application developers to consume these services in an end-to-end low/no-code fashion. This application-centric, end-to-end view also enables DevOps teams to independently innovate and operate applications without the need for centralized large networking groups.
The platform has been developed and trialled within the £9m 5G-ENCODE project, and provides enterprises with a ‘single pane of glass’ to visualise their end-to-end network as well as a set of automation features for optimal network management:
- Automates the design, scheduling and provisioning of network slices in line with intents specifying connectivity and QoS requirements; intents can be predefined or can be modelled; public 5G slices can also be sub-sliced through the use of a policy-based scheduler to provide more fine-grained control in multiplexing multiple applications over a single slice resource
- Facilitates performance monitoring of each network slice, flagging any deviations from the targeted intent and helping the enterprise team to determine the root cause and remediate, e.g., by adapting a slice as needed to best serve the affected application
- Similarly, automates fault isolation hence speeding issue resolution and delivering a better overall MPN quality & robustness
- Modifies application intent where needed to keep pace with varying application requirements as circumstances change
Zeetta translates the demand and intent into a set of parameters and complex actions for each domain, and leverages the open interfaces provided by the MNOs/MSPs supplying the MPN to create the connectivity slice and avoid over-dimensioning of the RAN, Core and BSS/OSS hence reducing cost (CAPEX and OPEX). This slice is then continuously monitored, compared and adapted based on the quality of experience (QoE) targets.
Zeetta product architecture
5G offers high capacity, low latency, and full flexibility, coupled with reliability through dedicated spectrum. Whilst public 5G network slices will evolve over time, the current lack of in-building coverage and fine-grained control means that for many enterprises the best solution is to procure their own 5G MPN.
Many pilots [Vodafone & Ford] have already demonstrated the significant benefits of 5G MPNs and a number of initial deployments are already operational [Verizon & UK ports]. 5G MPN rollout is likely to reach around 25k installations by 2026 and accelerate rapidly to ~120k by 2030 [Analysis Mason; IDC; Polaris Market Research; ABI research].
Whilst many have leant heavily on MNOs to help design, deploy and configure their MPNs, such an approach will be difficult to scale, and the growth projections are unlikely to be realised unless 5G MPNs can be as simple to deploy and manage as experienced with cloud resources today.
If achieved, this will open up 5G MPNs to enterprises of all sizes – in essence, similar to the democratisation of telco APIs brought about by the introduction of developer-friendly platforms (and RESTful APIs) from the likes of Twilio a decade or so ago.
Twilio growth in the past decade [source: Twilio]
The cloud emerged in a similar timeframe, but since those early launches of elastic processing and storage, a multibillion-dollar industry has grown up around them supplying tools and supplementary services to make the consumption of these resources simpler. To enable enterprise 5G MPNs to be built on-demand as simply as is now enjoyed with cloud resources will require a similar ecosystem of tools and services to emerge.
Zeetta is leading the vanguard in this regard by providing a sophisticated orchestration tool that acts essentially as a ‘Kubernetes for MPNs’, but extends across multiple technology domains (4G, 5G, Wi-Fi, SD-WAN, MEC, public 5G slices etc.) to provide comprehensive management, and all exposed via an intuitive ‘single pane of glass’ and DevOps-friendly interface.
Demand for high performance compute (HPC) on the rise
Once the stalwart of particle physicists, Formula 1 designers, and climate forecasters, the demand for HPC is rapidly going mainstream as corporates increasingly introduce deep learning models, simulations and complex business decisioning into their daily operations.
HPC can play a pivotal role in accelerating product design, tackling complex problems and enabling businesses to generate insights faster and with more depth and accuracy, and has applicability across an ever-widening range of industries including financial services, media, gaming and retail.
To-date, the only option for corporates seeking to access this level of compute was to build, maintain and operate dedicated HPC facilities in-house, but this brings a number of challenges. The first is cost – HPC systems are expensive, and only around 7% of the budget actually goes toward the hardware, the rest being consumed by buildings, staffing, power, cooling, networking etc. Moreover, because many of these systems are designed to support peak demand, utilisation can be as little as 60% for the majority of the time.
What’s more, significant additional capital is needed on a three year upgrade cycle to keep pace with demand as the business grows and the volume and complexity of workloads increases, and/or to reap the benefits of the latest computing technology. But this computing resource is intrinsically finite and hence projects within an organisation need to be prioritised leading to many missing out or having to step aside if more urgent tasks come along.
HPC on-premise deployments are traditionally designed and optimised around particular use cases (such as climate forecasting) whereas corporates today need HPC resources that can support a much broader set of applications and be able to adapt as workload characteristics evolve in reaction to the fast-moving competitive landscape.
Many companies are therefore turning to cloud-based HPC.
[Image Source: www.seekpng.com]
Benefits of cloud HPC
First and foremost, cloud HPC provides more flexibility for an organisation to gain access to HPC resources as and when needed and scale to match individual workload demands.
It also opens up more choices for the corporate; for instance, employing 10x HPC resources to accelerate product design and gain competitive advantage by being first to market. Or increasing productivity by removing compute barriers so that the corporate can use more detailed simulations or eliminate the effort in simplifying deep learning models to fit inside legacy hardware.
A cloud-based approach mitigates the risk in cost & complexity of operating HPC on-prem by providing flexibility to manage the cost/performance trade-off, allowing HPC environments to be created on the fly and then torn down as soon as the workloads have completed to avoid the corporate paying for resources and software licenses that are no longer needed.
To accommodate this variability in customer demand, CSPs dimension their cloud infrastructure with excess capacity which is powered-up and ready to use but otherwise sitting idle. To offset the monetary and environmental impact of this idle infrastructure, CSPs offer this excess capacity in the form of preemptible instances at massive discounts (up to 90% in some cases) but with the caveat that the resource can be reclaimed by the CSP at a moment’s notice if required by a full-paying customer – a corporate choosing to use these preemptible instances is essentially trading availability guarantees for a variable but much reduced ‘Spot’ price.
HPC workloads such as running a simulation, training a deep learning model, analysing a big data set or encoding video are periodic and batch in nature and not dependent on continuous availability hence a good fit for preemptible resources. If some of the resource instances within the HPC cluster are reclaimed during processing, the workload slows but does not completely stop. Ideally though, it should be possible to quickly and seamlessly re-distribute the part of the workload that was interrupted to alternate resources at the same or a different CSP thereby ensuring that the workload still completes on time – this is possible but not easy, and an area of speciality for some 3rd party tool providers.
With the increase in availability of HPC resources twinned with the ability to closely manage cost, corporates get the opportunity to open up HPC resources to the wider organisation, enabling a wider range of teams, departments and geographically dispersed business units to access the processing power they need whilst being able to track cost and performance and focus on outcomes rather than managing operational complexity.
In a world that is speeding up, becoming more competitive, and being driven by continuous integration and continuous delivery (CI/CD), easy access to cost-effective HPC resources on-the-fly is likely to become a key requirement for any corporate wishing to stay ahead.
Considerations when leveraging cloud HPC
Running complex technical workloads in the cloud is not as simple as swiping a credit card and getting a cloud account.
Many of the companies coming to cloud HPC will be specialists in their area, and may also have cloud expertise, but will need support in composing their workloads to take advantage of the parallelism within the cloud HPC stack, and tools to help them optimise their use of cloud resources.
Such tools will need to work across both workload management and resource provisioning, balancing them to meet the corporate’s target SLAs whether that be dynamically adding more resource to complete a workload on time, or prioritising and scheduling workloads to make most effective use of resources to meet budgetary constraints.
More specifically, tools will be needed that can:
Conduct realtime analysis of workload snapshots to determine their compute requirements.
Sift through the bewildering array of 30,000 different compute resources offered by the CSPs to ensure the best fit for each individual workload whilst also abiding by any corporate policy or individual budgetary targets. Factors that may need to be taken into account when selecting appropriate resources include:
- Workloads that are limited by the number of cores they can use
- Workloads that require particular processor hardware to match the OS used within a virtual image
- Workloads that may have scaling limitations due to the nature of the application licensing model provided by the software provider(s) that may not allow for bursting above a small number of processors
- Workloads that require processor hyper-threading to be disabled and/or are dependent on bare metal servers as opposed to virtual machines to maximise performance
- Tightly-coupled workloads that have specific latency and bandwidth requirements for communication between the cluster nodes
- Workloads involving sensitive data or regulatory restrictions that require all processing to be conducted within a particular locale
- Preference for cloud resources powered via renewables to meet corporate ESG targets
Create clusters of mixed instance types, and do this x-CSP to avoid vendor lock-in and/or to circumvent constraints imposed by any single CSP when dealing with large clusters.
Ensure the workload data is available in the relevant cloud by replicating data between CSPs and locations to ensure availability should a workload need to be executed there.
Monitor when workloads start and complete to ensure that resources are not left running when no workloads are executing.
Intelligently monitor spot/preemptible instances (where used) to ensure that workload cost stays within budget as the spot pricing fluctuates with demand, and reallocate workloads seamlessly if instances are reclaimed by the CSP to ensure that the composite cluster is able to deliver against the workload targets.
Integrate into a corporate’s DevOps and CI/CD processes to enable accessibility of HPC resources more broadly across the organisation.
Provide a single view of workload status and enable users to dynamically make changes to their workloads to deliver results on time and within their project budgetary constraints.
Coordinate with any 3rd party schedulers already used by the corporate (e.g., Slurm, IBM LSF, TIBCO DataSynapse GridServer etc.) to provide a single meta system for workload submission and management across on-prem, fixed cloud and public cloud HPC resources.
Client types and associated requirements
The relative importance of these different tools and requirements will very much be determined by the type of company seeking to utilise cloud HPC, the level of resources they may already have in place and the type of workloads they need to support.
Three example client types are outlined:
Multi-national organisations and specialist corporates in sectors such as academia, engineering, life sciences, oil & gas, aircraft and automotive that already have an HPC data centre on-prem but aim to supplement it with cloud HPC resources to avoid the cost of building out and maintaining additional HPC resources themselves to increase capacity.
Such clients may use cloud resources as an extension of their existing HPC for use with all workloads, or segment and only use cloud for adhoc non-critical (and loosely coupled workloads), or perhaps just for ‘bursting’ into the cloud to deal with peaks in demand either because the planned workload exceeded expectations and bursting was needed to complete it on time (e.g., CGI rendering), or bursting was employed to speed-up execution and produce simulation results more quickly. By using the cloud as an adjunct enables these companies to extend the usefulness of their existing on-prem systems, and any new systems they deploy can be designed with less peak performance capacity by being able to burst into the cloud whenever needed.
Given that the corporate will already have on-prem and/or private cloud infrastructure, cloud HPC tools will be needed that can interface with the existing 3rd party workload schedulers. Equally, any cloud HPC resources that are employed may need to be matched to the on-prem resource types already in-use, hence intelligent tooling will be needed that can analyse individual workload requirements and provision the most appropriate cloud resources across the myriad of available instance options from the CSPs, and map the workload accordingly across the on-prem and cloud infrastructure.
Depending on the workload, the corporate may also decide to use spot/preemptible instances to complete batch processing tasks without loading other cloud resources and/or as a way of managing cost.
Corporates in sectors such as financial services, retail, media, gaming, manufacturing and logistics that are dependent on high-performance compute to drive their deep learning models, simulations and business decisioning to maintain a competitive edge but with insufficient funds and/or interest in deploying and managing dedicated HPC resources on-prem hence reliant on such resources being provided via the cloud.
Given the mission-critical nature of their workloads, such corporates are likely to follow a multi-cloud strategy to provide resiliency and de-risk dependency on a single provider. Selection of resources may also be driven by corporate sustainability goals, with a preference for CSPs and/or specific CSP data centres that maximise use of renewables.
Intelligent tooling will also be needed for use by the corporate in parallelising their workloads and integrating into their existing DevOps processes, and a dashboard providing oversight of HPC resources employed and workload status.
Similar to the cloud-native corporates, many startups/scale-ups utilising deep learning for NLP, computer vision etc. are keen on gaining access to HPC resources to accelerate their product development and time to market, and/or would like to develop products and services that can scale up and down in the cloud, but may not have the budget or expertise to achieve this.
Such companies are therefore wholly dependent on automated tools that enable them to programmatically control their usage via DevOps interfaces and dynamically switch between different CSPs and instances to minimise their costs. Primary usage will be via preemptible resources, and startups may also choose to use older generation instances to meet budgetary constraints.
YellowDog is a pioneer in the cloud HPC space, providing solutions that enable intelligent orchestration, scheduling and provisioning at scale across on-prem, hybrid and multi-cloud environments and delivering on all the requirements outlined above.
In addition to providing benefits to companies already employing HPC, they’re unique in being able to generate clusters delivering HPC levels of compute using spot/preemptible instances hence are well placed to support the new breed of companies needing access to HPC performance levels at an affordable price and to provide startups with a base platform that enables them to easily develop a new autoscaling product or service hence reducing their time to market and simplifying development.
A particular speciality of YellowDog is the ability to rapidly spin-up massive scale HPC clusters that aggregate resources from multiple CSPs and/or across multiple regions to circumvent the scaling limits in any particular CSP; in 2021, YellowDog successfully demonstrated creation of a cluster utilising 3.2million vCPUs on AWS to run an HPC workload with 95% utilisation, and achieved this feat in under an hour.
Figure 4 Scale-up to 3.2 million vCPUs and rapid scale-down on job completion (YellowDog; AWS)
The YellowDog platform provides a straightforward GUI enabling engineers and scientists to use the platform without needing to be HPC specialists, and also provides a sophisticated dashboard and API access for managing workloads and provisioning preferences, including an ML-based prediction of completion time thereby enabling customers to easily flex the resources being employed to meet a particular deadline or budgetary constraint.
Unique in the market, YellowDog also compiles a realtime insight on the myriad of different instances offered by the main CSPs with regard to their machine performance, pricing, and use of renewables, and utilises this intelligence within the YellowDog platform to deliver optimal provisioning for its clients.
Whilst there are other companies offering solutions to help clients with their cloud orchestration and management, only YellowDog provide orchestration twinned with intelligent scheduling and provisioning at sufficient scale to deliver compute capabilities at HPC performance levels, and at a price point using spot/preemptible resources that meets the growing industry demand, and via a platform and set of tools that enable all to enjoy the benefits of cloud HPC.
The world is speeding up.
Easy access to HPC levels of compute via the cloud is changing the economics of product development, increasing the pace of innovation and enabling corporates to increase agility, accuracy, and critical insights in today’s data-driven economy. By harnessing preemptible instances and spot pricing, even the smallest of companies and startups can now afford to run HPC workloads.
Preemptible instances ensure that cloud resources do not lie idle, and bring environmental benefits as well as incremental revenues for the CSPs and lower costs for companies utilising the cloud – a veritable win:win for all, and demonstrates that HPC systems in the cloud can be cost-comparable to on-prem alternatives whilst bringing many advantages.
Harnessing the potential of cloud HPC whilst meeting all other business objectives though is no mean feat and will be dependent on intelligent tooling. YellowDog is a pioneer in this space and a perfect partner for any business looking to leverage cloud HPC resources to gain a competitive edge.
 Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, Oracle Cloud Infrastructure and Alibaba
Cybersecurity innovation critical in combatting the inexorable rise in cyber threats and ransomware attacks.
Bloc invests in technology areas that underpin the future growth and prosperity of the digital age. Cybersecurity, and in particular the challenges companies face as they move operations online and into the cloud, is a growing area of importance and innovation.
The landscape for security teams is rapidly changing. Digital transformation, accelerated by Covid and remote working, is driving a rapid uptake in cloud utilisation.
Hybrid multi-cloud & remote working practises are dramatically expanding the attack surface as workforces access company IT systems from unsecured devices (home PCs, tablets) and over unsecured WLANs (home, coffee shops) thereby tearing down the single security perimeter that security teams have previously come to rely upon.
Competitive pressures driven by DevOps & CI/CD working practises are leading to mistakes in cloud configuration and deployment of unauthorised shadow IT, both of which are creating additional vulnerabilities within company networks – Verizon estimates that 82% of enterprise breaches should have been stopped by existing security controls but weren’t, and 79% of observed exposures were in the cloud compared with 21% for on-premise assets.
Worst still, zero-day vulnerabilities introduced or exploited within the systems and software of companies’ suppliers is on the rise – a Trojan horse in effect that a business has very little control over, although startups such as Darkbeam are seeking to help companies manage the risk.
Cyber-attacks and the resultant data breaches are expensive, erode customer trust, damage brand reputation and can ultimately stop a company in its tracks.
And yet despite their efforts, many companies are being overwhelmed by the magnitude of threats they face, and are ill-equipped to differentiate between real threats and false alerts coming from their networks.
Survival will be dependent on the development of intelligent tools leveraging advanced AI/ML that can augment and support security teams in their ever-lasting battle with the cybercriminals.
Key areas for innovation identified by Bloc
We have identified a number of cybersecurity areas for innovation:
- Use of few-shot learning AI techniques for detecting zero-day exploits with unknown signatures such as those introduced through supply chain attacks
- Methods for obfuscating existing networks to inhibit attackers without the company needing to re-architect
- Enclave Networks is one such company helping its clients to ‘darken’ their networks through the introduction of invisible network access gates
- Implementing zero-trust principles to prevent attackers from moving laterally through the network after gaining access via infected systems
- Zero-trust assumes that everyone in the network could be a bad actor, hence all activity is continuously monitored for behavioural anomalies and access to individual systems managed via granular privileges and more robust authentication methods
- Introduction of cyber deception platforms and honeytraps that lure attackers into revealing themselves thereby enabling security teams to shut them down before they cause any serious damage
- CounterCraft, for example, provide a cyber deception and counterintelligence platform designed to detect intrusion and insider threats before attacks are perpetrated
- Supporting anomaly detection at scale, especially for Industrial IoT networks comprising huge numbers of devices
- Realtime anomaly detection becomes especially challenging in the IoT space as the number of devices scale into the millions. One way to tackle this (pioneered by Shield-IoT based on work conducted within MIT) is to compress the network and resulting data into a smaller coreset enabling context-free highly accurate anomaly detection in minutes instead of hours or days
The market opportunity is clear
Cybersecurity software & tools in 2020 was worth $12 billion in the UK, $26.5 billion in Europe and $78 billion globally and is projected to grow to $118 billion globally by 2024. The cybersecurity market for hardware & software combined is expected to exceed $200 billion by 2024 and reach $372 billion globally by 2028.
Managing cloud vulnerabilities is a race between attacker and defender and therefore ripe for new entrants bringing fresh ideas and utilising the latest technology to deliver anomaly detection, behavioural profiling and automated tools for supporting security teams and those companies wanting to take their business operations into the cloud.