Key Trends on the Horizon for HPC Clouds in 2011

As we begin a new year, it’s time to take a look at the current environment and try to determine the shape of what’s to come. So to kick off 2011, let’s begin with some predictions of things to come in 2011 for HPC and the computing industry in general.

One-year predictions however, must be about things that are already underway, so these may not be “predictions” to everyone, but these can, at the very least, serve as commentary on the shape of things to come.

While it would take a novel-length manifesto to go into depth about the following points, we can at least identify a host of key trends to watch as 2011 unfolds.

Infrastructure Management Evolution

  • Infrastructures are reaching a capacity and complexity point that manual intervention no longer makes sense. Much like we don’t write machine code to execute programs, we will generate policy, and policy will execute low level constructs at the system level to implement changes
  • Events that were previously not policy criteria begin to be integrated into management schema. Storage performance tiers (IOPS, throughput, capacity), compute performance tiers (over clocked, latest generation, core width, GPGPU), network performance tiers (IB, 10GbE, 1GbE), contiguous memory footprint are examples of criteria integrated into policy decision trees.
  • Virtualization performance inefficiencies are partially addressed, allowing HPC environments to consider use of virtualization.

Network Evolution

  • WAN bandwidth approaches commodity pricing
  • Usage models evolve to thin clients with data hosted from centralized, consolidated datacenters. While this may seem to not directly apply to the HPC market (most HPC shops have been doing thin client for years), the efforts and research in this space may enable very different futures provided sufficient success.
  • Significant efforts are invested in data centric cached distribution models.
  • Thin client software will evolve to present a more real time experience, narrowing the gap between remote execution and local execution performance.
  • Device aware content delivery progresses. There will be work invested in sensing the configuration of the client device to determine the quality of the experience.

Compute Evolution

  • Overclocking goes mainstream. This year we will see a couple different flavors of overclocked solutions emerge to allow a premium performance option on the compute front.
  • Larger variety of performance tiers. There will need to be accommodations on the provisioning side to allow for computational performance specification (overclocked, bin1, bin2, GPGPU) and prioritization.
  • DRAM capacities catch up (finally), but sill lag on the performance side.

Storage Evolution

  • SSD technologies integrate into enterprise storage solutions. This will add a performance component that has been sorely missing in spindle based solutions.
  • File systems start to look at performance characteristics and capacities of components as a storage decision criteria. Additional work will be invested in historical tracking of access patterns in order to fully flush out this capability.
  • Storage solution start taking part in policy based solutions. Policies will enable real time creation of cache copies of oversubscribed data sets, will constrain workload use of saturated file server resources, will migrate data to higher capacity, lower performing storage based on policy at a file, directory, or volume level.

Business Evolution

  • IT is recognized as a business enabler. Business will reassess how IT is funded, staffed, and reports
  • Continued growth of infrastructure drives reassessment of acquisition and management practices (getting too big, too complex with linear growth).
  • Internal IT organizations will evolve to address transforming into a management function, looking to outsource significant portions of technology consumption.
  • Purpose built clouds emerge to address specific business vectors. Over time, consolidation of these clouds can occur to accomplish additional cost benefit with the guidance of customer businesses.

This rounds out the list of what to watch in 2011 and provides some insight about some of the emerging trends in this rapidly-evolving space. While some of these movements may be well underway, we can expect to see greater maturation of clouds as a whole this year—for high-performance computing and beyond.

Blog originally posted at HPC in the Cloud.

Cloud Control: Outsourcing an HPC Cluster

So, thus far in this series of posts, we have discussed the following issues:

  • IT is not a core competency of the business, so we should look to outsource if we can outsource without jeopardizing the business.
  • We should look to cloud computing to bring costs under control and to deliver cost efficiencies over time, not as an immediate cost reduction activity.
  • In order to outsource IT, we must trust the suppliers and vendors involved, which means developing relationships, not better bludgeoning weapons. And we have already done an extremely similar divestiture in our past, so we have a model to look at that says it can be done successfully

Now we need to talk about what an organization would need to look like in order to properly manage the outsourcing of your HPC cluster. So what would that look like? Well, we should assume that all technical and operational capabilities necessary to execute the infrastructure are included in the outsource. The supplier is expected to provide the entirety of the technical function and carry out all operational duties. That is not to say that the customer is off the hook technically, just the opposite. The customer needs to assemble a small team of technically savvy, business minded (specific to the core product of your company) individuals to measure and manage the outsource. This team needs to be very strong technically in order to vet and gauge any available technologies for potential use as well as identify flaws in solutions or methodology of solutions delivered. The size of the team would be dependent on the size of the company (and therefore the size of the outsource).

Functionally, the outsource management team is the control point for the outsourcing of your infrastructure. Through this group, you maintain control over your infrastructure, and therefore can have full trust in your outsource partner (because you know exactly what you want, and you know how to measure if you are getting it). The intent of this team is to stay abreast of the constantly changing needs of the business, understand the continuously evolving capabilities of technology, and combine the two awareness’ to understand how the company should be leveraging technology to maximize benefit to the business and control costs. With that combined awareness, you now hold the outsource accountable for delivering an appropriate solution to your company’s need.

This is not to say that all responsibility falls to the customer outsource team. The supplier will need to have a disciplined focus in the specific space that your company does business, and be innovating their solution to specifically solve the problems of that industry. If they do not, then they will probably not be a cost competitive, viable supplier long term.

You will see many functions that fall under the customer outsource team. And remember, this team needs to remain small in order to avoid paying too much for your solution. There will be a constant loop for the outsource team to:

  1. Quantitatively measure the current solution
  2. Analyze cost and benefits of the current solution
  3. Assessment of best practices
  4. Revision of current solution
  5. Loop back to 1

There will be several technical responsibilities that the outsource team will participate in jointly. The supplier should be doing most of this work for the customer, but how do you know if the data they are presenting is 100% accurate or appropriate for your solution. When in doubt, the outsource team will generate their own data, and share that data with the supplier to derive a more accurate solution. In that, the outsource team will do some amount of, but not every facet of:

  • Technical and cost Benchmarking
  • Technical advisory / liaison (IT industry to customer business)
  • Technical architects – Designing architecture of applications and services that are appropriate for the company’s consumption

There are many responsibilities of the outsource team that will fall into the relationship management arena. This team will be the primary point of contact and control between the customer and the supplier, and I can’t say enough how important having a positive relationship with the supplier is to the quality of the product you consumer or the price you pay for that product. The outsource team will be responsible for communicating current and future requirements to the supplier, and many of those will take on the form of Service Level Agreements (SLAs) which we will talk about in a moment. The outsource team will also be responsible for how technology is being consumed by the customer company. The outsource team needs to make sure the company is getting the appropriate solution from the supplier company at an appropriate price with appropriate constraints / limitations / boundaries.

Another very important responsibility of the outsource team will be to maintain flexibility from a quality of solution as well as a cost perspective. In this, staying standards based is very important. It is not an absolute requirement, there may be solutions that are proprietary that solve a problem much more efficiently or cost effectively. What you need to consider in this case is when the vendor thinks they have you locked in, and start raising the price because they think you can’t get out of their solution, what is your plan for defeating them. So, where possible, use industry standards so that you can move from vendor to vendor without losing time, money, or critical features. Where that is not possible, what is the plan for using one vendor’s proprietary solution but being able to migrate to another vendor’s solutions without impact to maintain negotiating position.

Finally, there is the new component to infrastructure management. The outsource team will need to learn how to define and measure service level agreements (SLAs). The definition stage will have several components. What is the service level expectation (defines success and failure criteria)? This will sometimes have many different components for a single solution. An example would be storage: is there enough capacity, do we get enough IOPs, and is there enough throughput. All of these are different measurements, but critical to a storage infrastructure for HPC. How will this service level be measured and how often? We have all seen many improper SLA measurements where IT informs the engineer that they have 99.997% availability of the environment, but the engineer knows that there were several outages that had him or her non-productive for days at a time. So do you measure component level availability or solution level? How frequently are the polls for availability? Is availability the right measure? This is all part of the definition. And then, what happens when a failure criteria is met? This is where there is a lot of work happening in the industry. It is not sufficient to refund the months colo fees when a power outage cost the company 6 weeks worth of work. There is a cost to failure, and that is usually very specific to the industry. An outage on a cluster for an EDA company has different implications than an outage to a scientific computing cluster for a university. The recourse needs to be negotiated based impact. Does this at all sound familiar? Any insurance people reading this?? Well, that is one of the solutions the industry is exploring, is having insurance policies behind the supplier. Finally, we need to look at how service levels re-assessed over time. As the technology evolves, so should the service levels.

The fabless semiconductor industry is fairly mature in it’s process for outsourcing the fabrication function. They have cost models and laws (Rock’s Law for cost of a fab over time) that help decision processes, they have a collaborative (FSA) for arriving at better process, and they have an established track record that this can be accomplished very successfully and with cost benefit. The HPC Cloud industry needs to mature. That will just take time.

Original blog post found at HPC in the Cloud.

Cloud is History: The Sum of Trust

To continue where we left off with the last blog, this time we are focusing the discussion around trust. In considering cloud, this is probably the largest barrier we will encounter.

If we look at history, the issues associated with trusting someone else to perform what we view as a critical element of our business has been faced and successfully addressed in the past. Semiconductor companies had to have the entire process of manufacturing under their direct oversight and control because portions of that process were considered business differentiating and proprietary, and close coupling between design process and manufacturing process were required for successful ASIC development (lots of iterative, back and forth process). As time marched on, capacity needs increased, complexity climbed, the cost increased with each of those dynamics, creating an ever higher barrier to entry for maintaining existing or creating new fabrication facilities. In the mid 1980’s, we witnessed the birth of the first foundry, with TSMC coming onto the scene to create a differentiated business model (Fabless Semiconductor), where engineering companies could focus on just the process of design, and then hand off their designs off to TSMC to be manufactured. The Fabless Semiconductor Industry is a $50B market today, and growing.

So, are the issues we face with datacenters today any different? Not really, just a slight different view of the same picture. The dynamics are the same: a non-linear cost increase due to capacity and complexity increases is the driver for re-evaluating the current position. The function is considered critical, and sometimes differentiating and/or proprietary to the business, and is therefore internally maintained at present. And finally, the function deals directly with the core product of the company, therefore security is a paramount concern. What we witnessed with the fabrication facilities is that many companies were able to realize the cost benefits of outsourcing that function without damaging the business, so we should be able to follow that model to realize the cost benefits that cloud computing offers with respect to the datacenter. And we even have a recipe for success to look and use as a template for what to do and how to do it.

What customers of cloud will be looking for from service providers is multi-faceted:

  • Budget control – making sure they can continue to do the right thing for their company from a cost perspective and continue to come up with creative ways to keep budgets under control. This includes making sure they do not get locked into exclusive relationships, so they need to make sure that there are multiple vendor options so that there can be competition. In the same light, they need to make sure that the solution they consume is standards based, so that moving to another provider is simple, straight forward, and not costly.
  • Do it my way – points to customer intimacy. The consumer company must understand the solution they are leveraging and the supplier must provide the solution in such a way that it makes sense to the customer. This sounds obvious, but in many cases, companies have been held hostage even by their own internal IT organizations through confusing terminology, overly complex descriptions of solutions, and territorial behavior. The customer should understand the solution on their terms, which implies that the service provider must intimately understand the customer’s core business. Customers should get the services and solution they need, which is something specific to their business, not something bootstrapped from another industry or something built for a different or generic purpose. And it is not sufficient to have really smart technology people on staff, and have the customer tell the service provider exactly what they need so the supplier can do the right thing – many times the customer doesn’t know what they need, they just want it to work right. That is why this needs to be domain specific, performed by domain experts in the customer’s space.
  • Honesty – do I believe you? The customer needs to have faith and confidence that the supplier has the best interest of the consumer as a driver. Understanding intent and understanding positive behavioral characteristics as compared to negative ones. Any competitive or adversarial behavior will be the tip that trust should be called into question.
  • Focus on my business, not yours (counter-intuitive concept). This is really the crux of the issue. If the customer can really believe that the supplier is looking out for customer interests first, and not only trying to tell the customer whatever they think they want to hear, only then will the customer allow the supplier to absorb responsibility from them for their infrastructure to help make them successful. This is key because if the customer has to continue to drive success and own all the responsibility, then nothing has really changed, and it is probably easier for the customer to continue keeping all the resource in-house where they have much more direct control over hire/fire, retention, resource caliber, etc.

As a result, cloud service providers will need to demonstrate many things in order to establish trustworthiness. From an intent standpoint, make sure the focus is on the end customer. In the EDA space, that would be the engineer. Understand the customer’s business to the point that you can help them do their job. This implies an intimate understanding of the tools, what they do, how they work, and where they fit as well as business model, economic drivers, and a solid grasp of the industry dynamics. Also, the supplier should maintain a long term view (strategic) in addition to a short term perspective (tactical). Always do the right thing now, but how solutions are designed to scale into the future can have significant cost impacts over time. Finally, it should always be relationship focused. The ability to judge trustworthiness is measured over time, and your every action defines the integrity and character of your organization.

The behavior portion for the supplier is fairly straightforward. Deal with customers in a transparent, honest fashion. Don’t try to hide things, don’t try to play the poker game of masking your agenda, or worrying about what you’re leaving on the table, masking how much anyone is getting, trying to optimize one variable in the whole equation (profit/one sided benefit/etc.). Don’t create win / lose scenarios and don’t try to get some undeserved benefit. Exchanges should always be “appropriate” and fair, avoid adversarial relationship development. If relationship turns adversarial, be open to walking away. Customers need to be trained how to conduct themselves in a trustworthy manner as well as service providers, and have an equal hand in creating a trusting relationship. Make sure your relationships are cooperative, and not competitive. If you compete with your customers about who is smarter or who is the better negotiator, or only believing a deal is good if you win and the customer loses, you are building a bomb, not a partnership.

There is an equal amount of responsibility on the consumer side of the equation in order to get a partner. From an intent standpoint the customer should make sure the focus is on the business problem (not departmental issues, not policy issues, not contract issues, etc.), and help the service provider navigate the customer internal process in order to keep the focus on the business problem. The customer also needs to make sure there is strong communication with regard to intended future direction for the company to ensure that plans are strategic and not only focused only on the present. The concept of relationship implies a mutual dependence, and it is recognized that interdependence creates risk/exposure, but also accomplishes the desired efficiencies, economies of scale, superior solutions, and optimizes economic benefit.

Behaviorally, the customer should also demonstrate transparency and honesty, not hiding information from the provider. Create an environment where the supplier can feel safe being open and honest. The customer wants to understand that they are not being taken advantage of, and that can happen in good ways or bad ways. We will talk more about the good ways in our next blog on organization changes. The good way is to have done all the homework necessary to know roughly what the right answer looks like prior to getting that answer (whether price, technical solution, or technology direction). There is a tremendous amount of work that goes into the development of instincts. The wrong answer, adversarial behavior – just pounding vendors for a better price or a better discount or more resources so that you feel you got a deal, without any comprehension of what an appropriate price or solution looks like, will have fatal results for trust and your relationship with your vendors. Competitive or adversarial behavior will result in an adversarial response, which causes a lack of honesty leading to no trust.

You should not worry about “am I getting a better deal than anyone else in the world” or masking a lack of understanding by treating vendor brutally. Do your homework, know how much something is worth, and make sure you are getting an appropriate price and an appropriate solution. Don’t try to optimize one variable in the whole equation (overly custom for no benefit, only focus on cost, etc.) and don’t create a win / lose scenarios or expect to get something undeserved. Everyone needs to care about the health of the ecosystem. Lack of trust means that you will not get good deals or appropriate solutions for the long run.

In conclusion, businesses should focus on the core competency of the business. All non-core portions of the business should be considered for outsource provided good business practices. If there exists a trustworthy, cost effective, customer focused provider of non-core, non-strategically differentiated functions of the business, those providers should be patronized. If not, create them. Examples of this would be Global Foundries spin off from AMD, Jazz Semiconductor spin off from Conexant, etc. Outsource needs to be structured and contracted in such a way that it facilitates trustworthiness. Make sure the solutions can be moved to alternate provider without significant modification or cost. Avoid getting committed to vendor locked-in solutions (hardware, software, people, or process). Make sure the solution is standards based and non-proprietary. Make sure that the solution can take advantage of new innovations immediately. Ensure that you negotiate built in growth ramps for normal business evolution while maintaining flat (predictable) cost to the business (budget control). And make sure the solution scales with the business use case (up or down).

Original blog post found at HPC in the Cloud.