Reinders' Blogs

Subscribe to Reinders' Blogs feed
Updated: 21 min 29 sec ago

Ready for 2X Moore's Law: Intel Cluster Studio XE

November 8, 2011 - 8:33am

Today we introduced Intel® Cluster Studio XE, an exciting collection of powerful tools, for HPC programmers who use MPI along with other programming models to make the most of clusters and supercomputers. Intel Cluster Studio XE provides two substantial new capabilities to assist in hybrid programming:  The existing Intel Cluster Studio with additional MPI scaling and job control features plus substantial node-level analysis capabilities.

Hybrid programming combines MPI, used for internode parallelism, with a shared memory model such as OpenMP, Intel Threading Building Blocks (TBB) or Intel Cilk Plus, for intranode parallelism. To assist with hybrid programming, Cluster Studio XE includes cluster installation and usage support for Intel Inspector XE and Intel VTune Amplifier XE. Cluster installation for these tools makes getting started easier. The cluster usage of these tools allow them to gather node-level data on dozens, hundreds or thousands of processes. Both tools then take results and present them in a hierarchical format starting with a “by rank” view of the application.

Intel Inspector XE allows determination of memory errors, such as memory leaks, as well as threading errors, such as race conditions and deadlocks, to be pinpointed. (Learn more with "Using Intel® Inspector XE 2011 to Find Data Races in Multithreaded Code.")

Intel VTune Amplifier XE allows precise performance information to be probed to fully understand what is happening that affects application performance. VTune Amplifier XE probes node level performance, and beautifully complements the Intel Trace Analyzer and Collector, which probes MPI communication performance. Together, they offer an unequaled view of performance in a hybrid program.

New SLURM job manager support
The Intel MPI Library 4.0.3 offers better integration with SLURM job manager(s).  This provides for tighter control over job submission and startup time. It also provides information to allow process cleanup when a program terminates prematurely due to errors.

The MPI Library has been extended to allow visibility and control to the job scheduler for how many ranks and its respective resource utilization (memory, CPU usage, access to cache, etc.). Before this Intel MPI work, a job scheduler didn’t know if a rank died/ended leading to a condition akin to a resource leak (requiring a "kill -9" on the process).  When running many processes with this happening it could be quite a problem.  Now SLURM has visibility into the process state across the ranks and is able to clean up properly. There is additional information in the documentation on how to set-up and use this capability with any SLURM compatible based job scheduler.

Faster than Moore’s Law?
I’m fascinated by a trend that is running at a little more than 2X the annual rate of Moore’s Law: the increase in performance of supercomputers.  The Top 500 cluster growth graph shown here (from clearly shows that the performance of the Top 500 supercomputers has been consistently growing at an annual rate of over 80%, whereas Moore’s Law is a 40% per year growth rate.

Of course, Moore’s Law is about the doubling of transistor densities about every two years. The transistor density increases have in turn driven the computer industry to deliver more and more computer performance. Supercomputer designs have been able to use parallelism at multiple levels to double down on this trend and grow performance at a spectacular rate.

Hybrid programming rides this wave. Ten years ago, MPI programming was most often enough for large systems. Over the last decade, we have seen the individual cluster notes continue to get “fatter." This “obesity” at the node level has driven HPC developers to program for internode level parallelism differently than node level parallelism. This is most often seen as MPI + OpenMP, and the node level programming continues to get richer with more options all the time.

Which brings us back to why Cluster Studio XE is so important, especially considering the new hybrid programming insights.

Cluster Studio XE is more than Inspector and VTune Amplifier
Cluster Studio XE is a combination of almost every HPC software development tool that Intel makes. This is because the largest scale machines, and the applications that go with them, exercise every method possible to keep up this “pacing at twice rate of Moore’s Law.”

Cluster Studio XE includes the Intel C/C++ and Fortran compilers and related libraries including the Intel Math Kernel Library (MKL) that offer unequaled optimization for Intel and compatible processors. Our goal is to offer superior performance and standards support. We have great performance, and we’ve included industry leading support for (most of) C++11, Fortran 2003, Fortran 2008, and IEEE 754-2008. All four of these newer standards are mostly supported but not completely. No one has all four of these implemented – and we believe we have made at least as much progress as anyone else. Consult our documentation for details on what is done, and what is not. I think you will find that we have implemented the most important and most requested portions of each standard already (with more to come). We also have the latest Cilk Plus 1.1, TBB 4.0 and OpenMP 3.1 standards fully implemented. MKL offers core math functions include BLAS, LAPACK, sparse solvers, fast Fourier transforms, vector math, and more.  It also includes a highly optimized version of ScaLAPACK on clusters and delivers significant performance improvements.

Multicore today, and ready for a many-core future
Cluster Studio XE contains the tools and models for multicore programming today, and we are aligned and ready for many-core programming tomorrow. We believe strongly that the growth in cores in our future should not force a developer to split methods of programming. Writing scalable applications is not an easy job, but we can at least make it a single job instead of two jobs. The techniques and tools for scaling on multicore today are the same ones we will employ for many-core as well. Future generations of Cluster Studio XE will include multicore and many-core support throughout. Today, you can rest assured that the multicore support for today’s systems are aligned with this future. We have many-core support in limited usage today with many-core prototype systems (Knights Ferry), and are getting ready for one of the first many-core systems to be delivered (Stampede). Come see us at Supercomputing (in Seattle) to learn more. I’ll be there all week as will many other members of the Intel Software Development Products team. You might even run into Dr. Fortran. Look for us at future software conferences around the world – we really enjoy meeting developers and talking about how we can help!

Intel Cluster Studio XE: try it now

Intel Cluster Studio XE provides the key functionality that MPI programmers need to develop optimal programs for HPC needs. Cluster Studio XE offer a single package that simplifies installation at an economical price for those who want it all. Please try it, and let us know what you think, and what more we can do for you.

Fortran is more popular than ever; Intel makes it FAST

September 24, 2011 - 1:28am

Just this past week, a senior radio telescope astronomer told me about the shift from C++ back to Fortran in his corner of the world. It is all about efficiency. He believes this is a trend that will get stronger as we head to ExaFLOP scale machines at the end of this decade.

I'm sure C++ has nothing to fear, but neither does Fortran.

As far as we can tell, there are more Fortran programmers today than ever. Fortran is almost certainly a smaller % of the market than ten years ago but numerically it has grown. This is because the Fortran population is not growing nearly as fast as programming in general. But it is an important piece of the pie.

And judging by the growth of science and high performance computing, this will continue.

Intel Fortran compilers are leading the way in performance, features and support for new standards.

Dr. Fortran gave a very nice interview about "Why Fortran Matters."  (picked up by HPCwire too)

We've posted performance data with leading results on key benchmarks with the Intel Fortran Compiler which supports Linux, Windows and Mac OS X.

But you do not have to rely on us alone - Polyhedron benchmark results independently show benchmarking of Intel with leading performance on Intel processors and non-Intel processors both.

Intel is supporting Fortran related standards very strongly:

    • Coarray Fortran (optimized support for both shared memory and distributed memory)

    • much of Fortran 2003 and 2008 (we are working to lead in getting these both implemented)

Our aim is to be the best Fortran compiler available any where. While we believe our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, we recommend that you evaluate other compilers and libraries to determine which best meet your requirements.  We hope to win your Fortran business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. For more  information about compiler optimizations, see our Optimization Notice.

We know Fortran matters, and we aim to earn your Fortran business.

And if you were thinking Fortran was a dying language - think again.  It is doing quite nicely, thank you.

Punchcards and magnetic tape usage is dying out though, should you be wondering about that.


Parallel Studio XE SP1: Extreme Computing is a journey to the future, not a detour

September 15, 2011 - 9:30am

I am not a fan of detours. The challenge of scaling to extreme computing is a milestone on the road to every day computing.

In Justin Rattner's keynote this morning at IDF, we got to see another example of how we make programs, for multicore processors, run on many-core processors. Andrzej Nowak from CERN openlab demonstrated "Track Fitter," on a Intel MIC software development platform, which looks for tracks of particles in the data from a particle detector. This online processing near the detectors on the Large Hadron Collider uses advanced algorithms to determine what the real data from the detectors means. The code scales well on multicore processors and it scales well on our Intel MIC software development platform (which we call Knights Ferry). The code demonstrated required no source code changes in moving from running on multicore systems to running on a many-core system. These results by the team at CERN openlab using our tools, show very well how our investments in helping software development stay clear of detours.

We introduced Intel Parallel Studio XE 2011 in November 2010. We updated it, with SP1, this month (September 2011). It is designed for "Scaling to Extreme Computing" with the assumption that every method at our disposal should deliver this as a continuous journey using programming methods that will make sense long term. Intel Parallel Studio XE 2011 SP1 really delivers four ways:

    • High performance. To paraphrase Lee Iacocca, "If you can find a higher performance compiler, buy it." We pride ourselves in being the very best we can be for YOUR code. Try it out, and be sure to let us know if we are anything other than #1 for you for your IA (x86 or x86-64) code.

    • Parallel programming models. This is where the no detours part comes in. Programming models for multicore processors today that scale to many-core processors tomorrow (actually their prototypes today - more on that shortly). OpenMP 3.1, Coarray Fortran (part of Fortran 2008), Intel Threading Building Blocks (TBB) 4.0 with the very important Flow Graph feature, and Cilk Plus 1.1 support.

Of course, there are a lot more gems in SP1... including the ability to attached Intel VTune™ Amplifier to a running process instead of requiring that VTune launch the process you want to tune, and the ability to use Intel Parallel Advisor with XE.

And, of course, be sure to take Intel Parallel Studio XE 2011 SP1 for a spin.  Evaluation copies are waiting for you!

Parallelism as a First Class Citizen in C and C++, the time has come.

August 9, 2011 - 12:10pm

It is time to make Parallelism a full First Class Citizen in C and C++.  Hardware is once again ahead of software, and we need to close the gap so that application development is better able to utilize the hardware without low level programming.

The time has come for high level constructs for task and data parallelism to be explicitly added to C and C++.  This will enable Parallel Programming in C and C++ to be fully portable, easily intelligible, and consistently decipherable by a compiler.

Language solutions are superior to library solutions. Library solutions provide fertile grounds for exploration. Intel Threading Building Blocks (TBB) has been the most popular library solution for parallelism for C++. For more than five years, TBB has grown and proven itself. It is time to take it to the next level, and move to language solutions to support what we can be confident is needed in C and C++.

We need mechanisms for parallelism that have strong space-time guarantees, simple program understanding, and serialization semantics – all things we do not have in C and C++ today.

We should tackle task and data parallelism both, and as an industry we know how.

    • Task parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to individual processor cores. Mapping should be the job of tools (including run time schedulers), not explicit programming. Hence, the goal is to shift all programming to tasks, not threads. This has many benefits that have been demonstrated often. A simple fork/join support is fundamental (spawn/sync in Cilk Plus terminology). Looping with a “parallel for” avoids looping to spawn iterations serially and thereby expresses parallelism well.

    • Data parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to SIMD instructions vs. multiple processor cores vs. attached computing (GPUs or co-processors). Mapping should be the job of tools, not explicit programming. Hence, the goal is to shift all programming back to mathematical expressions, not intrinsics or explicitly parallel algorithm decompositions.

The solutions that show the most promise are documented in the Cilk™ Plus open specification (  They are as follows:

For data parallelism:

    • Explicit array syntax, to eliminate the need for explicit looping in a program to loop (serially) across elements to do the same operation multiple times

    • Elemental functions, to eliminate the need for authors of functions to worry about explicitly writing anything other than the simple, single wide, version of functions. This leaves a compiler to create wider versions for efficiency by coalescing operations in order to match the width of SIMD instructions.

    • Support for reduction operations in a way that makes semantic sense. For instance, this can be done via an explicit way to have private copies of data to shadow global data that needs to be used in parallel tasks that are created (the number of which is not known to the program explicitly). Perhaps this starts to overlap with task parallelism…

For task parallelism:

    • spawn/join to spawn a function, and to wait for spawns to complete

    • parallel for, to specifically avoid the need to serially spawn individual loop body instances – and make it very clear that all iterations are ready to spawn concurrently.

Like other popular programming languages, neither C nor C++ was designed as parallel programming languages. Parallelism is always hidden from a compiler and needs “discovery.” Compilers are not good at “complex discovery” – they are much better at optimizing and packaging up things that are explicit. Explicit constructs for parallelism solve this and make compiler support more likely. The constructs do not need to be numerous, just enough for other constructs to build upon… fewer is better!

For something as involved, or complex, as parallelism, incorporating parallelism semantics into the programming language improves both the expressability of the language, as well as the efficiency by which the compiler can implement  parallelism.

Years of investigation and experimentation have had some great results. Compiler writers have found they can offer substantial benefits for ease of programming, performance, debugging and portability.  These have appeared in a variety of papers and talks over the years, and could be the topic of future blogs.

Top of mind thoughts are:

    • Both C and C++ are important.  No solution should be specific to only one.

    • There is strong value in adding some basic task parallelism and data parallel support as a first class citizen into both C and C++. The time has come.

    • Task parallelism: nothing is more proven than the simple spawn/sync and parallel for of Cilk Plus.

    • Data parallelism: nothing is more simple than extending syntax to make data parallel operation explicit via array operations such as a[:]=b[:]+c[:]   Fortran 90 added similar capabilities over two decades ago!

    • Data parallelism: elemental functions have an important role and should be included

    • Task parallelism goal: shift parallel programming to explicit language features, making parallelism easy to express and exploit task parallelism that can be optimized by a compiler, and can be more easily tested and debugged for data races/deadlock.

    • We need strong space-time guarantees in parallelism constructs.

    • Data parallelism goal: shift programming to EXPLICIT and EASY-TO-FIND (exploit) data parallelism, so all varieties of hardware can be addressed

    • Everything proposed here is incredibly easy to teach. The power that can be placed underneath via a compiler is a big bonus, of course! That power makes all these very compelling. The data parallelism is immediately convincing by its compact form, but the task parallel constructs are convincing as well.

None of this is radical – and none of it need be proprietary.

If we don’t get carried away adding other stuff, KISS, we can add these and make a fundamental and important advance for task and data parallelism in C and C++… the languages that lead the evolution to parallelism, and are becoming more (not less) important in the future.

- james


New Parallel Studio: Intel Parallel Studio 2011

September 14, 2010 - 4:10pm

This month, we introduced Intel Parallel Studio 2011. It is a very worthy successor to the original Intel Parallel Studio by expanding both on the tooling and the parallel programming models it offers.

On the tooling, we have the Intel Parallel Advisor tool. It is an exciting tool that is a joy to use when considering where to add parallelism into an existing program. It has a straight-forward interface to find "hot spots" and add annotations about what you are considering doing to the program. Specifically, you can say "I'm thinking of doing this region in parallel" and "I'll put some sort of a lock around this code." Adding such annotations is done with a few mouse clicks and no work on syntax. Parallel Advisor then offers interactive estimates of speed-up and options to improve, as well as feedback on the correctness of the algorithm. If you forget a lock, you may see great speed-up estimates but will get exact feedback on where race conditions will exist (errors!!!). Having this tool can change lives of programmers adding parallelism to programs. The five steps to success in the tool are:

    1. Survey Target – This step helps you focus on the hot call trees and loops as locations to experiment with parallelism.

    1. Annotate Sources – Here you add Advisor annotations into source code to describe the parallel experiment.  You do this without modifying your source code!

    1. Check Suitability – This step evaluates the performance of parallel experiment. It displays performance projection for each parallel site and shows how each impact the entire program.  This way you can pick the areas that have the most performance impact.

    1. Check Correctness - Identifies data issues (races) of each parallel experiment so you can fix these before committing your changes to code.

    1. Add Parallel Framework – After you have corrected any correctness issues, you replace the Advisor framework with real parallel code using a variety of methods.

The other BIG addition with Intel Parallel Studio 2011 is the expansion of programming model support. We have introduced an umbrella project called Intel Parallel Building Blocks (Intel PBB). It is a collection of three offerings that include and build upon Intel Threading Building Blocks (Intel TBB). Intel TBB is in its fifth year and is more popular than ever. Intel TBB, by design, leaves two opportunities for us to address with complementary models. First, we introduce Intel Cilk Plus to show what can be done by implementing extensions in a compiler instead of the compiler-independent (and highly portable) approach used by Intel TBB. Secondly, we introduce Intel Array Building Blocks (ArBB) to tackle data parallelism directly. Specifically, Intel ArBB focuses on using SIMD parallelism (such as SSE and AVX) in conjunction with multicore parallelism. In other words, it takes simple looks programs and automatically vectorizes and parallelizes the work to be done. Previously, this was best done by making your source code complex and difficult to read.

Intel Cilk Plus is the product results of the combination of our compiler efforts with the team acquired from Cilk Art a year ago, all based on the award winning Cilk research that began around 1995 at MIT.

Intel Array Building Blocks is the result of the combination of the Intel Ct research project with the RapidMind team also acquired a year ago. The product experience of the RapidMind team form a solid foundation for this new offering. Intel ArBB is "beta" - and anyone can ask to join our beta.

Intel Parallel Studio 2011 maintains full compatibility with Microsoft Visual Studio 2005 and 2008, and adds support for 2010 which was released by Microsoft earlier this year.

I look forward to experiences and feedback. There is a lot more to write about the gems of this release... I'll work on posting more thoughts and experiences in the future.

By the way - I was on sabbatical this summer... hence my being behind on answering email and calls.  I'm catching up. Ask again if you don't hear back soon!

SP1 for Intel Parallel Studio - service pack worth installing!

November 19, 2009 - 3:48pm

Intel® Parallel Studio Service Pack 1 is now available, adding support for Windows* 7.

SP1 is well worth downloading and installing - here are some of the reasons:

    1. Parallel Inspector and Parallel Amplifier can be driven (for automating test suites) from the command line now.

    1. Bug fixes - of course - not many issues needed fixing, but you may appreciate the ones bugs that were found and fixed!

    1. Window 7 support (Parallel Studio came before Windows 7, now that it is released - we had a few things to update)

    1. TBB 2.2 and other improvements to align with the upcoming Microsoft Visual Studio 2010 I'm sure there are more - these are the highlights as I see them.

Download SP1 - you'll be glad you did!

See the release notes for more details - skip the main document if you want to read about what is new and useful - read the three individual documents.

Along for the ride

October 9, 2009 - 11:26am
Some of my most memorable and influential experiences happened because someone invited me along for the ride.

Presentations at IDF about Software Tools, available for download

September 23, 2009 - 2:52pm

Today, at Intel's Developer Forum, we have taught many classes on our tools, and have a few left to go.

If you could not join us in San Francisco, the presentations are available online for downloading at

Intel OpenCL solution - Rapidmind?!

September 4, 2009 - 9:42am
Language English


I am just wondering if the Rapidmind aquisition has anything to do with the adoption of OpenCL? Are there any demos of Rapidmind products? I read the whitepapers and it does appear to be quite a flexible and powerful solution.

Is Intel going to launch an OpenCL implementation soon?

Version 2.2, Intel Threading Building Blocks, worth a look

August 4, 2009 - 7:11am

If you write C or C++ code, and you haven’t given Intel Threading Building Blocks (TBB) a try, you really should. Intel Threading Building Blocks has emerged as the most popular high level programming method for writing parallel programs (see Evans Data Corp: The low level methods (using pthreads or Windows threads directly) popular before high level methods existed should be avoided by those writing new parallel programs because of their substantial learning curve, plus their high costs to create and maintain.

C programmers will want to take another look at Intel Threading Building Blocks (TBB) which has been popularized primarily by C++ programmers. Because C++ didn't have lambda functions, too much of C++ templates showed through when coding common operations. It was intimidating unless you know and like C++ templates. With version 2.2 and the latest compilers, lambda functions let coding with Intel TBB reasonable for C programmers too (using C++ compiler of course!)

Whether you are new to Intel TBB, or a current user, you’ll want to know about the latest version – 2.2. Intel TBB 2.2 can help you improve the scalability and portability of your code while being productive writing parallel programs.

Version 2.2 of Intel TBB is now available, in both the commercial and open source releases. These are built from identical sources – the only real difference is the license and support offerings. Get a copy and learn more at (open source) or (commercial).

Small version number change, but lots to offer

Intel TBB 2.2 maintains the functionality and platform support of previous versions and adds numerous feature and performance improvements, including full support for the lambda capabilities of the new C++ draft standard (C++0x) and more flexibility for developers to redistribute with their applications. Autodesk Maya and Epic Games Unreal Engine are among the applications that will be reshipping some or all of Intel TBB 2.2 to support their developers.

I’m not completely used to the small version increments common with open source projects. I’d have no trouble considering this version 3.0 or 4.0 of TBB as a commercial-only product. Yet 2.2 seems fitting from a point of being modest – a bit understated.

This release is packed with a bunch of additions, which continue to show the maturity you’d expect from a package as popular as Threading Building Blocks has proven itself to be. Users give great feedback, and that leads to improvements.

Automatic memory allocator replacement available

The memory allocator is one of the most popular features of Intel TBB. However, it can be time consuming to replace your own memory allocator calls. Version 2.2 uses a dynamic instrumentation method on Windows and the LD_PRELOAD function on Linux to offer automatic memory allocator replacement throughout your application.

Ron Henderson at DreamWorks Animation summed it up: "The Intel® TBB malloc was an important tool in achieving good parallel speedups for our threaded applications, and a drop-in replacement for the memory allocator in the C standard library."

Memory allocator faster than ever

Version 2.2 extends the performance lead of Intel TBB’s memory allocator's performance over the competition by delivering even better large-block (over 8K in size) allocation performance.

Scaling of scheduler enhanced significantly

Version 2.2 features a reworked the task scheduler to behave more like an ideal Cilk-style scheduler, yielding even more scalable behavior. True to the promise of using Intel TBB - the benefits of this work come to programs written using Intel TBB without requiring any code changes. Version 2.2 also has improvements to the affinity partitioner, and changes the default for loop templates from the simple_partitioner to the easier to use and adaptive auto_partitioner.

Automatic initialization available

Version 2.2 no longer requires an explicit initialization. Users of prior versions have told us that in a large application it is not easy to initialize in the right place. Version 2.2 takes care of automatically initializing the scheduler when it is first needed.

Parallel algorithms enhancements

  • Version 2.2 has a new parallel_invoke for running a group of functors simultaneously in parallel.
  • Version 2.2 has a new parallel_for_each and a simplified parallel_for interface to make writing some common for loops easier.
    • parallel_for_each(first, last, f) is like parallel_do(first, last, body) but without the feeder functionality that allows adding more work items. In other words, tbb::parallel_for_each is the parallel equivalent of std::for_each.
    • The new overload parallel_for(first, last, step, f) allows you to pass an integer first (auto i=first), last (i<last), and step (i+=step) for a given function f(i), handles simple cases easily, especially with the use of lambdas. The original interface parallel_for(range, body, partitioner) has been retained. It's more general but also more complicated to write, even with the use of lambdas.
  • Intel TBB's pipeline can now perform DirectX, OpenGL, and I/O parallelization by using the new thread_bound_filter feature. There are certain types of operations that require that they are used from the same thread every time and by using a filter bound to a thread, you can guarantee that the final stage of the pipeline will always use the same thread.
  • Exception safety support has been expanded significantly. Prior versions had support for exception propagation only in parallel_for, parallel_reduce and parallel_sort. Support is expanded to include parallel_do, the new paralle_invoke and parallel_for_each as well as the new forms of parallel_for and parallel_reduce.
  • Lambda support has been extended to cover not only parallel_for, but also parallel_reduce, parallel_sort, and the new parallel_for_each and parallel_invoke algorithms. In addition, the new combinable and enumerable_thread_specific classes for thread local storage can accept lambdas. The documentation and code examples are expanded to show lambdas in action. The Intel® Compiler 11.0 and Intel® Parallel Studio offer lambda support today, and Microsoft will support it in Visual Studio 2010 (it is in the beta currently). Based on feedback, I expect lambdas to be easily one of the most used features of the new C++ standard. It certainly makes code using Intel TBB easier to read – hence our long desire to see them a part of C++ (there is a section in my 2007 book about the desire for lambdas – we are very happy to have them now!). See "Hello Lambdas" C++ 0x, a quick guide to Lambdas in C++ for more background on lambdas, and see parallel_for is easier with lambdas, Intel Threading Building Blocks for more on parallel_for and lambdas.

Concurrent container enhancements

  • Thread local storage, which is portable across platforms, is now possible with the new enumerable_thread_specific and combinable classes. This can be useful for algorithms that reduce shared memory contention by creating local copies and then combining results later through something like a reduce operation.
  • Unbounded non-blocking interface for concurrent_queue and new blocking concurrent_bounded_queue. Some operations require synchronization and may or may not block depending on whether or not the queue is bounded. To get the best behavior, use the unbounded form if you need only basic non-blocking push/try_pop operations to modify the queue. Otherwise use the bounded form which supports both blocking and non-blocking push/pop operations.
  • Simplified interfaces for concurrent_hash_map that make it easier to utilize for common data types using the new tbb_hasher.
  • Improved interfaces for concurrent_vector that removes a common extra step needed to use the vector output.

Redistribution is easier

The licensing of the commercial version has been modified to allow redistribution of required DLLs and header files. This means you can redistribute DLLs and header files from version 2.2 with your application, to enable your customers to write Intel TBB code that will use the master applications DLLs and therefore the same infrastructure.

Also, Intel is offering additional redistribution rights for commercial customers who need more than just the DLLs and header files. If that is of interest, drop us a line and we’ll talk.

Of course, none of this really matters for the open source version – but if the nuisances of using the commercial version has you wanting more – you should ask as Intel is trying to help out.

This effectively makes Intel TBB freely available for the strong community of developers that support some of the world's best software. Gordon Bradley with the Maya Performance Team Lead at Autodesk summed it up: "The Maya team has successfully used Intel's TBB technology to internally parallelize Maya for several releases. Now thanks to Intel, TBB 2.2 lets Maya plug-in developers access the same advanced parallelism features that we've used at no additional charge."

Current users have a little work to do to upgrade

There are some changes you may need to do to move from prior versions of Intel TBB to the new 2.2 version. Personally, I don’t like doing anything to upgrade from one version to another – but sometimes it is necessary. You can simply add "#define TBB_DEPRECATED 1" to your code, and the old interfaces remain available to you (at least for now) – or adjust to the following changes:

  • auto_partitioner() is now the default instead of simple_partitioner(). To this I say: it’s about time! When I wrote my book on Intel TBB, I included auto_partitioner despite some concerns from the TBB team that it was new and somewhat experimental! Well – the writing was on the wall… this was the way to go! Now it’s the default. Of course, if you specified a perfect grain size, you might see a slow-down. In such a case, you should specify simple_partitioner() explicitly and drop us a note telling us about it – we’d like to know if the auto_partitioner() is not good enough. Or, you can use TBB_DEPRECATED to force the old default.
  • Concurrent queue API changes: renaming four interfaces. Or you can change pop_if_present to try_pop, push_if_not_full to try_push, begin to unsafe_begin and end to unsafe_end, and thereby be consistent with the latest API.
  • Concurrent vector API changes: renamed compact to shirnk_to_fit, and changed three interfaces to all consistently have return types of iterator. Previously grow_by returned size_type, grow_to_at_least returned nothing, push_back returned size_type.
  • The notion of task depth has been eliminated, so the following four members of class task have no effect: depth_type, depth, set_depth and add_to_depth. These have no effect in 2.2 even if you use TBB_DEPRECATED, but are nonetheless defined to permit their use without error messages.

Try it today!

Get a copy and learn more at (open source) or (commercial.)

Cilk + Intel

July 31, 2009 - 7:54pm

Parallelism can be smooth as Cilk? (pronounced "Silk")

If you've visited today, you see that the Cilk engineering team has joined Intel. I was surprised how fast I've gotten questions from a note on the Cilk web site on a Friday afternoon - it happened only minutes after the posting!  I've been a follower of Cilk technology for some time now - and it is exciting to have the opportunity to work with the team that is joining us! Working together we will result in even more options for parallel programming!

Updates today for our compilers, libraries and cluster toolkits

June 23, 2009 - 9:29am

Today we released updates for our C++ and Fortran compilers, our Intel Math Kernel (MKL) and Intel Integrated Performance Primitives (IPP) libraries and Cluster toolkits. Noteworthy additions include outstanding performance enhancements, support of Intel® Advanced Vector Extensions (AVX) and inclusion of some elements that debuted in Intel® Parallel Studio last month.

I can share some notes on the features, including our AVX and AES support in the tools (which I believe is the first product support in tools for Intel and compatible processors), our adaptation of some of new features from Parallel Studio to Linux and Mac OS X, and really great tuning of our performance leading MPI library.

The specific new product versions are:

    • Intel® Professional Edition Compilers 11.1 (Fortran & C/C++, for Windows, Linux, Mac OS X)

    • Intel® Integrated Performance Primitives (IPP) 6.1  (for Windows, Linux, Mac OS X)

    • Intel® Math Kernel Library (MKL) 10.2 (for Windows, Linux, Mac OS X)

    • Intel® Cluster Toolkit, Compiler Edition 3.2.1 (for Windows, Linux)

    • Intel® MPI Library 3.2.1 (for Windows, Linux)


If you've not moved from the 10.x to 11.x compilers, you will want to consider doing that. Aside from new functionality such as parallel debugging, OpenMP 3.0 and AVX support - you are very likely to see pleasing performance boost esp. on the latest Intel and compatible processors. Several customer have told us of 10% performance gains in moving from 10.x to 11.1. While I can't promise such gains to everyone, you have a reasonable shot at seeing performance gains based on what enhancement we did and what feedback we have been getting from users.

Likewise, moving from version 9.x to 10.x for the Intel Math Kernel library (MKL) has shown up to 45% gains in key routines. This is incredible given how consistently MKL is the library to beat in performance - a leadership position our MKL developers are not just maintaining - they are enlarging it! Of course, you don't have to take my word for it that we do this well for Intel and compatible processors - you can find reviews on the web including a recent one at

Of course, Intel Integrated Performance Primitives (IPP) and Intel MPI library have similar success stories - and you will want to stay up-to-date for the latest performance. With IPP 6.1, task parallelism usage gains give as much as 250% multicore performance scaling while the PNG codec added to Unified Image Codec framework offers 300% faster encoding than the open source reference version. Intel MPI 3.2.1 offers industry leading performance with low latency and high bandwidths, and now uses direct inter-process memory copy for increased bandwidth on Windows systems.

Intel® Advanced Vector Extensions (AVX) support

We have offered support for developing AVX (AVX is a 256 bit instruction set extension to SSE and is designed for applications that are floating point intensive) from for about a year now, and we've enjoyed the feedback on these offerings and input on our future direction. One recurring request has been for us to make the support for AVX a feature in our compilers and library products now, before the processors supporting AVX are available to purchase. This makes it possible to create and ship software now that is ready to utilize processors with AVX support. We have validated our code using simulators for our future processors (you can get the Intel® Software Development Emulator from

Many software vendors will want to do some testing on real processors before they ship - and having these compilers and libraries now makes that easy and realistic. There is plenty of time to incorporate our latest versions into your build systems, validate them for usage, and be fully ready for testing with processors using AVX. We've been reminded often that it's naive to expect that releasing compilers and libraries concurrent with new processors shipping can be adopted quickly. We have listened and acted on this feedback!

Tilo Kühn at Maxon Computer said, “We’ve been enthusiastically using the new version of Intel® C++ Compiler Professional Edition that includes support for Intel® Advanced Vector Extensions (Intel AVX.) Being able to performance tune our software well in advance of processor availability gives us a major development head start to ensure that our Cinebench product will be ready when the first Intel AVX-enabled processor is delivered.”

Performance using AVX can be incredible, but it is important to know its limits. In general, code using AVX for data parallel problems should outperform code using SSE. That is in general the key thing to know about using AVX - it should do at least as well as code using SSE. This does, of course, assume you overcome the overhead of any alignment and loop setup/tear down. It you have short vectors that are a multiple of 128 bits in length but not 256 bits, you may be better off with SSE. That is understandable. Aside from that, AVX should win in performance - which begs the question "by how much?" The answer, of course, is "it depends." It depends on the exact processor design, your algorithm, and your system design. The highest gains will come from code with intense computations running out of data cache. It isn't hard to imagine gains on such code approaching the theoretical doubling that moving 128 bits to 256 allows, but will be dependent on the processor and system design. The rest of the gains will depends on the factors mentioned. With AVX generally better than SSE, the migration to AVX is easy to choose. Our compilers and libraries make it even easier by easily producing both code paths (support for SSE-only, and for AVX+SSE).

Advanced Encryption Standard (AES)

In addition to the anticipated AVX support, we have our earliest Advanced Encryption Standard (AES) support. Unlike AVX, I expect AES will be used by very few developers directly - but our compilers have intrinsics and inline assembly support for AES. It Future versions of our Intel IPP library cryptographic algorithms will use AES, but those did not make it in the current release.

Ripped from Parallel Studio

I know the big news of Intel Parallel Studio last month created a few questions like "when will you have that for Linux? or Mac OS X?" I assure you - you see many features adapted in time! Now - some of that is here now. Specifically, compiler and library features - including debugger extensions - from Intel Parallel Studio have arrived!

The Intel® Parallel Debugger Extensions have been added to Intel® C++ Compiler, Professional Edition for Windows. This allow serializing parallel regions, finding data sharing violations, breaking on re-entrant functions, viewing all active thread structures, OpenMP* task teams and trees, barriers, locks, and waits. Of course, this works in current versions of Visual Studio (both 2005 and 2008).

The Intel C++ Compiler 11.1 offers all the functionality of Intel Parallel Composer, plus the AVX support, on Windows, Linux and Mac OS X. We've added Eclipse CDT 5.0 support, SLES11 support, Native Intel®64 compiler for Mac OS X, and we support the new Mac Xcode IDE ability to relocate the tools installation directories.

If you are wondering about updates for Intel Thread Checker and Intel VTune Performance Analyzer - you'll see that we are updating the compilers, libraries and cluster tools with this release - while analysis and tuning tool updates are still in the works. Rest assured we are working to make it a matter of "when" not "if."

Math Kernel Library

Like the compilers, we have performance improvements in many areas (reason enough to upgrade), AVX support, and well tuned support for the latest Intel® Xeon® 5500 processors.

FFT routines have been enhanced by adding scaling factors 1/N, 1/sqrt(N), adding DFTI_FORWARD_SIGN, implementing radices mix of 7, 11 and 13 and optimized real data transforms in Cluster FFT. All this with strong support of FFTW interfaces. We've added single precision support in PARDISO (Parallel Direct and Iterative Solvers) and complete support for LAPACK 3.2. For .NET users, we have included .NET/C# examples for calling MKL functions.

Fortran 2003 and beyond

For Fortran, as always we have focused aggressively on performance while implementing features from Fortran 2003 in the order our customers have encouraged. Version 11.1 adds most of the "object oriented" features. As of version 11.1, Intel we have a majority of Fortran 2003 features implemented with only a few smaller items remaining plus two bigger items: parameterized derived types (PDT) and user-defined derived type I/O (UDDTIO).  These two features are demanding to implement and in low demand, They are also not supported in most Fortran compilers causing them to not be used in anything portable. While we plan to support these eventually, we expect to finish off all other issues first and embark on some of Fortran 2008 first (much stronger customer demand).

The next revision of the Fortran standard is called Fortran 2008, with an expected publication in mid-2010.  While there are many small changes in Fortran 2008, there are a few new features for which we’ve already received requests - coarrays, submodules and bumping up array dimensions from 7 to 15. We are gathering feedback on these now.

Intel Integrated Performance Primitives (IPP)

Like MKL and the compilers, we have performance improvements in many areas, AVX support, and well tuned support for the latest Intel® Xeon® 5500 processors. From early simulator-based evaluation, a select set of 65 optimized functions showed an average a 50% speedup.

Intel IPP has loaded up on new functionality including a novel way to beat back Amdahl's Law and get better cache utilization at the same time with the Deferred Mode Image Processing (DMIP) Framework. DMIP is worth a look - its a new feature to help deliver pipelined parallelism. This reduces the serialization efefct you normally see when using repeated library calls, and though bette cache and multicore utilization it dramatically improves performance of pipelined image operations, especially on larger images.  The 6.1 version introduces task parallelism and as much as 250% multicore performance scaling.
PNG codec added to Unified Image Codec framework offers 300% faster encoding than the open source reference implementation. This is very important as PNG is replacing GIF in usage around the world. Visual Studio integration improvements allow intellisense autocompletes function names and exposes parameter details for faster, more accurate inclusion of Intel IPP functions. Texture compression, advanced lighting, and 3D geometric super sampling functions added for improved image processing performance. Improved data compression deflate/inflate APIs provide better zlib compatibility and superior performance. New cryptography functions (RSA_SSA1.5,RSA_PKCSv1.5) added to support HDCP 2.0 standard.

.NET users will find and intuitive programming layer for both C++ & .NET image processing application development.

Intel MPI

Intel® MPI Library 3.2.1 leads with low latency and high bandwidth, and now uses direct inter-process memory copy for increased bandwidth on Windows. It features improved automatic process pinning for more performance. Scalable mpdboot startup is offered for faster cluster application launching.

Intel Cluster Toolkit Installations easier on Windows and Linux

Windows users will find Active Directory based user authorization for seamless integration into Windows* environment. Linux users will find full support of Linux* Standard Base (LSB) compliant RPMs.

Of course, the Cluster Toolkits contain all the wonderful compiler and library enhancement mentioned earlier and support for the Intel Cluster Ready program continues as well!

Links to more information

There is more information about AVX and AES at

Evaluation copies of everything I've mentioned are downloadable from - you can download the new compilers and libraries to evaluate the new versions starting today!

Updates today for our compilers, libraries and cluster toolkits

June 23, 2009 - 9:29am

Today we released updates for our C++ and Fortran compilers, our Intel Math Kernel (MKL) and Intel Integrated Performance Primitives (IPP) libraries and Cluster toolkits. Noteworthy additions include outstanding performance enhancements, support of Intel® Advanced Vector Extensions (AVX) and inclusion of some elements that debuted in Intel® Parallel Studio last month.

Parallelism tidbits heard at PDC

October 27, 2008 - 6:41pm
Here at Microsoft's Professional Developers Conference, I'm busy attending every session they have on parallelism.  Microsoft engineers deserve high marks for talking about parallelism at PDC very well - not hyping it, not ducking it - very good presentations.  I suspect much of it will end up on Channel 9 and will be worth watching if you were not fortunate enough to be at PDC 2008.

How Software Is Built - includes interviews with me now...

July 31, 2007 - 5:17pm
Language English

There is an interesting project called 'How Software Is Built'- which I enjoy reading. They did some video interviewing at OSCON when I was there - and it is now posted at

Open Source - TBB 2.0

July 24, 2007 - 10:25am
Language English

Today we announced Intel Threading Building Blocks (TBB) 2.0 including the creation of an open source project for TBB at We remain committed to our TBB commercial product and support, but now we have the added dimension which open source and community contributions will add. This will help us follow through on our goals to be very broadly available for most (if not all) processors, OSes and compilers.

Nutshell Book on Intel Threading Building Blocks Now Available

July 15, 2007 - 9:33am
Language English

My book on Intel Threading Building Blocks (TBB) is now available from O'Reilly Media.

It was a lot of work, and many people helped with the book, and I'm very pleased with the results. I hope this book helps make TBB more popular and easy for more people to use and understand. Even if you have been using TBB - I think I've provided insights, examples and motivational information that will be very useful to you.