Posted by reinders on Tuesday June 14, 2016 at 09:26:16
I have created a BIBTEX file which has an entry for every chapter of the two "Pearls" books, the Xeon Phi books (original Knights Corner and the Knight Landing versions), and the Structured Parallel Programming book. I also included entries for all the other books I've been involved with, including TBB, VTune and Multithreading for VfX, and much more.
The entries include DOI numbers for the chapters of the Xeon Phi books (original Knights Corner and the Knight Landing versions), the two "Pearls" books, and the Structured Parallel Programming book.
This is a resource for the many people who have contributed to these books, and anyone who would like to cite these works.
I will gladly take feedback, and update the file from time to time based on feedback and new publications.
This "version 4" now includes all page numbers and DOI information for our latest book covering Knights Landing.
Posted by reinders on Sunday May 29, 2016 at 03:30:24
We have finished our latest book project: "Intel® Xeon Phi™ Processor High Performance Programming, Knights Landing Edition, by Jim Jeffers, James Reinders, and Avinash Sodani. Books are available as of mid-June (2016).
Our book has three sections: I. Knights Landing, II. Parallel Programming, III. Pearls. The book has an extensive Glossary and Index to facilitate jumping around the book.
Section I: Knights Landing. Focuses on Knights Landing itself, diving into the architecture, the high bandwidth memory, the cluster modes and the integrated fabric.
Chapter 1: Introduction. Introduces Many-core Programming. Explains why many-core is important, how to measure readiness for many-core, and the importance of tuning for performance for multi- and many-core. Parallel programming models play a key role. The dual-tuning advantage of many-core (with multi-core) is introduced, which is valdated in Section III of the book.
Chapter 2: Knights Landing Overview. Introduces Knights Landing, a many-core processor that delivers massive thread and data parallelism with high memory bandwidth. Knights Landing is the Second Generation of Intel® Xeon Phi™ products using a many-core architecture which both benefits from, and relies on, parallel programming. Key new innovations such as MCDRAM, cluster modes and memory modes are explained at a high-level.
Chapter 3: Programming MCDRAM and Cluster Modes. Essentials of programming to utilize the high bandwidth memory known as the MCDRAM and to utilize cluster modes. The memkind library, and use of numactl, are discussed.
Chapter 4: Knights Landing Architecture. Dives deeply into the Knights Landing architecture. Describes the tile and core architecture, as well as the cluster modes and memory modes supported by Knights Landing.
Chapter 5: Intel Omni-Path Fabric. Details on the next generation fabric with heritage from the Intel® TrueScale product line and the Cray Aries interconnect. Some versions of Knights Landing have this fabric integrated on-package.
Chapter 6: μarch Optimization Advice. Tuning advice that is specific to the Knights Landing design, which is known as the microarchitecture and is abbreviated as μarch. Focuses on tuning advice arising specifically from the Knights Landing μarch design when compared with the Knights Corner μarch (found in the first generation Intel Xeon Phi products) or the μarch of a recent Intel® Xeon® processor.
Section II: Parallel Programming. Focuses on application programming with consideration for the scale of many-core.
Chapter 7: Programming Overview for Knights Landing. Discusses the keys to effective parallel programming. While getting maximal performance from Knights Landing is largely the same challenge as with any processor, the challenge of parallel programming remains. The basics of managing parallelism at the domain, thread, data and locality levels are discussed. The provocative “To Refactor, or Not to Refactor” question is examined.
Chapter 8: Tasks and Threads. Discusses the key techniques, i.e., OpenMP, Fortran 2008, TBB, or MKL, which are expected to be the most popular for Knights Landing. Emerging trends and options are discussed briefly. The compatibility of Knights Landing means that much more is possible than covered in a short chapter.
Chapter 9: Vectorization. Discusses the AVX-512 vector parallel capabilities of Knights Landing and introduces how to utilize them. This chapter give the fundamentals, which are the same techniques found in most any tutorial or reference on vectorization for processors.
Chapter 10: Vectorization Advisor. Introduces the Intel Vectorization Advisor, which provides AVX-512 analysis capabilities to help reach the vectorization potential of Knights Landing. For scalar loops, it helps to discover what prevents code from being vectorized. For vectorized loops, it provides detailed AVX-512 performance characterization. Recommendations are additionally supplemented with the AVX-512 Traits and FLOPs, masks, Roofline and Gather/Scatter reports.
Chapter 11: Vectorization with SDLT. Introduces Intel® SIMD Data Layout Templates (SDLT) containers (use in place of std::vector). For C++ code, this can offer an effective method to achieve superior performance by increasing vectorization through “AOS to SOA or AOSOA” conversions. This can enhance performance of Knights Landing or any processor. Includes sample codes, and discussion on how to transition from Array of Structures (AOS) to Structure of Arrays (SOA) or Arrays of Structure of Arrays (ASA) utilizing SDLT while maintaining a high level object oriented structure.
Chapter 12: Vectorization with AVX-512 Intrinsics. Introduces programming with intrinsics for Intel® Advanced Vector Extensions 512-bits (AVX 512). Helps directly harness the richness of AVX-512 instructions by bypassing limitations of languages and compilers.
Chapter 13: Performance Libraries. Discusses three libraries from Intel, i.e., Intel® Math Kernel Library (MKL), Intel® Data Analytics Acceleration Library (DAAL), and Intel® Integrated Performance Primitives (IPP), collectively referred to as the Intel® Performance Libraries. These libraries high performance versions of important computationally complex algorithms. Knights Landing can utilize each of them; Intel has endowed these libraries with Knights Landing optimizations including support for AVX-512.
Chapter 14: Profiling and Timing. Discusses insight based on event counters built into Knights Landing, and using those counters with the Intel® VTune Amplifier. Also discusses timing, a critical element in evaluating performance.
Chapter 15: MPI. Discusses MPI on Knights Landing, which has the same interfaces as on Intel Xeon processor based systems. Discusses how the characteristics of hybrid MPI/OpenMP performance may require tuning as the optimal balance of MPI ranks and OpenMP threads may vary.
Chapter 16: PGAS Programming Models. Takes a look at Partitioned Global Address Space (PGAS) programming models, which scale across cores and nodes while preserving a shared memory-like programming model. While Knights Landing will be programmed mostly with MPI, OpenMP and TBB, utilizing PGAS models will be increasingly important in the future. Examples illustrate that PGAS can be an effective programming model for the large number of cores on a Knights Landing.
Chapter 17: Software Defined Visualization. Visualizations of large data sets are best done on processors, and this chapter explains why and how by highlighting three key open source libraries that are fundamental for SDVis work (i.e., OpenSWR, Embree, and OSPRay). These libraries benefit from the SDVis capabilities of Knights Landing.
Chapter 18: Offload to Knights Landing. Covers two topics: the offload programming model, and Knights Landing coprocessor specific considerations. They are separate, but related, topics which are addressed together and separately.
Chapter 19: Power Analysis. Explores the fundamentals of power and performance analysis on Knights Landing using both open-source and Intel tools. Because Knights Landing is compatible with other Intel Xeon processors, the power measurement techniques covered are also applicable to server systems based on other Intel processors.
Section III: Pearls. Focuses on parallel programming in full applications with examples with notes on Knights Landing specific results and optimizations.
Chapters 20-26: Results on LAMMPS, SeisSol, WRF, N-Body Simulations, Machine Learning, Trinity mini-applications and QCD are discussed.
Posted by reinders on Sunday January 24, 2016 at 04:59:36
The download file for Pearls 2 is compelte Note: I have moved the ZIP file to be on this server instead of Dropbox, based on feedback that some employers block Dropbox access. This download has the code (1.2GB in size), complete with Makefiles and build instructions, from the code used in our book "High Performance Parallelism Pearls Volume Two" - for the whole book (posted this complete version in January 2016).
Call this "version 3." We hope you find it useful. Please drop us a note with any feedback or suggestions! DOWNLOAD CODE - 1.4GB ZIP FILE LINK
Posted by reinders on Sunday September 27, 2015 at 08:43:02
We have created Powerpoint summaries of the High Performance Parallelism Pearls books. If you expand on these - please share with us! I will be happy to grow and expand (and correct) these powerpoint decks. I have uploaded completely open and unlocked PPTX files. The files are a bit large, but I did not want to over compress the images. I doubt anyone would ever use more than a quarter of the slides in any one talk, probably less - but having them all is useful.
Posted by reinders on Tuesday August 25, 2015 at 01:52:10
An article about our discussion of the work from Chapter 10 ran in HPCwire: COSMOS Team Achieves 100x Speedup on Cosmology Code. Unknown to us at the time, Tiffany Trader at HPCwire attended our talk at IDF in San Francisco on August 19, 2015. She enjoyed our talk... I think our enthusiasm about this work showed!
The "100X" speed-up is real - and compares Intel to Intel. Nothing in it was an attempt to mislead anyone - it was not a comparison of products from different companies in any attempt to mislead.
The team truly gets their analysis done 100X faster than when they started. It's a great example of "code modernization" - and the authors shared step-by-step their thinking as they made nine distinct changes to their code, discussing each one, on the path to higher performance on processors and the Intel Xeon Phi coprocessor. The tracking of performance improvement for both with the same changes is remarkable as well. There is a lot to learn from their example. In fact, readers of our Pearls books know that both volumes are full of teaching examples like this. "Just parallelism" as we are guilty of saying on occasion. It's not easy - but neither is regular programming.
We really like how the article captured our enthusiasm in presenting this work.
Posted by reinders on Friday July 31, 2015 at 11:22:33
We have ALL the figures from both Volumes of High Performance Parallelism Pearls available for download.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example:
High Performance Parallelism Pearls Volume One by Jim Jeffers and James Reinders, copyright 2015, published by Morgan Kaufmann, ISBN 978-0-128-02118-7.
High Performance Parallelism Pearls Volume Two by James Reinders and Jim Jeffers, copyright 2015, published by Morgan Kaufmann, ISBN 978-0-128-03819-2.
All figures are available in TIFF format, many are also available in EPS format. For most uses, the TIFF files are what you want.
Posted by reinders on Tuesday March 24, 2015 at 05:00:00
I had the privilege of giving a talk today in Maryland that covered many topic ranging from Parallelism, Intel Xeon Phi, Intel Parallel Studio XE (tools), and our books. I have posted the slides for the students and anyone else who is interested.
Posted by reinders on Tuesday January 27, 2015 at 11:27:22
Jim Dempsey provided this video related to his chapter: High Performance Parallelism Pearls, Chapter 5, Plesiochronous Phasing Barriers, by Jim Dempsey.
This is a video of the Plesiochronous Phasing Barriers in action. The video is not annotated nor does it have a voice over... a short explanation is provided below the video.
The left half of the screen represents the optimized tiled version and the right half represents the plesiochronous version. Each half is divided into two parts:
Top) A view of the Y/Z plane with the X dimension into the screen. Each pixel in the top portion of each side changes color upon completion of computation of column along X. Color changes are an indication of rate of computation, position of change indicates where and when in the Y/Z plane the computation occurred
Bottom) Each thread displays an individual line progressing in time from left to right, and wrapping around (raster-like) with two different colors: green for thread computing, red for in barrier wait. (red “ticks” may appear dark rather than red).
In the left half (traditional tiled), you can note that the Y/Z columns of X are at most in any one of two colors (time phases). The bottom of the left half illustrates the traditional tiled method runs well until the point where the threads start completion of their designated tile(s) and reach the barrier. It looks like a cascade of cars reaching a traffic jamb, which doesn’t clear until all threads reach the barrier.
The right half (plesiochronous), you can note that the Y/Z columns of X are at most in any one of three colors (time phases). The bottom half illustrates the barrier wait time for each thread, are for the most part not synchronized. You may notice that four threads appear to be synchronized, and they are. These are the treads of the same core, and the plesiochronous barrier scheme uses core barriers. These threads are not adjacent because of KMP_AFFINITY=scatter. You may also note that each thread computes their X columns along in the Y direction, essentially the threads tile is not rectangular. You also notice time domain edge is ragged indicating the time skew between threads. Occasionally you will also notice threads getting delayed, presumably by worst case memory latencies due to evictions.
The programs were instrumented to collect (RDTSC) time stamp counter information for each thread as it entered and left a computational region. The timer interval between computational regions is the barrier wait time.
You may click on the video to bring it up full size (double the width and height from that shown here).
Posted by reinders on Saturday November 15, 2014 at 03:14:06
Jim and I got to see the first copies of the new book today - together. They are here in time for SC'14. We have a book signing in the Intel booth on Thursday (Nov 20, 2014) at noon (drop by with your copy and we can sign it! - hopefully some of our coauthors will be there too.) Many thanks to the amazing team at Morgan Kaufmann Publishing, and to the wonderful contributors who worked so hard to share their work.
Posted by reinders on Thursday July 18, 2013 at 04:23:41
All the figures, tables, charts and drawings are available for download.
Please use them freely with attribution. You should find them to all be high quality artwork, suitable for presentations and other uses.
Suggestion attribution: (c) 2013 Jim Jeffers and James Reinders, used with permission.
Feel free to mention the book too: "Intel Xeon Phi Coprocessor High Performance Programming."
If you like our book - please let others know! If you have suggestions or feedback, please let us know!
Posted by reinders on Tuesday January 8, 2013 at 12:18:59
This book belongs on the bookshelf of every HPC professional. Not only does it successfully and accessibly teach us how to use and obtain high performance on the Intel MIC architecture, it is about much more than that. It takes us back to the universal fundamentals of high-performance computing including how to think and reason about the performance of algorithms mapped to modern architectures, and it puts into your hands powerful tools that will be useful for years to come.
—Robert J. Harrison
Institute for Advanced Computational Science,
Stony Brook University
Posted by reinders on Sunday December 16, 2012 at 10:09:15
Our book Intel Xeon Phi Corpocessor High Performance Programming (ISBN 978-0-124-10414-3) will be available from the publisher Morgan Kaufmann in February 2013, and many book sellers (including Amazon.com). Pushing computing to new heights is among one of the most exciting human endeavors both for the thrill of doing it, and the thrill of what it makes possible.
The Intel® Many Integrated Core (MIC) architecture and the first Intel® Xeon Phi™ coprocessor have brought us one of those rare, and very important, new chapters in this quest to push computing to new limits. Jim and James spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel® Xeon Phi™ coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on programming for this new architecture and these new products. This book is useful even before you ever touch a system with an Intel® Xeon Phi™ coprocessor. The key techniques emphasized in this book are essential to programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.