We are working on a new book packed with information about programming for Knights Landing. If you are interested in helping us by reviewing some chapters in December 2015 and January 2016, please drop us an email at Review.KNL.Book [at] lotsofcores.com. Please be patient... we may not reply to your e-mail until we have material to review.
Parallelism Pearls for Multicore and Many-core Programming
|We have created a Powerpoint summary of the Parallelism Pearl book Volume Two. If you expand on these - please share with me! I will be happy to grow and expand (and correct) this powerpoint deck. I have uploaded an completely open and unlocked PPTX. The file is a bit large, but I did not want to over compress the images. I doubt anyone would ever use more than a quarter of the slides in any one talk, probably less - but having them all is useful.
Powerpoint: Download Powerpoint for Pearls Volume Two (14.5Mb ZIP file) - updated slide 2 on 28-Sep-15
Last year, we created a similar deck for Volume One. Also Powerpoint without restrictions.
Download Powerpoint: Powerpoint for Pearls Volume One (20.6Mb PPTX file)
Note: I have moved the ZIP file to be on this server instead of Dropbox, based on feedback that some employers block Dropbox access. This download has the code (1.2Gb in size), complete with Makefiles and build instructions, from the code used in our book "High Performance Parallelism Pearls Volume Two" - for Chapters 2-4, 6-10, 12-13, 15-24. We will supply code from additional chapters when a few issues are worked out (licensing included).
Errors in "version 1" of this download...I posted a set of code in late July - but I accidentally used numbers that were pre-actual-chapter-numbers (I used s## which means submission ##, the correct download uses c## for Chapter ## which match the book!) Oops! I've fixed that now and included some additional volumes.
Call this "version 2." We'll update in late September when we have code from additional chapters. We hope you find it useful. Please drop us a note with any feedback or suggestions! DOWNLOAD CODE - 1.2Gb ZIP FILE LINK
Thanks to Ryan Coleman, at Sandia National Labs, we have a this on github at https://github.com/ryancoleman/lotsofcoresbook2code
|Code from Volume One is a separate download (90Mb in size), complete with Makefiles and build instructions, from the code used in our book "High Performance Parallelism Pearls." Please drop us a note with any feedback or suggestions! DOWNLOAD CODE - 90Mb ZIP FILE LINK
Thanks to Ryan Coleman, at Sandia National Labs, we have a this on github at https://github.com/ryancoleman/lotsofcoresbook1code
An article about our discussion of the work from Chapter 10 ran in HPCwire: COSMOS Team Achieves 100x Speedup on Cosmology Code. Unknown to us at the time, Tiffany Trader at HPCwire attended our talk at IDF in San Francisco on August 19, 2015. She enjoyed our talk... I think our enthusiasm about this work showed!
The "100X" speed-up is real - and compares Intel to Intel. Nothing in it was an attempt to mislead anyone - it was not a comparison of products from different companies in any attempt to mislead.
The team truly gets their analysis done 100X faster than when they started. It's a great example of "code modernization" - and the authors shared step-by-step their thinking as they made nine distinct changes to their code, discussing each one, on the path to higher performance on processors and the Intel Xeon Phi coprocessor. The tracking of performance improvement for both with the same changes is remarkable as well. There is a lot to learn from their example. In fact, readers of our Pearls books know that both volumes are full of teaching examples like this. "Just parallelism" as we are guilty of saying on occasion. It's not easy - but neither is regular programming.
We really like how the article captured our enthusiasm in presenting this work.
I have created a BIBTEX file which has an entry for every chapter of the two "Pearls" books, the Xeon Phi book, and the Structured Parallel Programming book.
I also have included entries for all the other books I've been involved with, including TBB, VTune and Multithreading for VfX, and much more.
The entries include DOI numbers for the chapters of the two "Pearls" books, the Xeon Phi book, and the Structured Parallel Programming book.
This is a resource for the many people who have contributed to these books, and any one who would like to cite these works.
I will gladly take feedback, and update the file from time to time based on feedback and new publications.
We got our first copies of our latest book today!
|We have all the figures (diagrams, photos, etc.) from the book available to download (221Mb).
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: High Performance Parallelism Pearls Volume Two by James Reinders and Jim Jeffers, copyright 2015, published by Morgan Kaufmann, ISBN 978-0-128-03819-2.
You can get them all by this one simple download: All Figures, TIFF format (221Mb).
Most of the figures are also available in EPS (but NOT all figures) by this one simple download: Many Figures, EPS format, NOTE: not all figures are represented in this ZIP (162Mb).
NOTE: To download the figures from Volume One, refer to: lotsofcores.com/pearls1.figures
I had the privilege of giving a talk today in Maryland that covered many topic ranging from Parallelism, Intel Xeon Phi, Intel Parallel Studio XE (tools), and our books. I have posted the slides for the students and anyone else who is interested.
|Jim Dempsey provided this video related to his chapter: High Performance Parallelism Pearls, Chapter 5, Plesiochronous Phasing Barriers, by Jim Dempsey.
This is a video of the Plesiochronous Phasing Barriers in action. The video is not annotated nor does it have a voice over... a short explanation is provided below the video.
The left half of the screen represents the optimized tiled version and the right half represents the plesiochronous version. Each half is divided into two parts:
Top) A view of the Y/Z plane with the X dimension into the screen. Each pixel in the top portion of each side changes color upon completion of computation of column along X. Color changes are an indication of rate of computation, position of change indicates where and when in the Y/Z plane the computation occurred
Bottom) Each thread displays an individual line progressing in time from left to right, and wrapping around (raster-like) with two different colors: green for thread computing, red for in barrier wait. (red “ticks” may appear dark rather than red).
In the left half (traditional tiled), you can note that the Y/Z columns of X are at most in any one of two colors (time phases). The bottom of the left half illustrates the traditional tiled method runs well until the point where the threads start completion of their designated tile(s) and reach the barrier. It looks like a cascade of cars reaching a traffic jamb, which doesn’t clear until all threads reach the barrier.
The right half (plesiochronous), you can note that the Y/Z columns of X are at most in any one of three colors (time phases). The bottom half illustrates the barrier wait time for each thread, are for the most part not synchronized. You may notice that four threads appear to be synchronized, and they are. These are the treads of the same core, and the plesiochronous barrier scheme uses core barriers. These threads are not adjacent because of KMP_AFFINITY=scatter. You may also note that each thread computes their X columns along in the Y direction, essentially the threads tile is not rectangular. You also notice time domain edge is ragged indicating the time skew between threads. Occasionally you will also notice threads getting delayed, presumably by worst case memory latencies due to evictions.
The programs were instrumented to collect (RDTSC) time stamp counter information for each thread as it entered and left a computational region. The timer interval between computational regions is the barrier wait time.
You may click on the video to bring it up full size (double the width and height from that shown here).
|We have all the figures (diagrams, photos, etc.) from the book available to download.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: High Performance Parallelism Pearls by Jim Jeffers and James Reinders, copyright 2015, published by Morgan Kaufmann, ISBN 978-0-128-02118-7.
You can get them all by this one simple download: All Figures, TIFF format (97Mb).
Most of the figures are also available in EPS (but NOT all figures) by this one simple download: Many Figures, EPS format, NOTE: not all figures in this ZIP (130Mb).
NOTE: To download the figures from Volume Two, refer to: lotsofcores.com/pearls2.figures
Jim and I got to see the first copies of the new book today - together. They are here in time for SC'14. We have a book signing in the Intel booth on Thursday (Nov 20, 2014) at noon (drop by with your copy and we can sign it! - hopefully some of our coauthors will be there too.) Many thanks to the amazing team at Morgan Kaufmann Publishing, and to the wonderful contributors who worked so hard to share their work.
|We have created a Powerpoint summary of the Parallelism Pearl book. If you expand on these - please share with me! I will be happy to grow and expand (and correct) this powerpoint deck. I have uploaded an completely open and unlocked PPTX as well as a PDF version. Call this "v1." I'll post updates as appropriate. I will be posting the CODE and FIGURES to this web site soon as well.
October 22: update... all editing is done... it heads to be printed now. 548 pages by my count.
We have a publication date: November 17! (and ISBN number: 978-0128021187)
High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches
Where to order:
There are some early reviews/write-ups based on a draft of the book:
(check for even more being posted for other chapters... Xeon Phi articles)
Colfax Research has just posted the 280-slide deck from their “Parallel Programming and Optimization with Intel Xeon Phi Coprocessors” developer training program.
The world's fastest computer, for the third time in a row on biannual Top500 list, uses Intel Xeon Phi coprocessors to make it possible.
Intel Xeon Phi coprocessors are used in the #1, #7, #15, #39, #50, #51, #65, #92, #101, #102, #103, #134, #157, #186, #235, #251 and #451 systems.
No wonder we are working on another book about programming for highly parallel systems!
All the figures, tables, charts and drawings are available for download.
Please use them freely with attribution. You should find them to all be high quality artwork, suitable for presentations and other uses.
Suggestion attribution: (c) 2013 Jim Jeffers and James Reinders, used with permission.
Feel free to mention the book too: "Intel Xeon Phi Coprocessor High Performance Programming."
If you like our book - please let others know! If you have suggestions or feedback, please let us know!
GZipped TAR file: XeonPhiBookFiguresEtc.tar.gz
ZIP file: XeonPhiBookFiguresEtc.zip
Checkout the download page for the code samples from Chapters 2, 3, and 4...
Our book has been reviewed at Dr. Dobbs - online at http://www.drdobbs.com/tools/developer-reading-list/240152134
I was excited to get a copy (sent to each author express from the printer) this week. It is available for purchase from many stores including http://store.elsevier.com/product.jsp?isbn=9780124104143
As of today - the book is in final production steps... we have proofreading to do still, but everything is in the production department at Morgan Kaufmann - on track to see books in February 2013.
As a teaser - here is the outline for the book:
Chapter 1 - Introduction
Chapter 2 - High Performance Closed TrackTest Drive!
Chapter 3 - A Friendly Country Road Race
Chapter 4 - Driving Around Town:Optimizing A Real-WorldCode Example
Chapter 5 - Lots of Data (Vectors)
Chapter 6 - Lots of Tasks (not Threads)
Chapter 7 - Offload
Chapter 8 - Coprocessor Architecture
Chapter 9 - Coprocessor System Software
Chapter 10 - Linux on the Coprocessor
Chapter 11 - Math Library
Chapter 12 - MPI
Chapter 13 - Profiling and Timing
Chapter 14 - Summary
We expect that to come out just over 400 pages.
This book belongs on the bookshelf of every HPC professional. Not only does it successfully and accessibly teach us how to use and obtain high performance on the Intel MIC architecture, it is about much more than that. It takes us back to the universal fundamentals of high-performance computing including how to think and reason about the performance of algorithms mapped to modern architectures, and it puts into your hands powerful tools that will be useful for years to come.
—Robert J. Harrison
Institute for Advanced Computational Science,
Stony Brook University
(this will be in the Preface to the book)
|Our book Intel Xeon Phi Corpocessor High Performance Programming (ISBN 978-0-124-10414-3) will be available from the publisher Morgan Kaufmann in February 2013, and many book sellers (including Amazon.com).
Pushing computing to new heights is among one of the most exciting human endeavors both for the thrill of doing it, and the thrill of what it makes possible.
The Intel® Many Integrated Core (MIC) architecture and the first Intel® Xeon Phi™ coprocessor have brought us one of those rare, and very important, new chapters in this quest to push computing to new limits.
Jim and James spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel® Xeon Phi™ coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on programming for this new architecture and these new products.
This book is useful even before you ever touch a system with an Intel® Xeon Phi™ coprocessor. The key techniques emphasized in this book are essential to programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.