Workshop Schedule

Summit on Software and Algorithms for the Cell Processor

October 25^th and 26^th, 2006

Held at the

Crowne Plaza Hotel

401 W Summit Hill Dr.
Knoxville, Tennessee 37902

Hosted by the Innovative Computing Laboratory at the University of Tennessee

Sponsored by:

��

From the organizers:

We would like to thank you for agreeing to participate in the workshop.� The purpose of this workshop is to focus on the development of algorithms and software for the IBM Cell processor, and to look for directions for future research and development. The idea is to bring together a small group of users to share experiences and ideas on the system. We would like to keep things informal.

Dress at the workshop is informal.� Please tell us if you need special requirements (vegetarian food etc...) We are expecting to have an internet and wireless connections at the meeting. There will be a laptop projector for giving talks.

Because of time constraints not everyone will be able to present a talk at the meeting. We hope everyone will participate in the discussions that occur during the sessions and outside the meeting room. We are looking for more cutting edge, provocative, honest, and/or controversial talks.

You should plan to fly into the Knoxville airport (McGhee Tyson Airport - TYS, see http://www.tys.org/). From the airport you should be able to take a taxi to the Crowne Plaza Hotel, it�s about 15 miles (~$25.00 taxi ride). The address of the hotel is:

Crowne Plaza Hotel

401 W Summit Hill Dr.
Knoxville, Tennessee 37902
(865) 522-2600

http://www.ichotelsgroup.com/h/d/cp/1/en/hotel/tyssh/transportation

The workshop will take place in the Crowne Plaza Hotel in Salon A.

Agenda:

Wednesday	October 25^th, 2006
8:30 � 9:00	Continental Breakfast	Meeting Room: Salon A
9:00 � 9:15	Welcome and Introduction	Jack Dongarra, UTK and ORNL and Gary Rancourt, IBM
9:15 � 9:45	David Bader, GA Tech	Building a Cell Ecosystem
9:45 � 10:15	Robert Cooper, Mercury	Programming the Cell Broadband Engine Processor
10:15 � 10:45	Jakub Kurzak, U Tennessee	New Approaches to Numerical Linear Algebra on the CELL Processor
10:45 � 11:15	Break
11:15 � 11:45	Ken Koch, LANL	The New Roadrunner Supercomputer: What, When,� How
11:45 � 12:15	Joseph Czechowski, GE	MR Processing on Cell
12:15 � 12:45	Chris Mueller, Indiana U	Synthetic Programming on the Cell BE
12:45 � 1:45	Lunch	Meeting Room: Salon A
1:45 � 2:15	Jes�s Labarta, Barcelona	Programming and Understanding the Cell
2:15 � 2:45	Jeremy Meredith, ORNL	Experiences Programming the Cell Across a Diverse Set of Applications
2:45 � 3:15	Christopher Anand, McMaster U	Developing an SPU libm Using Coconut
3:15 � 3:45	Break
3:45 � 4:15	Mike Acton, CellPerformance	Tapping the Cell for Game Development
4:15 � 5:15	Open discussion
6:30	Dinner	Chesapeake�s Restaurant

Thursday	October 26^th, 2006
8:30 � 9:00	Continental Breakfast	Meeting Room: Salon A
9:00 � 9:30	Mike Houston, Stanford	Experiences Building the Sequoia Cell Backend
9:30 � 10:00	Virat Agarwal, GA Tech	List Ranking on the Cell Processor
10:00 � 10:30	David Kunzman, UIUC	Experience Porting Charm++ to the Cell Processor
10:30 � 11:00	Break
11:00 � 11:30	Sam Williams, UC Berkeley	LBMHD3D on the Cell Processor
11:30 � 12:00	Jon Greene, Mercury	Developing Optimized SPU Assembly Code for the Cell
12:00 � 12:30	Fabrizio Petrini, PNL	Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors
12:30 � 1:30	Lunch	Meeting Room: Salon A
1:30 � 2:00	Yuan Zhao, Rice U	A Compiler for CELL Processor
2:00 � 2:30	Michael Perrone, IBM	Cell BE Programming Gotchas!
2:30 � 3:00	Luke Cico, Mercury	FFT Related
3:00 �	Open discussion

List of Attendees:

Attendee Name		Affiliation	email
Mike	Acton	CellPerformance	macton@gmail.com
Virat	Agarwal	GATech	virat9@gmail.com
Christopher	Anand	McMaster U	anandc@mcmaster.ca
David	Bader	Georgia Tech	bader@cc.gatech.edu
George	Bosilca	U of Tenn	Bosilca@cs.utk.edu
John	Brickman	Mercury Computer	jbrickman@mc.com
Alfredo	Buttari	U of Tenn	buttari@cs.utk.edu
Luke	Cico	Mercury	lcico@mc.com
Robert	Cooper	Mercury Computer	rcooper@mc.com
Joseph	Czechowski	GE	czechowski@crd.ge.com
Jack	Dongarra	U of Tenn	dongarra@utk.edu
Peng	Du	U of Tenn	du@cs.utk.edu
Kayvon	Fatahalian	Stanford	kayvonf@graphics.stanford.edu
Jon	Greene	Mercury	greene@mc.com
Paul	Henning	LANL	phenning@lanl.gov
Mike	Houston	Stanford	mhouston@graphics.stanford.edu
Kirk	Jordan	IBM	kjordan@us.ibm.com
Laxmikant	Kale	UIUC	kale@uiuc.edu
Ken	Koch	LANL	krk@lanl.gov
David	Kunzman	UIUC	kunzman2@uiuc.edu
Jakub	Kurzak	U of Tenn	kurzak@cs.utk.edu
Jesus	Labarta	Barcelona	jesus@ac.upc.edu
Piotr	Luszczek	U of Tenn	luszczek@cs.utk.edu
Ben	Martin	Indiana U	benjmart@cs.indiana.edu
Jeremy	Meredith	ORNL	jsmeredith@ornl.gov
Chris	Mueller	Indiana U.	chemuell@cs.indiana.edu
Michael	Perrone	IBM	mpp@us.ibm.com
Fabrizio	Petrini	PNL	fabrizio.petrini@pnl.gov
Gary	Rancourt	IBM	rancourt@us.ibm.com
Bob	Szabo	IBM	rszabo@us.ibm.com
Stan	Tomov	U of Tenn	tomov@cs.utk.edu
Samuel	Williams	UC Berkeley	samw@EECS.Berkeley.EDU
Yuan	Zhao	Rice	yzhao@cs.rice.edu

Abstracts

Mike Acton, CellPerformance

Tapping the Cell for Game Development

Harnessing the tremendous power of the PS3 and Cell processor presents pit falls for game programmers not accustomed to the platform. The first challenge that programmers transitioning to PS3/Cell must overcome is to unlearn their old habits. The focus of this presentation is to present experiences and strategies to smooth the transition from developing for conventional platforms onto the PS3/Cell.

Virat Agarwal, GA Tech

List Ranking on the Cell Processor
Given a linked list, the list ranking problem finds the distance from each node to the head of the list. List ranking, representative of combinatorial and graph-theoretic applications, is difficult to parallelize due to its highly irregular memory access patterns. In this talk, we present an efficient implementation of list ranking on the Cell Broadband Engine that uses a general work-partitioning technique to hide memory access latency. We run our algorithm on a 3.2 GHz Cell processor and demonstrate a substantial speedup in comparison with traditional cache-based microprocessors. For a random linked list of 1 million nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation.

Christopher Anand, McMaster U

Developing an SPU libm Using Coconut

- Performance and accuracy results for SPU elementary math functions.

- Techniques to leverage the SPU ISA.

- Coconut Project:� from mathematical specifications to efficient parallel implementations.

- Coconut Tools:� declarative assembly language, simulation tool and scheduling algorithms.

This library has been developed using IBM's MASS as a model, with assistance from IBM.� IBM has the resulting code which they many release in some form in the future.�

David Bader, GA Tech

Building a Cell Ecosystem

Robert Cooper, Mercury Computer Systems

Programming the Cell Broadband Engine Processor

The Cell Broadband Engine processor can be viewed as a distributed memory multiprocessor on a single chip. We have been able to apply over a decade of experience with a variety of distributed memory architectures to the programming of Cell-based systems. We have take a very practical approach in order to acheive early application success for our users. This is exemplified by the Mercury MultiCore Framework, an API for explicitly programming heterogeneous multicore architectures. This talk will contrast our approach with that of emerging tools for multicores that automate more aspects of programming and optimization, and will discuss the challenges of ensuring wide adoption of the Cell processor by the programming community.

Joseph Czechowski, GE

MR Processing on Cell

One method of examining Cell capabilities is by attempting to apply it to a real world problem such as Magnetic Resonance (MR) imaging.� The processing required for MR imaging is very regular, and the Cell is a good match for the types of computations involved (so long as the volume of data is manageable).� This presentation will briefly describe our experience using the Cell to perform MR imaging.�

Luke Cico, Mercury Computer Systems

FFT Related

Mike Houston, Stanford

Experiences Building the Sequoia Cell Backend

I'll give a quick introduction to Sequoia and discuss how the compiler backend was designed and the issues with the Cell toolchain and hardware we ran into along the way.� I'll discuss the main difficulties that persist for our users when using Cell, and the performance impacts of working around some of the issues.� We also have some results from Sequoia applications running on Cell.

Ken Koch, LANL

The New Roadrunner Supercomputer: What, When,� How

The new Los Alamos National Laboratory supercomputer named Roadrunner is described in this talk.� Roadrunner will be deployed in multiple phases.� This talk covers details of the actual machine architecture from the current Base System now being delivered through the final Cell accelerated system in early 2008.� There will be a focus on the hardware and software of the Cell-accelerated final system configuration.

Dave Kunzman, UIUC

Experience Porting Charm++ to the Cell Processor

There are several features of the Charm++ programming model that make it a good fit for the Cell processor, including data encapsulation, virtualization, peek-ahead in message queue, and portability. We, at the Parallel Programming Lab., have begun porting the Charm++ Runtime System (RTS) to the Cell processor; this will allow Charm++ applications to take advantage of the computational power of the Cell processor. We will present our experience in porting the Charm++ RTS along with our initial impressions of Charm++ applications running on the Cell. The Charm++ RTS takes advantage of an interface called the Offload API to move computation to the SPEs. We will discuss the Offload API, which allows any C/C++ based programs to easily �offload� computation onto the SPEs.

Jakub Kurzak, University of Tennessee

New Approaches to Numerical Linear Algebra on the CELL Processor

From the standpoint of numerical linear algebra, the CELL processor can be characterized by its two distinct features, the hybrid nature of its floating point capabilities in terms of speed and compliance with the IEEE standard, and the potential for parallelization at a much finer level of granularity than common processors. We present preliminary results with mixed-precision algorithms for solving dense linear systems of equations, where the bulk of the work is done in single precision, and the technique of iterative refinement is used to correct the solution to double precision accuracy. As of today, speeds in excess of 100 GFlop/s are achieved using a single CELL processor. The opportunity of fine grain parallelization on the CELL processor exposes the shortcomings of the model relying on parallelization encapsulated in the layer of BLAS (Basic Linear Algebra Subroutines). We present ongoing work on algorithms utilizing pipelining and streaming techniques directly at the topmost level of linear algebra algorithms.

Jes�s Labarta, CEPBA - UPC, Barcelona

Programming and Understanding the Cell

The Cell architecture offers a huge processing power in a chip, at the expense of complexity in its use and the understanding of its behaviour. The talk will describe current efforts at BSC on the development and use of programming models that should ease the portability of general applications to the Cell architecture.� IBM�s Octopiler is a compiler that accepts OpenMP programs and outlines the body of parallel regions to the SPE, the run time taking care of accessing data on demand.� Cell Superscalar takes as input a sequential program annotated with directives that specify for each potentially outlined computation the input and output arguments.� From it, the run time determines the actual parallelism exploitable and orchestrates the work of the different SPEs. An instrumentation framework is in place to obtain traces of the actual behaviour of the chip that can then be analyzed with Paraver.�

Jeremy Meredith, ORNL

Experiences Programming the Cell Across a Diverse Set of Applications

The heterogeneous cores of the Cell processor are capable of high performance, but developers must explicitly manage data movement, scheduling, and synchronization.� While these attributes provide the Cell with its greatest performance strengths, the also form its greatest weaknesses in terms of developer productivity, code portability, and initial performance efficiencies.� I will explore optimization strategies and performance results with the standard high level toolchain available for the Cell system, using a workload drawn from scientific, imaging, and cognitive problem domains.

Chris Mueller, Indiana University

Synthetic Programming on the Cell BE

In this talk, we introduce the Synthetic Programming Environment (SPE*) for the Cell BE.� The SPE is a meta-programming tool for developing high performance computational kernels in Python. Originally developed for the PowerPC, the SPE allows developers to synthesize machine instruction streams (synthetic programs) at run time and provides direct access to processor resources previously available only through intermediate languages.� After a brief introduction to synthetic programming, we will discuss our implementation of the SPE for the Cell BE and, using BLAST as an example, demonstrate how to use the SPE to develop high-performance code for both the PPU and SPU.

Additional information is available at www.synthetic-programming.org

Michael P. Perrone, IBM T.J. Watson Research Center

Cell BE Programming Gotchas!
Abstract: When programmed properly, the Cell BE processor can achieve tremendous performance; however certain peculiarities of the architecture and tool set can lead to surprising "gotchas" that can negatively impact performance. This presentation will describe some of these issues and how to deal with them. The audience will be strongly encouraged to share their own anecdotes about programming the Cell BE. Questions on how best to program Cell BE will be opened up for the entire audience to debate.

Fabrizio Petrini, PNL

Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors

Numerous applications require the exploration of large graphs. The problem has been tackled in the past through a variety of solutions, either based on commodity processors or dedicated hardware.� Processors based on multiple cores, like the Cell Broadband Engine (CBE), are gaining popularity as basic building blocks for high performance clusters. Nevertheless, no studies have still investigated how effectively the CBE architecture can explore large graphs, and how its performance compares with other architectural solutions.

��

In this paper, we describe the challenges and design choices involved in mapping a breadth-first search (BFS) algorithm on the CBE. Our implementation has been driven by an accurate performance model, that has allowed seamless coordination between on-chip communication, off-chip memory access, and computation.

Preliminary results obtained on a pre-production prototype running at 2.4 GHz show almost linear speedups when using multiple synergistic processing units and impressive levels of performance when compared to other processors. With small arity graphs, a single CBE can provide the same processing rate of 512 BlueGene/L processors, and it is five time faster than a top-of-the-line AMD Opteron clocked at the same frequency. The performance gap narrows with a larger graph arity, where the CBE is still able to outperform 128 BlueGene/L processors and it is almost three times as fast as the AMD Opteron.

Samuel Williams (UCB/LBL)

LBMHD3D on the Cell Processor

In this talk will discuss the implementation and performance of the core of the 3D lattice Boltzmann magnetohydrodynamic turbulence code (LBMHD3D) on a cell blade.� As each grid point requires more than 1KB of data, the small amount of blade DRAM limits the problem size.� For a 62x64x4 weak scaling problem, each SPE delivers over 1GFlop/s in double precision and blade performance scales to over 16GFlop/s.� This compares very favorably against vector machines, and is more than 10x faster than super scalars.

Yuan Zhao, Rice University

A Compiler for CELL Processor

In this talk, we will present a source-to-source compiler for the CELL processor.� The compiler focuses on the loop nests that often represent computation kernels in scientific applications, and uses a fork-and-join model to offload computation to SPEs.� Requiring no parallelism directives/pragmas in user applications, the compiler relies on the dependence analysis information to perform automatic parallelization, vectorization, data movement, data alignment and synchronization generation.

We will show preliminary results and discuss the ongoing and future research directions.

May	JUN	Jul
	28
2007	2008	2009