The Wayback Machine - https://web.archive.org/web/20080628174818/http://www.cs.utk.edu:80/~dongarra/cell2006/

 

Workshop Schedule

 

Summit on Software and Algorithms for the Cell Processor

October 25th and 26th, 2006

 

Held at the

Crowne Plaza Hotel

401 W Summit Hill Dr.
Knoxville, Tennessee 37902


Hosted by the Innovative Computing Laboratory at the University of Tennessee

 

Sponsored by:

���������������� ����������

 


From the organizers:

We would like to thank you for agreeing to participate in the workshop.The purpose of this workshop is to focus on the development of algorithms and software for the IBM Cell processor, and to look for directions for future research and development. The idea is to bring together a small group of users to share experiences and ideas on the system. We would like to keep things informal.

 

Dress at the workshop is informal.Please tell us if you need special requirements (vegetarian food etc...) We are expecting to have an internet and wireless connections at the meeting. There will be a laptop projector for giving talks.

 

Because of time constraints not everyone will be able to present a talk at the meeting. We hope everyone will participate in the discussions that occur during the sessions and outside the meeting room. We are looking for more cutting edge, provocative, honest, and/or controversial talks.

 

You should plan to fly into the Knoxville airport (McGhee Tyson Airport - TYS, see http://www.tys.org/). From the airport you should be able to take a taxi to the Crowne Plaza Hotel, it�s about 15 miles (~$25.00 taxi ride). The address of the hotel is:

Crowne Plaza Hotel

401 W Summit Hill Dr.
Knoxville, Tennessee 37902

(865) 522-2600

http://www.ichotelsgroup.com/h/d/cp/1/en/hotel/tyssh/transportation

 

The workshop will take place in the Crowne Plaza Hotel in Salon A.


 

 

Agenda:

Wednesday

October 25th, 2006

 

8:30 � 9:00

Continental Breakfast

Meeting Room: Salon A

9:00 � 9:15

Welcome and Introduction

Jack Dongarra, UTK and ORNL and Gary Rancourt, IBM

9:15 � 9:45

David Bader, GA Tech

Building a Cell Ecosystem

9:45 � 10:15

Robert Cooper, Mercury

Programming the Cell Broadband Engine Processor

10:15 � 10:45

Jakub Kurzak, U Tennessee

New Approaches to Numerical Linear Algebra on the CELL Processor

10:45 � 11:15

Break

 

11:15 � 11:45

Ken Koch, LANL

 

The New Roadrunner Supercomputer: What, When,How

11:45 � 12:15

Joseph Czechowski, GE

MR Processing on Cell

12:15 � 12:45

Chris Mueller, Indiana U

Synthetic Programming on the Cell BE

12:45 � 1:45

Lunch

Meeting Room: Salon A

1:45 � 2:15

Jes�s Labarta, Barcelona

Programming and Understanding the Cell

2:15 � 2:45

Jeremy Meredith, ORNL

Experiences Programming the Cell Across a Diverse Set of Applications

2:45 � 3:15

Christopher Anand, McMaster U

Developing an SPU libm Using Coconut

3:15 � 3:45

Break

 

3:45 � 4:15

Mike Acton, CellPerformance

Tapping the Cell for Game Development

4:15 � 5:15

Open discussion

 

6:30

Dinner

Chesapeake�s Restaurant

 


 

Thursday

October 26th, 2006

 

8:30 � 9:00

Continental Breakfast

Meeting Room: Salon A

9:00 � 9:30

Mike Houston, Stanford

Experiences Building the Sequoia Cell Backend

9:30 � 10:00

Virat Agarwal, GA Tech

List Ranking on the Cell Processor

10:00 � 10:30

David Kunzman, UIUC

Experience Porting Charm++ to the Cell Processor

10:30 � 11:00

Break

 

11:00 � 11:30

Sam Williams, UC Berkeley

LBMHD3D on the Cell Processor

 

11:30 � 12:00

Jon Greene, Mercury

Developing Optimized SPU Assembly Code for the Cell

12:00 � 12:30

Fabrizio Petrini, PNL

Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors

12:30 � 1:30

Lunch

Meeting Room: Salon A

1:30 � 2:00

Yuan Zhao, Rice U

A Compiler for CELL Processor

2:00 � 2:30

Michael Perrone, IBM

Cell BE Programming Gotchas!

2:30 � 3:00

Luke Cico, Mercury

FFT Related

3:00 �

Open discussion

 

 


 

 

List of Attendees:

 

 

Attendee Name

Affiliation

email

Mike

Acton

CellPerformance

macton@gmail.com

Virat

Agarwal

GATech

virat9@gmail.com

Christopher

Anand

McMaster U

anandc@mcmaster.ca

David

Bader

Georgia Tech

bader@cc.gatech.edu

George

Bosilca

U of Tenn

Bosilca@cs.utk.edu

John

Brickman

Mercury Computer

jbrickman@mc.com

Alfredo

Buttari

U of Tenn

buttari@cs.utk.edu

Luke

Cico

Mercury

lcico@mc.com

Robert

Cooper

Mercury Computer

rcooper@mc.com

Joseph

Czechowski

GE

czechowski@crd.ge.com

Jack

Dongarra

U of Tenn

dongarra@utk.edu

Peng

Du

U of Tenn

du@cs.utk.edu

Kayvon

Fatahalian

Stanford

kayvonf@graphics.stanford.edu

Jon

Greene

Mercury

greene@mc.com

Paul

Henning

LANL

phenning@lanl.gov

Mike

Houston

Stanford

mhouston@graphics.stanford.edu

Kirk

Jordan

IBM

kjordan@us.ibm.com

Laxmikant

Kale

UIUC

kale@uiuc.edu

Ken

Koch

LANL

krk@lanl.gov

David

Kunzman

UIUC

kunzman2@uiuc.edu

Jakub

Kurzak

U of Tenn

kurzak@cs.utk.edu

Jesus

Labarta

Barcelona

jesus@ac.upc.edu

Piotr

Luszczek

U of Tenn

luszczek@cs.utk.edu

Ben

Martin

Indiana U

benjmart@cs.indiana.edu

Jeremy

Meredith

ORNL

jsmeredith@ornl.gov

Chris

Mueller

Indiana U.

chemuell@cs.indiana.edu

Michael

Perrone

IBM

mpp@us.ibm.com

Fabrizio

Petrini

PNL

fabrizio.petrini@pnl.gov

Gary

Rancourt

IBM

rancourt@us.ibm.com

Bob

Szabo

IBM

rszabo@us.ibm.com

Stan

Tomov

U of Tenn

tomov@cs.utk.edu

Samuel

Williams

UC Berkeley

samw@EECS.Berkeley.EDU

Yuan

Zhao

Rice

yzhao@cs.rice.edu

 


 

 

Abstracts

 

Mike Acton, CellPerformance

Tapping the Cell for Game Development

Harnessing the tremendous power of the PS3 and Cell processor presents pit falls for game programmers not accustomed to the platform.  The first challenge that programmers transitioning to PS3/Cell must overcome is to unlearn their old habits. The focus of this presentation is to present experiences and strategies to smooth the transition from developing for conventional platforms onto the PS3/Cell.

 

 

Virat Agarwal, GA Tech

List Ranking on the Cell Processor
Given a linked list, the list ranking problem finds the distance from each node to the head of the list. List ranking, representative of combinatorial and graph-theoretic applications, is difficult to parallelize due to its highly irregular memory access patterns.  In this talk, we present an efficient implementation of list ranking on the Cell Broadband Engine that uses a general work-partitioning technique to hide memory access latency. We run our algorithm on a 3.2 GHz Cell processor and demonstrate a substantial speedup in comparison with traditional cache-based microprocessors. For a random linked list of 1 million  nodes, we achieve an overall speedup of 8.34 over a PPE-only implementation.

 

 

Christopher Anand, McMaster U

Developing an SPU libm Using Coconut

- Performance and accuracy results for SPU elementary math functions.

- Techniques to leverage the SPU ISA.

- Coconut Project:from mathematical specifications to efficient parallel implementations.

- Coconut Tools:declarative assembly language, simulation tool and scheduling algorithms.

 

This library has been developed using IBM's MASS as a model, with assistance from IBM.IBM has the resulting code which they many release in some form in the future.

 

 

David Bader, GA Tech

Building a Cell Ecosystem

 

 

Robert Cooper, Mercury Computer Systems

Programming the Cell Broadband Engine Processor

The Cell Broadband Engine processor can be viewed as a distributed memory multiprocessor on a single chip. We have been able to apply over a decade of experience with a variety of distributed memory architectures to the programming of Cell-based systems. We have take a very practical approach in order to acheive early application success for our users. This is exemplified by the Mercury MultiCore Framework, an API for explicitly programming heterogeneous multicore architectures. This talk will contrast our approach with that of emerging tools for multicores that automate more aspects of programming and optimization, and will discuss the challenges of ensuring wide adoption of the Cell processor by the programming community.

 

 

Joseph Czechowski, GE

MR Processing on Cell

One method of examining Cell capabilities is by attempting to apply it to a real world problem such as Magnetic Resonance (MR) imaging.The processing required for MR imaging is very regular, and the Cell is a good match for the types of computations involved (so long as the volume of data is manageable).This presentation will briefly describe our experience using the Cell to perform MR imaging.

 

 

Luke Cico, Mercury Computer Systems

FFT Related

 

 

Mike Houston, Stanford

Experiences Building the Sequoia Cell Backend

I'll give a quick introduction to Sequoia and discuss how the compiler backend was designed and the issues with the Cell toolchain and hardware we ran into along the way.I'll discuss the main difficulties that persist for our users when using Cell, and the performance impacts of working around some of the issues.We also have some results from Sequoia applications running on Cell.

 

 

Ken Koch, LANL

The New Roadrunner Supercomputer: What, When,How

The new Los Alamos National Laboratory supercomputer named Roadrunner is described in this talk.Roadrunner will be deployed in multiple phases.This talk covers details of the actual machine architecture from the current Base System now being delivered through the final Cell accelerated system in early 2008.There will be a focus on the hardware and software of the Cell-accelerated final system configuration.

 

 

Dave Kunzman, UIUC

Experience Porting Charm++ to the Cell Processor

There are several features of the Charm++ programming model that make it a good fit for the Cell processor, including data encapsulation, virtualization, peek-ahead in message queue, and portability. We, at the Parallel Programming Lab., have begun porting the Charm++ Runtime System (RTS) to the Cell processor; this will allow Charm++ applications to take advantage of the computational power of the Cell processor. We will present our experience in porting the Charm++ RTS along with our initial impressions of Charm++ applications running on the Cell. The Charm++ RTS takes advantage of an interface called the Offload API to move computation to the SPEs. We will discuss the Offload API, which allows any C/C++ based programs to easily �offload� computation onto the SPEs.

 

 

Jakub Kurzak, University of Tennessee

New Approaches to Numerical Linear Algebra on the CELL Processor

From the standpoint of numerical linear algebra, the CELL processor can be characterized by its two distinct features, the hybrid nature of its floating point capabilities in terms of speed and compliance with the IEEE standard, and the potential for parallelization at a much finer level of granularity than common processors. We present preliminary results with mixed-precision algorithms for solving dense linear systems of equations, where the bulk of the work is done in single precision, and the technique of iterative refinement is used to correct the solution to double precision accuracy. As of today, speeds in excess of 100 GFlop/s are achieved using a single CELL processor. The opportunity of fine grain parallelization on the CELL processor exposes the shortcomings of the model relying on parallelization encapsulated in the layer of BLAS (Basic Linear Algebra Subroutines). We present ongoing work on algorithms utilizing pipelining and streaming techniques directly at the topmost level of linear algebra algorithms.

 

 

Jes�s Labarta, CEPBA - UPC, Barcelona

Programming and Understanding the Cell

The Cell architecture offers a huge processing power in a chip, at the expense of complexity in its use and the understanding of its behaviour. The talk will describe current efforts at BSC on the development and use of programming models that should ease the portability of general applications to the Cell architecture.IBM�s Octopiler is a compiler that accepts OpenMP programs and outlines the body of parallel regions to the SPE, the run time taking care of accessing data on demand.Cell Superscalar takes as input a sequential program annotated with directives that specify for each potentially outlined computation the input and output arguments.From it, the run time determines the actual parallelism exploitable and orchestrates the work of the different SPEs. An instrumentation framework is in place to obtain traces of the actual behaviour of the chip that can then be analyzed with Paraver.

 

 

Jeremy Meredith, ORNL

Experiences Programming the Cell Across a Diverse Set of Applications

The heterogeneous cores of the Cell processor are capable of high performance, but developers must explicitly manage data movement, scheduling, and synchronization.While these attributes provide the Cell with its greatest performance strengths, the also form its greatest weaknesses in terms of developer productivity, code portability, and initial performance efficiencies.I will explore optimization strategies and performance results with the standard high level toolchain available for the Cell system, using a workload drawn from scientific, imaging, and cognitive problem domains.

 

 

Chris Mueller, Indiana University

Synthetic Programming on the Cell BE

In this talk, we introduce the Synthetic Programming Environment (SPE*) for the Cell BE.The SPE is a meta-programming tool for developing high performance computational kernels in Python. Originally developed for the PowerPC, the SPE allows developers to synthesize machine instruction streams (synthetic programs) at run time and provides direct access to processor resources previously available only through intermediate languages.After a brief introduction to synthetic programming, we will discuss our implementation of the SPE for the Cell BE and, using BLAST as an example, demonstrate how to use the SPE to develop high-performance code for both the PPU and SPU.

Additional information is available at www.synthetic-programming.org

 

 

Michael P. Perrone, IBM T.J. Watson Research Center

Cell BE Programming Gotchas!
Abstract:  When programmed properly, the Cell BE processor can achieve tremendous performance; however certain peculiarities of the architecture and tool set can lead to surprising "gotchas" that can negatively impact performance.  This presentation will describe some of these issues and how to deal with them.  The audience will be strongly encouraged to share their own anecdotes about programming the Cell BE.  Questions on how best to program Cell BE will be opened up for the entire audience to debate.

 

 

Fabrizio Petrini, PNL

Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors

Numerous applications require the exploration of large graphs. The problem has been tackled in the past through a variety of solutions, either based on commodity processors or dedicated hardware.Processors based on multiple cores, like the Cell Broadband Engine (CBE), are gaining popularity as basic building blocks for high performance clusters. Nevertheless, no studies have still investigated how effectively the CBE architecture can explore large graphs, and how its performance compares with other architectural solutions.

���������

In this paper, we describe the challenges and design choices involved in mapping a breadth-first search (BFS) algorithm on the CBE. Our implementation has been driven by an accurate performance model, that has allowed seamless coordination between on-chip communication, off-chip memory access, and computation.

 

Preliminary results obtained on a pre-production prototype running at 2.4 GHz show almost linear speedups when using multiple synergistic processing units and impressive levels of performance when compared to other processors. With small arity graphs, a single CBE can provide the same processing rate of 512 BlueGene/L processors, and it is five time faster than a top-of-the-line AMD Opteron clocked at the same frequency. The performance gap narrows with a larger graph arity, where the CBE is still able to outperform 128 BlueGene/L processors and it is almost three times as fast as the AMD Opteron.

 

 

Samuel Williams (UCB/LBL)

LBMHD3D on the Cell Processor

In this talk will discuss the implementation and performance of the core of the 3D lattice Boltzmann magnetohydrodynamic turbulence code (LBMHD3D) on a cell blade.As each grid point requires more than 1KB of data, the small amount of blade DRAM limits the problem size.For a 62x64x4 weak scaling problem, each SPE delivers over 1GFlop/s in double precision and blade performance scales to over 16GFlop/s.This compares very favorably against vector machines, and is more than 10x faster than super scalars.

 

 

Yuan Zhao, Rice University

A Compiler for CELL Processor

 

In this talk, we will present a source-to-source compiler for the CELL processor.The compiler focuses on the loop nests that often represent computation kernels in scientific applications, and uses a fork-and-join model to offload computation to SPEs.Requiring no parallelism directives/pragmas in user applications, the compiler relies on the dependence analysis information to perform automatic parallelization, vectorization, data movement, data alignment and synchronization generation.

We will show preliminary results and discuss the ongoing and future research directions.