Workshop Schedule
October 25th and 26th, 2006
Held at the
Crowne Plaza Hotel
Hosted by the Innovative Computing Laboratory at the
Sponsored by:
����������������
����������
From the organizers:
We would like to thank you
for agreeing to participate in the workshop.�
The purpose of this workshop is to focus on the development of
algorithms and software for the IBM Cell processor, and to look for directions
for future research and development. The idea is to bring together a small
group of users to share experiences and ideas on the system. We would like to
keep things informal.
Dress at the workshop is
informal.� Please tell us if you need
special requirements (vegetarian food etc...) We are expecting to have an
internet and wireless connections at the meeting. There will be a laptop
projector for giving talks.
Because of time constraints
not everyone will be able to present a talk at the meeting. We hope everyone
will participate in the discussions that occur during the sessions and outside
the meeting room. We are looking for more cutting edge, provocative, honest,
and/or controversial talks.
You should plan to fly into
the
Crowne
Plaza Hotel
(865) 522-2600
http://www.ichotelsgroup.com/h/d/cp/1/en/hotel/tyssh/transportation
The workshop will take place
in the Crowne Plaza Hotel in Salon A.
Agenda:
|
Wednesday |
October 25th, 2006 |
|
|
8:30 � 9:00 |
Continental Breakfast |
Meeting Room: Salon A |
|
9:00 � 9:15 |
Welcome and Introduction |
Jack Dongarra, UTK and ORNL
and Gary Rancourt, IBM |
|
9:15 � 9:45 |
David Bader, GA Tech |
|
|
9:45 � 10:15 |
Robert Cooper, Mercury |
|
|
10:15 � 10:45 |
Jakub Kurzak, U |
New
Approaches to Numerical Linear Algebra on the CELL Processor |
|
10:45 � 11:15 |
Break |
|
|
11:15 � 11:45 |
Ken Koch, LANL |
|
|
11:45 � 12:15 |
Joseph Czechowski, GE |
|
|
12:15 � 12:45 |
Chris |
|
|
12:45 � 1:45 |
Lunch |
Meeting Room: Salon A |
|
1:45 � 2:15 |
Jes�s Labarta, |
|
|
2:15 � 2:45 |
Jeremy Meredith, ORNL |
Experiences Programming the Cell Across
a Diverse Set of Applications |
|
2:45 � 3:15 |
Christopher Anand, McMaster
U |
|
|
3:15 � 3:45 |
Break |
|
|
3:45 � 4:15 |
Mike Acton, CellPerformance |
|
|
4:15 � 5:15 |
Open discussion |
|
|
6:30 |
Dinner |
|
|
Thursday |
October 26th,
2006 |
|
|
8:30 � 9:00 |
Continental Breakfast |
Meeting Room: Salon A |
|
9:00 � 9:30 |
Mike Houston, Stanford |
|
|
9:30 � 10:00 |
|
|
|
10:00 � 10:30 |
David Kunzman, UIUC |
|
|
10:30 � 11:00 |
Break |
|
|
11:00 � 11:30 |
Sam Williams, UC Berkeley |
|
|
11:30 � 12:00 |
Jon Greene, Mercury |
|
|
12:00 � 12:30 |
Fabrizio Petrini, PNL |
Challenges in Mapping Graph Exploration Algorithms
on Advanced Multi-core Processors |
|
12:30 � 1:30 |
Lunch |
Meeting Room: Salon A |
|
1:30 � 2:00 |
Yuan Zhao, Rice U |
|
|
2:00 � 2:30 |
Michael Perrone, IBM |
|
|
2:30 � 3:00 |
Luke Cico, Mercury |
|
|
3:00 � |
Open discussion |
|
List of Attendees:
|
Attendee
Name |
Affiliation |
email |
|
|
Mike |
|
CellPerformance |
|
|
Virat |
Agarwal |
GATech |
|
|
Christopher |
Anand |
McMaster U |
|
|
David |
Bader |
Georgia Tech |
|
|
George |
Bosilca |
U of |
|
|
John |
Brickman |
Mercury Computer |
|
|
Alfredo |
Buttari |
U of |
|
|
Luke |
Cico |
Mercury |
|
|
Robert |
Cooper |
Mercury Computer |
|
|
Joseph |
Czechowski |
GE |
|
|
Jack |
Dongarra |
U of |
|
|
Peng |
Du |
U of |
du@cs.utk.edu |
|
Kayvon |
Fatahalian |
Stanford |
kayvonf@graphics.stanford.edu |
|
Jon |
Greene |
Mercury |
|
|
Paul |
Henning |
LANL |
|
|
Mike |
|
Stanford |
|
|
Kirk |
|
IBM |
|
|
Laxmikant |
Kale |
UIUC |
|
|
Ken |
Koch |
LANL |
krk@lanl.gov |
|
David |
Kunzman |
UIUC |
|
|
Jakub |
Kurzak |
U of |
|
|
Jesus |
Labarta |
|
|
|
Piotr |
Luszczek |
U of |
|
|
Ben |
Martin |
|
|
|
Jeremy |
Meredith |
ORNL |
|
|
Chris |
Mueller |
|
|
|
Michael |
Perrone |
IBM |
|
|
Fabrizio |
Petrini |
PNL |
fabrizio.petrini@pnl.gov |
|
|
Rancourt |
IBM |
|
|
Bob |
Szabo |
IBM |
|
|
Stan |
Tomov |
U of |
|
|
Samuel |
Williams |
UC Berkeley |
|
|
Yuan |
Zhao |
Rice |
yzhao@cs.rice.edu |
Mike Acton, CellPerformance
Tapping the Cell for
Game Development
Harnessing
the tremendous power of the PS3 and Cell processor presents pit falls for game
programmers not accustomed to the platform. The first challenge that
programmers transitioning to PS3/Cell must overcome is to unlearn their old
habits. The focus of this presentation is to present experiences and strategies
to smooth the transition from developing for conventional platforms onto the
PS3/Cell.
List Ranking on the
Cell Processor
Given a linked list, the list
ranking problem finds the distance from each node to the head of the list. List
ranking, representative of combinatorial and graph-theoretic applications, is
difficult to parallelize due to its highly irregular memory access
patterns. In this talk, we present an efficient implementation of
list ranking on the Cell Broadband Engine that uses a general work-partitioning
technique to hide memory access latency. We run our algorithm on a 3.2 GHz Cell
processor and demonstrate a substantial speedup in comparison with traditional
cache-based microprocessors. For a random linked list of 1 million nodes,
we achieve an overall speedup of 8.34 over a PPE-only implementation.
Christopher Anand,
McMaster U
Developing an SPU libm
Using Coconut
- Performance and accuracy results for SPU elementary
math functions.
- Techniques to leverage the SPU ISA.
- Coconut Project:�
from mathematical specifications to efficient parallel implementations.
- Coconut Tools:�
declarative assembly language, simulation tool and scheduling
algorithms.
This library has been developed using IBM's MASS as a
model, with assistance from IBM.� IBM has
the resulting code which they many release in some form in the future.�
David Bader, GA Tech
Building a Cell Ecosystem
Robert Cooper, Mercury
Computer Systems
Programming the Cell
Broadband Engine Processor
The Cell Broadband Engine
processor can be viewed as a distributed memory multiprocessor on a single
chip. We have been able to apply over a decade of experience with a variety of
distributed memory architectures to the programming of Cell-based systems. We
have take a very practical approach in order to acheive early application
success for our users. This is exemplified by the Mercury MultiCore Framework, an
API for explicitly programming heterogeneous multicore architectures. This talk
will contrast our approach with that of emerging tools for multicores that
automate more aspects of programming and optimization, and will discuss the
challenges of ensuring wide adoption of the Cell processor by the programming
community.
Joseph Czechowski, GE
MR Processing on Cell
One method of examining Cell
capabilities is by attempting to apply it to a real world problem such as
Magnetic Resonance (MR) imaging.� The processing
required for MR imaging is very regular, and the Cell is a good match for the
types of computations involved (so long as the volume of data is
manageable).� This presentation will
briefly describe our experience using the Cell to perform MR imaging.�
Luke Cico, Mercury
Computer Systems
FFT Related
Mike Houston, Stanford
Experiences Building the Sequoia Cell Backend
I'll give a quick introduction to Sequoia and discuss
how the compiler backend was designed and the issues with the Cell toolchain
and hardware we ran into along the way.�
I'll discuss the main difficulties that persist for our users when using
Cell, and the performance impacts of working around some of the issues.� We also have some results from Sequoia
applications running on Cell.
Ken Koch, LANL
The New Roadrunner Supercomputer: What, When,� How
The new Los Alamos National Laboratory supercomputer
named Roadrunner is described in this talk.�
Roadrunner will be deployed in multiple phases.� This talk covers details of the actual
machine architecture from the current Base System now being delivered through
the final Cell accelerated system in early 2008.� There will be a focus on the hardware and
software of the Cell-accelerated final system configuration.
Dave Kunzman, UIUC
Experience Porting
Charm++ to the Cell Processor
There are several features of
the Charm++ programming model that make it a good fit for the Cell processor,
including data encapsulation, virtualization, peek-ahead in message queue, and
portability. We, at the Parallel Programming Lab., have begun porting the
Charm++ Runtime System (RTS) to the Cell processor; this will allow Charm++
applications to take advantage of the computational power of the Cell
processor. We will present our experience in porting the Charm++ RTS along with
our initial impressions of Charm++ applications running on the Cell. The
Charm++ RTS takes advantage of an interface called the Offload API to move
computation to the SPEs. We will discuss the Offload API, which allows any
C/C++ based programs to easily �offload� computation onto the SPEs.
Jakub Kurzak,
New Approaches to Numerical Linear Algebra on the CELL
Processor
From the standpoint of numerical linear algebra, the
CELL processor can be characterized by its two distinct features, the hybrid
nature of its floating point capabilities in terms of speed and compliance with
the IEEE standard, and the potential for parallelization at a much finer level
of granularity than common processors. We present preliminary results with
mixed-precision algorithms for solving dense linear systems of equations, where
the bulk of the work is done in single precision, and the technique of
iterative refinement is used to correct the solution to double precision
accuracy. As of today, speeds in excess of 100 GFlop/s are achieved using a
single CELL processor. The opportunity of fine grain parallelization on the
CELL processor exposes the shortcomings of the model relying on parallelization
encapsulated in the layer of BLAS (Basic Linear Algebra Subroutines). We
present ongoing work on algorithms utilizing pipelining and streaming
techniques directly at the topmost level of linear algebra algorithms.
Jes�s Labarta, CEPBA -
UPC,
Programming and Understanding
the Cell
The Cell architecture offers
a huge processing power in a chip, at the expense of complexity in its use and
the understanding of its behaviour. The talk will describe current efforts at
BSC on the development and use of programming models that should ease the portability
of general applications to the Cell architecture.� IBM�s Octopiler is a compiler that accepts
OpenMP programs and outlines the body of parallel regions to the SPE, the run
time taking care of accessing data on demand.�
Cell Superscalar takes as input a sequential program annotated with
directives that specify for each potentially outlined computation the input and
output arguments.� From it, the run time
determines the actual parallelism exploitable and orchestrates the work of the
different SPEs. An instrumentation framework is in place to obtain traces of
the actual behaviour of the chip that can then be analyzed with Paraver.�
Jeremy Meredith, ORNL
Experiences Programming
the Cell Across a Diverse Set of Applications
The heterogeneous cores of
the Cell processor are capable of high performance, but developers must
explicitly manage data movement, scheduling, and synchronization.� While these attributes provide the Cell with
its greatest performance strengths, the also form its greatest weaknesses in
terms of developer productivity, code portability, and initial performance
efficiencies.� I will explore
optimization strategies and performance results with the standard high level
toolchain available for the Cell system, using a workload drawn from
scientific, imaging, and cognitive problem domains.
Chris Mueller,
Synthetic Programming on the Cell BE
In this talk, we introduce the Synthetic Programming
Environment (SPE*) for the Cell BE.� The
SPE is a meta-programming tool for developing high performance computational
kernels in Python. Originally developed for the PowerPC, the SPE allows
developers to synthesize machine instruction streams (synthetic programs) at
run time and provides direct access to processor resources previously available
only through intermediate languages.�
After a brief introduction to synthetic programming, we will discuss our
implementation of the SPE for the Cell BE and, using BLAST as an example,
demonstrate how to use the SPE to develop high-performance code for both the
PPU and SPU.
Additional information is available at www.synthetic-programming.org
Michael P. Perrone,
Cell BE Programming
Gotchas!
Abstract: When programmed properly, the Cell BE processor can achieve
tremendous performance; however certain peculiarities of the architecture and
tool set can lead to surprising "gotchas" that can negatively impact
performance. This presentation will describe some of these issues and how
to deal with them. The audience will be strongly encouraged to share
their own anecdotes about programming the Cell BE. Questions on how best
to program Cell BE will be opened up for the entire audience to debate.
Fabrizio Petrini, PNL
Challenges in Mapping
Graph Exploration Algorithms on Advanced Multi-core Processors
Numerous applications require
the exploration of large graphs. The problem has been tackled in the past
through a variety of solutions, either based on commodity processors or
dedicated hardware.� Processors based on
multiple cores, like the Cell Broadband Engine (CBE), are gaining popularity as
basic building blocks for high performance clusters. Nevertheless, no studies
have still investigated how effectively the CBE architecture can explore large
graphs, and how its performance compares with other architectural solutions.
���������
In this paper, we describe
the challenges and design choices involved in mapping a breadth-first search
(BFS) algorithm on the CBE. Our implementation has been driven by an accurate
performance model, that has allowed seamless coordination between on-chip
communication, off-chip memory access, and computation.
Preliminary results obtained
on a pre-production prototype running at 2.4 GHz show almost linear speedups
when using multiple synergistic processing units and impressive levels of
performance when compared to other processors. With small arity graphs, a
single CBE can provide the same processing rate of 512 BlueGene/L processors,
and it is five time faster than a top-of-the-line AMD Opteron clocked at the
same frequency. The performance gap narrows with a larger graph arity, where
the CBE is still able to outperform 128 BlueGene/L processors and it is almost
three times as fast as the AMD Opteron.
Samuel Williams (UCB/LBL)
LBMHD3D on the Cell Processor
In this talk will discuss the implementation and
performance of the core of the 3D lattice Boltzmann magnetohydrodynamic
turbulence code (LBMHD3D) on a cell blade.�
As each grid point requires more than 1KB of data, the small amount of
blade DRAM limits the problem size.� For
a 62x64x4 weak scaling problem, each SPE delivers over 1GFlop/s in double
precision and blade performance scales to over 16GFlop/s.� This compares very favorably against vector
machines, and is more than 10x faster than super scalars.
Yuan Zhao,
A Compiler for CELL
Processor
In this talk, we will present
a source-to-source compiler for the CELL processor.� The compiler focuses on the loop nests that
often represent computation kernels in scientific applications, and uses a
fork-and-join model to offload computation to SPEs.� Requiring no parallelism directives/pragmas
in user applications, the compiler relies on the dependence analysis
information to perform automatic parallelization, vectorization, data movement,
data alignment and synchronization generation.
We will show preliminary
results and discuss the ongoing and future research directions.