Large Scale Graph Mining on GPU

Project maintained by hanzhou and srw

By Han Zhou (hanzhou) and Shannon Williams (srw)

Main Page

Summary

We want to implement large scale graph mining on GPUs and compare the runtime to the versions available for CPU.

Background

This project is inspired from PEGASUS, which runs graph mining algorithms on the top of Hadoop. The core idea of PEGASUS is converting graph mining problems to a matrix-vector multiplication and using sql operations to achieve the computation. Therefore the primitive operation is GIM-V (Generalized Iterated Matrix-Vector multiplication). Then it is implemented in Hadoop with some optimizations. (See: this paper for details)

Previously, we implemented a sequential version of GIM-V in SQL. We plan to use this as a baseline to measure the improvement of our code.

Challenge

The challenge of this project is scaling up GPU to work on a large dataset. Local memory is limited, so we need to do many disk reads to compute over the entire graph. We need to efficiently hide the latency of these reads, and also improve the compute time such that, even with the additional latency, we still see a speed-up.

Another challenge would be scaling up to use multiple GPUs. Communication between GPUs will be difficult to implement, and also add further latency. A key point of our project will be analyzing these trade offs.

Resources

PEGASUS is an open source graph mining program, so we can directly download it from their website.

cuBLAS is provided by Nvidia to support the basic linear algebra computing. And these operations should have been well optimized. But we don’t know whether we can still use them when we want to implement a scalable version, so we will try both this library and functions developed by ourselves.

Goals and Deliverables

PLAN TO ACHIEVE

A GPU version of one or more graph mining algorithms. Specifically, we plan to implement PageRank and Connected Components.
An analysis of PEGASUS, a sequential SQL version, and the GPU version on GB and TB sized graphs.
A GPU version which effectively uses multiple GPUs.

HOPE TO ACHIEVE

A hybrid version of CPU and GPU which can beat the performance of GPU alone.
Find an effective method to store sparse matrix on GPU (see http://www.vldb.org/pvldb/vol4/p231-yang.pdf )

DEMO

At our demo, we plan to present several graphs comparing the runtime of our 3 algorithms on multiple input graphs for multiple graph mining algorithms. This should clearly show when each algorithm is superior to the others.

We probably won’t show a demo of our programs actually running, because it wouldn’t look very interesting, and it might take a while to run.

WHAT WE HOPE TO LEARN

We want to discover if the improved compute time of GPUs is worth the possible overhead of moving large quantities of data, such as those found in modern day graphs.

Platform Choice

We plan to use the Latedays cluster to run our code, which will use Tesla K40 GPUs. The K40 are advanced GPUs used for scientific computing. Also, Latedays has a large memory and many multi-core CPUs making it ideal for parallelizing on the CPU as well. For consistency, we will also run the sequential SQL version on Latedays. In contrast, PEGASUS uses Hadoop cloud computing to run, so we can just run this with AWS.

Schedule

Week 0: (March 30 - April 5)

Write project proposal.
Investigate background material and available resources.
Run PEGASUS on AWS with sample graph sets and record data.

Week 1: (April 6 - April 12)

Implement basic GPU version of PageRank within small graph.
Run SQL version on Latedays, record data results.

Week 2: (April 13 - April 19)

Extend GPU version to handle TB graph on single core.

Week 3 - 4: (April 20 - May 3)

Extend GPU version to use multiple GPUs.
Optimize GPU version.

Week 5: (May 3 - May 10)

Consider using hybrid of CPU and GPU.
Run GPU version on Latedays with sample graph sets and record data.
Compare and analyze results.

Having trouble with Pages? Check out the documentation at https://help.github.com/pages or contact support@github.com and we’ll help you sort it out.