EECS 395 / EECS 495 -- Hot Topics in Distributed Systems: Data-Intensive Computing

 

Final Presentations

Date: Thursday, March 18th, 2010
Time: 12:30PM - 3:50PM
Location: TECH L168
Questions: Dr. Ioan Raicu
(iraicu@eecs.northwestern.edu)

The EECS 495 course on "data-intensive distributed computing" had a great quarter, filled with interesting discussions and topics! There were also 4 distinct projects that were done by the course students which have been under preparation for the last 8 weeks. This is an outline of their presentations, which are open to anyone who is interested to learn more about these topics.

Time Talk Title Abstract Speaker Bio
12:30PM  Opening Remarks    
12:45PM Tunebot in the Cloud
(Slides)
Tunebot is an online music search engine for query-by-humming (QBH). Users search for songs by singing or humming the melody and Tunebot responds with a ranked list of potential matching songs. Tunebot is currently hosted locally on a single server. User queries require a linear scan of the database and currently take several seconds to complete. Given that the size of the database is expected to grow several fold and that the number of simultaneous users is expected to increase, the current architecture is insufficient to meet the anticipated demands on the system. I present a new cloud-based architecture for Tunebot. In this architecture the various components of Tunebot are decoupled and computation is distributed across virtual machines hosted in a cloud environment. The architecture makes it possible to reduce the response time for user queries regardless of the size of the database and allows scaling the system dynamically in response to increased or fluctuating demand. Arefin Huq is a Ph.D. student in Computer Science at Northwestern University. He is interested in using techniques from Machine Learning and Music Information Retrieval to aid the process of creating and improvising music. Before attending Northwestern he worked most recently as a consultant for Sourcetone, LLC developing algorithms for automated music emotion recognition.
1:30PM Trace Collection in Large Scale Peer to Peer Networks
(Slides)
As the scale of the present day peer to peer networks grows, the problem of monitoring and diagnosis becomes all the more difficult and yet all the more important. Traces from each of the connected peers need to be collected at a central location for the purpose of diagnosis. In this project, we propose the design and implementation of a novel protocol for trace collection for peer to peer networks. To the best of our knowledge, the only solutions in this problem domain have been through the use of network coding. We however propose here a complementary approach addressing the structural design of the network to optimize trace collection. As in the previous works, we also operate on the paradigm of delay tolerant data collection so as to not overwhelm the network with the data collection process which is only secondary to the service for which the network actually exists. Yinzhi Cao, Second year Phd of Northwestern University. He receive his B.S from Tsinghua University in China. He is now working with Prof. Yan Chen. His interest is about web security. Vaibhav Rastogi, First year Phd of Northwestern University. He receive his B.S from IIT in Indian. He is now working with Prof. Yan Chen. His interest is about P2P streaming.
2:15PM Automatic Parallelism Discovery (Slides) With the ever growing scale of computation, both in the sense of amount of computation and the size of data involved, it is becoming increasingly difficult, or sometimes impossible, to handle the computation task on a single processor. Given the trend of decreasing cost for commodity computers, it is a natural evolution to leverage parallel and distributed computing to cope with the increasing computation cost. However, one serious challenge lies on the movement towards this natural solution, which is the need for parallel programming. In this project we design and develop the tool that hides all the overhead of parallel programming from developers. It takes in code written in the traditional sequential fashion and automatically executes it in parallel. This tool does not cause large performance degradation comparing to hand-written parallel code and existing tools like Swift. In addition, it is scalable and targeting at the capability to distribute the computation task to up to hundreds of thousand of processors. Hongyu Gao is currently a second year PhD student in the Department of EECS, Northwestern University, under the advising of Prof. Yan Chen. He received his B.S in 2008, from Peking University, China. His current research focus is network intrusion detection and online social network security. He has worked on a variety of research topics, including web information retrieval and performance optimization using GPU.
3:00PM A Distributed File System
(Slides)
Data intensive file systems are the core of data intensive computing paradigms like Map/Reduce. However, most applications are highly customized in order to run on top of Hadoop Distributed File System (HDFS). The applications that need general file system operations cannot run as it is. Also, the scalability of HDFS is limited due to the centralized metadata management. In this project, we propose a new distributed file system (DFS) with distributed metadata servers by using distributed hash table (DHT) technique, namely DHT-DFS. It can provide high scalability due to the distributed MDS architecture and high portability due to the Userspace File operations. Chen has joined the PhD program under Prof. Alok Choudhary since Spring 2009. Previously she worked as a software developer on the parallel I/O middleware, in National Center of Computational Science (NCCS) at ORNL. Her current research focuses are data-intensive distributed system and parallel I/O storage (middleware and File system).
3:45PM Final Remarks    

 

Last modified: July 07, 2011