Research Interest
My research interests cover parallel and distributed computing, large-scale data mining and machine learning, big data systems and applications, and optimization algorithms. In specific, I would like to work on scalable parallel data mining algorithms which facilitate computational and scientific applications by dealing with large-scale data via employing efficient parallel computing strategies to them via appropriate parallel runtime models, such as MPI, GraphLab, and (iterative) MapReduce, on various parallel and distributed environments. I am also highly interested in applying the optimization techniques underlying data mining and machine learning algorithms in order to achieve better outputs.
Current Projects
- SparseBench: Sparse Matrix Operation Benchmark [2015 - Present]
- GossipMap: Scalable Graph Community Detection Algorithm [2012 - Present]
Sparsity of the data is a typical phenomenon in the most of the big graph data. Although most dense matrix operations have been researched several decades, such as BLAS, PBLAS, LAPACK, ScaLAPACK, and so on, research community of Linear Algebra (LA) has been working on the Sparse LA to tackle more efficiently, recently. On the other hand, the database systems are known as not suitable to do LA operations, due to high overhead. However, the story could be changed when we think about sparse matrix operations, since most of the sparse matrix operations could be implemented with Join and Aggregation operations in DB systems. Therefore, it is an interesting and essential research work to do a benchmark study for comparison recent Big Data Systems and released Sparse LA libraries.
Community-detection is a powerful approach to uncover important structures in large networks. Infomap is a flow-based and information-theoretic community detection algorithm. While Infomap is known to be an effective algorithm, it is critical to utilize parallel and distributed computing for taking care of huge graphs, such as web-scale graphs with billions of edges. I worked on a novel and efficient parallel generalization of Infomap, called RelaxMap. I have also proposed a prioritized method of the RelaxMap which can be much more efficient than the non-prioritized version. Recently, I extended the RelaxMap algorithm in distibuted-memory cluster systems for achieving high scalability, called GossipMap. Currently, I am exerting my efforts to extend the GossipMap algorithm for analyses of large-scale temporal graph datasets as well as hierarchical community structures of the given graphs.
Past Projects
- Deterministic Annealing Multidimensional Scaling (DA-MDS) [2009 - 2012]
- MDS Interpolation and High performance Parallel MDS for Large High-dimensional Data Visualization. [2008 - 2012]
- Chemical compounds data visualization. [2010 - 2012]
- Parallel Runtime Comparison [2010]
- GIS data clustering and Visualization(using Indiana Census data & SAVI data) [2007]
- Message Passing Overhead Benchmarking [2007]
- dPattern: Transcription Factor Binding Site (TFBS) Discovery [2006]
- Adaptive Mutation Rates for Hybrid Genetic Algorithm (GA) [2002 - 2004]
- GIS Data Representation based on SVG [2001]
I applied Deterministic Annealing (DA) optimization approach to MDS algorithms, for the purpose of avoiding well-known local optima problem in conventional MDS solutions, and originated DA-MDS algorithm. DA-MDS improves mapping quality in terms of the well-known objective function, called STRESS, compared to Expectation-Maximization (EM)-like majorization approach named SMACOF. In addition, DA-MDS turns out more efficient than SMACOF algorithm for larger data sets. Since DA-MDS shows better performance in terms of mapping quality, I parallelized the DA-MDS algorithm by using MPI.
Developed high-performance parallel MDS, which is a data-intensive application as well as computation-intensive application for large high-dimensional scientific data, by using Thread and MPI.
Since the time complexity of MDS is quadratic, it is still difficult and suffers from long running time to deal with millions of data, which is a normal data size nowadays, even if we parallelized the algorithm. Therefore, I invented a MDS interpolation algorithm to reduce time complexity based on majorizing method used in SMACOF.
Applied MDS algorithm to chemical compounds data for data point visualization of millions of chemical compounds in PubChem database with respect to 166-bit fingerprint.
Analyzed different parallel runtimes' performance, including Microsoft's Concurrency and Coordination Runtime (CCR), Task Parallel Library (TPL), and MPI.NET through various algorithms.
Analyzed DA clustering results with Indiana Census data, and visualize clustering results on Google Earth and Microsoft's Virtual Earth.
Compared message passing overhead of Microsoft's CCR with various MPI parallel runtimes, for instance, MPICH2, mpiJava, and MPJ Express (MPJE), upon multicore systems based on simple message passing patterns.
Devised an algorithm to discover TFBS in Human Genome by a discriminative pattern analysis. The given profile model is improved by a mixture model approach which combines the original profile model with the complimentary model which is a k-order Markov model with respect to the predefined alignment result in Genome Browser of UCSC. [dPattern web server]
Investigated an adaptive method for mutation rates of Hybrid Genetic Algorithm (GA). Since hybrid GA combines normal GA procedure with strong local optimum algorithm, the level of perturbation decreases considerably. Thus, we tried to increase mutation rate over the generations to maintain the perturbation rate, in contrast to the conventions, and the experimental results supported the proposed strategy.
Transformed eXtensible Markup Language (XML) type of GIS data into Scalable Vector Graphics (SVG) using eXtensible Stylesheet Language Transformation (XSLT).