Pradipto Das - SUNY Buffalo, CSE Dept. Research Softwares

p r e v

Links to Softwares

Some Java codes related to recent papers

> In our latest CVPR paper, there is a mention about computing inversions. Here is the Java code on simultaneous stable sorting and computing inversions in N log(N) compares. Inversions count the number of exchanges required to transform a permutation of N natural numbers to their original ordering.

> The Java code to extract color histogram features from an arbitrary video can be found here. The command to execute the code is
java -cp VideoColorHistogram.jar edu.buffalo.cse.VideoProcessing
.ColorHistogramFromVideo data/videos data/features/ColorHist 8
The code calls the ffmpeg utility in Linux to extract frames. If the number of bins per color channel is 8, then the total number of bins for the histogram is 8x8x8 = 512. The sample data directory consisting of just one example test video is here.

> Java code to perform multithreaded N-Way merge. The easiest way to time the code on arrays with random doubles in [0,10) is
time java -cp NWayMerge.jar NWayMerge -t 100 -l 1000000
A better way is to use the code as needed. The test client can be found in the NWayMerge.java file. The general usage is:
java -cp NWayMerge.jar NWayMwerge -t <numThreads> -l <numItemsToSort> -i <inputFile> -o <outputFile> [--verbose]
Type java -cp NWayMerge.jar NWayMwerge -h for usage.

The AToM Framework

I had worked on the AToM (Another Topic Model) framework that implements a framework for statistical topic modeling codes in C++. [ Gibbs-LDA download]

Note: This version implements only a basic Gibbs sampler type LDA (Latent Dirichlet Allocation). The softwares here are intended as quick and dirty prototypes for beginners. Some additional comments:

The code uses the IP (Imputation-Posterior) iterative framework for sampling.
The goal was to have an interpretive version of Hal Daume's HBC just like Java vs. C++. HBC is really good but often times, the genarated code becomes very hard to follow for the uninitiated.
I am not sure I will ever get to completing a full topic model interpreter any time soon. Feel free to try it out and extend it any way you like.

Prototypes - requires polishing

Codes from an Ancient Time

Discovering Voter Preferences in Blogs using Mixtures of Topic Models - Pradipto Das, Rohini Srihari and Smruthi Mukund, AND'09, July 23-24, 2009, Barcelona, Spain [Noisy Text Data Analytics Workshop under ICDAR]

Indexed Obama speeches (text) as used by the model
Indexed McCain speeches (text) as used by the model
Indexed Blog responses to the speeches as used by the model
David Blei's Correlated Topic Model code in C (ctmC), only slightly modified to output the per document variational word topic distribution matrices
The code (in C++) as used for the paper. Requires outputs from the ctmC as well as all index files. The executable is in the Debug directory. Note that this code project can be imported into the Eclipse CDT IDE and the makefiles are those generated by Eclipse. You may need to change the subdir.mk and the Makefile for the gsl and atlas library paths
Readme.txt

TA Corner

TA1. Adhoc Datastructures

[ This Eclipse CDT project] serves as a repository for standard algorithms that are not found in the standard C++ stl (except heaps). Source codes have been borrowed from several sources. Being written entirely in C++, this package currently includes implementation for Multi-Way-Merge or K-Way-Merge, B+ Tree, a minimal Trie, Heaps using vectors and a minimal on-disk binary search (often used for inverted index searching based on query words). WishList: include code from Google's sparse hash, SGI STL hash_map and TPIE See the readme file inside the tarball for compilation instructions. The SGI STL hash_map can be accessed using standard c++ library in most nix systems and can be used in code as:

#include<ext/hash_map>
using namespace __gnu_cxx;

TA2. CSE 4/535 IR course - Fall 2010

[ Warmup code] This is an adhoc implementatation for the first project. The expected behavior of this code is to extract internal wiki links from wiki markups. For more info on how the markup files look like, please see the files under the "Wiki" subdirectory under the data directory.
(+) First project.
(+) Second project.
(+) Third project.

TA3. CSE 4/535 IR course - Fall 2009

[ Warmup notes on C++ STL] The content in this pdf was geared towards use in first project. The focus of the project was on document language identification using character bi-grams. Stl bitset was used for unicode extraction. Format for unicode is well described in the UTF-8 article in Wikipedia
(*) Find the UTF-8 code chart here

n e x t

Please report bugs to me. My email can be found at the bottom right corner of this page with "university domain" being substituted with "buffalo . edu". Thanks!

email: pdas3 at [university domain]