readme.txt Created on: Dec 9, 2010 Author: data ===== COMPILING ===== $> cd bin $> make clean $> make $> cd .. ===== DATASET FORMAT ===== Sample data and resource files can be found in _sample_data folder Each row of model.input file looks like: #num_of_terms term1_id:term1_count .... termN_id:termN_count Note - The model.input files where termM_count is 1 indicates that the terms of a document follow a sequential order ===== TRAINING ===== $> bin/VanillaLDA _config_files/sample-train-config.txt train The fields in the sample-train-config.txt are very self explanatory and are reproduced here: file sample-train-config.txt: #+-------------------------------------------------------------------------+ input file: _sample_data/YA09/index/model.input #data input mode: grouped - not needed #TODO: support for space saving sequential format is not available yet input dict file: _sample_data/YA09/index/term_to_id.txt num topics: 10 # Number of iterations # -------------------- # em iter: 20 em var iter: 50 # the model root directory where the training model will be stored # ---------------------------------------------------------------- # model root dir: _sample_data/YA09/lda_store/train/10 # the symmetric dirichlet prior for topic proportions # --------------------------------------------------- # topic proportion prior : 0.01 #+-------------------------------------------------------------------------+ ===== INFERENCE ===== $> bin/VanillaLDA _config_files/sample-test-config.txt test The fields in the sample-test-config.txt are very self explanatory and are reproduced here: file sample-test-config.txt: #+-------------------------------------------------------------------------+ input file: _sample_data/YA09/index/model.input #data input mode: grouped input dict file: _sample_data/YA09/index/term_to_id.txt #num topics: 10 # Number of iterations # -------------------- # em iter: 50 em var iter: 30 # the model root directory where the training model will be stored # ---------------------------------------------------------------- # model root dir: _sample_data/YA09/lda_store/test/10 model train root dir: _sample_data/YA09/lda_store/train/10 #+-------------------------------------------------------------------------+ ===== CODE DESCRIPTION ===== 1. The CVanillaLDAEMFunctionoid class implemented in EMAlgorithms/CVanillaLDAEMFunctionoid.cpp creates the VanillaLDA model implemented in file GraphicalModels/CVanillaLDA.cpp 2.a. The VanillaLDA model does the main job of reading the model input file and creating and initializing model matrices 2.b. The VanillaLDA model also calculates the log likelihood corresponding to the model The log likelihood is calculated using the method double CVanillaLDA::compute_doc_log_likelihood(size_t doc_id, SVanillaLDADocument* doc) {} 3. The CVanillaLDAEMFunctionoid has three major methods: 3.a. void CVanillaLDAEMFunctionoid::operator ()() {} drives the EM algorithm for VanillaLDA 3.b. void CVanillaLDAEMFunctionoid::expectation_step() {} is the driver for the E-step The sufficient statistic matrices are updated here on a per document basis as in VB for LDA 3.b.i. double CVanillaLDAEMFunctionoid::doc_expectation_step(size_t doc_id, SVanillaLDADocument* doc) {} The optimal values of the variational matrices for the current iteration are calculated in this step Called by CVanillaLDAEMFunctionoid::expectation_step() for every document 3.c. void CVanillaLDAEMFunctionoid::maximization_step() {} finds the optimal values of the model parameters for the current iteration 4. src/VanillaLDA.cpp is the driver file VanillaLDA.cpp invokes CVanillaLDATrainCPPUnit or CVanillaLDATestCPPUnit depending on the model operation mode +-----------------------------------------------------------+ For training purposes, the following statements are executed: /** * @brief the model_description_file and the model_compiled_file are not used in this project */ CVanillaLDATrainCPPUnit::CVanillaLDATrainCPPUnit(string model_description_file, string model_compiled_file, string config_file) { CVanillaLDAEMFunctionoid* p_em_model = new CVanillaLDAEMFunctionoid(); CVanillaLDAGeneralModel* VanillaLDA_model = p_em_model->get_concrete_model(); VanillaLDA_model->set_bytecode("VanillaLDA"); string model_operation_mode = "train"; VanillaLDA_model->set_model_operation_mode(model_operation_mode); // MUST SET THIS ---- IMPORTANT!!! VanillaLDA_model->set_model_dump_mode("text"); // default is text file dumping; binary dumping is an easy extension // model initialization VanillaLDA_model->read_config_file(config_file); VanillaLDA_model->read_data_from_file(); // use the functionoid for EM (*p_em_model)(); delete p_em_model; } +-----------------------------------------------------------+