On Information Extraction

Started work on information extraction. Should be done in a couple of weeks. Finance is on back burner for now. It has become boring.

I think if I use numpy to store word embeddings(Glove etc.),  I can use it similar to LSA for document comparison etc. LSA stores document and component matrix along with word and component matrix. Embeddings store word and component matrix.

I think LSA gets the components for each word and adds them to get a final vector. This is then cosined with all the document vectors (they are also the sum of word components) to identify similar documents. I can replicate this using word embeddings.

Essentially LSA is cosine distance (not using words but components (more popularly referred to as dimensions)) but with an added benefit of dimensionality reduction which exploits the latent semantics of words to give better results than simple cosine distance (using words only).

Both are essentially doing the same thing. The method of getting the embedding is a bit different. This way I wont have to train a new LSA matrix. Moreover, I think embeddings should give better results. Eventually both are bag of embeddings method. Combining with TFIDF will further inprove the results.

With embeddings I dont need to store a document matrix. I can create it on the fly. If some preprocessing is applied, then only relevant document vectors need to be created. Maybe it will not be necessary.

The code needs to be worked out.

Domain specific training should result in better embeddings. Embeddings trainings is pretty much unsupervised. So, no hassles there.

I remain skeptical of the efficacy of embeddings. I think it is useful but needs to be complemented. For information extraction, I suppose they are fine. But as usual, I will add my special ingredient. LDA with word embeddings should be quite helpful.

ELMO Embeddings are the next step.

A typical sentence embedding generated using word embeddings is 300 dimensions whereas one using sentence vectors is typically 4800 dimensions. Doesn't it seem a bit incongruous?

I guess, using LDA, I will do the training at document level and extract information at sentence and para and document level. This will result in good keyword and theme extraction with minimal supervision. The problem of unseen words can be handled if I use embeddings for LDA.

Comments

Popular posts from this blog

On Quora