Latent Semantic Analysis (LSA)

Use Latent Semantic Analysis (LSA) to discover hidden semantics of words in a corpus of documents. Available in Excel using the XLSTAT software.

What is Latent Semantic Analysis?

Latent Semantic Analysis (LSA) allows you to discover the hidden and underlying (latent) semantics of words in a corpus of documents by constructing concepts (or topic) related to documents and terms. The LSA uses an input document-term matrix that describes the occurrence of group of terms in documents. It is a sparse matrix whose lines correspond to documents and whose columns correspond to terms.

There are several applications for LSA, including:

  • Compare documents in the low-dimensional space (data clustering, document classification).

  • Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval).

  • Find relations between terms (synonymy and polysemy).

Latent Semantic Analysis options in XLSTAT

Various options are available in the LSA dialog box such as:

Number of topics: enter the number of topics considered for which Latent Semantic Analysis will be applied.

Document clustering: enable this option if you want to create document classes in the created semantic space. These classes can be displayed via the Color by class option located below the Document-document correlation matrix check box in the charts tab.

Term clustering: enable this option if you want to create term classes in the created semantic space. These classes can be displayed via the Color by class option located below the Term-term correlation matrix check box in the charts tab.

Type of clustering: you can activate one of the two options to select the type of clustering related the two clustering options above.

  • Hard : choose this option to perform a classification in the new created semantic space in which each element (term / document) can belong to only one topic at a time to represent a class (hard clustering).

  • Fuzzy : choose this option to perform a classification in the new semantic space created in which each element (term / document) can belong to several topics at once to represent a class (Soft clustering).

Latent Semantic Analysis results in XLSTAT

Summary table : the summary table shows the total number of document-terms composing them for each topic. The user has the opportunity thereafter to display all of these in the graphs related to the correlation matrices as well as in the topic table.

The eigenvalues and the corresponding scree plot are also displayed. The cumulative variance provides an indication of the relevance of the calculated topics. The higher the latter, the better the approximation resulting from the "truncated" SVD.

Topics table : this table displays the list of terms / topic from left to right in descending order of relationship with the topic concerned.

Nearest neighbor terms : this table displays the n nearest neighbors terms related to the term selected in the drop-down list, in descending order of similarity.

Correlation matrices : the correlation graphs (term-term, document-document) make it possible to visualize the degree of similarity (cosine similarity) between the terms (Term-term correlation matrix) or the documents (Document-document correlation matrix) in their respective spaces. The similarities are between 0 and 1, the value 1 corresponding to a perfect similarity in both directions (positive and negative).