Chapter 1 – Introduction: This chapter introduces the book and its intended target audience. A very brief introduction to MATLAB® and the Text Analytics Toolbox™ is also provided.
Part I: Fundamentals
Chapter 2 – Handling Text Data: This chapter focuses on the different types of variables that can be used for manipulating text, as well as it introduces some basic MATLAB® and Text Analytics Toolbox™ built in functions for operating with strings.
Chapter 3 – Regular Expressions: This chapter is completely devoted to the specificities and use of regular expressions in the MATLAB® programming environment.
Chapter 4 – Basic Operations with Strings: This chapter focuses on basic operations with strings, including search, replacement, segmentation, concatenation, and some basic set operations that can be applied to string and character sets.
Chapter 5 – Reading and Writing Files: This chapter deals with reading and writing text files and describes some commonly used file formats. Also, in this chapter, some basic functions for operating with directories and files are introduced.
Chapter 6 – The Structure of Language: This chapter introduces the structure of language. Some functionalities of the Text Analytics Toolbox™ that are useful to process and analyze text data at different language structural levels are introduced in this chapter.
Part II: Mathematical Models
Chapter 7 – Basic Corpus Statistics: This chapter introduces the main concepts related to corpus statistics. First, it presents some fundamental properties of language such as the Zipf’s law of frequencies and the burstiness phenomenon. Then, it introduces the notion of word co-occurrences and the incidence of word order information.
Chapter 8 – Statistical Models: This chapter is devoted to the statistical modeling approach. It introduces the basic n-gram model and the fundamental concepts of discounting and model interpolation. Additionally, statistical bag-of-words and topic models are also presented and discussed.
Chapter 9 – Geometrical Models: This chapter focuses on the geometrical modeling approach. It starts by introducing the concept of term-document matrix and the vector space model. Then, the geometrical modeling notions of distance, similarity and association scores are introduced.
Chapter 10 – Dimensionality Reduction: This chapter is devoted to the specific problem of dimensionality reduction. It presents the ideas of vocabulary pruning and merging, as well as some fundamental linear and non-linear projection methods. Additionally, it introduces the notion of the embedded representations.
Part III: Methods and Applications
Chapter 11 – Document Categorization: This chapter focuses on the problem of document categorization. It presents basic techniques for unsupervised clustering and supervised classification. The case of supervised classification is addressed from both the vector space and statistical modeling approaches. Also, in this chapter, basic methods for extracting terminology that is relevant to a given document category are illustrated
Chapter 12 – Document Search: This chapter focuses on the problem of document search. More specifically, the binary search and the vector-based search approaches are described and illustrated. This chapter also introduces the basic metrics of precision and recall, as well as some other fundamental concepts of information retrieval, such as query expansion and relevance ranking. Finally, the problem of cross-language search is introduced.
Chapter 13 – Content Analysis: This chapter deals with the problem of content analysis. Although this is indeed a very broad concept, this chapter focuses on two specific types of content analysis: polarity estimation and property extraction. In the first case, the problems of detecting polarity and estimating its intensity within the context of opinionated texts is presented and discussed. In the second case, the problem of extracting properties and other specific informational elements by means of text-pattern matching is introduced and illustrated.
Chapter 14 – Keyword Extraction and Summarization: This chapter focuses on the problems of keyword extraction and text summarization. With regards to keyword extraction, the concept of word centrality is presented and two standard approaches to keyword extraction are introduced. In the case of text summarization, the extractive summarization approach is introduced and discussed.
Chapter 15 – Question Answering and Dialogue: This chapter focuses on the topics of question answering and dialogue systems. In the first case, the generic problem of question answering is introduced along with the specific problems of question understanding and intent detection. With regards to dialogue, the basic operation of a dialogue manager is described, briefly introducing the specific problems of dialogue state tracking and response selection.