Noun Phrases in Documents: Preprocessing, Automatic Extraction, and Statistical Analysis in Different Categories of Text

Youngin Kim. Noun Phrases in Documents: Preprocessing, Automatic Extraction, and Statistical Analysis in Different Categories of Text. Ph.D. dissertation. Advisor: Michael Cooper. University of California, Berkeley. 2002.


The primary objective of this study is to analyze noun phrase patterns in full-text documents. Knowledge of noun phrase patterns could facilitate the automatic indexing of electronic documents stored in an information retrieval system.

This dissertation examines several different questions concerning the identification of noun phrase patterns with the aid of natural language processing techniques. First, how does the preprocessing stage (preparing raw text for automated linguistic analysis) affect noun phrase extraction? Second, are some natural language processing techniques more effective than others in extracting noun phrases? And finally, do the properties of text such as subject domain (e.g., humanities, social sciences, engineering), genre (academic research articles vs. newspaper articles), research method (quantitative vs. qualitative), effects on noun phrase patterns?

To investigate these questions, a data set consisting of 1,099 full-text documents (450 academic research articles and 649 newspaper articles) was developed for this study. An examination of the raw data set revealed significant inconsistencies in document format and content in different subject domains. Besides presenting practical problems to overcome, these inconsistencies provide additional evidence for the existence of domain-specific textual characteristics and the need for domain-specific methods of term identification and extraction.

A comparative evaluation of three different automatic language analysis tools (a probabilistic parser, a rule-based tagger, and a statistical tagger) showed that all had comparable effectiveness rates (97-99%). The statistical tagger was chosen for noun phrase identification in the remainder of this study because of its combination of effectiveness and efficiency.

A statistical analysis of the document data set found that both subject domain and research method influence noun phrase patterns in academic documents. The statistical frequency of noun phrases and the proportion of different types of noun phrases differ from one domain to the next.

The findings confirmed the significance of domain-specific characteristics in text. They suggest that different document types, with different textual characteristics, may require different methods of content representation in information retrieval system design to optimize performance.

Last updated:

September 20, 2016