home << newsletters << October 2003 index << Concept-Based Document Analysis
 

Newsletter:  October 2003/ Issue 5
Concept-Based Document Analysis: A Human-Centered Approach to Information Discovery

by Marina Skumanich, PNWCGS Staff

We live in a time of information-overload, where the amount of information on any given subject seems almost limitless, and is often beyond easy assimilation. Intelligence and security professionals are keenly aware of this situation, since information analysis—the analysis of published materials, written reports, and even open source news stories—plays such a central role in their work on proliferation and other security concerns. Fortunately, for those struggling to manage this information, more help is on the way. Researchers Alan Chappell, Judi Thomson, and Alan Willse of the Pacific Northwest National Laboratory (PNNL) are working on a lab-funded research and development project to develop a “Concept-Based Document Analysis Program.”

One of the biggest issues for analysts seeking to identify trends and events of relevance in a set of documents is how to sort and group the documents by appropriate topics. The most straightforward, but tedious, approach is to have the documents sorted “by hand” —that is, by using the analyst’s own knowledge to determine appropriate groupings or clusters of documents. While this approach has the advantage of drawing on the analyst’s internal “expert system,” it clearly becomes harder and harder to accomplish with increasing quantities of data.

The obvious limitations to the manual approach have led to the development of a series of computer-based systems such as PNNL’s SPIRE and Starlight systems. These commercialized systems use automated rule-based document association, with visual outputs. In this approach, certain rules of association (e.g., similarities of words, etc.) are applied to a set of documents. As documents show greater or lesser association, they are automatically “clustered” together and displayed on a visual graph, so that the user can see various themes “emerge” from the documents. In the case of SPIRE, document themes appear as mountains, with greater concentrations of documents in a theme leading to higher peaks. In the case of Starlight, document themes appear as clusters of stars in a constellation. Both of these programs have proven to be very helpful in the effort to analyze large numbers of documents.

The innovation of the Concept-Based Document Analysis approach is to bring the analyst’s knowledge back into the process. In particular, Chappell explained, “Concept-Based Document Analysis seeks to create more of a partnership between the automated processing in which computers excel and the expert judgment of analysts regarding what is important and what particular topics are of interest.”

Concept-Based Document Analysis starts by asking the analyst: What are the topics or subjects you are concerned about? What are representative documents—or sections of a document—that would reflect these topics? Then once the analyst has identified “cluster themes” and the representative, the Concept-Based Document Analysis Program can cluster the rest of the set of documents based on these themes. It also allows a given document to fall into multiple clusters, if more than one theme is identified in it. Finally, documents that don’t fit into any defined cluster can be re-analyzed to determine what novel concepts they might represent. This latter feature means that the analyst has some assistance in identifying knowledge or issues that don’t fit current concepts, as a way to keep his or her thinking fresh.

“One of the biggest advan-tages of Concept-Based Document Analysis,” Chappell pointed out, “is that it uses themes or cluster topics that are meaningful to the analyst and that reflect the way humans think about issues. In addition, it helps make the clustering process more transparent and explainable to others. While it may not serve all the needs of a security analyst, it will provide a real addition to the toolbox of document analysis.”

This past year, Chappell and his colleagues have developed an initial proof-of-concept for the Concept-Based Document Analysis program, using funding provided by PNNL for lab-directed R&D exploratory efforts. This next year they will be seeking to test their prototype with actual analysts and their document sets.

_____________________________________________
Site last updated:  Thursday, February 22, 2007 
Webmaster