Newsletter:
October 2003/ Issue 5
Concept-Based Document
Analysis: A Human-Centered Approach to Information Discovery
by Marina Skumanich, PNWCGS Staff
We live in a time of
information-overload, where the amount of information on
any given subject seems almost limitless, and is often
beyond easy assimilation. Intelligence and security
professionals are keenly aware of this situation, since
information analysis—the analysis of published materials,
written reports, and even open source news stories—plays
such a central role in their work on proliferation and
other security concerns. Fortunately, for those struggling
to manage this information, more help is on the way.
Researchers Alan Chappell, Judi Thomson, and Alan Willse of
the Pacific Northwest National Laboratory (PNNL) are
working on a lab-funded research and development project to
develop a “Concept-Based Document Analysis Program.”
One of the biggest issues for
analysts seeking to identify trends and events of relevance
in a set of documents is how to sort and group the documents
by appropriate topics. The most straightforward, but tedious,
approach is to have the documents sorted “by hand” —that is,
by using the analyst’s own knowledge to determine appropriate
groupings or clusters of documents. While this approach has
the advantage of drawing on the analyst’s internal “expert
system,” it clearly becomes harder and harder to accomplish
with increasing quantities of data.
The obvious limitations to the
manual approach have led to the development of a series of
computer-based systems such as PNNL’s SPIRE and Starlight
systems. These commercialized systems use automated
rule-based document association, with visual outputs. In this
approach, certain rules of association (e.g., similarities of
words, etc.) are applied to a set of documents. As documents
show greater or lesser association, they are automatically
“clustered” together and displayed on a visual graph, so that
the user can see various themes “emerge” from the documents.
In the case of SPIRE, document themes appear as mountains,
with greater concentrations of documents in a theme leading
to higher peaks. In the case of Starlight, document themes
appear as clusters of stars in a constellation. Both of these
programs have proven to be very helpful in the effort to
analyze large numbers of documents.
The innovation of the Concept-Based
Document Analysis approach is to bring the analyst’s
knowledge back into the process. In particular, Chappell
explained, “Concept-Based Document Analysis seeks to create
more of a partnership between the automated processing in
which computers excel and the expert judgment of analysts
regarding what is important and what particular topics are of
interest.”
Concept-Based
Document Analysis starts by asking the analyst: What are the
topics or subjects you are concerned about? What are
representative documents—or sections of a document—that would
reflect these topics? Then once the analyst has identified
“cluster themes” and the representative, the Concept-Based
Document Analysis Program can cluster the rest of the set of
documents based on these themes. It also allows a given
document to fall into multiple clusters, if more than one
theme is identified in it. Finally, documents that don’t fit
into any defined cluster can be re-analyzed to determine what
novel concepts they might represent. This latter feature
means that the analyst has some assistance in identifying
knowledge or issues that don’t fit current concepts, as a way
to keep his or her thinking fresh.
“One of the biggest advan-tages of
Concept-Based Document Analysis,” Chappell pointed out, “is
that it uses themes or cluster topics that are meaningful to
the analyst and that reflect the way humans think about
issues. In addition, it helps make the clustering process
more transparent and explainable to others. While it may not
serve all the needs of a security analyst, it will provide a
real addition to the toolbox of document analysis.”
This past year, Chappell and his
colleagues have developed an initial proof-of-concept for the
Concept-Based Document Analysis program, using funding
provided by PNNL for lab-directed R&D exploratory efforts.
This next year they will be seeking to test their prototype
with actual analysts and their document sets.
|
_____________________________________________
Site last updated:
Thursday, February 22, 2007
Webmaster
|