Thursday, November 5, 2009

Learning Correspondence Representations for Natural Language Processing, John Blitzer

John Blitzer, postdoc at University of California, Berkeley, will be giving a talk
on Friday, November 6, 2009 at 10:00 a.m. in room 2120 AVW.

TITLE:

Learning Correspondence Representations for Natural Language Processing

ABSTRACT: The key to creating scalable, robust natural language
processing (NLP) systems is to exploit correspondences between known
and unknown linguistic structure. Natural language processing has
experienced tremendous success over the past two decades, but our most
successful systems are still limited to the domains and languages
where we have large amounts of hand-annotated data. Unfortunately,
these domains and languages represent a tiny portion of the total
linguistic data in the world. No matter the task, we always encounter
unknown linguistic features like words and syntactic constituents that
we have never observed before when estimating our models. This talk
is about linking these linguistic features to one another through
correspondence representations.

The first part describes a technique to learn lexical correspondences
for domain adaptation of sentiment analysis systems. These systems
predict the general attitude of an essay toward a particular topic.
In this case, words which are highly predictive in one domain may not
be present in another. We show how to build a correspondence
representation between words in different domains using projections to
low-dimensional, real-valued spaces. Unknown words are projected onto
this representation and related directly to known features via
Euclidean distance. The correspondence representation allows us to
train significantly more robust models in new domains, and we achieve
a 40% relative reduction in error due to adaptation over a
state-of-the-art system.

The second part describes a technique to learn syntactic
correspondences between languages for machine translation. Syntactic
machine translation models exploit syntactic correspondences to
translate grammatical structures (e.g. subjects, verbs, and objects)
from one language to another. Given pairs of sentences which are
translations of one another, we build a latent correspondence grammar
which links grammatical structures in one language to grammatical
structures in another. The syntactic correspondences induced by our
grammar significantly improve a state-of-the-art Chinese-English
machine translation system.

BIO: John Blitzer is a postdoctoral fellow in the computer science
department at the University of California, Berkeley, working with Dan
Klein. He completed his PhD in computer science at the University of
Pennsylvania under Fernando Pereira, and in 2008 spent 6 months as a
visiting researcher in the natural language computing group at
Microsoft Research Asia. John's research focuses on applications of
machine learning to natural language. In particular, he is interested
in exploiting unlabeled data and other sources of side information to
improve supervised models. He has applied these techniques to
tagging, parsing, entity recognition, web search, and machine
translation. More info on John's research interests is available at
http://john.blitzer.com.


No comments:

Post a Comment