by Benjamin Recchie, AB’03

A child doesn’t have to be told that cat and cats are variants of the same word—she picks it up just by listening. To a computer, though, they’re as different as, well, cats and dogs. (At least, if that particular rule isn’t given explicitly; computers are only as smart as their programmers, after all.) Still, it’s the computers that we usually assume are superior in detecting patterns and rules, not four-year olds. John Goldsmith, Edward Carson Waller Distinguished Service Professor of Linguistics and Computer Science, and graduate student Jackson Lee are trying to, if not to solve that puzzle definitively, at least provide the tools to do so.

Studying natural language morphology has both a practical aspect and a theoretical one. On the theoretical side, linguists and cognitive scientists have long sought a better understanding of how humans learn language. “Computational modeling of how natural language morphology may be learned from raw text is an explicit attempt to answer this question,” explains Lee. And on the practical side, better understanding of natural language morphology can lead to better designed human-machine interfaces and a better way to search large databases.

“We are trying to do computationally what linguists have always done,” explains Goldsmith: “collect large amounts of texts in a language, and produce grammatical analyses of the language. We would like to understand that process of what we”--that is, humans, and human linguists--“do so well that we can implement it computationally.”

To provide examples for their analysis, Goldsmith and Lee used standard bodies of written language called corpora. Each corpus contains millions, sometimes billions, of words, taken from many different genres of writing. (The Brown corpus, the first of its kind in American English, contained roughly 1 million words; the Google N-gram corpus contains 155 billion words.) Their combined data set was enormous, far too big to be handled on a desktop computer. Instead, they turned to the Research Computing Center (RCC) and the Midway supercomputing cluster for help.

RCC consultants also helped Lee and Goldsmith to visualize their results. “A typical scenario for us is that, given some raw data, we have some intuition about certain patterns in the data, and we collaborate with RCC to create visualization tools to display data in a way that enables us to explore these patterns.” Lee says. He gives the example of the query word “going”; the visualization showed what words occur most frequently on the left and right of it in a natural language corpus.

“The construction of this visualization tool grew out of the observation that overall word distribution patterns are sensitive to the specific distribution of individual words, and we need a tool to “see” what the grammar of a given word really looks like.” Lee and Goldsmith demonstrated this work in a poster presented at this past year’s Mind Bytes symposium, where it won a special award from the judges for novel uses of computational resources.

Lee and Goldsmith are taking their work and developing it into an integrated research and visualization tool. “This includes not only the suite of the visualization tools developed, but also implementations of algorithms and ideas—both from us and other researchers—with regard to the unsupervised learning of linguistic structure,” says Lee. The final product will allow different research groups to visualize their results and compare their methods.

But beyond just the computational problem, Goldsmith sees a deeper question waiting to be answered. Philosophers and linguists have long argued about whether a language can only be learned by understanding the meaning of the sentences that make it up. “At the end of the day,” says Goldsmith, “yes, language exists with the function of organizing and communicating meaning. But is it possible to define and detect grammatical structure even before knowing the meaning in a text?” With any luck, we may find out soon.