Feature Stories

Googlewhacks for Fun: Recent Grad Gives a Google Talk

Recent graduate Jonathan Lansey

Just one week after graduating from NJIT, Jonathan Lansey (2008) was invited to give a talk at Google’s Manhattan office. 

While an NJIT student, Lansey studied the probabilities of Internet searches.  He also designed a computer model that calculates the hits a pair of words should get on a search engine, based on the hits each word gets individually.

Intrigued by his research, in May of 2008 Google invited Lansey to its Manhattan office to discuss his findings.  His talk, “Googlewhacks for Fun and Profit,” was well received by Google’s NYC employees.

Lansey also summarized the results of his research in an academic paper --“Internet Search Result Probabilities Heaps Law and Word Associativity" – that is due to appear in the upcoming issue of the Journal of Quantitative Linguistics.

To be invited to give a Google Talk is considered something of an honor -- one usually reserved for prominent speakers and leaders in their fields: Senator Hillary Clinton, Apple co-founder Steve Wozniak and Mayor Michael Bloomberg, for instance, all recently gave Google Talks. So for Lansey to have given a talk one week after graduating from NJIT is quite an achievement.   

At NJIT, Lansey was known for his unalloyed passion for math.  He was president of the Math Club, where he and fellow members spent many hours after class working on math demonstrations, just for the fun of it.  Even his internet probability research began as an independent project – one he did out of his own volition and to satisfy his own intellectual curiosity.  Lansey, in addition, was a scholar in the Albert Dorman Honors College who partook in the Undergraduate Biology and Math Training Program (UBMTP).  That program taught him how to use math to solve vexing biological problems and introduced him to the field of computational neuroscience.

Lansey is working now on a master’s degree in that field -- Cognitive and Neural Systems -- at Boston University.  In this interview, he talks about his Google Talk, his studies at NJIT and his passion for the two subjects he learned to love at NJIT: math and biology.


Can you describe the research you did on Internet search probabilities?
Well, I looked at how big the web pages are (how many big, medium, small) and also at how many distinct words are on each page (really how many distinct words based on size of a page – a webpage that is twice as long an another will NOT typically have twice as many different words). Using a generally accepted assumption for the distribution of the size of pages, I computed the coefficient -- called the Heaps’ Law coefficient -- for how many distinct words there are on a page on the indexed web.

What was new about your research?
Previously, the data sets for computing this coefficient were no more than 20,000 pages. My work dealt with an index of about 8 billion pages – the result was beta=0.52 and is in line with previous estimates.  From this, one can take a pair or triplet of words and see how “related” they are (automatically). For example, “stairway” and “heaven” (the title of the famous song by Led Zeppelin) appear together more often than would be expected if they were unrelated. The model can quantify how strongly related pairs of word are.  Finally, my computer model can be used to see how many web pages are indexed by the major search sites (they no longer publish this information).

How did you get interested in this research project?
Someone introduced me to the game "Googlewhack," and after some playing I began thinking there must be a way to predict what will be a googlewhack.  So I created a simple model to figure this out, and when I got a positive result I got excited.  I then emailed the man who created googlewhack.com.  He told me no formal study had ever been done on this, and that there was no way it could be done. In math it’s normal for everything imaginable to have been done 100 years ago, so the fact it was new got me quite excited and I made it into a full research project.

And you teamed up with Professor Bruce Bukiet?
Yes, I worked on this research as an independent project with Professor Bukiet.  He’s been a great mentor to me since I was in high school, when he accompanied my Boy Scout troop on a trip to South Dakota.  After that trip, I’d email him questions about math and send him photos of the things I built, such as a harmonograph.  I came to NJIT in large part because he was here.  Regarding my Internet probability research, which I did when I was a senior, I needed an adviser to teach me how to write a paper worthy of a peer review journal (this is my first and so far only one). It was really great to have an experienced mathematician such as Professor Bukiet to bounce ideas off and to make sure that my research and my writing all made sense. 

How were you invited to give a Google Talk? Google usually invites only prominent speakers who are much older than you are.
When I was a student at NJIT, I worked two summers at the Keck Graduate Institute in California.  One of the professors I worked for there, Chris Adami, had attended a Google conference at its campus in Mountain View, California. Professor Admai mentioned my project to a Google employee, Aaron Joyner, who was interested enough in to invite me in for a talk.

Is your research on Internet search probabilities an example of a fun way to use math? 
Of course it is! I thought it was very fun being able to peer into the landscape of the Internet by just comparing a few numbers. I used some computational models to verify some approximations I used to scale the model up to the monstrous size of the Internet. I did a few other fun projects at NJIT, like the math biology program with Professor Amit Bose, as well as some things I did under the auspices of the NJIT Math Club. You can see some of fun projects I did at NJIT by viewing: http://web.njit.edu/~jcl7.

What’s your focus now at Boston University? 
I have switched my research focus now to studying Neuroscience and Neural Systems, but my mathematics education is a very important asset in this field. BU and Boston is a big place with a huge number of people throughout the city doing exciting research.  I usually go to a few talks a week on interesting topics.

You mentioned the importance of your math background. Did NJIT give you a good education in math, and in general?
I’m just finishing up my first semester here, and now that I’ve gotten to know the students in my master’s program better, I can see, relative to them, that NJIT did indeed give me a great undergraduate education.  I studied with great professors like Professor Bukiet, I had the chance to do research in math biology, I worked summer internships at the Keck Institute and I did my Googlewhack/Internet search project as an independent study.  I don’t think a student could ask for more opportunities from a university, and I’m grateful to NJIT.  

(By Robert Florida, University Web Services)

Sidebar:

As this graph shows, any number above zero means the words are more related than “random,” and a number around 1 means they are 10 times as related (shows up together 10 times as often as should be expected). A number around 2 means they are 100 times as related and around 3 means 1000 times as related --- just the number of zeroes after the 1.

In this graph, the right hand column gives a result based on the mathematical function “log.” If the number is around 0, the combination appears about the “right” amount.  If the number is greater than zero, the words should be associated together. A number around 1 means the combo appears together 10 times as much as would be expected if things were random; and a 2 means about 100 times, etc.. So a better example is “Smashing Pumpkins,” which at 2,159 appear together about 140 times as much as they would be expected if the 2 words weren’t associated.

Word Pair Actual Results Expected Results Strength of Associativity
Paintable Paintability 635 1 2.780
Smashing Pumpkins 930000 6442 2.159
Britney Spears 5130000 49016 2.020
Stairway Heaven 893000 35151 1.405
Surge Protector 1150000 59259 1.288
Paradigm Shift 3400000 761719 0.650
Grand Slam 2470000 701296 0.547
Train Station 16700000 8755605 0.280