Marc Najork's home page

Marc's boyish smile

Welcome to my personal home page.  I recently joined Google as a Senior Staff Research Scientist.

From October 2001 to March 2014, I was a researcher at the (now defunctMicrosoft Research Silicon Valley lab.  In that role, I worked on various aspects of social search; link-based ranking algorithms for web search results; the Scalable Hyperlink Store, a distributed in-memory store for web graphs; heuristics for detecting spam web pages; PageTurner, a large-scale study of the evolution of web pages; and Boxwood, a distributed B-Tree system.

From October 1993 to September 2001, I worked at Digital Equipment's (later Compaq's) Systems Research Center.  Projects at SRC included Mercator, a high-performance distributed web crawler; JCAT, a web-based algorithm animation system; and Obliq-3D, a scripting system for 3D animations.

I am the editor-in-chief of the ACM Transactions on the Web. I served as co-chair of the news section of the Communications of the ACM from 2008 until 2014, conference chair of WSDM 2008, and program co-chair of WWW 2004. I received a Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign for my work on Cube, a 3D visual programming language.

You can visit me on LinkedIn or FaceBook, and you can download my CV.

Papers
 
Here is a fairly complete list of the papers that I have written over the years (you can also try Google Scholar):
  1. Marc Najork. Social Search. Keynote abstract, 14th International Conference on Web Engineering (ICWE), 2014.
  2. Omar Alonso, Catherine C. Marshall, and Marc Najork. Crowdsourcing a Subjective Labeling Task: A Human-Centered Framework for Ensuring Reliable Results. Microsoft Research Technical Report MSR-TR-2014-91, 2014.
  3. Omar Alonso, Catherine C. Marshall, and Marc Najork. A Human-Centered Framework for Ensuring Reliability on Crowdsourced Labeling Tasks. Conference on Human Computation & Crowdsourcing (HCOMP), 2013.
  4. Omar Alonso, Catherine C. Marshall, and Marc Najork. Are Some Tweets More Interesting Than Others? #HardQuestion. Symposium on Human-Computer Interaction and Information Retrieval (HCIR), 2013.
  5. Moises Goldszmidt, Marc Najork, and Stelios Paparizos. Boot-strapping Language Identifiers for Short Colloquial Postings. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), 2013
  6. Nick Craswell, Bodo Billerbeck, Dennis Fetterly, Marc Najork. Robust Query Rewriting using Anchor Data. 6th ACM Intl. Conference on Web Search and Data Mining (WSDM), 2013.
  7. Marc Najork. Detecting Quilted Web Pages at Scale. 35th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2012.
  8. Rina Panigrahy, Marc Najork and Yinglian Xie. How User Behavior is Related to Social Affinity. 5th ACM Intl. Conference on Web Search and Data Mining (WSDM), 2012.
  9. Marc Najork, Dennis Fetterly, Alan Halverson, Krishnaram Kenthapadi and Sreenivas Gollapudi. Of Hammers and Nails: An Empirical Comparison of Three Paradigms for Processing Large Graphs. 5th ACM Intl. Conference on Web Search and Data Mining (WSDM), 2012.
  10. Bodo Billerbeck, Nick Craswell, Dennis Fetterly, Marc Najork. Microsoft Research at TREC 2011 Web Track. 20th Text Retrieval Conference (TREC), 2011.
  11. Nick Craswell, Dennis Fetterly and Marc Najork. The Power of Peers. 33rd European Conference on Information Retrieval (ECIR), 2011.
  12. Nick Craswell, Dennis Fetterly and Marc Najork. Microsoft Research at TREC 2010 Web Track. 19th Text Retrieval Conference (TREC), 2010.
  13. Marc Najork. Querying the Web Graph. 17th International Symposium on String Processing and Information Retrieval (SPIRE), 2010.
  14. Atish Das Sarma, Sreenivas Gollapudi, Marc Najork and Rina Panigrahy. A Sketch-Based Distance Oracle for Web-Scale Graphs. 3rd ACM Intl. Conference on Web Search and Data Mining (WSDM), 2010.
  15. Christopher Olston and Marc Najork. Web Crawling. Foundations and Trends in Information Retrieval 4(3):175-246, 2010.
  16. Nick Craswell, Dennis Fetterly, Marc Najork, Stephen Robertson and Emine Yilmaz. Microsoft Research at TREC 2009: Web and Relevance Feedback Tracks. 18th Text Retrieval Conference (TREC), 2009.
  17. Marc Najork. Web Crawler Architecture. Entry in Encyclopedia of Database Systems, 2009.
  18. Hugo Zaragoza and Marc Najork. Web Search Relevance Ranking. Entry in Encyclopedia of Database Systems, 2009.
  19. Marc Najork. Web Spam Detection. Entry in Encyclopedia of Database Systems, 2009.
  20. Marc Najork. The Scalable Hyperlink Store. 20th ACM Conference on Hypertext and Hypermedia (HT), 2009.
  21. Marc Najork, Sreenivas Gollapudi and Rina Panigrahy. Less is More: Sampling the Neighborhood Graph Makes SALSA Better and Faster. 2nd ACM Intl. Conference on Web Search and Data Mining (WSDM), 2009.
  22. Marc Najork and Nick Craswell. Efficient and Effective Link Analysis with Precomputed SALSA Maps. 17th ACM Conference on Information and Knowledge Management (CIKM), 2008.
  23. Frank McSherry and Marc Najork. Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores. 30th European Conference on Information Retrieval (ECIR), 2008.
  24. Sreenivas Gollapudi, Marc Najork and Rina Panigrahy. Using Bloom Filters to Speed Up HITS-Like Ranking Algorithms. 5th Workshop on Algorithms and Models for the Web Graph (WAW), 2007.
  25. Marc Najork. Comparing the Effectiveness of HITS and SALSA. 16th ACM Conference on Information and Knowledge Management (CIKM), 2007.
  26. Marc Najork, Hugo Zaragoza and Michael Taylor. HITS on the Web: How does it Compare? 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2007.
  27. Brian Davison, Marc Najork and Tim Converse. SIGIR Workshop Report: Adversarial Information Retrieval on the Web (AIRWeb 2006). ACM SIGIR Forum 40(2):27-30, 2006.
  28. Alexandros Ntoulas, Marc Najork, Marc Manasse and Dennis Fetterly. Detecting Spam Web Pages Through Content Analysis. 15th International World Wide Web Conference (WWW), 2006.
  29. Dennis Fetterly, Mark Manasse and Marc Najork. Detecting Phrase-Level Duplication on the World Wide Web. 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2005.
  30. John MacCormick, Nick Murphy, Marc Najork, Chandamohan Thekkath and Lidong Zhou. Boxwood: Abstractions as the Foundation for Storage Infrastructure. 6th Symposium on Operating Systems Design and Implementation (OSDI), 2004.
  31. Dennis Fetterly, Mark Manasse and Marc Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. Journal of Web Engineering 2(4):228-246, 2004.
  32. Dennis Fetterly, Mark Manasse and Marc Najork. Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. 7th International Workshop on the Web and Databases (WebDB), 2004.
  33. Dennis Fetterly, Mark Manasse, Marc Najork and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. Software: Practice & Experience 34(2):213-237, 2004.
  34. Dennis Fetterly, Mark Manasse and Marc Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. 1st Latin American Web Congress (LA-WEB), 2003.
  35. Dennis Fetterly, Mark Manasse, Marc Najork and Janet Wiener. A Large-Scale Study of the Evolution of Web Pages. 12th International World Wide Web Conference (WWW), 2003.
  36. Andrei Broder, Marc Najork and Janet Wiener. Efficient URL Caching for World Wide Web Crawling. 12th International World Wide Web Conference (WWW), 2003.
  37. Marc Najork and Allan Heydon. High-Performance Web Crawling. Compaq Systems Research Center (SRC) Research Report 173, 2001.
  38. Marc Najork and Marc Brown. Three-Dimensional Web-based Algorithm Animation. Compaq Systems Research Center (SRC) Research Report 170, 2001.
  39. Marc Najork. Web-Based Algorithm Animation. 38th Design Automation Conference (DAC), 2001.
  40. Marc Najork and Janet L. Wiener. Breadth-First Search Crawling Yields High-Quality Pages. 10th International World Wide Web Conference (WWW), 2001.
  41. Allan Heydon and Marc Najork. Performance Limitations of the Java Core Libraries. Concurrency: Practice & Experience 12(6):363-373, 2000.
  42. Monika Henzinger, Allan Heydon, Michael Mitzenmacher and Marc Najork. On Near-Uniform URL Sampling. 9th International World Wide Web Conference (WWW), 2000.
  43. Allan Heydon and Marc Najork: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4):219-229, 1999.
  44. Allan Heydon and Marc Najork. Performance Limitations of the Java Core Libraries. 1st ACM Conference on Java Grande (JAVA), 1999.
  45. Monika Henzinger, Allan Heydon, Michael Mitzenmacher and Marc Najork. Measuring Index Quality Using Random Walks on the Web. 8th International World Wide Web Conference (WWW), 1999.
  46. Marc Brown, Johannes Marais, Marc Najork and William Weihl. Focus+Context Displays of Web Pages: Implementation Alternatives. DEC Systems Research Center (SRC) Technical Note 1997-010, 1997.
  47. Marc Brown and Marc Najork. Distributed Applets. Conference on Human Factors in Computing Systems (CHI), extended abstracts, 1997.
  48. Marc Brown, Marc Najork and Roope Raisamo. A Java-Based Implementation of Collaborative Active Textbooks. 13th IEEE Symposium on Visual Languages (VL), 1997.
  49. Marc Brown and Marc Najork. Collaborative Active Textbooks. Journal of Visual Languages and Computing 8(4):453-486, 1997.
  50. Marc Brown and Marc Najork. Collaborative Active Textbooks: A Web-Based Algorithm Animation System for an Electronic Classroom. 12th IEEE Symposium on Visual Languages (VL), 1996.
  51. Marc Brown and Marc Najork. Distributed Active Objects. 5th International World Wide Web Conference (WWW), 1996.
  52. Marc Najork. Programming in Three Dimensions. Journal of Visual Languages and Computing 7(2):219-242, 1996.
  53. Marc Najork and Marc Brown. Obliq-3D: A High-Level, Fast-Turnaround 3D Animation System. IEEE Transactions on Visualization and Computer Graphics 1(2):175-193, 1995.
  54. Marc Najork. Obliq-3D Tutorial and Reference Manual. DEC Systems Research Center (SRC) Research Report 129, 1994.
  55. Marc Najork and Marc Brown. A Library for Visualizing Combinatorial Structures. 5th IEEE Visualization (VIS), 1994.
  56. Marc Brown and Marc Najork. Algorithm Animation Using 3D Interactive Graphics. 6th ACM Symposium on User Interface Software and Technology (UIST), 1993.
  57. Marc Najork and Simon Kaplan. Specifying Visual Languages with Conditional Set Rewrite Systems. 9th IEEE Workshop on Visual Languages (VL), 1993.
  58. Marc Najork and Simon Kaplan. A Prototype Implementation of the Cube Language. 8th IEEE Workshop on Visual Languages (VL), 1992.
  59. Marc Najork and Simon Kaplan. The CUBE Language. 7th IEEE Workshop on Visual Languages (VL), 1991.
  60. Marc Najork and Eric Golin. Enhancing Show-and-Tell with a polymorphic type system and higher-order functions. 6th IEEE Workshop on Visual Languages (VL), 1990.
  61. Sharon Kuck, Roland John, Arnd Lewe and Marc Najork. Roles and their role in posing recursive queries. Information Systems 15(2):173-186, 1990. 
Patents

I am a sole or co-inventor on 24 issued US patents, with more applications pending.  The US Patent & Trademark Office has up-to-date information on these patents. Here is a list, including convenient PDF copies:
  1. US Patent 8856112. Considering document endorsements when processing queries.
  2. US Patent 8666920. Estimating shortest distances in graphs using sketches.
  3. US Patent 8392366. Changing number of machines running distributed hyperlink database.
  4. US Patent 8209305. Incremental update scheme for hyperlink database.
  5. US Patent 7962510. Using content analysis to detect spam web pages.
  6. US Patent 7818334. Query dependant link-based ranking using authority scores.
  7. US Patent 7792854. Query dependent link-based ranking.
  8. US Patent 7783671. Deletion and compaction using versioned nodes.
  9. US Patent 7739281. Systems and methods for ranking documents based upon structurally interrelated information.
  10. US Patent 7680785. System and method for inferring uniform resource locator (URL) normalization rules.
  11. US Patent 7627777. Fault tolerance scheme for distributed hyperlink database.
  12. US Patent 7340467. System and method for maintaining a distributed database of hyperlinks.
  13. US Patent 7139747. System and method for distributed web crawling.
  14. US Patent 7082438. Algorithm for tree traversal using left links.
  15. US Patent 7072904. Deletion and compaction using versioned nodes.
  16. US Patent 7007027. Algorithm for tree traversal using left links.
  17. US Patent 6952730. System and method for efficient filtering of data set addresses in a web crawler.
  18. US Patent 6910077. System and method for identifying cloaked web servers.
  19. US Patent 6594694. System and method for near-uniform sampling of web page addresses.
  20. US Patent 6377984. Web crawler system using parallel queues for queing data sets having common address and concurrently downloading data associated with data set in each queue.
  21. US Patent 6351755. System and method for associating an extensible set of data with documents downloaded by a web crawler.
  22. US Patent 6321265. System and method for enforcing politeness while scheduling downloads in a web crawler.
  23. US Patent 6301614. System and method for efficient representation of data set addresses in a web crawler.
  24. US Patent 6263364. Web crawler system using plurality of parallel priority level queues having distinct associated download priority levels for prioritizing document downloading and maintaining document freshness.

Videos

In addition to three "video research reports" dating back to my DEC SRC days, a number of video recordings of talks I gave are available online:
  1. Algorithm Animation Using 3D Interactive Graphics. DEC SRC Research Report 110b, 1993. 
  2. A Library for Visualizing Combinatorial Structures. DEC SRC Research Report 128b, 1994.
  3. Distributed Active Objects. DEC SRC Research Report 141b, 1996.
  4.  Atrax, a distributed web crawler. Talk at Microsoft Research Redmond, 2001.
  5. The Scalable Hyperlink Store. Talk at Microsoft Research Cambridge, 2005.
  6. WebSpam. Lecture at University of California at Berkeley, 2005.
  7. Efficient and Effective Link Analysis with Precomputed SALSA Maps. Presentation at ACM Conference on Information and Knowledge Management (CIKM), 2008.
  8. Less is More: Sampling the Neighborhood Graph Makes SALSA Better and Faster. Presentation at 2nd ACM Intl. Conference on Web Search and Data Mining (WSDM), 2009.
  9. Social Search. Keynote talk at European Conference on Information Retrieval (ECIR) Industry Day, 2013.
  10. Past and Future of Web Search. Interview at European Conference on Information Retrieval (ECIR), 2013.


Code

Some of the code I wrote is available online:

  1. Anim3D, a 3D animation library written in Modula-3.
  2. Obliq3D, a wrapper around Anim3D that allows animations to be scripted in the Obliq language, fully described in the Obliq-3D reference manual.
  3. WinVBT, a port of the Trestle Windowing System to Microsoft Windows.
  4. Tie-aware IR performance measures, a C# package for computing IR measures (e.g. mean average precision, mean reciprocal rank, normalized cumulative discounted gain, etc) in a tie-aware manner, as described in .
  5. A C# package for language identification, described in
  6. The Scalable Hyperlink Store, a distributed in-memory store for large web graphs, described in and