Yushi Jing  ( Kevin Jing )

Currently the head of visual search at Pinterest.  Passionate about everything visual search: machine vision /learning, systems/interfaces, and product/go-to-market strategies.

-  Pinterest Visual Discovery, 2014~
-  Visual Graph, 2013~
-  Google Machine Vision Team 2004 - 2012

Previously co-founder and CEO at VisualGraph, acquired by Pinterest.  Google Research from 2004 to 2012, co-developed Google's first vision application.  Launched Google Image Swirl as a 20% project.  Technical adviser for National Center for Missing and Exploited Children. BS, MS, PhD in CS from Georgia Tech, studied Stats/ms&e at Stanford.  Best-student-paper award recipient at ICML.  Visiting scientist at Tokyo University in 2012.  40+ filed patents on visual search / ads.

Yushi Jing, Kevin Jingd

Products Launched:


Pinterest Flashlight, interactive visual search, 2015            Google Image Swirl, hierarchical visual search, 2010
   [Wall Street Journal][Venture Beat] [KDD2015]                           [Wall Street Journal] [WWW2010]

Work covered by Press:   

Pinterest Sharpens its Visual Search Skills, Wall Street Journal, 2015
New technology will let users scour for objects like those they highlight in photos ...

AI advance makes it possible to search, shop with images, MIT Technology Review, 2015
Deep Learning software has dramatically improved image recognition tools.  Pinterest and Shoes.com are testing it out on shoppers.  Both companies have turned to a technique known as deep learning, which has recently enabled software to match humans on some benchmarks for image recognition ....  

Pinterest improves Related Pins with deep learning, Venture Beat, 2015
Pinterest is getting smarter when it comes to spotting things in all those images that millions of users pin to boards.  Engineers at the company have developed a technology called visual search, which can find and display visually similar images.  The system powers a Pinterest feature called Related Pins.  It's getting more engagement than previous implementations for recommending images.  And later this year Pinterest will roll out a new feature, called Similar Looks, that relies on the technology ...

Pinterest acquires Image Recognition and Visual Search Startup VisualGraph,  Techcrunch, 2014
Pinterest has just acquired two-man startup VisualGraph, which creates machine vision, image recognition and visual search technologies.  The company's founder Kevin Jing and his partner David Liu are joining the Pinterest engineering team today.   Pinterest says the "acquisition of VisualGraph will help us build technology to better understand what people are pinning.  By doing so, we hope to make it easier for people to find the things they love ..."

In search of images worth 1000 results, Wall Street Journal, 2012
If you've ever visualized something in your head but couldn't think of its name, you might appreciate a new method of online discovery: visual search. This week, I tested forms of visual search from two companies that hold some serious clout when it comes online search—Google and Microsoft.  Although Google has become our go-to site for looking anything up on the Internet, its searches are dense with text. Microsoft's Bing search engine is marketed as a Google alternative that aims to return more useful query data on the first results page.   Users can use Google's Image Swirl search to sift through some 200,000 queries of images...

A Google Prototype for a Precision Image Search, New York Times, 2008
Google researchers say they have a software technology intended to do for digital images on the Web what the company's original PageRank software did for searches of Web pages.  On Thursday at the International World Wide Web conference in Beijing, two Google scientists presented a paper describing what the researchers called VisualRank, an algorithm for blending image-recognition software methods with techniques for weighting and ranking images that look most similar.

A farewell to keywords, Scientific American, 2006
A picture maybe worth a kilo of words, but typing into Google Image the single word "rosebud" returns about 60,000 pictures.  The power of an individusual keywords is both good and bad.  IT can find a virtual stack of Web pages.  But it is unable to differentiate between the flower in bloom and the legendary film director Orson Welles's scrowl.  Ideally, an internet user should be able to use the likeness of a rose to tell the search engine to find others like it ...


Search and Recommendation Systems and Interfaces:

Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Sarah Tavel, Visual Search at Pinterest,  KDD, 2015
We demonstrate that, with the availability of distributed computation platforms such as Amazon Web Services and open-source tools, it is possible for a small engineering team to build, launch and maintain a cost-effective, largescale visual search system with widely available tools. We also demonstrate, through a comprehensive set of live experiments at Pinterest, that content recommendation powered by visual search improve user engagement. By sharing our implementation details and the experiences learned from launching a commercial visual search engines from scratch, we hope visual search are more widely incorporated into today’s commercial applications.
visual search  

Yusuke Matsui, Kiyoharu Aizawa, Yushi Jing, Sketch-based manga retrieval, ICIP, 2013
We propose a sketch-based method for manga image retrieval, in which users draw sketches via a Web browser that enables the automatic retrieval of similar images from a database of manga titles.  The characteristics of manga images are different from those of naturalistic images.  Despite the widespread attention given to content-based image retrieval systems, the question of how to retrieve manga images effectively has been little studied. 

Yushi Jing, Henry Rowley, Meng Wang, David Tsai, Chuck Rosenberg, Michele Covell,  Google Image Swirl:  A Large-scale Content-based Image Visualization System, World Wide Web (2012)
Web image retrieval systems, such as Google or Bing image search, present search results as a relevance-ordered list. Although alternative browsing models (e.g. results as clusters or hierarchies) have been proposed in the past, it remains to be seen whether such models can be applied to large-scale image search. This work presents Google Image Swirl, a large-scale, publicly available, hierarchical image browsing system by automatically group the search results based on visual and semantic similarity. This paper describes methods used to build such system and shares the findings from 2-years worth of user feedback and usage statistics.

(Journal) Yushi Jing, David Tsai, James Rehg, Michele Covell,  Learning Query-specific Distance Functions for Large-scale Image Search, Transactions on Multimedia (TMM) (2012)
Current Google image search adopt a hybrid search approach in which a text-based query (e.g. "Paris Landmarks") is used to retrieve a set of relevant images, which are then refined by the user (e.g. by re-ranking the retrieved images based on similarity to a selected example).  We conjecture that given such hybrid image search engines, learning per-query distance functions over image features can improve the estimation of image similarity.   We propose scalable solutions to learning query-specific distance functions by 1) adopting a simple large-margin learning framework, and 2) using query-logs of text-based image search engines to train distance functions used in content-based systems.  We evaluate the feasibility and efficacy of our proposed system through comprehensive human evaluation, and compare the results with the state-of-the-art distance function used by Google similar image search.

Shumeet Baluja, Rohan Seth, Sivakumar, Yushi Jing, Jay Yagnik, Kumar, Ravichandran, Aly,, Video Suggestion and Discovery for YouTube.   World Wide Web (WWW), 2008
The rapid growth of the number of videos in YouTube provides enormous potential for users to find content of interest to them. Unfortunately, given the difficulty of searching videos, the size of the video repository also makes the discovery of new content a daunting task. In this paper, we present a novel method based upon the analysis of the entire user–video graph to provide personalized video suggestions for users. The resulting algorithm, termed Adsorption, provides a simple method to efficiently propagate preference information through a variety of graphs. We extensively test the results of the recommendations on a three month snapshot of live data from YouTube.

Yushi Jing, Shumeet Baluja, Henry Rowley, Canonical Image Selection from the Web,  International Conference on Image and Video Retrieval (CIVR), 2007
The vast majority of the features used in today’s commercially deployed image search systems employ techniques that are largely indistinguishable from text-document search – the images returned in response to a query are based on the text of the web pages from which they are linked. Unfortunately, depending on the query type, the quality of this approach can be inconsistent. Several recent studies have demonstrated the effectiveness of using image features to refine search results. However, it is not clear whether (or how much) image-based approach can generalize to larger samples of web queries. Also, the previously used global features often only capture a small part of the image information, which in many cases does not correspond to the distinctive characteristics of the category. This paper explores the use of local features in the concrete task of finding the single canonical images for a collection of commonly searched-for products. Through large-scale user testing, the canonical images found by using only local image features significantly outperformed the top results from Yahoo, Microsoft and Google, highlighting the importance of having these image features as an integral part of future image
search engines.

Computer Vision / ML

Meng Wang, Konrad, Ishwar, Yushi  Jing, Henry Rowley, Image Saliency: from Local to Global Context, Computer Vision and Pattern Recognition (CVPR), 2011
We propose a novel framework for automatic saliency estimation in natural images. We consider saliency to be an anomaly with respect to a given context that can be global or local. In the case of global context, we estimate saliency in the whole image relative to a large dictionary of images. Unlike in some prior methods, this dictionary is not annotated, i.e., saliency is assumed unknown. In the case of local context, we partition the image into patches and estimate saliency in each patch relative to a large dictionary of unannotated patches from the rest of the image. We propose a unified framework that applies to both cases in three steps. First, given an input (image or patch) we extract k nearest neighbors from the dictionary. Then, we geometrically warp each neighbor to match the input. Finally, we derive the saliency map from the mean absolute error between the input and all its warped neighbors. This algorithm is not only easy to implement but also outperforms state-of-the-art methods.

David Tsai, Yushi Jing, Henry Rowley, Yi Liu, Large-scale Image Annotation using Visual Synset, International Conference on Computer Vision (ICCV), 2011
We address the problem of large-scale annotation of web images. Our approach is based on the concept of visual synset, which is an organization of images which are visually-similar and semantically-related. Each visual synset represents a single prototypical visual concept, and has an associated set of weighted annotations. Linear SVM’s are utilized to predict the visual synset membership for unseen image examples, and a weighted voting rule is used to construct a ranked list of predicted annotations from a set of visual synsets. We demonstrate that visual synsets lead to better performance than standard methods on a new annotation database containing more than 200 million images and 300 thousand annotations, which is the largest ever reported.

(Journal) Yushi Jing, Shumeet Baluja, VisualRank: Applying PageRank to Large-Scale Image Search, Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2008
Because of the relative ease in understanding and processing text, commercial image-search systems often rely on techniques that are largely indistinguishable from text search. Recently, academic studies have demonstrated the effectiveness of employing image-based features to provide either alternative or additional signals to use in this process. However, it remains uncertain whether such techniques will generalize to a large number of popular Web queries and whether the potential improvement to search quality warrants the additional computational cost. In this work, we cast the image-ranking problem into the task of identifying “authority” nodes on an inferred visual similarity graph and propose VisualRank to analyze the visual link structures among images. The images found to be “authorities” are chosen as those that answer the image-queries well. To understand the performance of such an approach in a real system, we conducted a series of large-scale experiments based on the task of retrieving images for 2,000 of the most popular products queries. Our experimental results show significant improvement, in terms of user satisfaction and relevancy, in comparison to the most recent Google Image Search results. Maintaining modest computational cost is vital to ensuring that this procedure can be used in practice; we describe the techniques required to make this system practical for large-scale deployment in commercial search engines.


Yushi Jing, Shumeet Baluja,  PageRank for Product Image Search,  World Wide Web (WWW), 2008
In this paper, we cast the image-ranking problem into the task of identifying "authority" nodes on an inferred visual similarity graph and propose an algorithm to analyze the visual link structures that can be created among a set of images.   Through an interative procedure based on the PageRank computation, a numerical weight is assigned to each image;  this measures its relative importance to other images being considered.   The incorporation of visual signals in this process differs from the majority of large-scale commercial search engines in use today.   Commercial search-engines often solely rely on the text clues of the pages in which images are embedded to rank images, and often entirely ignore the content of the images themselves as a ranking signal.   To quantify the performance of our approach in a real-world system, we conducted a series of experiments based on the task of retrieving images for 2000 of the most popular product queries.   Our experiment results show significant improvement, in terms of user satisfaction and relevancy, in comparison to hte most recent Google Image Search results.

(Journal) Yushi Jing, Vladimir Pavlovic, James Rehg, Boosted Bayesian Network Classifier,  Machine Learning Journal, 2008
The use of Bayesian networks for classification problems has received significant recent attention. Although computationally efficient, the standard maximum likelihood learning method tends to be suboptimal due to the mismatch between its optimization criteria (data likelihood) and the actual goal of classification (label prediction accuracy). Recent approaches to optimizing classification performance during parameter or structure learning show promise, but lack the favorable computational properties of maximum likelihood learning. In this paper we present Boosted Bayesian Network Classifiers, a framework to combine discriminative data-weighting with generative training of intermediate models. We show that Boosted Bayesian network Classifiers encompass the basic generative models in isolation, but improve their classification performance when the model structure is suboptimal. This framework can be easily extended to temporal Bayesian network models including HMM and DBN. On a large suite of benchmark data-sets, this approach outperforms generative graphical models such as naive Bayes, TAN, unrestricted Bayesian network and DBN in classification accuracy. Boosted Bayesian network classifiers have comparable or better performance in comparison to other discriminatively trained graphical models including ELR-NB, ELR-TAN, BNC-2P, BNC-MDL and CRF. Furthermore, boosted Bayesian networks require significantly less training time than all of the competing methods.

Yushi Jing, Vladimir Pavlovic, James Rehg, Efficient Discriminative Learning of Bayesian Network Classifier, International Conference on Machine Learning (ICML), 2005 -- Best Student Paper
The use of Bayesian Networks for classification problems has received significant recent attention.  Although computationally efficient, the standard maximum likelihood learning method tends to be suboptimal due to the mismatch between its optimization criteria (data likelihood) and the actual goal for classification (label prediction).  Recent approaches to optimizing the classification performance during parameter or structure learning show promise, but lack the favorable computational properties of maximum likelihood learning.  In this paper we present the Boosted Augmented Naive Bayes (BAN) classifier.   We show that a combination of discriminative data-weighting with generative training of intermediate models can yield a computationally efficient method for discriminative parameter learning and structure selection.

Henry Rowley, Yushi Jing, Shumeet Baluja, Large-scale Image-based Adult-content Filtering, International Conference on Computer Vision Theory, 2005
As more people start using the Internet and more content is placed online, the chances that individuals will encounter inappropriate or adult-oriented content increases. Search engines can exacerbate this problem by aggregating content from many sites and summarizing it into a single result page. Many existing methods for detecting adult-content currently attempt to classify web pages based on their text content. If the text content of a page is classied as adult-content, this information can be propagated to linked images and pages. However, keyword and other text-based approaches have signicant limitations. First, they are language specic and require a tremendous amount of manual work to construct (either directly, or by labeling training data for all languages). Second, many adult-content pages do not contain enough text for reliable classication. Third, the text on the page may be intentionally obfuscated (i.e. encoded in an image).  This paper looks at practical ways to detect adult content in the images themselves, on a scale which can be applied to a search engine covering a large fraction of the images on the WWW. The focus is on efficient and robust techniques, such as color classication and face detection, which together can detect many pornographic images with little computational cost.