Yushi Jing  ( Kevin Jing )

Bio

I am currently leading visual-search team at Pinterest, before that I was the co-founder and CEO at VisualGraph, a cloud-based visual-shopping platform.

Previously I worked as a Senior Research Scientist / Tech-lead, @ Google.   I helped to develop the first image processing application at Google in 2004 [paper | scientific american], and became Senior Research Scientist in 2011.   I work in the area of content-based image retrieval ("visual-search") as the tech-lead for three projects:  content-based image ranking [paper | new york times | blog], image discovery and visualization [paper | wall street journal | blog], and visual-search for third-party enterprise.  

My work at Google resulted in more than 35 patent applications (12 granted) in the area of media search and advertising.   I was a visiting scientist at Tokyo University in 2012.   I received my BS, MS, PhD from Georgia Institute of Technology, and studied Statistics and Engineering Management at Stanford University.   I received the best-student-paper award at International Conference on Machine Learning in 2005. 

Google Research
Tokyo University
Georgia Institute of Technology
Stanford University

il








Yushi Jing, Kevin Jing


Publications



Yusuke Matsui, Kiyoharu Aizawa, Yushi Jing, Sketch2Manga: sketch-based manga retrieval, ICIP, 2013
We propose a sketch-based method for manga image retrieval, in which users draw sketches via a Web browser that enables the automatic retrieval of similar images from a database of manga titles.  The characteristics of manga images are different from those of naturalistic images.  Despite the widespread attention given to content-based image retrieval systems, the question of how to retrieve manga images effectively has been little studied. 


Yushi Jing, David Tsai, James Rehg, Michele Covell,  Learning Query-specific Distance Functions for Large-scale Image Search, Submitted to Transactions on Multimedia (TMM) (2012)
Current Google image search adopt a hybrid search approach in which a text-based query (e.g. "Paris Landmarks") is used to retrieve a set of relevant images, which are then refined by the user (e.g. by re-ranking the retrieved images based on similarity to a selected example).  We conjecture that given such hybrid image search engines, learning per-query distance functions over image features can improve the estimation of image similarity.   We propose scalable solutions to learning query-specific distance functions by 1) adopting a simple large-margin learning framework, and 2) using query-logs of text-based image search engines to train distance functions used in content-based systems.  We evaluate the feasibility and efficacy of our proposed system through comprehensive human evaluation, and compare the results with the state-of-the-art distance function used by Google similar image search.

Yushi Jing, Henry Rowley, Meng Wang, David Tsai, Chuck Rosenberg, Michele Covell,  Google Image Swirl:  A Large-scale Content-based Image Visualization System, World Wide Web (2012)
Web image retrieval systems, such as Google or Bing image search, present search results as a relevance-ordered list. Although alternative browsing models (e.g. results as clusters or hierarchies) have been proposed in the past, it remains to be seen whether such models can be applied to large-scale image search. This work presents Google Image Swirl, a large-scale, publicly available, hierarchical image browsing system by automatically group the search results based on visual and semantic similarity. This paper describes methods used to build such system and shares the findings from 2-years worth of user feedback and usage statistics.

Meng Wang, Konrad, Ishwar, Yushi  Jing, Henry Rowley, Image Saliency: from Local to Global Context, Computer Vision and Pattern Recognition (CVPR), 2011
We propose a novel framework for automatic saliency estimation in natural images. We consider saliency to be an anomaly with respect to a given context that can be global or local. In the case of global context, we estimate saliency in the whole image relative to a large dictionary of images. Unlike in some prior methods, this dictionary is not annotated, i.e., saliency is assumed unknown. In the case of local context, we partition the image into patches and estimate saliency in each patch relative to a large dictionary of unannotated patches from the rest of the image. We propose a unified framework that applies to both cases in three steps. First, given an input (image or patch) we extract k nearest neighbors from the dictionary. Then, we geometrically warp each neighbor to match the input. Finally, we derive the saliency map from the mean absolute error between the input and all its warped neighbors. This algorithm is not only easy to implement but also outperforms state-of-the-art methods.

David Tsai, Yushi Jing, Henry Rowley, Yi Liu, Large-scale Image Annotation using Visual Synset, International Conference on Computer Vision (ICCV), 2011
We address the problem of large-scale annotation of web images. Our approach is based on the concept of visual synset, which is an organization of images which are visually-similar and semantically-related. Each visual synset represents a single prototypical visual concept, and has an associated set of weighted annotations. Linear SVM’s are utilized to predict the visual synset membership for unseen image examples, and a weighted voting rule is used to construct a ranked list of predicted annotations from a set of visual synsets. We demonstrate that visual synsets lead to better performance than standard methods on a new annotation database containing more than 200 million images and 300 thousand annotations, which is the largest ever reported.

Yushi Jing, Shumeet Baluja, VisualRank: Applying PageRank to Large-Scale Image Search, Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2008
Because of the relative ease in understanding and processing text, commercial image-search systems often rely on techniques that are largely indistinguishable from text search. Recently, academic studies have demonstrated the effectiveness of employing image-based features to provide either alternative or additional signals to use in this process. However, it remains uncertain whether such techniques will generalize to a large number of popular Web queries and whether the potential improvement to search quality warrants the additional computational cost. In this work, we cast the image-ranking problem into the task of identifying “authority” nodes on an inferred visual similarity graph and propose VisualRank to analyze the visual link structures among images. The images found to be “authorities” are chosen as those that answer the image-queries well. To understand the performance of such an approach in a real system, we conducted a series of large-scale experiments based on the task of retrieving images for 2,000 of the most popular products queries. Our experimental results show significant improvement, in terms of user satisfaction and relevancy, in comparison to the most recent Google Image Search results. Maintaining modest computational cost is vital to ensuring that this procedure can be used in practice; we describe the techniques required to make this system practical for large-scale deployment in commercial search engines.

Yushi Jing, Vladimir Pavlovic, James Rehg, Boosted Bayesian Network Classifier,  Machine Learning Journal, 2008
The use of Bayesian networks for classification problems has received significant recent attention. Although computationally efficient, the standard maximum likelihood learning method tends to be suboptimal due to the mismatch between its optimization criteria (data likelihood) and the actual goal of classification (label prediction accuracy). Recent approaches to optimizing classification performance during parameter or structure learning show promise, but lack the favorable computational properties of maximum likelihood learning. In this paper we present Boosted Bayesian Network Classifiers, a framework to combine discriminative data-weighting with generative training of intermediate models. We show that Boosted Bayesian network Classifiers encompass the basic generative models in isolation, but improve their classification performance when the model structure is suboptimal. This framework can be easily extended to temporal Bayesian network models including HMM and DBN. On a large suite of benchmark data-sets, this approach outperforms generative graphical models such as naive Bayes, TAN, unrestricted Bayesian network and DBN in classification accuracy. Boosted Bayesian network classifiers have comparable or better performance in comparison to other discriminatively trained graphical models including ELR-NB, ELR-TAN, BNC-2P, BNC-MDL and CRF. Furthermore, boosted Bayesian networks require significantly less training time than all of the competing methods.

Yushi Jing, Shumeet Baluja,  PageRank for Product Image Search,  World Wide Web (WWW), 2008
In this paper, we cast the image-ranking problem into the task of identifying "authority" nodes on an inferred visual similarity graph and propose an algorithm to analyze the visual link structures that can be created among a set of images.   Through an interative procedure based on the PageRank computation, a numerical weight is assigned to each image;  this measures its relative importance to other images being considered.   The incorporation of visual signals in this process differs from the majority of large-scale commercial search engines in use today.   Commercial search-engines often solely rely on the text clues of the pages in which images are embedded to rank images, and often entirely ignore the content of the images themselves as a ranking signal.   To quantify the performance of our approach in a real-world system, we conducted a series of experiments based on the task of retrieving images for 2000 of the most popular product queries.   Our experiment results show significant improvement, in terms of user satisfaction and relevancy, in comparison to hte most recent Google Image Search results.

Shumeet Baluja, Rohan Seth, Sivakumar, Yushi Jing, Jay Yagnik, Kumar, Ravichandran, Aly,, Video Suggestion and Discovery for YouTube.   World Wide Web (WWW), 2008
The rapid growth of the number of videos in YouTube provides enormous potential for users to find content of interest to them. Unfortunately, given the difficulty of searching videos, the size of the video repository also makes the discovery of new content a daunting task. In this paper, we present a novel method based upon the analysis of the entire user–video graph to provide personalized video suggestions for users. The resulting algorithm, termed Adsorption, provides a simple method to efficiently propagate preference information through a variety of graphs. We extensively test the results of the recommendations on a three month snapshot of live data from YouTube.

Yushi Jing, Shumeet Baluja, Henry Rowley, Canonical Image Selection from the Web,  International Conference on Image and Video Retrieval (CIVR), 2007
The vast majority of the features used in today’s commercially deployed image search systems employ techniques that are largely indistinguishable from text-document search – the images returned in response to a query are based on the text of the web pages from which they are linked. Unfortunately, depending on the query type, the quality of this approach can be inconsistent. Several recent studies have demonstrated the effectiveness of using image features to refine search results. However, it is not clear whether (or how much) image-based approach can generalize to larger samples of web queries. Also, the previously used global features often only capture a small part of the image information, which in many cases does not correspond to the distinctive characteristics of the category. This paper explores the use of local features in the concrete task of finding the single canonical images for a collection of commonly searched-for products. Through large-scale user testing, the canonical images found by using only local image features significantly outperformed the top results from Yahoo, Microsoft and Google, highlighting the importance of having these image features as an integral part of future image
search engines.

Yushi Jing, Vladimir Pavlovic, James Rehg, Efficient Discriminative Learning of Bayesian Network Classifier, International Conference on Machine Learning (ICML), 2005 -- Best Student Paper
The use of Bayesian Networks for classification problems has received significant recent attention.  Although computationally efficient, the standard maximum likelihood learning method tends to be suboptimal due to the mismatch between its optimization criteria (data likelihood) and the actual goal for classification (label prediction).  Recent approaches to optimizing the classification performance during parameter or structure learning show promise, but lack the favorable computational properties of maximum likelihood learning.  In this paper we present the Boosted Augmented Naive Bayes (BAN) classifier.   We show that a combination of discriminative data-weighting with generative training of intermediate models can yield a computationally efficient method for discriminative parameter learning and structure selection.

Henry Rowley, Yushi Jing, Shumeet Baluja, Large-scale Image-based Adult-content Filtering, International Conference on Computer Vision Theory, 2005
As more people start using the Internet and more content is placed online, the chances that individuals will encounter inappropriate or adult-oriented content increases. Search engines can exacerbate this problem by aggregating content from many sites and summarizing it into a single result page. Many existing methods for detecting adult-content currently attempt to classify web pages based on their text content. If the text content of a page is classied as adult-content, this information can be propagated to linked images and pages. However, keyword and other text-based approaches have signicant limitations. First, they are language specic and require a tremendous amount of manual work to construct (either directly, or by labeling training data for all languages). Second, many adult-content pages do not contain enough text for reliable classication. Third, the text on the page may be intentionally obfuscated (i.e. encoded in an image).  This paper looks at practical ways to detect adult content in the images themselves, on a scale which can be applied to a search engine covering a large fraction of the images on the WWW. The focus is on efficient and robust techniques, such as color classication and face detection, which together can detect many pornographic images with little computational cost.