Subscribe: Google Research Blog
Added By: Feedage Forager Feedage Grade B rated
Language: English
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Google Research Blog

Google Research Blog

The latest news on Google Research.

Updated: 2017-03-24T08:18:19.164-07:00


Adding Sound Effect Information to YouTube Captions


Posted by Sourish Chaudhuri, Software Engineer, Sound UnderstandingThe effect of audio on our perception of the world can hardly be overstated. Its importance as a communication medium via speech is obviously the most familiar, but there is also significant information conveyed by ambient sounds. These ambient sounds create context that we instinctively respond to, like getting startled by sudden commotion, the use of music as a narrative element, or how laughter is used as an audience cue in sitcoms. Since 2009, YouTube has provided automatic caption tracks for videos, focusing heavily on speech transcription in order to make the content hosted more accessible. However, without similar descriptions of the ambient sounds in videos, much of the information and impact of a video is not captured by speech transcription alone. To address this, we announced the addition of sound effect information to the automatic caption track in YouTube videos, enabling greater access to the richness of all the audio content.In this post, we discuss the backend system developed for this effort, a collaboration among the Accessibility, Sound Understanding and YouTube teams that used machine learning (ML) to enable the first ever automatic sound effect captioning system for YouTube. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src=";feature=player_embedded" width="640">Click the CC button to see the sound effect captioning system in action.The application of ML – in this case, a Deep Neural Network (DNN) model – to the captioning task presented unique challenges. While the process of analyzing the time-domain audio signal of a video to detect various ambient sounds is similar to other well known classification problems (such as object detection in images), in a product setting the solution faces additional difficulties. In particular, given an arbitrary segment of audio, we need our models to be able to 1) detect the desired sounds, 2) temporally localize the sound in the segment and 3) effectively integrate it in the caption track, which may have parallel and independent speech recognition results.A DNN Model for Ambient SoundThe first challenge we faced in developing the model was the task of obtaining enough labeled data suitable for training our neural network. While labeled ambient sound information is difficult to come by, we were able to generate a large enough dataset for training using weakly labeled data. But of all the ambient sounds in a given video, which ones should we train our DNN to detect? For the initial launch of this feature, we chose [APPLAUSE], [MUSIC] and [LAUGHTER], prioritized based upon our analysis of human-created caption tracks that indicates that they are among the most frequent sounds that are manually captioned. While the sound space is obviously far richer and provides even more contextually relevant information than these three classes, the semantic information conveyed by these sound effects in the caption track is relatively unambiguous, as opposed to sounds like [RING] which raises the question of “what was it that rang – a bell, an alarm, a phone?”Much of our initial work on detecting these ambient sounds also included developing the infrastructure and analysis frameworks to enable scaling for future work, including both the detection of sound events and their integration into the automatic caption track. Investing in the development of this infrastructure has the added benefit of allowing us to easily incorporate more sound types in the future, as we expand our algorithms to understand a wider vocabulary of sounds (e.g. [RING], [KNOCK], [BARK]). In doing so, we will be able to incorporate the detected sounds into the narrative to provide more relevant information (e.g. [PIANO MUSIC], [RAUCOUS APPLAUSE]) to viewers. Dense Detections to CaptionsWhen a video is uploaded to YouTube, the sound effect recognition pipeline runs on t[...]

Distill: Supporting Clarity in Machine Learning


Posted by Shan Carter, Software Engineer and Chris Olah, Research Scientist, Google Brain TeamScience isn't just about discovering new results. It’s also about human understanding. Scientists need to develop notations, analogies, visualizations, and explanations of ideas. This human dimension of science isn't a minor side project. It's deeply tied to the heart of science.That’s why, in collaboration with OpenAI, DeepMind, YC Research, and others, we’re excited to announce the launch of Distill, a new open science journal and ecosystem supporting human understanding of machine learning. Distill is an independent organization, dedicated to fostering a new segment of the research community.Modern web technology gives us powerful new tools for expressing this human dimension of science. We can create interactive diagrams and user interfaces the enable intuitive exploration of research ideas. Over the last few years we've seen many incredible demonstrations of this kind of work.An interactive diagram explaining the Neural Turing Machine from Olah & Carter, 2016.Unfortunately, while there are a plethora of conferences and journals in machine learning, there aren’t any research venues that are dedicated to publishing this kind of work. This is partly an issue of focus, and partly because traditional publication venues can't, by virtue of their medium, support interactive visualizations. Without a venue to publish in, many significant contributions don’t count as “real academic contributions” and their authors can’t access the academic support structure.That’s why Distill aims to build an ecosystem to support this kind of work, starting with three pieces: a research journal, prizes recognizing outstanding work, and tools to facilitate the creation of interactive articles.Distill is an ecosystem to support clarity in Machine Learning.Led by a diverse steering committee of leaders from the machine learning and user interface communities, we are very excited to see where Distill will go. To learn more about Distill, see the overview page or read the latest articles. [...]

Announcing Guetzli: A New Open Source JPEG Encoder


Posted by Robert Obryk and Jyrki Alakuijala, Software Engineers, Google Research Europe(Cross-posted on the Google Open Source Blog)At Google, we care about giving users the best possible online experience, both through our own services and products and by contributing new tools and industry standards for use by the online community. That’s why we’re excited to announce Guetzli, a new open source algorithm that creates high quality JPEG images with file sizes 35% smaller than currently available methods, enabling webmasters to create webpages that can load faster and use even less data.Guetzli [guɛtsli] — cookie in Swiss German — is a JPEG encoder for digital images and web graphics that can enable faster online experiences by producing smaller JPEG files while still maintaining compatibility with existing browsers, image processing applications and the JPEG standard. From the practical viewpoint this is very similar to our Zopfli algorithm, which produces smaller PNG and gzip files without needing to introduce a new format, and different than the techniques used in RNN-based image compression, RAISR, and WebP, which all need client changes for compression gains at internet scale. The visual quality of JPEG images is directly correlated to its multi-stage compression process: color space transform, discrete cosine transform, and quantization. Guetzli specifically targets the quantization stage in which the more visual quality loss is introduced, the smaller the resulting file. Guetzli strikes a balance between minimal loss and file size by employing a search algorithm that tries to overcome the difference between the psychovisual modeling of JPEG's format, and Guetzli’s psychovisual model, which approximates color perception and visual masking in a more thorough and detailed way than what is achievable by simpler color transforms and the discrete cosine transform. However, while Guetzli creates smaller image file sizes, the tradeoff is that these search algorithms take significantly longer to create compressed images than currently available methods.Figure 1. 16x16 pixel synthetic example of a phone line hanging against a blue sky — traditionally a case where JPEG compression algorithms suffer from artifacts. Uncompressed original is on the left. Guetzli (on the right) shows less ringing artefacts than libjpeg (middle) and has a smaller file size.And while Guetzli produces smaller image file sizes without sacrificing quality, we additionally found that in experiments where compressed image file sizes are kept constant that human raters consistently preferred the images Guetzli produced over libjpeg images, even when the libjpeg files were the same size or even slightly larger. We think this makes the slower compression a worthy tradeoff.Figure 2. 20x24 pixel zoomed areas from a picture of a cat’s eye. Uncompressed original on the left. Guetzli (on the right) shows less ringing artefacts than libjpeg (middle) without requiring a larger file size.It is our hope that webmasters and graphic designers will find Guetzli useful and apply it to their photographic content, making users’ experience smoother on image-heavy websites in addition to reducing load times and bandwidth costs for mobile users. Last, we hope that the new explicitly psychovisual approach in Guetzli will inspire further image and video compression research. [...]

An Upgrade to SyntaxNet, New Models and a Parsing Competition


Posted by David Weiss and Slav Petrov, Research ScientistsAt Google, we continuously improve the language understanding capabilities used in applications ranging from generation of email responses to translation. Last summer, we open-sourced SyntaxNet, a neural-network framework for analyzing and understanding the grammatical structure of sentences. Included in our release was Parsey McParseface, a state-of-the-art model that we had trained for analyzing English, followed quickly by a collection of pre-trained models for 40 additional languages, which we dubbed Parsey's Cousins. While we were excited to share our research and to provide these resources to the broader community, building machine learning systems that work well for languages other than English remains an ongoing challenge. We are excited to announce a few new research resources, available now, that address this problem.SyntaxNet UpgradeWe are releasing a major upgrade to SyntaxNet. This upgrade incorporates nearly a year’s worth of our research on multilingual language understanding, and is available to anyone interested in building systems for processing and understanding text. At the core of the upgrade is a new technology that enables learning of richly layered representations of input sentences. More specifically, the upgrade extends TensorFlow to allow joint modeling of multiple levels of linguistic structure, and to allow neural-network architectures to be created dynamically during processing of a sentence or document.Our upgrade makes it, for example, easy to build character-based models that learn to compose individual characters into words (e.g. ‘c-a-t’ spells ‘cat’). By doing so, the models can learn that words can be related to each other because they share common parts (e.g. ‘cats’ is the plural of ‘cat’ and shares the same stem; ‘wildcat’ is a type of ‘cat’). Parsey and Parsey’s Cousins, on the other hand, operated over sequences of words. As a result, they were forced to memorize words seen during training and relied mostly on the context to determine the grammatical function of previously unseen words. As an example, consider the following (meaningless but grammatically correct) sentence: This sentence was originally coined by Andrew Ingraham who explained: “You do not know what this means; nor do I. But if we assume that it is English, we know that the doshes are distimmed by the gostak. We know too that one distimmer of doshes is a gostak." Systematic patterns in morphology and syntax allow us to guess the grammatical function of words even when they are completely novel: we understand that ‘doshes’ is the plural of the noun ‘dosh’ (similar to the ‘cats’ example above) or that ‘distim’ is the third person singular of the verb distim. Based on this analysis we can then derive the overall structure of this sentence even though we have never seen the words before.ParseySaurusTo showcase the new capabilities provided by our upgrade to SyntaxNet, we are releasing a set of new pretrained models called ParseySaurus. These models use the character-based input representation mentioned above and are thus much better at predicting the meaning of new words based both on their spelling and how they are used in context. The ParseySaurus models are far more accurate than Parsey’s Cousins (reducing errors by as much as 25%), particularly for morphologically-rich languages like Russian, or agglutinative languages like Turkish and Hungarian. In those languages there can be dozens of forms for each word and many of these forms might never be observed during training - even in a very large corpus.Consider the following fictitious Russian sentence, where again the stems are meaningless, but the suffixes define an unambiguous interpretation of the sentence structure:Even though our Russian ParseySaurus model has never seen these words, it can correctly analyze the sentence by inspecting the character sequences which constitute each [...]

Quick Access in Drive: Using Machine Learning to Save You Time


Posted by Sandeep Tata, Software Engineer, Google ResearchAt Google, we research cutting-edge machine learning (ML) techniques that allow us to provide products and services aimed at helping you focus on what’s important. From providing language translations to understanding images to helping you respond to emails, it is our goal to help you save time, making life — and work — a little more convenient.Recent studies have shown that finding information is second only to managing email as a drain on workplace productivity. To help address this, last year we launched Quick Access, a feature in Google Drive that uses ML to surface the most relevant documents as soon as you visit the Google Drive home screen. Originally available only for G Suite customers on Android, Quick Access is now available for anyone who uses Google Drive (on the Web, Android, and iOS), saving you from having to enter a search or to browse through your folders. Our metrics show that Quick Access takes you to the documents you need in half the time compared to manually navigating or searching.Quick Access uses deep neural networks to determine patterns from various signals, such as activity in Drive, meetings on your Calendar, and more, to anticipate your needs and show the appropriate documents on the Drive home screen. Traditional ML approaches require domain experts to derive complex features from data, which are in turn used to train the model. For Quick Access, however, we constructed thousands of simple features from the various signals above (for instance, the timestamps of the last 20 edit events on a document would constitute 20 simple input features), and combined them with the power of deep neural networks to learn from the aggregated activity of our users. By using deep neural networks we were able to develop accurate predictive models with simpler features and less feature engineering effort.Quick Access suggestions on the top row in Drive on a desktop browser.The model computes a relevance score for each of the documents in Drive and the top scoring documents are presented on the home screen. For example, if you have a Calendar entry for a meeting with a coworker in the next few minutes, Quick Access might predict that the presentation you’ve been working on with that coworker is more relevant compared to your monthly budget spreadsheet or the photos you uploaded last week. If you’ve been updating a spreadsheet every weekend, then next weekend, Quick Access will likely display that spreadsheet ahead of the other documents you viewed during the week.We hope Quick Access helps you use Drive more effectively, allowing you to save time and be more productive. To learn more, watch this talk from Google Cloud Next ‘17 that dives into more details on the ML behind Quick Access.AcknowledgementsThanks to Alexandrin Popescul and Marc Najork for contributions that made this application of machine learning technology possible. This work was in close collaboration with several engineers on the Drive team including Sean Abraham, Brian Calaci, Mike Colagrosso, Mike Procopio, Jesse Sterr, and Timothy Vis. [...]

Assisting Pathologists in Detecting Cancer with Deep Learning


Posted by Martin Stumpe, Technical Lead, and Lily Peng, Product ManagerA pathologist’s report after reviewing a patient’s biological tissue samples is often the gold standard in the diagnosis of many diseases. For cancer in particular, a pathologist’s diagnosis has a profound impact on a patient’s therapy. The reviewing of pathology slides is a very complex task, requiring years of training to gain the expertise and experience to do well.Even with this extensive training, there can be substantial variability in the diagnoses given by different pathologists for the same patient, which can lead to misdiagnoses. For example, agreement in diagnosis for some forms of breast cancer can be as low as 48%, and similarly low for prostate cancer. The lack of agreement is not surprising given the massive amount of information that must be reviewed in order to make an accurate diagnosis. Pathologists are responsible for reviewing all the biological tissues visible on a slide. However, there can be many slides per patient, each of which is 10+ gigapixels when digitized at 40X magnification. Imagine having to go through a thousand 10 megapixel (MP) photos, and having to be responsible for every pixel. Needless to say, this is a lot of data to cover, and often time is limited.To address these issues of limited time and diagnostic variability, we are investigating how deep learning can be applied to digital pathology, by creating an automated detection algorithm that can naturally complement pathologists’ workflow. We used images (graciously provided by the Radboud University Medical Center) which have also been used for the 2016 ISBI Camelyon Challenge1 to train algorithms that were optimized for localization of breast cancer that has spread (metastasized) to lymph nodes adjacent to the breast. The results? Standard “off-the-shelf” deep learning approaches like Inception (aka GoogLeNet) worked reasonably well for both tasks, although the tumor probability prediction heatmaps produced were a bit noisy. After additional customization, including training networks to examine the image at different magnifications (much like what a pathologist does), we showed that it was possible to train a model that either matched or exceeded the performance of a pathologist who had unlimited time to examine the slides.Left: Images from two lymph node biopsies. Middle: earlier results of our deep learning tumor detection. Right: our current results. Notice the visibly reduced noise (potential false positives) between the two versions.In fact, the prediction heatmaps produced by the algorithm had improved so much that the localization score (FROC) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint2. We were not the only ones to see promising results, as other groups were getting scores as high as 81% with the same dataset. Even more exciting for us was that our model generalized very well, even to images that were acquired from a different hospital using different scanners. For full details, see our paper “Detecting Cancer Metastases on Gigapixel Pathology Images”.A closeup of a lymph node biopsy. The tissue contains a breast cancer metastasis as well as macrophages, which look similar to tumor but are benign normal tissue. Our algorithm successfully identifies the tumor region (bright green) and is not confused by the macrophages.While these results are promising, there are a few important caveats to consider.Like most metrics, the FROC localization score is not perfect. Here, the FROC score is defined as the sensitivity (percentage of tumors detected) at a few pre-defined average false positives per slide. It is pretty rare for a pathologist to make a false positive call (mistaking normal cells as tumor). For example, the score of 73% mentioned above corresponds to a 73% sensitivity and zero false positives. By contrast, our algorithm[...]

Google Research Awards 2016


We’ve just completed another round of the Google Research Awards, our annual open call for proposals on computer science and related topics including machine learning, machine perception, natural language processing, and security. Our grants cover tuition for a graduate student and provide both faculty and students the opportunity to work directly with Google researchers and engineers.

This round we received 876 proposals covering 44 countries and over 300 universities. After expert reviews and committee discussions, we decided to fund 143 projects. Here are a few observations from this round:

Congratulations to the well-deserving recipients of this round’s awards. If you are interested in applying for the next round (deadline is September 30th), please visit our website for more information.(image)

Preprocessing for Machine Learning with tf.Transform


Posted by Kester Tong, David Soergel, and Gus Katsiapis, Software EngineersWhen applying machine learning to real world datasets, a lot of effort is required to preprocess data into a format suitable for standard machine learning models, such as neural networks. This preprocessing takes a variety of forms, from converting between formats, to tokenizing and stemming text and forming vocabularies, to performing a variety of numerical operations such as normalization.Today we are announcing tf.Transform, a library for TensorFlow that allows users to define preprocessing pipelines and run these using large scale data processing frameworks, while also exporting the pipeline in a way that can be run as part of a TensorFlow graph. Users define a pipeline by composing modular Python functions, which tf.Transform then executes with Apache Beam, a framework for large-scale, efficient, distributed data processing. Apache Beam pipelines can be run on Google Cloud Dataflow with planned support for running with other frameworks. The TensorFlow graph exported by tf.Transform enables the preprocessing steps to be replicated when the trained model is used to make predictions, such as when serving the model with Tensorflow Serving.A common problem encountered when running machine learning models in production is "training-serving skew", where the data seen at serving time differs in some way from the data used to train the model, leading to reduced prediction quality. tf.Transform ensures that no skew can arise during preprocessing, by guaranteeing that the serving-time transformations are exactly the same as those performed at training time, in contrast to when training-time and serving-time preprocessing are implemented separately in two different environments (e.g., Apache Beam and TensorFlow, respectively).In addition to facilitating preprocessing, tf.Transform allows users to compute summary statistics for their datasets. Understanding the data is very important in every machine learning project, as subtle errors can arise from making wrong assumptions about what the underlying data look like. By making the computation of summary statistics easy and efficient, tf.Transform allows users to check their assumptions about both raw and preprocessed allows users to define a preprocessing pipeline. Users can materialize the preprocessed data for use in TensorFlow training, and also export a tf.Transform graph that encodes the transformations as a TensorFlow graph. This transformation graph can then be incorporated into the model graph used for inference.We’re excited to be releasing this latest addition to the TensorFlow ecosystem, and we hope users will find it useful for preprocessing and understanding their data.AcknowledgementsWe wish to thank the following members of the tf.Transform team for their contributions to this project: Clemens Mewald, Robert Bradshaw, Rajiv Bharadwaja, Elmer Garduno, Afshin Rostamizadeh, Neoklis Polyzotis, Abhi Rao, Joe Toth, Neda Mirian, Dinesh Kulkarni, Robbie Haertel, Cyril Bortolato and Slaven Bilac. We also wish to thank the TensorFlow, TensorFlow Serving and Cloud Dataflow teams for their support. [...]

Headset “Removal” for Virtual and Mixed Reality


Posted by Vivek Kwatra, Research Scientist and Christian Frueh, Avneesh Sud, Software EngineersVirtual Reality (VR) enables remarkably immersive experiences, offering new ways to view the world and the ability to explore novel environments, both real and imaginary. However, compared to physical reality, sharing these experiences with others can be difficult, as VR headsets make it challenging to create a complete picture of the people participating in the experience.Some of this disconnect is alleviated by Mixed Reality (MR), a related medium that shares the virtual context of a VR user in a two dimensional video format allowing other viewers to get a feel for the user’s virtual experience. Even though MR facilitates sharing, the headset continues to block facial expressions and eye gaze, presenting a significant hurdle to a fully engaging experience and complete view of the person in VR.Google Machine Perception researchers, in collaboration with Daydream Labs and YouTube Spaces, have been working on solutions to address this problem wherein we reveal the user’s face by virtually “removing” the headset and create a realistic see-through effect.VR user captured in front of a green-screen is blended with the virtual environment to generate the MR output: Traditional MR output has the user face occluded, while our result reveals the face. Note how the headset is modified with a marker to aid tracking.Our approach uses a combination of 3D vision, machine learning and graphics techniques, and is best explained in the context of enhancing Mixed Reality video (also discussed in the Google-VR blog). It consists of three main components:Dynamic face model captureThe core idea behind our technique is to use a 3D model of the user’s face as a proxy for the hidden face. This proxy is used to synthesize the face in the MR video, thereby creating an impression of the headset being removed. First, we capture a personalized 3D face model for the user with what we call gaze-dependent dynamic appearance. This initial calibration step requires the user to sit in front of a color+depth camera and a monitor, and then track a marker on the monitor with their eyes. We use this one-time calibration procedure — which typically takes less than a minute — to acquire a 3D face model of the user, and learn a database that maps appearance images (or textures) to different eye-gaze directions and blinks. This gaze database (i.e. the face model with textures indexed by eye-gaze) allows us to dynamically change the appearance of the face during synthesis and generate any desired eye-gaze, thus making the synthesized face look natural and aliveOn the left, the user’s face is captured by a camera as she tracks a marker on the monitor with her eyes. On the right, we show the dynamic nature of reconstructed 3D face model: by moving or clicking on the mouse, we are able to simulate both apparent eye gaze and blinking.Calibration and AlignmentCreating a Mixed Reality video requires a specialized setup consisting of an external camera, calibrated and time-synced with the headset. The camera captures a video stream of the VR user in front of a green screen and then composites a cutout of the user with the virtual world to create the final MR video. An important step here is to accurately estimate the calibration (the fixed 3D transformation) between the camera and headset coordinate systems. These calibration techniques typically involve significant manual intervention and are done in multiple steps. We simplify the process by adding a physical marker to the front of the headset and tracking it visually in 3D, which allows us to optimize for the calibration parameters automatically from the VR session. For headset “removal”, we need to align the 3D face model with the visible portion of the face in the camera stream, so t[...]

The CS Capacity Program - New Tools and SIGCSE 2017


Posted by Chris Stephenson, Head of Computer Science Education StrategyThe CS Capacity program was launched in March of 2015 to help address a dramatic increase in undergraduate computer science enrollments that is creating serious resource and pedagogical challenges for many colleges and universities. Over the last two years, a diverse group of universities have been working to develop successful strategies that support the expansion of high-quality CS programs at the undergraduate level. Their work focuses on innovations in teaching and technologies that support scaling while ensuring the engagement of women and underrepresented students. These innovations could provide assistance to many other institutions that are challenged to provide a high-quality educational experience to an increasing number of introductory-level students.The cohort of CS Capacity institutions include George Mason University, Mount Holyoke College, Rutgers University, and the University California Berkeley which are working individually, and Duke University, North Carolina State University, the University of Florida, and the University of North Carolina which are working together. These institution each brings a unique approach to addressing CS capacity challenges. Two years into the program, we're sharing an update on some of the great projects and ideas to emerge so far. At George Mason, for example, computer science professor Jeff Offutt and his team have developed an online system to provide self-paced learning for CS1 and CS2 classes that allows learners through the learning materials wore quickly or slowly depending on their needs. The system, called SPARC, includes course content, practice and assessment exercises (including automated testing), mini-lectures, and daily inspirations. This team has also launched a program to recruit and train undergraduate tutorial assistants to increase learning support. For more information on SPARC, contact Jeff Offutt at MaGE Peer Mentor program at Mount Holyoke College is addressing its increasing CS student enrollment by preparing undergraduate peer mentors to provide effective feedback on coding assignments and contribute to an inclusive learning environment. One of the major elements of these program is an online course that helps to recruit and train students to be undergraduate peer mentors. Mount Holyoke has made their entire online course curriculum for the peer mentor program available so that other institutions can incorporate all or part of it to assist with preparing their own student tutors. For more information on the MaGE curriculum, contact Heather Pon-Barry at Program Students and Faculty from Mount Holyoke CollegeAt University of California, Berkeley, the CS Capacity team is focused on providing access to increased and better tutoring. They’ve instituted a small-group tutoring program that includes weekend mastery learning sessions, increased office hours support, designated discussions section, project checkpoint deadlines, exam/homework/lab/discussion walkthrough videos, and a new office hours app that tracks student satisfaction with office hours. For more information on Berkeley’s interventions, contact Josh Hug at CS Capacity team at Rutgers has been exploring the gender gap at multiple levels using a longitudinal study across four required CS classes (paper to be published in the proceedings of the SIGCSE 2017 Technical Symposium). They’re investigating several factors that may impact the retention of women and underrepresented student populations, including intention to major in CS, grades, and prior experience. They’ve also been defining an additional set of feature set to improve their use of Autolab (a course management system with automated grading). T[...]

An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!


Posted by Paul Natsev, Software EngineerLast September, we released the YouTube-8M dataset, which spans millions of videos labeled with thousands of classes, in order to spur innovation and advancement in large-scale video understanding. More recently, other teams at Google have released datasets such as Open Images and YouTube-BoundingBoxes that, along with YouTube-8M, can be used to accelerate image and video understanding. To further these goals, today we are releasing an update to the YouTube-8M dataset, and in collaboration with Google Cloud Machine Learning and, we are also organizing a video understanding competition and an affiliated CVPR’17 Workshop.An Updated YouTube-8MThe new and improved YouTube-8M includes cleaner and more verbose labels (twice as many labels per video, on average), a cleaned-up set of videos, and for the first time, the dataset includes pre-computed audio features, based on a state-of-the-art audio modeling architecture, in addition to the previously released visual features. The audio and visual features are synchronized in time, at 1-second temporal granularity, which makes YouTube-8M a large-scale multi-modal dataset, and opens up opportunities for exciting new research on joint audio-visual (temporal) modeling. Key statistics on the new version are illustrated below (more details here).A tree-map visualization of the updated YouTube-8M dataset, organized into 24 high-level verticals, including the top-200 most frequent entities, plus the top-5 entities for each vertical.Sample videos from the top-18 high-level verticals in the YouTube-8M dataset.The Google Cloud & YouTube-8M Video Understanding Challenge We are also excited to announce the Google Cloud & YouTube-8M Video Understanding Challenge, in partnership with Google Cloud and The challenge invites participants to build audio-visual content classification models using YouTube-8M as training data, and to then label ~700K unseen test videos. It will be hosted as a Kaggle competition, sponsored by Google Cloud, and will feature a $100,000 prize pool for the top performers (details here). In order to enable wider participation in the competition, Google Cloud is also offering credits so participants can optionally do model training and exploration using Google Cloud Machine Learning. Open-source TensorFlow code, implementing a few baseline classification models for YouTube-8M, along with training and evaluation scripts, is available at Github. For details on getting started with local or cloud-based training, please see our README and the getting started guide on Kaggle. The CVPR 2017 Workshop on YouTube-8M Large-Scale Video Understanding We will announce the results of the challenge and host invited talks by distinguished researchers at the 1st YouTube-8M Workshop, to be held July 26, 2017, at the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) in Honolulu, Hawaii. The workshop will also feature presentations by top-performing challenge participants and a selected set of paper submissions. We invite researchers to submit papers describing novel research, experiments, or applications based on YouTube-8M dataset, including papers summarizing their participation in the above challenge.We designed this dataset with scale and diversity in mind, and hope lessons learned here will generalize to many video domains (YouTube-8M captures over 20 diverse video domains). We believe the challenge can also accelerate research by enabling researchers without access to big data or compute clusters to explore and innovate at previously unprecedented scale. Please join us in advancing video understanding!AcknowledgementsThis post reflects the work of many others within Machine Perception at Google Research, including Sam[...]

Announcing TensorFlow 1.0


Posted by Amy McDonald Sandjideh, Technical Program Manager, TensorFlowIn just its first year, TensorFlow has helped researchers, engineers, artists, students, and many others make progress with everything from language translation to early detection of skin cancer and preventing blindness in diabetics. We’re excited to see people using TensorFlow in over 6000 open-source repositories online. Today, as part of the first annual TensorFlow Developer Summit, hosted in Mountain View and livestreamed around the world, we’re announcing TensorFlow 1.0:It’s faster: TensorFlow 1.0 is incredibly fast! XLA lays the groundwork for even more performance improvements in the future, and now includes tips & tricks for tuning your models to achieve maximum speed. We’ll soon publish updated implementations of several popular models to show how to take full advantage of TensorFlow 1.0 - including a 7.3x speedup on 8 GPUs for Inception v3 and 58x speedup for distributed Inception v3 training on 64 GPUs!It’s more flexible: TensorFlow 1.0 introduces a high-level API for TensorFlow, with tf.layers, tf.metrics, and tf.losses modules. We’ve also announced the inclusion of a new tf.keras module that provides full compatibility with Keras, another popular high-level neural networks library.It’s more production-ready than ever: TensorFlow 1.0 promises Python API stability (details here), making it easier to pick up new features without worrying about breaking your existing code. Other highlights from TensorFlow 1.0:Python APIs have been changed to resemble NumPy more closely. For this and other backwards-incompatible changes made to support API stability going forward, please use our handy migration guide and conversion script.Experimental APIs for Java and GoHigher-level API modules tf.layers, tf.metrics, and tf.losses - brought over from tf.contrib.learn after incorporating skflow and TF SlimExperimental release of XLA, a domain-specific compiler for TensorFlow graphs, that targets CPUs and GPUs. XLA is rapidly evolving - expect to see more progress in upcoming releases.Introduction of the TensorFlow Debugger (tfdbg), a command-line interface and API for debugging live TensorFlow programs.New Android demos for object detection and localization, and camera-based image stylization.Installation improvements: Python 3 docker images have been added, and TensorFlow’s pip packages are now PyPI compliant. This means TensorFlow can now be installed with a simple invocation of pip install tensorflow.We’re thrilled to see the pace of development in the TensorFlow community around the world. To hear more about TensorFlow 1.0 and how it’s being used, you can watch the TensorFlow Developer Summit talks on YouTube, covering recent updates from higher-level APIs to TensorFlow on mobile to our new XLA compiler, as well as the exciting ways that TensorFlow is being used:Click here for a link to the livestream and video playlist (individual talks will be posted online later in the day).The TensorFlow ecosystem continues to grow with new techniques like Fold for dynamic batching and tools like the Embedding Projector along with updates to our existing tools like TensorFlow Serving. We’re incredibly grateful to the community of contributors, educators, and researchers who have made advances in deep learning available to everyone. We look forward to working with you on forums like GitHub issues, Stack Overflow, @TensorFlow, the group and at future events. [...]

On-Device Machine Intelligence


Posted by Sujith Ravi, Staff Research Scientist, Google ResearchTo build the cutting-edge technologies that enable conversational understanding and image recognition, we often apply combinations of machine learning technologies such as deep neural networks and graph-based machine learning. However, the machine learning systems that power most of these applications run in the cloud and are computationally intensive and have significant memory requirements. What if you want machine intelligence to run on your personal phone or smartwatch, or on IoT devices, regardless of whether they are connected to the cloud?Yesterday, we announced the launch of Android Wear 2.0, along with brand new wearable devices, that will run Google's first entirely “on-device” ML technology for powering smart messaging. This on-device ML system, developed by the Expander research team, enables technologies like Smart Reply to be used for any application, including third-party messaging apps, without ever having to connect with the cloud…so now you can respond to incoming chat messages directly from your watch, with a tap.The research behind this began last year while our team was developing the machine learning systems that enable conversational understanding capability in Allo and Inbox. The Android Wear team reached out to us and was interested to know whether it would be possible to deploy this Smart Reply technology directly onto a smart device. Because of the limited computing power and memory on smart devices, we quickly realized that it was not possible to do so. Our product manager, Patrick McGregor, realized that this presented a unique challenge and an opportunity for the Expander team to return to the drawing board to design a completely new, lightweight, machine learning architecture — not only to enable Smart Reply on Android Wear, but also to power a wealth of other on-device mobile applications. Together with Tom Rudick, Nathan Beach, and other colleagues from the Android Wear team, we set out to build the new system.Learning with ProjectionsA simple strategy to build lightweight conversational models might be to create a small dictionary of common rules (input → reply mappings) on the device and use a naive look-up strategy at inference time. This can work for simple prediction tasks involving a small set of classes using a handful of features (such as binary sentiment classification from text, e.g. “I love this movie” conveys a positive sentiment whereas the sentence “The acting was horrible” is negative). But, it does not scale to complex natural language tasks involving rich vocabularies and the wide language variability observed in chat messages. On the other hand, machine learning models like recurrent neural networks (such as LSTMs), in conjunction with graph learning, have proven to be extremely powerful tools for complex sequence learning in natural language understanding tasks, including Smart Reply. However, compressing such rich models to fit in device memory and produce robust predictions at low computation cost (rapidly on-demand) is extremely challenging. Early experiments with restricting the model to predict only a small handful of replies or using other techniques like quantization or character-level models did not produce useful results.Instead, we built a different solution for the on-device ML system. We first use a fast, efficient mechanism to group similar incoming messages and project them to similar (“nearby”) bit vector representations. While there are several ways to perform this projection step, such as using word embeddings or encoder networks, we employ a modified version of locality sensitive hashing (LSH) to reduce dimension from millions of unique words [...]

Announcing TensorFlow Fold: Deep Learning With Dynamic Computation Graphs


Posted by Moshe Looks, Marcello Herreshoff and DeLesley Hutchins, Software EngineersIn much of machine learning, data used for training and inference undergoes a preprocessing step, where multiple inputs (such as images) are scaled to the same dimensions and stacked into batches. This lets high-performance deep learning libraries like TensorFlow run the same computation graph across all the inputs in the batch in parallel. Batching exploits the SIMD capabilities of modern GPUs and multi-core CPUs to speed up execution. However, there are many problem domains where the size and structure of the input data varies, such as parse trees in natural language understanding, abstract syntax trees in source code, DOM trees for web pages and more. In these cases, the different inputs have different computation graphs that don't naturally batch together, resulting in poor processor, memory, and cache utilization. Today we are releasing TensorFlow Fold to address these challenges. TensorFlow Fold makes it easy to implement deep-learning models that operate over data of varying size and structure. Furthermore, TensorFlow Fold brings the benefits of batching to such models, resulting in a speedup of more than 10x on CPU, and more than 100x on GPU, over alternative implementations. This is made possible by dynamic batching, introduced in our paper Deep Learning with Dynamic Computation Graphs.This animation shows a recursive neural network run with dynamic batching. Operations with the same color are batched together, which lets TensorFlow run them faster. The Embed operation converts words to vector representations. The fully connected (FC) operation combines word vectors to form vector representations of phrases. The output of the network is a vector representation of an entire sentence. Although only a single parse tree of a sentence is shown, the same network can run, and batch together operations, over multiple parse trees of arbitrary shapes and sizes.The TensorFlow Fold library will initially build a separate computation graph from each input.Because the individual inputs may have different sizes and structures, the computation graphs may as well. Dynamic batching then automatically combines these graphs to take advantage of opportunities for batching, both within and across inputs, and inserts additional instructions to move data between the batched operations (see our paper for technical details). To learn more, head over to our github site. We hope that TensorFlow Fold will be useful for researchers and practitioners implementing neural networks with dynamic computation graphs in TensorFlow. AcknowledgementsThis work was done under the supervision of Peter Norvig. [...]

Advancing Research on Video Understanding with the YouTube-BoundingBoxes Dataset


Posted by Esteban Real, Vincent Vanhoucke, Jonathon Shlens, Google Brain team andStefano Mazzocchi, Google ResearchOne of the most challenging research areas in machine learning today is enabling computers to understand what a scene is about. For example, while humans know that a ball that disappears behind a wall only to reappear a moment later is very likely the same object, this is not at all obvious to an algorithm. Understanding this requires not only a global picture of what objects are contained in each frame of a video, but also where those objects are located within the frame and their locations over time. Just last year we published YouTube-8M, a dataset consisting of automatically labelled YouTube videos. And while this helps further progress in the field, it is only one piece to the puzzle. Today, in order to facilitate progress in video understanding research, we are introducing YouTube-BoundingBoxes, a dataset consisting of 5 million bounding boxes spanning 23 object categories, densely labeling segments from 210,000 YouTube videos. To date, this is the largest manually annotated video dataset containing bounding boxes, which track objects in temporally contiguous frames. The dataset is designed to be large enough to train large-scale models, and be representative of videos captured in natural settings. Importantly, the human-labelled annotations contain objects as they appear in the real world with partial occlusions, motion blur and natural lighting.Summary of dataset statistics. Bar Chart: Relative number of detections in existing image (red) and video (blue) data sets. The YouTube BoundingBoxes dataset (YT-BB) is at the bottom, is at the bottom. Table: The three columns are counts for: classification annotations, bounding boxes, and unique videos with bounding boxes. Full details on the dataset can be found in the preprint.A key feature of this dataset is that bounding box annotations are provided for entire video segments. These bounding box annotations may be used to train models that explicitly leverage this temporal information to identify, localize and track objects over time. In a video, individual annotated objects might become entirely occluded and later return in subsequent frames. These annotations of individual objects are sometimes not recognizable from individual frames, but can be understood and recognized in the context of the video if the objects are localized and tracked accurately. Three video segments, sampled at 1 frame per second. The final frame of each example shows how it is visually challenging to recognize the bounded object, due to blur or occlusion (train example, blue arrow). However, temporally-related frames, where the object has been more clearly identified, can allow object classes to be inferred. Note how only visible parts are included in the box: the orange arrow in the bear example (middle row) points to the hidden head. The dog example illustrates tight bounding boxes that track the tail (orange arrows) and foot (blue arrows). The airplane example illustrates how partial objects are annotated (first frame) tracked across changes in perspective, occlusions and camera cuts.We hope that this dataset might ultimately aid the computer vision and machine learning community and lead to new methods for analyzing and understanding real world vision problems. You can learn more about the dataset in this associated preprint.AcknowledgementsThis work was greatly helped along by Xin Pan, Thomas Silva, Mir Shabber Ali Khan, Ashwin Kakarla and many others, as well as support and advice from Manfred Georg, Sami Abu-El-Haija, Susanna Ricco and George Toderici. [...]

Using Machine Learning to predict parking difficulty


Posted by James Cook, Yechen Li, Software Engineers and Ravi Kumar, Research Scientist"When Solomon said there was a time and a place for everything he had not encountered the problem of parking his automobile." -Bob Edwards, Broadcast JournalistMuch of driving is spent either stuck in traffic or looking for parking. With products like Google Maps and Waze, it is our long-standing goal to help people navigate the roads easily and efficiently. But until now, there wasn’t a tool to address the all-too-common parking woes.Last week, we launched a new feature for Google Maps for Android across 25 US cities that offers predictions about parking difficulty close to your destination so you can plan accordingly. Providing this feature required addressing some significant challenges:Parking availability is highly variable, based on factors like the time, day of week, weather, special events, holidays, and so on. Compounding the problem, there is almost no real time information about free parking spots.Even in areas with internet-connected parking meters providing information on availability, this data doesn’t account for those who park illegally, park with a permit, or depart early from still-paid meters.Roads form a mostly-planar graph, but parking structures may be more complex, with traffic flows across many levels, possibly with different layouts.Both the supply and the demand for parking are in constant flux, so even the best system is at risk of being outdated as soon as it’s built.To face these challenges, we used a unique combination of crowdsourcing and machine learning (ML) to build a system that can provide you with parking difficulty information for your destination, and even help you decide what mode of travel to take — in a pre-launch experiment, we saw a significant increase in clicks on the transit travel mode button, indicating that users with additional knowledge of parking difficulty were more likely to consider public transit rather than driving.Three technical pieces were required to build the algorithms behind the parking difficulty feature: good ground truth data from crowdsourcing, an appropriate ML model and a robust set of features to train the model on.Ground Truth DataGathering high-quality ground truth data is often a key challenge in building any ML solution. We began by asking individuals at a diverse set of locations and times if they found the parking difficult. But we learned that answers to subjective questions like this produces inconsistent results - for a given location and time, one person may answer that it was “easy” to find parking while another found it “difficult.” Switching to objective questions like “How long did it it take to find parking?” led to an increase in answer confidence, enabling us to crowdsource a high-quality set of ground truth data with over 100K responses.Model FeaturesWith this data available, we began to determine features we could train a model on. Fortunately, we were able to turn to the wisdom of the crowd, and utilize anonymous aggregated information from users who opt to share their location data, which already is a vital source of information for estimates of live traffic or popular times and visit durations. We quickly discovered that even with this data, some unique challenges remain. For example, our system shouldn’t be fooled into thinking parking is plentiful if someone is parking in a gated or private lot. Users arriving by taxi might look like a sign of abundant parking at the front door, and similarly, public-transit users might seem to park at bus stops. These false positives, and many others, all have the potential to m[...]

App Discovery with Google Play, Part 3: Machine Learning to Fight Spam and Abuse at Scale


Posted by Hsu-Chieh Lee, Xing Chen, Software Engineers, and Qian An, AnalystIn Part 1 and Part 2 of this series on app discovery, we discussed using machine learning to gain a deeper understanding of the topics associated with an app, and a deep learning framework to provide personalized recommendations. In this post, we discuss a machine learning approach to fight spam and abuse on apps section of the Google Play Store, making it a safe and trusted app platform for more than a billion Android users. With apps becoming an increasingly important part of people’s professional and personal lives, we realize that it is critical to make sure that 1) the apps found on Google Play are safe, and 2) the information presented to you about the apps is both authentic and unbiased. With more than 1 million apps in our catalog, and a significant number of new apps introduced everyday, we needed to develop scalable methods to identify bad actors accurately and swiftly. To tackle this problem, we take a two-pronged approach, both employing various machine learning techniques to help fight against spam and abuse at scale.Identifying and blocking ‘bad’ apps from entering Google Play platformAs mentioned in Google Play Developer Policy, we don’t allow listing of malicious, offensive, or illegal apps. Despite such policy, there are always a small number of bad actors who attempt to publish apps that prey on users. Finding the apps that violate our policy among the vast app catalog is not a trivial problem, especially when there are tens of thousands of apps being submitted each day. This is why we embraced machine learning techniques in assessing policy violations and potential risks an app may pose to its potential users.We use various techniques such as text analysis with word embedding with large probabilistic networks, image understanding with Google Brain, and static and dynamic analysis of the APK binary. These individual techniques are aimed to detect specific violations (e.g., restricted content, privacy and security, intellectual property, user deception), in a more systematic and reliable way compared to manual reviews. Apps that are flagged by our algorithms either gets sent back to the developers for addressing the detected issues, or are ‘quarantined’ until we can verify its safety and/or clears it of potential violations. Because of this app review process combining analyses by human experts and algorithms, developers can take necessary actions (e.g., iterate or publish) within a few hours of app submission.Visualization of word embedding of samples of offensive content policy violating apps (red dots) and policy compliant apps (green dots), visualized with t-SNE (t-Distributed Stochastic Neighbor Embedding).Preventing manipulation of app ratings and rankings While an app may itself be legitimate, some bad actors may attempt to create fake engagements in order to manipulate an app’s ratings and rankings. In order to provide our users with an accurate reflection of the app’s perceived quality, we work to nullify these attempts. However, as we place countermeasures against these efforts, the actors behind the manipulation attempts change and adapt their behaviors to bypass our countermeasures thereby presenting us with an adversarial problem.As such, instead of using a conventional supervised learning approach (as we did in the ‘Part 1’ or ‘Part 2’ of this series, which are more ‘stationary’ problems), we needed to develop a repeatable process that allowed us the same (if not more) agility that bad actors have. We achieved this by using a hybrid strategy that utilizes uns[...]

Facilitating the discovery of public datasets


Posted by Natasha Noy, Google Research and Dan Brickley, Open Source Programs OfficeThere are many hundreds of data repositories on the Web, providing access to tens of thousands—or millions—of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others’ work, and providing data journalists easier access to information and its provenance. For these reasons, many publishers and funding agencies now require that scientists make their research data available publicly.However, due to the volume of data repositories available on the Web, it can be extremely difficult to determine not only where is the dataset that has the information that you are looking for, but also the veracity or provenance of that information. Yet, there is no reason why searching for datasets shouldn’t be as easy as searching for recipes, or jobs, or movies. These types of searches are often open-ended ones, where some structure over the search space makes the exploration and serendipitous discovery possible. To provide better discovery and rich content for books, movies, events, recipes, reviews and a number of other content categories with Google Search, we rely on structured data that content providers embed in their sites using vocabulary. To facilitate similar capabilities for datasets, we have recently published new guidelines to help data providers describe their datasets in a structured way, enabling Google and others to link this structured metadata with information describing locations, scientific publications, or even Knowledge Graph, facilitating data discovery for others. We hope that this metadata will help us improve the discovery and reuse of public datasets on the Web for everybody.The approach for describing datasets is based on an effort recently standardized at W3C (the Data Catalog Vocabulary), which we expect will be a foundation for future elaborations and improvements to dataset description. While these industry discussions are evolving, we are confident that the standards that already exist today provide a solid basis for building a data ecosystem.Technical ChallengesWhile we have released the guidelines on publishing the metadata, many technical challenges remain before search for data becomes as seamless as we feel it should be. These challenges include:Defining more consistently what constitutes a dataset: For example, is a single table a dataset? What about a collection of related tables? What about a protein sequence? A set of images? An API that provides access to data? We hope that a better understanding of what a dataset is will emerge as we gain more experience with how data providers define, describe, and use data.Identifying datasets: Ideally, datasets should have permanent identifiers conforming to some well known scheme that enables us to identify them uniquely, but often they don’t. Is a URL for the metadata page a good identifier? Can there be multiple identifiers? Is there a primary one?Relating datasets to each other: When are two records describing a dataset “the same” (for instance, if one repository copies metadata from another )? What if an aggregator provides more metadata about the same dataset or cleans the data in some useful way? We are working on clarifying and defining these relationships, but it is likely tha[...]

A Large Corpus for Supervised Word-Sense Disambiguation


Posted by Colin Evans and Dayu Yuan, Software EngineersUnderstanding the various meanings of a particular word in text is key to understanding language. For example, in the sentence “he will receive stock in the reorganized company”, we know that “stock” refers to “the capital raised by a business or corporation through the issue and subscription of shares” as defined in the New Oxford American Dictionary (NOAD), based on the context. However, there are more than 10 other definitions for “stock” in NOAD, ranging from “goods in a store”to “a medieval device for punishment”. For a computer algorithm, distinguishing between these meanings is so difficult that it has been described as “AI-complete” in the past (Navigli, 2009; Ide and Veronis 1998; Mallery 1988).In order to help further progress on this challenge, we’re happy to announce the release of word-sense annotations on the popular MASC and SemCor datasets, manually annotated with senses from the NOAD. We’re also releasing mappings from the NOAD senses to English Wordnet, which is more commonly used by the research community. This is one of the largest releases of fully sense-annotated English corpora. Supervised Word-Sense DisambiguationHumans distinguish between meanings of words in text easily because we have access to an enormous amount of common-sense knowledge about how the world works, and how this connects to language. For an example of the difficulty, “[stock] in a business” implies the financial sense, but “[stock] in a bodega” is more likely to refer to goods on the shelves of a store, even though a bodega is a kind of business. Acquiring sufficient knowledge in a form that a machine can use, and then applying it to understanding the words in text, is a challenge.Supervised word-sense disambiguation (WSD) is the problem of building a machine-learned system using human-labeled data that can assign a dictionary sense to all words used in text (in contrast to entity disambiguation, which focuses on nouns, mostly proper). Building a supervised model that performs better than just assigning the most frequent sense of a word without considering the surrounding text is difficult, but supervised models can perform well when supplied with significant amounts of training data. (Navigli, 2009)By releasing this dataset, it is our hope that the research community will be able to further the advance of algorithms that allow machines to understand language better, allowing applications such as:Facilitating the automatic construction of databases from text in order to answer questions and connect knowledge in documents. For example, understanding that a “hemi engine” is a kind of automotive machinery, and a “locomotive engine” is a kind of train, or that “Kanye West is a star” implies that he is a celebrity, but “Sirius is a star” implies that it is an astronomical object.Disambiguating words in queries, so that results for “date palm” and “date night” or “web spam” and “spam recipe” can have distinct interpretations for different senses, and documents returned from a query have the same meaning that is implied by the query.Manual AnnotationIn the manually labeled data sets that we are releasing, each sense annotation is labeled by five raters. To ensure high quality of the sense annotation, raters are first trained with gold annotations, which were labeled by experienced linguists in a separate pilot study before the annotation task. The figure below shows an example of a rater’s work page in our annotation to[...]

The Google Brain team — Looking Back on 2016


Posted by Jeff Dean, Google Senior Fellow, on behalf of the entire Google Brain teamThe Google Brain team's long-term goal is to create more intelligent software and systems that improve people's lives, which we pursue through both pure and applied research in a variety of different domains. And while this is obviously a long-term goal, we would like to take a step back and look at some of the progress our team has made over the past year, and share what we feel may be in store for 2017. Research PublicationsOne important way in which we assess the quality of our research is through publications in top tier international machine learning venues like ICML, NIPS, and ICLR. Last year our team had a total of 27 accepted papers at these venues, covering a wide ranging set of topics including program synthesis, knowledge transfer from one network to another, distributed training of machine learning models, generative models for language, unsupervised learning for robotics, automated theorem proving, better theoretical understanding of neural networks, algorithms for improved reinforcement learning, and many others. We also had numerous other papers accepted at conferences in fields such as natural language processing (ACL, CoNNL), speech (ICASSP), vision (CVPR), robotics (ISER), and computer systems (OSDI). Our group has also submitted 34 papers to the upcoming ICLR 2017, a top venue for cutting-edge deep learning research. You can learn more about our work in our list of papers, here.Natural Language UnderstandingAllowing computers to better understand human language is one key area for our research. In late 2014, three Brain team researchers published a paper on Sequence to Sequence Learning with Neural Networks, and demonstrated that the approach could be used for machine translation. In 2015, we showed that this this approach could also be used for generating captions for images, parsing sentences, and solving computational geometry problems. In 2016, this previous research (plus many enhancements) culminated in Brain team members worked closely with members of the Google Translate team to wholly replace the translation algorithms powering Google Translate with a completely end-to-end learned system (research paper). This new system closed the gap between the old system and human quality translations by up to 85% for some language pairs. A few weeks later, we showed how the system could do “zero-shot translation”, learning to translate between languages for which it had never seen example sentence pairs (research paper). This system is now deployed on the production Google Translate service for a growing number of language pairs, giving our users higher quality translations and allowing people to communicate more effectively across language barriers. Gideon Lewis-Kraus documented this translation effort (along with the history of deep learning and the history of the Google Brain team) in “The Great A.I. Awakening”, an in-depth article that appeared in The NY Times Magazine in December, 2016.RoboticsTraditional robotics control algorithms are carefully and painstakingly hand-programmed, and therefore embodying robots with new capabilities is often a very laborious process. We believe that having robots automatically learn to acquire new skills through machine learning is a better approach. Last year, we collaborated with researchers at [X] to demonstrate how robotic arms could learn hand-eye coordination, pooling their experiences to teach themselves more quickly (research paper). Our robo[...]

Google Brain Residency Program - 7 months in and looking ahead


Posted by Jeff Dean, Google Senior Fellow and Leslie Phillips, Google Brain Residency Program Manager“Beyond being incredibly instructive, the Google Brain Residency program has been a truly affirming experience. Working alongside people who truly love what they do--and are eager to help you develop your own passion--has vastly increased my confidence in my interests, my ability to explore them, and my plans for the near future.”-Akosua Busia, B.S. Mathematical and Computational Science, Stanford University ‘162016 Google Brain ResidentIn October 2015 we launched the Google Brain Residency, a 12-month program focused on jumpstarting a career for those interested in machine learning and deep learning research. This program is an opportunity to get hands on experience using the state-of-the-art infrastructure available at Google, and offers the chance to work alongside top researchers within the Google Brain team. Our first group of residents arrived in June 2016, working with researchers on problems at the forefront of machine learning. The wide array of topics studied by residents reflects the diversity of the residents themselves — some come to the program as new graduates with degrees ranging from BAs to Ph.Ds in computer science to physics and mathematics to biology and neuroscience, while other residents come with years of industry experience under their belts. They all have come with a passion for learning how to conduct machine learning research.The breadth of research being done by the Google Brain Team along with resident-mentorship pairing flexibility ensures that residents with interests in machine learning algorithms and reinforcement learning, natural language understanding, robotics, neuroscience, genetics and more, are able to find good mentors to help them pursue their ideas and publish interesting work. And just seven months into the program, the Residents are already making an impact in the research field. To date, Google Brain Residents have submitted a total of 21 papers to leading machine learning conferences, spanning topics from enhancing low resolution images to building neural networks that in turn design novel, task specific neural network architectures. Of those 21 papers, 5 were accepted in the recent BayLearn Conference (two of which, “Mean Field Neural Networks” and “Regularizing Neural Networks by Penalizing Their Output Distribution’’, were presented in oral sessions), 2 were accepted in the NIPS 2016 Adversarial Training workshop, and another in ISMIR 2016 (see the full list of papers, including the 14 submissions to ICLR 2017, after the figures below).An LSTM Cell (Left) and a state of the art RNN Cell found using a neural network (Right). This is an example of a novel architecture found using the approach presented in “Neural Architecture Search with Reinforcement Learning” (B. Zoph and Q. V. Le, submitted to ICLR 2017). This paper uses a neural network to generate novel RNN cell architectures that outperform the widely used LSTM on a variety of different tasks. The training accuracy for neural networks, colored from black (random chance) to red (high accuracy). Overlaid in white dashed lines are the theoretical predictions showing the boundary between trainable and untrainable networks. (a) Networks with no dropout. (b)-(d) Networks with dropout rates of 0.01, 0.02, 0.06 respectively. This research explores whether theoretical calculations can replace large hyperparameter searches. For more details, read “Deep Informatio[...]

Get moving with the new Motion Stills


Posted by Matthias Grundmann and Ken Conley, Machine PerceptionLast June, we released Motion Stills, an iOS app that uses our video stabilization technology to create easily shareable GIFs from Apple Live Photos. Since then, we integrated Motion Stills into Google Photos for iOS and thought of ways to improve it, taking into account your ideas for new features.Today, we are happy to announce a major new update to the Motion Stills app that will help you create even more beautiful videos and fun GIFs using motion-tracked text overlays, super-resolution videos, and automatic cinemagraphs.Motion TextWe’ve added motion text so you can create moving text effects, similar to what you might see in movies and TV shows, directly on your phone. With Motion Text, you can easily position text anywhere over your video to get the exact result you want. It only takes a second to initialize while you type, and a tracks at 1000 FPS throughout the whole Live Photo, so the process feels instantaneous.To make this possible, we took the motion tracking technology that we run on YouTube servers for “Privacy Blur”, and made it run even faster on your device. How? We first create motion metadata for your video by leveraging machine learning to classify foreground/background features as well as to model temporally coherent camera motion. We then take this metadata, and use it as input to an algorithm that can track individual objects while discriminating it from others. The algorithm models each object’s state that includes its motion in space, an implicit appearance model (described as a set of its moving parts), and its centroid and extent, as shown in the figure below.Enhance! your videos with better detail and loopsLast month, we published the details of our state-of-the-art RAISR technology, which employs machine learning to create super-resolution detail in images. This technology is now available in Motion Stills, automatically sharpening every video you export. We are also going beyond stabilization to bring you fully automatic cinemagraphs. After freezing the background into a still photo, we analyze our result to optimize for the perfect loop transition. By considering a range of start and end frames, we build a matrix of transition scores between frame pairs. A significant minimum in this matrix reflects the perfect transition, resulting in an endless loop of motion stillness.Continuing improve the experienceThanks to your feedback, we’ve additionally rebuilt our navigation and added more tutorials. We’ve also added Apple’s 3D touch to let you “peek and pop” clips in your stream and movie tray. Lots more is coming to address your top requests, so please download the new release of Motion Stills and keep sending us feedback with #motionstills on your favorite social media. [...]

App Discovery with Google Play, Part 2: Personalized Recommendations with Related Apps


Posted by Ananth Balashankar & Levent Koc, Software Engineers, and Norberto Guimaraes, Product ManagerIn Part 1 of this series on app discovery, we discussed using machine learning to gain a deeper understanding of the topics associated with an app, in order to provide a better search and discovery experience on the Google Play Apps Store. In this post, we discuss a deep learning framework to provide personalized recommendations to users based on their previous app downloads and the context in which they are used. Providing useful and relevant app recommendations to visitors of the Google Play Apps Store is a key goal of our apps discovery team. An understanding of the topics associated with an app, however, is only one part of creating a system that best serves the user. In order to create a better overall experience, one must also take into account the tastes of the user and provide personalized recommendations. If one didn’t, the “You might also like” recommendation would look the same for everyone! Discovering these nuances requires both an understanding what an app does, and also the context of the app with respect to the user. For example, to an avid sci-fi gamer, similar game recommendations may be of interest, but if a user installs a fitness app, recommending a health recipe app may be more relevant than five more fitness apps. As users may be more interested in downloading an app or game that complements one they already have installed, we provide recommendations based on app relatedness with each other (“You might also like”), in addition to providing recommendations based on the topic associated with an app (“Similar apps”). Suggestions of similar apps and apps that you also might like shown both before making an install decision (left) and while the current install is in progress (right).One particularly strong contextual signal is app relatedness, based on previous installs and search query clicks. As an example, a user who has searched for and plays a lot of graphics-heavy games likely has a preference for apps which are also graphically intense rather than apps with simpler graphics. So, when this user installs a car racing game, the “You might also like” suggestions includes apps which relate to the “seed” app (because they are graphically intense racing games) ranked higher than racing apps with simpler graphics. This allows for a finer level of personalization where the characteristics of the apps are matched with the preferences of the user.To incorporate this app relatedness in our recommendations, we take a two pronged approach: (a) offline candidate generation i.e. the generation of the potential related apps that other users have downloaded, in addition to the app in question, and (b) online personalized re-ranking, where we re-rank these candidates using a personalized ML model.Offline Candidate GenerationThe problem of finding related apps can be formulated as a nearest neighbor search problem. Given an app X, we want to find the k nearest apps. In the case of “you might also like”, a naive approach would be one based on counting, where if many people installed apps X and Y, then the app Y would be used as candidate for seed app X. However, this approach is intractable as it is difficult to learn and generalize effectively in the huge problem space. Given that there are over a million apps on Google Play, the total number of possible app pairs is over ~1012. To solve this, w[...]

Open sourcing the Embedding Projector: a tool for visualizing high dimensional data


Posted by Daniel Smilkov and the Big Picture group Recent advances in Machine Learning (ML) have shown impressive results, with applications ranging from image recognition, language translation, medical diagnosis and more. With the widespread adoption of ML systems, it is increasingly important for research scientists to be able to explore how the data is being interpreted by the models. However, one of the main challenges in exploring this data is that it often has hundreds or even thousands of dimensions, requiring special tools to investigate the space. To enable a more intuitive exploration process, we are open-sourcing the Embedding Projector, a web application for interactive visualization and analysis of high-dimensional data recently shown as an A.I. Experiment, as part of TensorFlow. We are also releasing a standalone version at, where users can visualize their high-dimensional data without the need to install and run TensorFlow.Exploring EmbeddingsThe data needed to train machine learning systems comes in a form that computers don't immediately understand. To translate the things we understand naturally (e.g. words, sounds, or videos) to a form that the algorithms can process, we use embeddings, a mathematical vector representation that captures different facets (dimensions) of the data. For example, in this language embedding, similar words are mapped to points that are close to each other.With the Embedding Projector, you can navigate through views of data in either a 2D or a 3D mode, zooming, rotating, and panning using natural click-and-drag gestures. Below is a figure showing the nearest points to the embedding for the word “important” after training a TensorFlow model using the word2vec tutorial. Clicking on any point (which represents the learned embedding for a given word) in this visualization, brings up a list of nearest points and distances, which shows which words the algorithm has learned to be semantically related. This type of interaction represents an important way in which one can explore how an algorithm is performing. Methods of Dimensionality ReductionThe Embedding Projector offers three commonly used methods of data dimensionality reduction, which allow easier visualization of complex data: PCA, t-SNE and custom linear projections. PCA is often effective at exploring the internal structure of the embeddings, revealing the most influential dimensions in the data. t-SNE, on the other hand, is useful for exploring local neighborhoods and finding clusters, allowing developers to make sure that an embedding preserves the meaning in the data (e.g. in the MNIST dataset, seeing that the same digits are clustered together). Finally, custom linear projections can help discover meaningful "directions" in data sets - such as the distinction between a formal and casual tone in a language generation model - which would allow the design of more adaptable ML systems.A custom linear projection of the 100 nearest points of "See attachments." onto the "yes" - "yeah" vector (“yes” is right, “yeah” is left) of a corpus of 35k frequently used phrases in emailsThe Embedding Projector website includes a few datasets to play with. We’ve also made it easy for users to publish and share their embeddings with others (just click on the “Publish” button on the left pane). It is our hope that the Embedding Projector will be a useful tool to help the research co[...]

NIPS 2016 & Research at Google


Posted by Doug Eck, Research Scientist, Google Brain TeamThis week, Barcelona hosts the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), a machine learning and computational neuroscience conference that includes invited talks, demonstrations and oral and poster presentations of some of the latest in machine learning research. Google will have a strong presence at NIPS 2016, with over 280 Googlers attending in order to contribute to and learn from the broader academic research community by presenting technical talks and posters, in addition to hosting workshops and tutorials.Research at Google is at the forefront of innovation in Machine Intelligence, actively exploring virtually all aspects of machine learning including classical algorithms as well as cutting-edge techniques such as deep learning. Focusing on both theory as well as application, much of our work on language understanding, speech, translation, visual processing, ranking, and prediction relies on Machine Intelligence. In all of those tasks and many others, we gather large volumes of direct or indirect evidence of relationships of interest, and develop learning approaches to understand and generalize. If you are attending NIPS 2016, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people, and to see demonstrations of some of the exciting research we pursue. You can also learn more about our work being presented at NIPS 2016 in the list below (Googlers highlighted in blue).Google is a Platinum Sponsor of NIPS 2016.Organizing CommitteeExecutive Board includes: Corinna Cortes, Fernando PereiraAdvisory Board includes: John C. PlattArea Chairs include: John Shlens, Moritz Hardt, Navdeep Jaitly, Hugo Larochelle, Honglak Lee, Sanjiv Kumar, Gal ChechikInvited TalkDynamic Legged RobotsMarc RaibertAccepted Papers:Boosting with AbstentionCorinna Cortes, Giulia DeSalvo, Mehryar MohriCommunity Detection on Evolving GraphsStefano Leonardi, Aris Anagnostopoulos, Jakub Łącki, Silvio Lattanzi, Mohammad MahdianLinear Relaxations for Finding Diverse Elements in Metric SpacesAditya Bhaskara, Mehrdad Ghadiri, Vahab Mirrokni, Ola SvenssonNearly Isometric Embedding by RelaxationJames McQueen, Marina Meila, Dominique Joncas Optimistic Bandit Convex OptimizationMehryar Mohri, Scott YangReward Augmented Maximum Likelihood for Neural Structured PredictionMohammad Norouzi, Samy Bengio, Zhifeng Chen, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans Stochastic Gradient MCMC with Stale GradientsChangyou Chen, Nan Ding, Chunyuan Li, Yizhe Zhang, Lawrence CarinUnsupervised Learning for Physical Interaction through Video PredictionChelsea Finn*, Ian Goodfellow, Sergey LevineUsing Fast Weights to Attend to the Recent PastJimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Leibo, Catalin IonescuA Credit Assignment Compiler for Joint PredictionKai-Wei Chang, He He, Stephane Ross, Hal IIIA Neural TransducerNavdeep Jaitly, Quoc Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, Samy BengioAttend, Infer, Repeat: Fast Scene Understanding with Generative ModelsS. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Koray Kavukcuoglu, Geoffrey HintonBi-Objective Online Matching and Submodular AllocationsHossein Esfandia[...]