Subscribe: Google Research Blog
http://googleresearch.blogspot.com/atom.xml
Added By: Feedage Forager Feedage Grade A rated
Language: English
Tags:
data  google  images  learning  machine learning  machine  model  models  neural  new  posted  research  tensorflow  video 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Google Research Blog

Google Research Blog



The latest news on Google Research.



Updated: 2017-11-21T12:22:20.087-08:00

 



Understanding Medical Conversations

2017-11-21T12:19:00.832-08:00

Posted by Katherine Chou, Product Manager and Chung-Cheng Chiu, Software Engineer, Google Brain TeamGood documentation helps create good clinical care by communicating a doctor's thinking, their concerns, and their plans to the rest of the team. Unfortunately, physicians routinely spend more time doing documentation than doing what they love most — caring for patients. Part of the reason is that doctors spend ~6 hours in an 11-hour workday in the Electronic Health Records (EHR) on documentation.1 Consequently, one study found that more than half of surveyed doctors report at least one symptom of burnout.2 In order to help offload note-taking, many doctors have started using medical scribes as a part of their workflow. These scribes listen to the patient-doctor conversations and create notes for the EHR. According to a recent study, introducing scribes not only improved physician satisfaction, but also medical chart quality and accuracy.3 But the number of doctor-patient conversations that need a scribe is far beyond the capacity of people who are available for medical scribing. We wondered: could the voice recognition technologies already available in Google Assistant, Google Home, and Google Translate be used to document patient-doctor conversations and help doctors and scribes summarize notes more quickly?In “Speech Recognition for Medical Conversations”, we show that it is possible to build Automatic Speech Recognition (ASR) models for transcribing medical conversations. While most of the current ASR solutions in medical domain focus on transcribing doctor dictations (i.e., single speaker speech consisting of predictable medical terminology), our research shows that it is possible to build an ASR model which can handle multiple speaker conversations covering everything from weather to complex medical diagnosis.Using this technology, we will start working with physicians and researchers at Stanford University, who have done extensive research on how scribes can improve physician satisfaction, to understand how deep learning techniques such as ASR can facilitate the scribing process of physician notes. In our pilot study, we investigate what types of clinically relevant information can be extracted from medical conversations to assist physicians in reducing their interactions with the EHR. The study is fully patient-consented and the content of the recording will be de-identified to protect patient privacy. We hope these technologies will not only help return joy to practice by facilitating doctors and scribes with their everyday workload, but also help the patients get more dedicated and thorough medical attention, ideally, leading to better care.1 http://www.annfammed.org/content/15/5/419.full↩2 http://www.mayoclinicproceedings.org/article/S0025-6196%2815%2900716-8/abstract↩3 http://www.annfammed.org/content/15/5/427.full↩ [...]



SLING: A Natural Language Frame Semantic Parser

2017-11-21T12:22:20.110-08:00

Posted by Michael Ringgaard, Software Engineer and Rahul Gupta, Research ScientistUntil recently, most practical natural language understanding (NLU) systems used a pipeline of analysis stages, from part-of-speech tagging and dependency parsing to steps that computed a semantic representation of the input text. While this facilitated easy modularization of different analysis stages, errors in earlier stages would have cascading effects in later stages and the final representation, and the intermediate stage outputs might not be relevant on their own. For example, a typical pipeline might perform the task of dependency parsing in an early stage and the task of coreference resolution towards the end. If one was only interested in the output of coreference resolution, it would be affected by cascading effects of any errors during dependency parsing.Today we are announcing SLING, an experimental system for parsing natural language text directly into a representation of its meaning as a semantic frame graph. The output frame graph directly captures the semantic annotations of interest to the user, while avoiding the pitfalls of pipelined systems by not running any intermediate stages, additionally preventing unnecessary computation. SLING uses a special-purpose recurrent neural network model to compute the output representation of input text through incremental editing operations on the frame graph. The frame graph, in turn, is flexible enough to capture many semantic tasks of interest (more on this below). SLING's parser is trained using only the input words, bypassing the need for producing any intermediate annotations (e.g. dependency parses). SLING provides fast parsing at inference time by providing (a) an efficient and scalable frame store implementation and (b) a JIT compiler that generates efficient code to execute the recurrent neural network. Although SLING is experimental, it achieves a parsing speed of >2,500 tokens/second on a desktop CPU, thanks to its efficient frame store and neural network compiler. SLING is implemented in C++ and it is available for download on GitHub. The entire system is described in detail in a technical report as well.Frame Semantic ParsingFrame Semantics [1] represents the meaning of text — such as a sentence — as a set of formal statements. Each formal statement is called a frame, which can be seen as a unit of knowledge or meaning, that also contains interactions with concepts or other frames typically associated with it. SLING organizes each frame as a list of slots, where each slot has a name (role) and a value which could be a literal or a link to another frame. As an example, consider the sentence:“Many people now claim to have predicted Black Monday.”The figure below illustrates SLING recognizing mentions of entities (e.g. people, places, or events), measurements (e.g. dates or distances), and other concepts (e.g. verbs), and placing them in the correct semantic roles for the verbs in the input. The word predicted evokes the most dominant sense of the verb "predict", denoted as a PREDICT-01 frame. Additionally, this frame also has interactions (slots) with who made the prediction (denoted via the ARG0 slot, which points to the PERSON frame for people) and what was being predicted (denoted via ARG1, which links to the EVENT frame for Black Monday). Frame semantic parsing is the task of producing a directed graph of such frames linked through slots.Although the example above is fairly simple, frame graphs are powerful enough to model a variety of complex semantic annotation tasks. For starters, frames provide a convenient way to bring together language-internal and external information types (e.g. knowledge bases). This can then be used to address complex language understanding problems such as reference, metaphor, metonymy, and perspective. The frame graphs for these tasks only differ in the inventory of frame types, roles, and any linking constraints.SLINGSLING trains a recurrent neural network by optimizing for the semantic frames of inte[...]



On-Device Conversational Modeling with TensorFlow Lite

2017-11-14T13:00:15.300-08:00

Posted by Sujith Ravi, Research Scientist, Google Expander TeamEarlier this year, we launched Android Wear 2.0 which featured the first "on-device" machine learning technology for smart messaging. This enabled cloud-based technologies like Smart Reply, previously available in Gmail, Inbox and Allo, to be used directly within any application for the first time, including third-party messaging apps, without ever having to connect to the cloud. So you can respond to incoming chat messages on the go, directly from your smartwatch. Today, we announce TensorFlow Lite, TensorFlow’s lightweight solution for mobile and embedded devices. This framework is optimized for low-latency inference of machine learning models, with a focus on small memory footprint and fast performance. As part of the library, we have also released an on-device conversational model and a demo app that provides an example of a natural language application powered by TensorFlow Lite, in order to make it easier for developers and researchers to build new machine intelligence features powered by on-device inference. This model generates reply suggestions to input conversational chat messages, with efficient inference that can be easily plugged in to your chat application to power on-device conversational intelligence. The on-device conversational model we have released uses a new ML architecture for training compact neural networks (as well as other machine learning models) based on a joint optimization framework, originally presented in ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections. This architecture can run efficiently on mobile devices with limited computing power and memory, by using efficient “projection” operations that transform any input to a compact bit vector representation — similar inputs are projected to nearby vectors that are dense or sparse depending on type of projection. For example, the messages “hey, how's it going?” and “How's it going buddy?”, might be projected to the same vector representation.Using this idea, the conversational model combines these efficient operations at low computation and memory footprint. We trained this on-device model end-to-end using an ML framework that jointly trains two types of models — a compact projection model (as described above) combined with a trainer model. The two models are trained in a joint fashion, where the projection model learns from the trainer model — the trainer is characteristic of an expert and modeled using larger and more complex ML architectures, whereas the projection model resembles a student that learns from the expert. During training, we can also stack other techniques such as quantization or distillation to achieve further compression or selectively optimize certain portions of the objective function. Once trained, the smaller projection model is able to be used directly for inference on device.For inference, the trained projection model is compiled into a set of TensorFlow Lite operations that have been optimized for fast execution on mobile platforms and executed directly on device. The TensorFlow Lite inference graph for the on-device conversational model is shown here.TensorFlow Lite execution for the On-Device Conversational Model.The open-source conversational model released today (along with code) was trained end-to-end using the joint ML architecture described above. Today’s release also includes a demo app, so you can easily download and try out one-touch smart replies on your mobile device. The architecture enables easy configuration for model size and prediction quality based on application needs. You can find a list of sample messages where this model does well here. The system can also fall back to suggesting replies from a fixed set that was learned and compiled from popular response intents observed in chat conversations. The underlying model is different from the ones Google uses for Smart Reply responses in its apps1.Beyond Conversational ModelsInt[...]



Fused Video Stabilization on the Pixel 2 and Pixel 2 XL

2017-11-13T09:48:13.435-08:00

Posted by Chia-Kai Liang, Senior Staff Software Engineer and Fuhao Shi, Android Camera TeamOne of the most important aspects of current smartphones is easily capturing and sharing videos. With the Pixel 2 and Pixel 2 XL smartphones, the videos you capture are smoother and clearer than ever before, thanks to our Fused Video Stabilization technique based on both optical image stabilization (OIS) and electronic image stabilization (EIS). Fused Video Stabilization delivers highly stable footage with minimal artifacts, and the Pixel 2 is currently rated as the leader in DxO's video ranking (also earning the highest overall rating for a smartphone camera). But how does it work?A key principle in videography is keeping the camera motion smooth and steady. A stable video is free of the distraction, so the viewer can focus on the subject of interest. But, videos taken with smartphones are subject to many conditions that make taking a high-quality video a significant challenge:Camera ShakeMost people hold their mobile phones in their hands to record videos - you pull the phone from your pocket, record the video, and the video is ready to share right after recording. However, that means your videos shake as much as your hands do -- and they shake a lot! Moreover, if you are walking or running while recording, the camera motion can make videos almost unwatchable: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/BtAWLQK2dUI/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/BtAWLQK2dUI?rel=0&feature=player_embedded" width="640">Motion BlurIf the camera or the subject moves during exposure, the resulting photo or video will appear blurry. Even if we stabilize the motion in between consecutive frames, the motion blur in each individual frame cannot be easily restored in practice, especially on a mobile device. One typical video artifact due to motion blur is sharpness inconsistency: the video may rapidly alternate between blurry and sharp, which is very distracting even after the video is stabilized: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/cao9yrsbcDA/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/cao9yrsbcDA?rel=0&feature=player_embedded" width="640">Rolling ShutterThe CMOS image sensor collects one row of pixels, or “scanline”, at a time, and it takes tens of milliseconds to go from the top scanline to the bottom. Therefore, anything moving during this period can appear distorted. This is called the rolling shutter distortion. Even if you have a steady hand, the rolling shutter distortion will appear when you move quickly: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/ra6cvns9NPY/0.jpg" frameborder="0" height="180" src="https://www.youtube.com/embed/ra6cvns9NPY?rel=0&feature=player_embedded" width="640">A simulated rendering of a video with global (left) and rolling (right) shutter.Focus BreathingWhen there are objects of varying distance in a video, the angle of view can change significantly due to objects “jumping” in and out of the foreground. As result, everything shrinks or expands like the video below, which professionals call “breathing”: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/-v5ru0vI7Gg/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/-v5ru0vI7Gg?rel=0&feature=player_embedded" width="640">A good stabilization system should address all of these issues: the video should look sharp, the motion should be smooth, and the rolling shutter and focus breathing should be corrected.Many professionals mount the camera on a mechanical stabilizer to entirely isolate hand motion. These devices actively sense and compensate for the camera’s movement to remove all unwanted motions. However, they are usually expensive and cumbersome; you wouldn’t want to carry one every day. The[...]



Seamless Google Street View Panoramas

2017-11-09T10:30:00.753-08:00

Posted by Mike Krainin, Software Engineer and Ce Liu, Research Scientist, Machine PerceptionIn 2007, we introduced Google Street View, enabling you to explore the world through panoramas of neighborhoods, landmarks, museums and more, right from your browser or mobile device. The creation of these panoramas is a complicated process, involving capturing images from a multi-camera rig called a rosette, and then using image blending techniques to carefully stitch them all together. However, many things can thwart the creation of a "successful" panorama, such as mis-calibration of the rosette camera geometry, timing differences between adjacent cameras, and parallax. And while we attempt to address these issues by using approximate scene geometry to account for parallax and frequent camera re-calibration, visible seams in image overlap regions can still occur. Left: A Street View car carrying a multi-camera rosette. Center: A close-up of the rosette, which is made up of 15 cameras. Right: A visualization of the spatial coverage of each camera. Overlap between adjacent cameras is shown in darker gray.Left: The Sydney Opera House with stitching seams along its iconic shells. Right: The same Street View panorama after optical flow seam repair. In order to provide more seamless Street View images, we’ve developed a new algorithm based on optical flow to help solve these challenges. The idea is to subtly warp each input image such that the image content lines up within regions of overlap. This needs to be done carefully to avoid introducing new types of visual artifacts. The approach must also be robust to varying scene geometry, lighting conditions, calibration quality, and many other conditions. To simplify the task of aligning the images and to satisfy computational requirements, we’ve broken it into two steps.Optical FlowThe first step is to find corresponding pixel locations for each pair of images that overlap. Using techniques described in our PhotoScan blog post, we compute optical flow from one image to the other. This provides a smooth and dense correspondence field. We then downsample the correspondences for computational efficiency. We also discard correspondences where there isn’t enough visual structure to be confident in the results of optical flow.The boundaries of a pair of constituent images from the rosette camera rig that need to be stitched together.An illustration of optical flow within the pair’s overlap region.Extracted correspondences in the pair of images. For each colored dot in the overlap region of the left image, there is an equivalently-colored dot in the overlap region of the right image, indicating how the optical flow algorithm has aligned the point. These pairs of corresponding points are used as input to the global optimization stage. Notice that the overlap covers only a small portion of each image.Global OptimizationThe second step is to warp the rosette’s images to simultaneously align all of the corresponding points from overlap regions (as seen in the figure above). When stitched into a panorama, the set of warped images will then properly align. This is challenging because the overlap regions cover only a small fraction of each image, resulting in an under-constrained problem. To generate visually pleasing results across the whole image, we formulate the warping as a spline-based flow field with spatial regularization. The spline parameters are solved for in a non-linear optimization using Google’s open source Ceres Solver.A visualization of the final warping process. Left: A section of the panorama covering 180 degrees horizontally. Notice that the overall effect of warping is intentionally quite subtle. Right: A close-up, highlighting how warping repairs the seams.Our approach has many similarities to previously published work by Shum & Szeliski on “deghosting” panoramas. Key differences include that our approach estimates dense, smooth correspondences (rather[...]



Feature Visualization

2017-11-07T10:38:59.994-08:00

Posted by Christopher Olah, Research Scientist, Google Brain Team and Alex Mordvintsev, Research Scientist, Google Research Have you ever wondered what goes on inside neural networks? Feature visualization is a powerful tool for digging into neural networks and seeing how they work.Our new article, published in Distill, does a deep exploration of feature visualization, introducing a few new tricks along the way!Building on our work in DeepDream, and lots of work by others since, we are able to visualize what every neuron a strong vision model (GoogLeNet [1]) detects. Over the course of multiple layers, it gradually builds up abstractions: first it detects edges, then it uses those edges to detect textures, the textures to detect patterns, and the patterns to detect parts of objects….But neurons don’t understand the world by themselves — they work together. So we also need to understand how they interact with each other. One approach is to explore interpolations between them. What images can make them both fire, to different extents?Here we interpolate from a neuron that seems to detect artistic patterns to a neuron that seems to detect lizard eyes:We can also let you try adding different pairs of neurons together, to explore the possibilities for yourself:In addition to allowing you to play around with visualizations, we explore a variety of techniques for getting feature visualization to work, and let you experiment with using them.Techniques for visualizing and understanding neural networks are becoming more powerful. We hope our article will help other researchers apply these techniques, and give people a sense of their potential. Check it out on Distill.AcknowledgementWe're extremely grateful to our co-author, Ludwig Schurbert, who made incredible contributions to our paper and especially to the interactive visualizations. [...]



Tangent: Source-to-Source Debuggable Derivatives

2017-11-06T13:50:13.888-08:00

Posted by Alex Wiltschko, Research Scientist, Google Brain TeamTangent is a new, free, and open-source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models:Easily debug your backward passFast gradient surgeryForward mode automatic differentiationEfficient Hessian-vector productsCode optimizationsThis post gives an overview of the Tangent API. It covers how to use Tangent to generate gradient code in Python that is easy to interpret, debug and modify.Neural networks (NNs) have led to great advances in machine learning models for images, video, audio, and text. The fundamental abstraction that lets us train NNs to perform well at these tasks is a 30-year-old idea called reverse-mode automatic differentiation (also known as backpropagation), which comprises two passes through the NN. First, we run a “forward pass” to calculate the output value of each node. Then we run a “backward pass” to calculate a series of derivatives to determine how to update the weights to increase the model’s accuracy. Training NNs, and doing research on novel architectures, requires us to compute these derivatives correctly, efficiently, and easily. We also need to be able to debug these derivatives when our model isn’t training well, or when we’re trying to build something new that we do not yet understand. Automatic differentiation, or just “autodiff,” is a technique to calculate the derivatives of computer programs that denote some mathematical function, and nearly every machine learning library implements it. Existing libraries implement automatic differentiation by tracing a program’s execution (at runtime, like TF Eager, PyTorch and Autograd) or by building a dynamic data-flow graph and then differentiating the graph (ahead-of-time, like TensorFlow). In contrast, Tangent performs ahead-of-time autodiff on the Python source code itself, and produces Python source code as its output.As a result, you can finally read your automatic derivative code just like the rest of your program. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility.You can easily inspect and debug your models written in Tangent, without special tools or indirection. Tangent works on a large and growing subset of Python, provides extra autodiff features other Python ML libraries don’t have, is high-performance, and is compatible with TensorFlow and NumPy.Automatic differentiation of Python codeHow do we automatically generate derivatives of plain Python code? Math functions like tf.exp or  tf.log have derivatives, which we can compose to build the backward pass. Similarly, pieces of syntax, such as subroutines, conditionals, and loops, also have backward-pass versions. Tangent contains recipes for generating derivative code for each piece of Python syntax, along with many NumPy and TensorFlow function calls.Tangent has a one-function API:Here’s an animated graphic of what happens when we call tangent.grad on a Python function:If you want to print out your derivatives, you can run:Under the hood, tangent.grad first grabs the source code of the Python function you pass it. Tangent has a large library of recipes for the derivatives of Python syntax, as well as TensorFlow Eager functions. The function  tangent.grad then walks your code in reverse order, looks up the matching backward-pass recipe, and adds it to the end[...]



AutoML for large scale image classification and object detection

2017-11-06T10:41:49.096-08:00

Posted by Barret Zoph, Vijay Vasudevan, Jonathon Shlens and Quoc Le, Research Scientists, Google Brain Team A few months ago, we introduced our AutoML project, an approach that automates the design of machine learning models. While we found that AutoML can design small neural networks that perform on par with neural networks designed by human experts, these results were constrained to small academic datasets like CIFAR-10, and Penn Treebank. We became curious how this method would perform on larger more challenging datasets, such as ImageNet image classification and COCO object detection. Many state-of-the-art machine learning architectures have been invented by humans to tackle these datasets in academic competitions.In Learning Transferable Architectures for Scalable Image Recognition, we apply AutoML to the ImageNet image classification and COCO object detection dataset -- two of the most respected large scale academic datasets in computer vision. These two datasets prove a great challenge for us because they are orders of magnitude larger than CIFAR-10 and Penn Treebank datasets. For instance, naively applying AutoML directly to ImageNet would require many months of training our method. To be able to apply our method to ImageNet we have altered the AutoML approach to be more tractable to large-scale datasets:We redesigned the search space so that AutoML could find the best layer which can then be stacked many times in a flexible manner to create a final network.We performed architecture search on CIFAR-10 and transferred the best learned architecture to ImageNet image classification and COCO object detection.With this method, AutoML was able to find the best layers that work well on CIFAR-10 but work well on ImageNet classification and COCO object detection. These two layers are combined to form a novel architecture, which we called “NASNet”.Our NASNet architecture is composed of two types of layers: Normal Layer (left), and Reduction Layer (right). These two layers are designed by AutoML.On ImageNet image classification, NASNet achieves a prediction accuracy of 82.7% on the validation set, surpassing all previous Inception models that we built [2, 3, 4]. Additionally, NASNet performs 1.2% better than all previous published results and is on par with the best unpublished result reported on arxiv.org [5]. Furthermore, NASNet may be resized to produce a family of models that achieve good accuracies while having very low computational costs. For example, a small version of NASNet achieves 74% accuracy, which is 3.1% better than equivalently-sized, state-of-the-art models for mobile platforms. The large NASNet achieves state-of-the-art accuracy while halving the computational cost of the best reported result on arxiv.org (i.e., SENet) [5].Accuracies of NASNet and state-of-the-art, human-invented models at various model sizes on ImageNet image classification.We also transferred the learned features from ImageNet to object detection. In our experiments, combining the features learned from ImageNet classification with the Faster-RCNN framework [6] surpassed previous published, state-of-the-art predictive performance on the COCO object detection task in both the largest as well as mobile-optimized models. Our largest model achieves 43.1% mAP which is 4% better than the previous, published state-of-the-art.Example object detection using Faster-RCNN with NASNet.We suspect that the image features learned by NASNet on ImageNet and COCO may be reused for many computer vision applications. Thus, we have open-sourced NASNet for inference on image classification and for object detection in the Slim and Object Detection TensorFlow repositories. We hope that the larger machine learning community will be able to build on these models to address multitudes of computer vision problems we have not yet imagined.Special thanks to J[...]



Latest Innovations in TensorFlow Serving

2017-11-07T09:49:36.300-08:00

Posted by Chris Olston, Research Scientist, and Noah Fiedel, Software Engineer, TensorFlow Serving Since initially open-sourcing TensorFlow Serving in February 2016, we’ve made some major enhancements. Let’s take a look back at where we started, review our progress, and share where we are headed next.Before TensorFlow Serving, users of TensorFlow inside Google had to create their own serving system from scratch. Although serving might appear easy at first, one-off serving solutions quickly grow in complexity. Machine Learning (ML) serving systems need to support model versioning (for model updates with a rollback option) and multiple models (for experimentation via A/B testing), while ensuring that concurrent models achieve high throughput on hardware accelerators (GPUs and TPUs) with low latency. So we set out to create a single, general TensorFlow Serving software stack.We decided to make it open-sourceable from the get-go, and development started in September 2015. Within a few months, we created the initial end-to-end working system and our open-source release in February 2016. Over the past year and half, with the help of our users and partners inside and outside our company, TensorFlow Serving has advanced performance, best practices, and standards: Out-of-the-box optimized serving and customizability: We now offer a pre-built canonical serving binary, optimized for modern CPUs with AVX, so developers don't need to assemble their own binary from our libraries unless they have exotic needs. At the same time, we added a registry-based framework, allowing our libraries to be used for custom (or even non-TensorFlow) serving scenarios.Multi-model serving: Going from one model to multiple concurrently-served models presents several performance obstacles. We serve multiple models smoothly by (1) loading in isolated thread pools to avoid incurring latency spikes on other models taking traffic; (2) accelerating initial loading of all models in parallel upon server start-up; (3) multi-model batch interleaving to multiplex hardware accelerators (GPUs/TPUs).Standardized model format: We added SavedModel to TensorFlow 1.0, giving the community a single standard model format that works across training and serving.Easy-to-use inference APIs: We released easy-to-use APIs for common inference tasks (classification, regression) that we know work for a wide swathe of our applications. To support more advanced use-cases we support a lower-level tensor-based API (predict) and a new multi-inference API that enables multi-task modeling.All of our work has been informed by close collaborations with: (a) Google’s ML SRE team, which helps ensure we are robust and meet internal SLAs; (b) other Google machine learning infrastructure teams including ads serving and TFX; (c) application teams such as Google Play; (d) our partners at the UC Berkeley RISE Lab, who explore complementary research problems with the Clipper serving system; (e) our open-source user base and contributors.TensorFlow Serving is currently handling tens of millions of inferences per second for 1100+ of our own projects including Google’s Cloud ML Prediction. Our core serving code is available to all via our open-source releases.Looking forward, our work is far from done and we are exploring several avenues of innovation. Today we are excited to share early progress in two experimental areas:Granular batching: A key technique we employ to achieve high throughput on specialized hardware (GPUs and TPUs) is "batching": processing multiple examples jointly for efficiency. We are developing technology and best practices to improve batching to: (a) enable batching to target just the GPU/TPU portion of the computation, for maximum efficiency; (b) enable batching within recursive neural networks, used to process sequence data e.g. text and event sequences. W[...]



Eager Execution: An imperative, define-by-run interface to TensorFlow

2017-10-31T11:57:16.102-07:00

Posted by Asim Shankar and Wolff Dobson, Google Brain Team Today, we introduce eager execution for TensorFlow. Eager execution is an imperative, define-by-run interface where operations are executed immediately as they are called from Python. This makes it easier to get started with TensorFlow, and can make research and development more intuitive.The benefits of eager execution include:Fast debugging with immediate run-time errors and integration with Python toolsSupport for dynamic models using easy-to-use Python control flowStrong support for custom and higher-order gradientsAlmost all of the available TensorFlow operationsEager execution is available now as an experimental feature, so we're looking for feedback from the community to guide our direction.To understand this all better, let's look at some code. This gets pretty technical; familiarity with TensorFlow will help.Using Eager ExecutionWhen you enable eager execution, operations execute immediately and return their values to Python without requiring a Session.run(). For example, to multiply two matrices together, we write this:import tensorflow as tfimport tensorflow.contrib.eager as tfetfe.enable_eager_execution()x = [[2.]]m = tf.matmul(x, x)It’s straightforward to inspect intermediate results with print or the Python debugger.print(m)# The 1x1 matrix [[4.]]Dynamic models can be built with Python flow control. Here's an example of the Collatz conjecture using TensorFlow’s arithmetic operations:a = tf.constant(12)counter = 0while not tf.equal(a, 1): if tf.equal(a % 2, 0): a = a / 2 else: a = 3 * a + 1 print(a)Here, the use of the tf.constant(12) Tensor object will promote all math operations to tensor operations, and as such all return values with be tensors.GradientsMost TensorFlow users are interested in automatic differentiation. Because different operations can occur during each call, we record all forward operations to a tape, which is then played backwards when computing gradients. After we've computed the gradients, we discard the tape.If you’re familiar with the autograd package, the API is very similar. For example:def square(x): return tf.multiply(x, x)grad = tfe.gradients_function(square)print(square(3.)) # [9.]print(grad(3.)) # [6.]The gradients_function call takes a Python function square() as an argument and returns a Python callable that computes the partial derivatives of square() with respect to its inputs. So, to get the derivative of square() at 3.0, invoke grad(3.0), which is 6.The same gradients_function call can be used to get the second derivative of square:gradgrad = tfe.gradients_function(lambda x: grad(x)[0])print(gradgrad(3.)) # [2.]As we noted, control flow can cause different operations to run, such as in this example.def abs(x): return x if x > 0. else -xgrad = tfe.gradients_function(abs)print(grad(2.0)) # [1.]print(grad(-2.0)) # [-1.]Custom GradientsUsers may want to define custom gradients for an operation, or for a function. This may be useful for multiple reasons, including providing a more efficient or more numerically stable gradient for a sequence of operations.Here is an example that illustrates the use of custom gradients. Let's start by looking at the function log(1 + ex), which commonly occurs in the computation of cross entropy and log likelihoods.def log1pexp(x): return tf.log(1 + tf.exp(x))grad_log1pexp = tfe.gradients_function(log1pexp)# The gradient computation works fine at x = 0.print(grad_log1pexp(0.))# [0.5]# However it returns a `nan` at x = 100 due to numerical instability.print(grad_log1pexp(100.))# [nan]We can use a custom gradient for the above function that analytically simplifies the gradient expression. Notice how the gradient function implementation below reuses an expression (tf.exp(x)) that was computed during the forward pass, [...]



Closing the Simulation-to-Reality Gap for Deep Robotic Learning

2017-11-06T10:30:00.670-08:00

Posted by Konstantinos Bousmalis, Senior Research Scientist, and Sergey Levine, Faculty Advisor, Google Brain TeamEach of us can learn remarkably complex skills that far exceed the proficiency and robustness of even the most sophisticated robots, when it comes to basic sensorimotor skills like grasping. However, we also draw on a lifetime of experience, learning over the course of multiple years how to interact with the world around us. Requiring such a lifetime of experience for a learning-based robot system is quite burdensome: the robot would need to operate continuously, autonomously, and initially at a low level of proficiency before it can become useful. Fortunately, robots have a powerful tool at their disposal: simulation. Simulating many years of robotic interaction is quite feasible with modern parallel computing, physics simulation, and rendering technology. Moreover, the resulting data comes with automatically-generated annotations, which is particularly important for tasks where success is hard to infer automatically. The challenge with simulated training is that even the best available simulators do not perfectly capture reality. Models trained purely on synthetic data fail to generalize to the real world, as there is a discrepancy between simulated and real environments, in terms of both visual and physical properties. In fact, the more we increase the fidelity of our simulations, the more effort we have to expend in order to build them, both in terms of implementing complex physical phenomena and in terms of creating the content (e.g., objects, backgrounds) to populate these simulations. This difficulty is compounded by the fact that powerful optimization methods based on deep learning are exceptionally proficient at exploiting simulator flaws: the more powerful the machine learning algorithm, the more likely it is to discover how to "cheat" the simulator to succeed in ways that are infeasible in the real world. The question then becomes: how can a robot utilize simulation to enable it to perform useful tasks in the real world?The difficulty of transferring simulated experience into the real world is often called the "reality gap." The reality gap is a subtle but important discrepancy between reality and simulation that prevents simulated robotic experience from directly enabling effective real-world performance. Visual perception often constitutes the widest part of the reality gap: while simulated images continue to improve in fidelity, the peculiar and pathological regularities of synthetic pictures, and the wide, unpredictable diversity of real-world images, makes bridging the reality gap particularly difficult when the robot must use vision to perceive the world, as is the case for example in many manipulation tasks. Recent advances in closing the reality gap with deep learning in computer vision for tasks such as object classification and pose estimation provide promising solutions.  For example,  Shrivastava et al. and Bousmalis et al. explored pixel-level domain adaptation. Ganin et al. and Bousmalis and Trigeorgis et al. focus on feature-level domain adaptation. These advances required a rethinking of the approaches used to solve the simulation-to-reality domain shift problem for robotic manipulation as well. Although a number of recent works have sought to address the reality gap in robotics, through techniques such as machine learning-based domain adaptation (Tzeng et al.) and randomization of simulated environments (Sadeghi and Levine), effective transfer in robotic manipulation has been limited to relatively simple tasks, such as grasping rectangular, brightly-colored objects (Tobin et al. and James et al.) and free-space motion (Christiano et al.).  In t[...]



Announcing OpenFermion: The Open Source Chemistry Package for Quantum Computers

2017-10-26T09:50:50.960-07:00

Posted by Ryan Babbush and Jarrod McClean, Quantum Software Engineers, Quantum AI Team“The underlying physical laws necessary for the mathematical theory of a large part of physics and the whole of chemistry are thus completely known, and the difficulty is only that the exact application of these laws leads to equations much too complicated to be soluble.”-Paul Dirac, Quantum Mechanics of Many-Electron Systems (1929)In this passage, physicist Paul Dirac laments that while quantum mechanics accurately models all of chemistry, exactly simulating the associated equations appears intractably complicated. Not until 1982 would Richard Feynman suggest that instead of surrendering to the complexity of quantum mechanics, we might harness it as a computational resource. Hence, the original motivation for quantum computing: by operating a computer according to the laws of quantum mechanics, one could efficiently unravel exact simulations of nature. Such simulations could lead to breakthroughs in areas such as photovoltaics, batteries, new materials, pharmaceuticals and superconductivity. And while we do not yet have a quantum computer large enough to solve classically intractable problems in these areas, rapid progress is being made. Last year, Google published this paper detailing the first quantum computation of a molecule using a superconducting qubit quantum computer. Building on that work, the quantum computing group at IBM scaled the experiment to larger molecules, which made the cover of Nature last month.Today, we announce the release of OpenFermion, the first open source platform for translating problems in chemistry and materials science into quantum circuits that can be executed on existing platforms. OpenFermion is a library for simulating the systems of interacting electrons (fermions) which give rise to the properties of matter. Prior to OpenFermion, quantum algorithm developers would need to learn a significant amount of chemistry and write a large amount of code hacking apart other codes to put together even the most basic quantum simulations. While the project began at Google, collaborators at ETH Zurich, Lawrence Berkeley National Labs, University of Michigan, Harvard University, Oxford University, Dartmouth College, Rigetti Computing and NASA all contributed to alpha releases. You can learn more details about this release in our paper, OpenFermion: The Electronic Structure Package for Quantum Computers.One way to think of OpenFermion is as a tool for generating and compiling physics equations which describe chemical and material systems into representations which can be interpreted by a quantum computer1. The most effective quantum algorithms for these problems build upon and extend the power of classical quantum chemistry packages used and developed by research chemists across government, industry and academia. Accordingly, we are also releasing OpenFermion-Psi4 and OpenFermion-PySCF which are plugins for using OpenFermion in conjunction with the classical electronic structure packages Psi4 and PySCF.The core OpenFermion library is designed in a quantum programming framework agnostic way to ensure compatibility with various platforms being developed by the community. This allows OpenFermion to support external packages which compile quantum assembly language specifications for diverse hardware platforms. We hope this decision will help establish OpenFermion as a community standard for putting quantum chemistry on quantum computers. To see how OpenFermion is used with diverse quantum programming frameworks, take a look at OpenFermion-ProjectQ and Forest-OpenFermion - plugins which link OpenFermion to the externally developed circuit simulation and compilation platforms known as ProjectQ and Forest.The following wo[...]



Announcing AVA: A Finely Labeled Video Dataset for Human Action Understanding

2017-10-19T11:42:39.610-07:00

Posted by Chunhui Gu & David Ross, Software EngineersTeaching machines to understand human actions in videos is a fundamental research problem in Computer Vision, essential to applications such as personal video search and discovery, sports analysis, and gesture interfaces. Despite exciting breakthroughs made over the past years in classifying and finding objects in images, recognizing human actions still remains a big challenge. This is due to the fact that actions are, by nature, less well-defined than objects in videos, making it difficult to construct a finely labeled action video dataset. And while many benchmarking datasets, e.g., UCF101, ActivityNet and DeepMind’s Kinetics, adopt the labeling scheme of image classification and assign one label to each video or video clip in the dataset, no dataset exists for complex scenes containing multiple people who could be performing different actions.In order to facilitate further research into human action recognition, we have released AVA, coined from “atomic visual actions”, a new dataset that provides multiple action labels for each person in extended video sequences. AVA consists of URLs for publicly available videos from YouTube, annotated with a set of 80 atomic actions (e.g. “walk”, “kick (an object)”, “shake hands”) that are spatial-temporally localized, resulting in 57.6k video segments, 96k labeled humans performing actions, and a total of 210k action labels. You can browse the website to explore the dataset and download annotations, and read our arXiv paper that describes the design and development of the dataset.Compared with other action datasets, AVA possesses the following key characteristics:Person-centric annotation. Each action label is associated with a person rather than a video or clip. Hence, we are able to assign different labels to multiple people performing different actions in the same scene, which is quite common.Atomic visual actions. We limit our action labels to fine temporal scales (3 seconds), where actions are physical in nature and have clear visual signatures.Realistic video material. We use movies as the source of AVA, drawing from a variety of genres and countries of origin. As a result, a wide range of human behaviors appear in the data.Examples of 3-second video segments (from Video Source) with their bounding box annotations in the middle frame of each segment. (For clarity, only one bounding box is shown for each example.)To create AVA, we first collected a diverse set of long form content from YouTube, focusing on the “film” and “television” categories, featuring professional actors of many different nationalities. We analyzed a 15 minute clip from each video, and uniformly partitioned it into 300 non-overlapping 3-second segments. The sampling strategy preserved sequences of actions in a coherent temporal context.Next, we manually labeled all bounding boxes of persons in the middle frame of each 3-second segment. For each person in the bounding box, annotators selected a variable number of labels from a pre-defined atomic action vocabulary (with 80 classes) that describe the person’s actions within the segment. These actions were divided into three groups: pose/movement actions, person-object interactions, and person-person interactions. Because we exhaustively labeled all people performing all actions, the frequencies of AVA’s labels followed a long-tail distribution, as summarized below.Distribution of AVA’s atomic action labels. Labels displayed in the x-axis are only a partial set of our vocabulary.The unique design of AVA allows us to derive some interesting statistics that are not available in other existing datasets. For example, given the large number of persons wit[...]



Portrait mode on the Pixel 2 and Pixel 2 XL smartphones

2017-10-17T17:41:30.394-07:00

Posted by Marc Levoy, Principal Engineer and Yael Pritch, Software EngineerPortrait mode, a major feature of the new Pixel 2 and Pixel 2 XL smartphones, allows anyone to take professional-looking shallow depth-of-field images. This feature helped both devices earn DxO's highest mobile camera ranking, and works with both the rear-facing and front-facing cameras, even though neither is dual-camera (normally required to obtain this effect). Today we discuss the machine learning and computational photography techniques behind this feature.HDR+ picture without (left) and with (right) portrait mode. Note how portrait mode’s synthetic shallow depth of field helps suppress the cluttered background and focus attention on the main subject. Click on these links in the caption to see full resolution versions. Photo by Matt JonesWhat is a shallow depth-of-field image?A single-lens reflex (SLR) camera with a big lens has a shallow depth of field, meaning that objects at one distance from the camera are sharp, while objects in front of or behind that "in-focus plane" are blurry. Shallow depth of field is a good way to draw the viewer's attention to a subject, or to suppress a cluttered background. Shallow depth of field is what gives portraits captured using SLRs their characteristic artistic look.The amount of blur in a shallow depth-of-field image depends on depth; the farther objects are from the in-focus plane, the blurrier they appear. The amount of blur also depends on the size of the lens opening. A 50mm lens with an f/2.0 aperture has an opening 50mm/2 = 25mm in diameter. With such a lens, objects that are even a few inches away from the in-focus plane will appear soft.One other parameter worth knowing about depth of field is the shape taken on by blurred points of light. This shape is called bokeh, and it depends on the physical structure of the lens's aperture. Is the bokeh circular? Or is it a hexagon, due to the six metal leaves that form the aperture inside some lenses? Photographers debate tirelessly about what constitutes good or bad bokeh.Synthetic shallow depth of field imagesUnlike SLR cameras, mobile phone cameras have a small, fixed-size aperture, which produces pictures with everything more or less in focus. But if we knew the distance from the camera to points in the scene, we could replace each pixel in the picture with a blur. This blur would be an average of the pixel's color with its neighbors, where the amount of blur depends on distance of that scene point from the in-focus plane. We could also control the shape of this blur, meaning the bokeh.How can a cell phone estimate the distance to every point in the scene? The most common method is to place two cameras close to one another – so-called dual-camera phones. Then, for each patch in the left camera's image, we look for a matching patch in the right camera's image. The position in the two images where this match is found gives the depth of that scene feature through a process of triangulation. This search for matching features is called a stereo algorithm, and it works pretty much the same way our two eyes do.A simpler version of this idea, used by some single-camera smartphone apps, involves separating the image into two layers – pixels that are part of the foreground (typically a person) and pixels that are part of the background. This separation, sometimes called semantic segmentation, lets you blur the background, but it has no notion of depth, so it can't tell you how much to blur it. Also, if there is an object in front of the person, i.e. very close to the camera, it won't be blurred out, even though a real camera would do this.Whether done using stereo or segmentation, artificially blurring pix[...]



TensorFlow Lattice: Flexibility Empowered by Prior Knowledge

2017-10-11T10:00:00.156-07:00

Posted by Maya Gupta, Research Scientist, Jan Pfeifer, Software Engineer and Seungil You, Software Engineer (Cross-posted on the Google Open Source Blog)Machine learning has made huge advances in many applications including natural language processing, computer vision and recommendation systems by capturing complex input/output relationships using highly flexible models. However, a remaining challenge is problems with semantically meaningful inputs that obey known global relationships, like “the estimated time to drive a road goes up if traffic is heavier, and all else is the same.” Flexible models like DNNs and random forests may not learn these relationships, and then may fail to generalize well to examples drawn from a different sampling distribution than the examples the model was trained on.Today we present TensorFlow Lattice, a set of prebuilt TensorFlow Estimators that are easy to use, and TensorFlow operators to build your own lattice models. Lattices are multi-dimensional interpolated look-up tables (for more details, see [1--5]), similar to the look-up tables in the back of a geometry textbook that approximate a sine function. We take advantage of the look-up table’s structure, which can be keyed by multiple inputs to approximate an arbitrarily flexible relationship, to satisfy monotonic relationships that you specify in order to generalize better. That is, the look-up table values are trained to minimize the loss on the training examples, but in addition, adjacent values in the look-up table are constrained to increase along given directions of the input space, which makes the model outputs increase in those directions. Importantly, because they interpolate between the look-up table values, the lattice models are smooth and the predictions are bounded, which helps to avoid spurious large or small predictions in the testing time. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/kaPheQxIsPY/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/kaPheQxIsPY?rel=0&feature=player_embedded" width="640">How Lattice Models Help YouSuppose you are designing a system to recommend nearby coffee shops to a user. You would like the model to learn, “if two cafes are the same, prefer the closer one.” Below we show a flexible model (pink) that accurately fits some training data for users in Tokyo (purple), where there are many coffee shops nearby. The pink flexible model overfits the noisy training examples, and misses the overall trend that a closer cafe is better. If you used this pink model to rank test examples from Texas (blue), where businesses are spread farther out, you would find it acted strangely, sometimes preferring farther cafes! Slice through a model’s feature space where all the other inputs stay the same and only distance changes. A flexible function (pink) that is accurate on training examples from Tokyo (purple) predicts that a cafe 10km-away is better than the same cafe if it was 5km-away. This problem becomes more evident at test-time if the data distribution has shifted, as shown here with blue examples from Texas where cafes are spread out more.A monotonic flexible function (green) is both accurate on training examples and can generalize for Texas examples compared to non-monotonic flexible function (pink) from the previous figure.In contrast, a lattice model, trained over the same example from Tokyo, can be constrained to satisfy such a monotonic relationship and result in a monotonic flexible function (green). The green line also accurately fits the Tokyo training examples, but also generalizes well to Texas, never preferring farther cafes.In ge[...]



The Google Brain Team’s Approach to Research

2017-09-13T08:42:22.830-07:00

Posted by Jeff Dean, Google Senior FellowAbout a year ago, the Google Brain team first shared our mission “Make machines intelligent. Improve people’s lives.” In that time, we’ve shared updates on our work to infuse machine learning across Google products that hundreds of millions of users access everyday, including Translate, Maps, and more. Today, I’d like to share more about how we approach this mission both through advancement in the fundamental theory and understanding of machine learning, and through research in the service of product.Five years ago, our colleagues Alfred Spector, Peter Norvig, and Slav Petrov published a blog post and paper explaining Google’s hybrid approach to research, an approach that always allowed for varied balances between curiosity-driven and application-driven research. The biggest challenges in machine learning that the Brain team is focused on require the broadest exploration of new ideas, which is why our researchers set their own agendas with much of our team focusing specifically on advancing the state-of-the-art in machine learning. In doing so, we have published hundreds of papers over the last several years in conferences such as NIPS, ICML and ICLR, with acceptance rates significantly above conference averages.Critical to achieving our mission is contributing new and fundamental research in machine learning. To that end, we’ve built a thriving team that conducts long-term, open research to advance science. In pursuing research across fields such as visual and auditory perception, natural language understanding, art and music generation, and systems architecture and algorithms, we regularly collaborate with researchers at external institutions, with fully 1/3rd of our papers in 2017 having one or more cross-institutional authors. Additionally, we host collaborators from academic institutions to enhance our own work and strengthen our connection to the external scientific community.We also believe in the importance of clear and understandable explanations of the concepts in modern machine learning. Distill.pub is an online technical journal providing a forum for this purpose, launched by Brain team members Chris Olah and Shan Carter. TensorFlow Playground is an in-browser experimental venue created by the Google Brain team’s visualization experts to give people insight into how neural networks behave on simple problems, and PAIR’s deeplearn is an open source WebGL-accelerated JavaScript library for machine learning that runs entirely in your browser, with no installations and no backend.In addition to working with the best minds in academia and industry, the Brain team, like many other teams at Google, believes in fostering the development of the next generation of scientists. Our team hosts more than 50 interns every year, with the goal of publishing their work in top machine learning venues (roughly 25% of our group’s publications so far in 2017 have intern co-authors, usually as primary authors). Additionally, in 2016, we welcomed the first cohort of the Google Brain Residency Program, a one-year program for people who want to learn to do machine learning research. In its inaugural year, 27 residents conducted research alongside and under the mentorship of Brain team members, and authored more than 40 papers that were accepted in top research conferences. Our second group of 36 residents started their one-year residency in our group in July, and are already involved in a wide variety of projects.Along with other teams within Google Research, we enjoy the freedom to both contribute fundamental advances in machine learning, and separately conduct p[...]



Highlights from the Annual Google PhD Fellowship Summit, and Announcing the 2017 Google PhD Fellows

2017-09-12T10:04:02.680-07:00

Posted by Susie Kim, Program Manager, University RelationsIn 2009, Google created the PhD Fellowship Program to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. Now in its ninth year, our Fellowships have helped support over 300 graduate students in Australia, China and East Asia, India, North America, Europe and the Middle East who seek to shape and influence the future of technology.Recently, Google PhD Fellows from around the globe converged on our Mountain View campus for the second annual Global PhD Fellowship Summit. VP of Education and University Programs Maggie Johnson welcomed the Fellows and went over Google's approach to research and its impact across our products and services. The students heard talks from researchers like Ed Chi, Douglas Eck, Úlfar Erlingsson, Dina Papagiannaki, Viren Jain, Ian Goodfellow, Kevin Murphy and Galen Andrew, and got a glimpse into some of the state-of-the-art research pursued across Google. Google Fellows attending the 2017 Global PhD Fellowship SummitThe event included a panel discussion with Domagoj Babic, Kathryn McKinley, Nina Taft, Roy Want and Sunny Colsalvo about their unique career paths in academia and industry. Fellows also had the chance to connect one-on-one with Googlers to discuss their research, as well as receive feedback from leaders in their fields in smaller deep dives and a poster event.Fellows share their work with Google researchers during the poster sessionOur PhD Fellows represent some the best and brightest young researchers around the globe in Computer Science and it is our ongoing goal to support them as they make their mark on the world.We’d additionally like to announce the complete list of our 2017 Google PhD Fellows, including the latest recipients from China and East Asia, India, and Australia. We look forward to seeing each of them at next year’s summit!2017 Google PhD FellowsAlgorithms, Optimizations and MarketsChiu Wai Sam Wong, University of California, BerkeleyEric Balkanski, Harvard UniversityHaifeng Xu, University of Southern CaliforniaHuman-Computer InteractionMotahhare Eslami, University of Illinois, Urbana-ChampaignSarah D'Angelo, Northwestern UniversitySarah Mcroberts, University of Minnesota - Twin CitiesSarah Webber, The University of MelbourneMachine LearningAude Genevay, Fondation Sciences Mathématiques de ParisDustin Tran, Columbia UniversityJamie Hayes, University College LondonJin-Hwa Kim, Seoul National UniversityLing Luo, The University of SydneyMartin Arjovsky, New York UniversitySayak Ray Chowdhury, Indian Institute of ScienceSong Zuo, Tsinghua UniversityTaco Cohen, University of AmsterdamYuhuai Wu, University of TorontoYunhe Wang, Peking UniversityYunye Gong, Cornell UniversityMachine Perception, Speech Technology and Computer VisionAvijit Dasgupta, International Institute of Information Technology - HyderabadFranziska Müller, Saarland University - Saarbrücken GSCS and Max Planck Institute for InformaticsGeorge Trigeorgis, Imperial College LondonIro Armeni, Stanford UniversitySaining Xie, University of California, San DiegoYu-Chuan Su, University of Texas, AustinMobile ComputingSangeun Oh, Korea Advanced Institute of Science and TechnologyShuo Yang, Shanghai Jiao Tong UniversityNatural Language ProcessingBidisha Samanta, Indian Institute of Technology KharagpurEkaterina Vylomova, The University of MelbourneJianpeng Cheng, The University of EdinburghKevin Clark, Stanford UniversityMeng Zhang, Tsinghua UniversityPreksha Nama, Indian Institute of Technology MadrasTim Rocktaschel, University College LondonPrivacy and S[...]



Build your own Machine Learning Visualizations with the new TensorBoard API

2017-09-11T12:21:50.311-07:00

Posted by Chi Zeng and Justine Tunney, Software Engineers, Google Brain TeamWhen we open-sourced TensorFlow in 2015, it included TensorBoard, a suite of visualizations for inspecting and understanding your TensorFlow models and runs. Tensorboard included a small, predetermined set of visualizations that are generic and applicable to nearly all deep learning applications such as observing how loss changes over time or exploring clusters in high-dimensional spaces. However, in the absence of reusable APIs, adding new visualizations to TensorBoard was prohibitively difficult for anyone outside of the TensorFlow team, leaving out a long tail of potentially creative, beautiful and useful visualizations that could be built by the research community.To allow the creation of new and useful visualizations, we announce the release of a consistent set of APIs that allows developers to add custom visualization plugins to TensorBoard. We hope that developers use this API to extend TensorBoard and ensure that it covers a wider variety of use cases.We have updated the existing dashboards (tabs) in TensorBoard to use the new API, so they serve as examples for plugin creators. For the current listing of plugins included within TensorBoard, you can explore the tensorboard/plugins directory on GitHub. For instance, observe the new plugin that generates precision-recall curves:The plugin demonstrates the 3 parts of a standard TensorBoard plugin:A TensorFlow summary op used to collect data for later visualization. [GitHub]A Python backend that serves custom data. [GitHub]A dashboard within TensorBoard built with TypeScript and polymer. [GitHub]Additionally, like other plugins, the “pr_curves” plugin provides a demo that (1) users can look over in order to learn how to use the plugin and (2) the plugin author can use to generate example data during development. To further clarify how plugins work, we’ve also created a barebones TensorBoard “Greeter” plugin. This simple plugin collects greetings (simple strings preceded by “Hello, ”) during model runs and displays them. We recommend starting by exploring (or forking) the Greeter plugin as well as other existing plugins.A notable example of how contributors are already using the TensorBoard API is Beholder, which was recently created by Chris Anderson while working on his master’s degree. Beholder shows a live video feed of data (e.g. gradients and convolution filters) as a model trains. You can watch the demo video here. We look forward to seeing what innovations will come out of the community. If you plan to contribute a plugin to TensorBoard’s repository, you should get in touch with us first through the issue tracker with your idea so that we can help out and possibly guide you.AcknowledgementsDandelion Mané and William Chargin played crucial roles in building this API. [...]



Seminal Ideas from 2007

2017-09-06T10:00:21.684-07:00

Posted by Anna Ukhanova, Technical Program Manager, Google Research EuropeIt is not everyday we have the chance to pause and think about how previous work has led to current successes, how it influenced other advances and reinterpret it in today’s context. That’s what the ICML Test-of-Time Award is meant to achieve, and this year it was given to the work Sylvain Gelly, now a researcher on the Google Brain team in our Zurich office, and David Silver, now at DeepMind and lead researcher on AlphaGo, for their 2007 paper Combining Online and Offline Knowledge in UCT. This paper presented new approaches to incorporate knowledge, learned offline or created online on the fly, into a search algorithm to augment its effectiveness.The Game of Go is an ancient Chinese board game, which has tremendous popularity with millions of players worldwide. Since the success of Deep Blue in the game of Chess in the late 90’s, Go has been considered as the next benchmark for machine learning and games. Indeed, it has simple rules, can be efficiently simulated, and progress can be measured objectively. However, due to the vast search space of possible moves, making an ML system capable of playing Go well represented a considerable challenge. Over the last two years, DeepMind’s AlphaGo has pushed the limit of what is possible with machine learning in games, bringing many innovations and technological advances in order to successfully defeat some of the best players in the world [1], [2], [3]. A little more than 10 years before the success of AlphaGo, the classical tree search techniques that were so successful in Chess were reigning in computer Go programs, but only reaching weak amateur level for human Go players. Thanks to Monte-Carlo Tree Search — a (then) new type of search algorithm based on sampling possible outcomes of the game from a position, and incrementally improving the search tree from the results of those simulations — computers were able to search much deeper in the game. This is important because it made it possible to incorporate less human knowledge in the programs — a task which is very hard to do right. Indeed, any missing knowledge that a human expert either cannot express or did not think about may create errors in the computer evaluation of the game position, and lead to blunders*. In 2007, Sylvain and David augmented the Monte Carlo Tree Search techniques by exploring two types of knowledge incorporation: (i) online, where the decision for the next move is taken from the current position, using compute resources at the time when the next move is needed, and (ii) offline, where the learning process happens entirely before the game starts, and is summarized into a model that can be applied to all possible positions of a game (even though not all possible positions have been seen during the learning process). This ultimately led to the computer program MoGo, which showed an improvement in performance over previous Go algorithms. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/Bm7zah_LrmE/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/Bm7zah_LrmE?rel=0&feature=player_embedded" width="640">For the online part, they adapted the simple idea that some actions don’t necessarily depend on each other. For example, if you need to book a vacation, the choice of the hotel, flight and car rental is obviously dependent on the choice of your destination. However, once given a destination, these things can be chosen (mostly) independently of each other. The same idea c[...]



Transformer: A Novel Neural Network Architecture for Language Understanding

2017-09-01T09:37:11.137-07:00

Posted by Jakob Uszkoreit, Software Engineer, Natural Language UnderstandingNeural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to language understanding tasks such as language modeling, machine translation and question answering. In Attention Is All You Need we introduce the Transformer, a novel neural network architecture based on a self-attention mechanism that we believe to be particularly well-suited for language understanding. In our paper, we show that the Transformer outperforms both recurrent and convolutional models on academic English to German and English to French translation benchmarks. On top of higher translation quality, the Transformer requires less computation to train and is a much better fit for modern machine learning hardware, speeding up training by up to an order of magnitude.BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to German translation benchmark.BLEU scores (higher is better) of single models on the standard WMT newstest2014 English to French translation benchmark.Accuracy and Efficiency in Language UnderstandingNeural networks usually process language by generating fixed- or variable-length vector-space representations. After starting with representations of individual words or even pieces of words, they aggregate information from surrounding words to determine the meaning of a given bit of language in context. For example, deciding on the most likely meaning and appropriate representation of the word “bank” in the sentence “I arrived at the bank after crossing the…” requires knowing if the sentence ends in “... road.” or “... river.”RNNs have in recent years become the typical network architecture for translation, processing language sequentially in a left-to-right or right-to-left fashion. Reading one word at a time, this forces RNNs to perform multiple steps to make decisions that depend on words far away from each other. Processing the example above, an RNN could only determine that “bank” is likely to refer to the bank of a river after reading each word between “bank” and “river” step by step. Prior research has shown that, roughly speaking, the more such steps decisions require, the harder it is for a recurrent network to learn how to make those decisions.The sequential nature of RNNs also makes it more difficult to fully take advantage of modern fast computing devices such as TPUs and GPUs, which excel at parallel and not sequential processing. Convolutional neural networks (CNNs) are much less sequential than RNNs, but in CNN architectures like ByteNet or ConvS2S the number of steps required to combine information from distant parts of the input still grows with increasing distance.The TransformerIn contrast, the Transformer only performs a small, constant number of steps (chosen empirically). In each step, it applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position. In the earlier example “I arrived at the bank after crossing the river”, to determine that the word “bank” refers to the shore of a river and not a financial institution, the Transformer can learn to immediately attend to the word “river” and make this decision in a single step. In fact, in our English-French translation model we observe exactly this behavior.More specifically, to compute the next representation for a given word - “bank” for example - the Transformer compar[...]



Exploring and Visualizing an Open Global Dataset

2017-08-25T10:00:30.473-07:00

Posted by Reena Jana, Creative Lead, Business Inclusion, and Josh Lovejoy, UX Designer, Google ResearchMachine learning systems are increasingly influencing many aspects of everyday life, and are used by both the hardware and software products that serve people globally. As such, researchers and designers seeking to create products that are useful and accessible for everyone often face the challenge of finding data sets that reflect the variety and backgrounds of users around the world. In order to train these machine learning systems, open, global — and growing — datasets are needed.Over the last six months, we’ve seen such a dataset emerge from users of Quick, Draw!, Google’s latest approach to helping wide, international audiences understand how neural networks work. A group of Googlers designed Quick, Draw! as a way for anyone to interact with a machine learning system in a fun way, drawing everyday objects like trees and mugs. The system will try to guess what their drawing depicts, within 20 seconds. While the goal of Quick, Draw! was simply to create a fun game that runs on machine learning, it has resulted in 800 million drawings from twenty million people in 100 nations, from Brazil to Japan to the U.S. to South Africa.And now we are releasing an open dataset based on these drawings so that people around the world can contribute to, analyze, and inform product design with this data. The dataset currently includes 50 million drawings Quick Draw! players have generated (we will continue to release more of the 800 million drawings over time).It’s a considerable amount of data; and it’s also a fascinating lens into how to engage a wide variety of people to participate in (1) training machine learning systems, no matter what their technical background; and (2) the creation of open data sets that reflect a wide spectrum of cultures and points of view.Seeing national — and global — patterns in one glanceTo understand visual patterns within the dataset quickly and efficiently, we worked with artist Kyle McDonald to overlay thousands of drawings from around the world. This helped us create composite images and identify trends in each nation, as well across all nations. We made animations of 1000 layered international drawings of cats and chairs, below, to share how we searched for visual trends with this data:Cats, made from 1000 drawings from around the world:Chairs, made from 1,000 drawings around the world:Doodles of naturally recurring objects, like cats (or trees, rainbows, or skulls) often look alike across cultures:However, for objects that might be familiar to some cultures, but not others, we saw notable differences. Sandwiches took defined forms or were a jumbled set of lines; mug handles pointed in opposite directions; and chairs were drawn facing forward or sideways, depending on the nation or region of the world:One size doesn’t fit allThese composite drawings, we realized, could reveal how perspectives and preferences differ between audiences from different regions, from the type of bread used in sandwiches to the shape of a coffee cup, to the aesthetic of how to depict objects so they are visually appealing. For example, a more straightforward, head-on view was more consistent in some nations; side angles in others.Overlaying the images also revealed how to improve how we train neural networks when we lack a variety of data — even within a large, open, and international data set. For example, when we analyzed 115,000+ drawings of shoes in the Quic[...]



Launching the Speech Commands Dataset

2017-08-24T10:06:05.326-07:00

Posted by Pete Warden, Software Engineer, Google Brain TeamAt Google, we’re often asked how to get started using deep learning for speech and other audio recognition problems, like detecting keywords or commands. And while there are some great open source speech recognition systems like Kaldi that can use neural networks as a component, their sophistication makes them tough to use as a guide to a simpler tasks. Perhaps more importantly, there aren’t many free and openly available datasets ready to be used for a beginner’s tutorial (many require preprocessing before a neural network model can be built on them) or that are well suited for simple keyword detection.To solve these problems, the TensorFlow and AIY teams have created the Speech Commands Dataset, and used it to add training* and inference sample code to TensorFlow. The dataset has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website. It’s released under a Creative Commons BY 4.0 license, and will continue to grow in future releases as more contributions are received. The dataset is designed to let you build basic but useful voice interfaces for applications, with common words like “Yes”, “No”, digits, and directions included. The infrastructure we used to create the data has been open sourced too, and we hope to see it used by the wider community to create their own versions, especially to cover underserved languages and applications.To try it out for yourself, download the prebuilt set of the TensorFlow Android demo applications and open up “TF Speech”. You’ll be asked for permission to access your microphone, and then see a list of ten words, each of which should light up as you say them.The results will depend on whether your speech patterns are covered by the dataset, so it may not be perfect — commercial speech recognition systems are a lot more complex than this teaching example. But we’re hoping that as more accents and variations are added to the dataset, and as the community contributes improved models to TensorFlow, we’ll continue to see improvements and extensions.You can also learn how to train your own version of this model through the new audio recognition tutorial on TensorFlow.org. With the latest development version of the framework and a modern desktop machine, you can download the dataset and train the model in just a few hours. You’ll also see a wide variety of options to customize the neural network for different problems, and to make different latency, size, and accuracy tradeoffs to run on different platforms.We are excited to see what new applications people are able to build with the help of this dataset and tutorial, so I hope you get a chance to dive in and start recognizing!* The architecture this network is based on is described in Convolutional Neural Networks for Small-footprint Keyword Spotting, presented at Interspeech 2015.↩ [...]



Google at KDD’17: Graph Mining and Beyond

2017-09-12T09:29:03.923-07:00

Posted by Bryan Perozzi, Research Scientist, NYC Algorithms and Optimization TeamThe 23rd ACM conference on Knowledge Discovery and Data Mining (KDD’17), a main venue for academic and industry research in data science, information retrieval, data mining and machine learning, was held last week in Halifax, Canada. Google has historically been an active participant in KDD, and this year was no exception, with Googlers’ contributing numerous papers and participating in workshops. In addition to our overall participation, we are happy to congratulate fellow Googler Bryan Perozzi for receiving the SIGKDD 2017 Doctoral Dissertation Award, which serves to recognize excellent research by doctoral candidates in the field of data mining and knowledge discovery. This award was given in recognition of his thesis on the topic of machine learning on graphs performed at Stony Brook University, under the advisorship of Steven Skiena. Part of his thesis was developed during his internships at Google. The thesis dealt with using a restricted set of local graph primitives (such as ego-networks and truncated random walks) to effectively exploit the information around each vertex for classification, clustering, and anomaly detection. Most notably, the work introduced the random-walk paradigm for graph embedding with neural networks in DeepWalk.DeepWalk: Online Learning of Social Representations, originally presented at KDD'14, outlines a method for using a series of local information obtained from truncated random walks to learn latent representations of nodes in a graph (e.g. users in a social network). The core idea was to treat each segment of a random walk as a sentence “in the language of the graph.” These segments could then be used as input for neural network models to learn representations of the graph’s nodes, using sequence modeling methods like word2vec (which had just been developed at the time). This research continues at Google, most recently with Learning Edge Representations via Low-Rank Asymmetric Projections.The full list of Google contributions at KDD’17 is listed below (Googlers highlighted in blue).Organizing CommitteePanel Chair: Andrew Tomkins Research Track Program Chair: Ravi Kumar Applied Data Science Track Program Chair: Roberto J. Bayardo Research Track Program Committee: Sergei Vassilvitskii, Alex Beutel, Abhimanyu Das, Nan Du, Alessandro Epasto, Alex Fabrikant, Silvio Lattanzi, Kristen Lefevre, Bryan Perozzi, Karthik Raman, Steffen Rendle, Xiao YuApplied Data Science Program Track Committee: Edith Cohen, Ariel Fuxman, D. Sculley, Isabelle Stanton, Martin Zinkevich, Amr Ahmed, Azin Ashkan, Michael Bendersky, James Cook, Nan Du, Balaji Gopalan, Samuel Huston, Konstantinos Kollias, James Kunz, Liang Tang, Morteza ZadimoghaddamAwardsDoctoral Dissertation Award: Bryan Perozzi, for Local Modeling of Attributed Graphs: Algorithms and Applications.Doctoral Dissertation Runner-up Award: Alex Beutel, for User Behavior Modeling with Large-Scale Graph Analysis.PapersEgo-Splitting Framework: from Non-Overlapping to Overlapping ClustersAlessandro Epasto, Silvio Lattanzi, Renato Paes LemeHyperLogLog Hyperextended: Sketches for Concave Sublinear Frequency StatisticsEdith CohenGoogle Vizier: A Service for Black-Box OptimizationDaniel Golovin, Benjamin Solnik, Subhodeep Moitra,[...]



Announcing the NYC Algorithms and Optimization Site

2017-08-21T10:11:26.652-07:00

Posted by Vahab Mirrokni, Principal Research Scientist and Xerxes Dotiwalla, Product Manager, NYC Algorithms and Optimization TeamNew York City is home to several Google algorithms research groups. We collaborate closely with the teams behind many Google products and work on a wide variety of algorithmic challenges, like optimizing infrastructure, protecting privacy, improving friend suggestions and much more.Today, we’re excited to provide more insights into the research done in the Big Apple with the launch of the NYC Algorithms and Optimization Team page. The NYC Algorithms and Optimization Team comprises multiple overlapping research groups working on large-scale graph mining, large-scale optimization and market algorithms. Large-scale Graph MiningThe Large-scale Graph Mining Group is tasked with building the most scalable library for graph algorithms and analysis and applying it to a multitude of Google products. We formalize data mining and machine learning challenges as graph algorithms problems and perform fundamental research in those fields leading to publications in top venues.Our projects include:Large-scale Similarity Ranking: Our research in pairwise similarity ranking has produced a number of innovative methods, which we have published in top venues such as WWW, ICML, and VLDB, e.g., improving friend suggestion using ego-networks and computing similarity rankings in large-scale multi-categorical bipartite graphs.Balanced Partitioning: Balanced partitioning is often a crucial first step in solving large-scale graph optimization problems. As our paper shows, we are able to achieve a 15-25% reduction in cut size compared to state-of-the-art algorithms in the literature.Clustering and Connected Components: We have state-of-the-art implementations of many different algorithms including hierarchical clustering, overlapping clustering, local clustering, spectral clustering, and connected components. Our methods are 10-30x faster than the best previously studied algorithms and can scale to graphs with trillions of edges.Public-private Graph Computation: Our research on novel models of graph computation based on a personal view of private data preserves the privacy of each user.Large-scale OptimizationThe Large-scale Optimization Group’s mission is to develop large-scale optimization techniques and use them to improve the efficiency and robustness of infrastructure at Google. We apply techniques from areas such as combinatorial optimization, online algorithms, and control theory to make Google’s massive computational infrastructure do more with less. We combine online and offline optimizations to achieve such goals as increasing throughput, decreasing latency, minimizing resource contention, maximizing the efficacy of caches, and eliminating unnecessary work in distributed systems. Our research is used in critical infrastructure that supports core products:Consistent Hashing: We designed memoryless balanced allocation algorithms to assign a dynamic set of clients to a dynamic set of servers such that the load on each server is bounded, and the allocation does not change by much for every update operation. This technique is currently implemented in Google Cloud Pub/Sub and externally in the open-source haproxy.Distributed Optimization Based on Core-sets: Composable core-sets provide an effective method for solving optimization problems on massive datasets. This technique can be used for severa[...]



Making Visible Watermarks More Effective

2017-08-18T11:00:07.131-07:00

Posted by Tali Dekel and Michael Rubinstein, Research ScientistsWhether you are a photographer, a marketing manager, or a regular Internet user, chances are you have encountered visible watermarks many times. Visible watermarks are those logos and patterns that are often overlaid on digital images provided by stock photography websites, marking the image owners while allowing viewers to perceive the underlying content so that they could license the images that fit their needs. It is the most common mechanism for protecting the copyrights of hundreds of millions of photographs and stock images that are offered online daily.It’s standard practice to use watermarks on the assumption that they prevent consumers from accessing the clean images, ensuring there will be no unauthorized or unlicensed use. However, in “On The Effectiveness Of Visible Watermarks” recently presented at the 2017 Computer Vision and Pattern Recognition Conference (CVPR 2017), we show that a computer algorithm can get past this protection and remove watermarks automatically, giving users unobstructed access to the clean images the watermarks are intended to protect. Left: example watermarked images from popular stock photography websites. Right: watermark-free version of the images on the left, produced automatically by a computer algorithm. More results are available below and on our project page. Image sources: Adobe Stock, 123RF.As often done with vulnerabilities discovered in operating systems, applications or protocols, we want to disclose this vulnerability and propose solutions in order to help the photography and stock image communities adapt and better protect its copyrighted content and creations. From our experiments much of the world’s stock imagery is currently susceptible to this circumvention. As such, in our paper we also propose ways to make visible watermarks more robust to such manipulations. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/02Ywt87OpS4/0.jpg" frameborder="0" height="360" src="https://www.youtube.com/embed/02Ywt87OpS4?rel=0&feature=player_embedded" width="640">The Vulnerability of Visible WatermarksVisible watermarks are often designed to contain complex structures such as thin lines and shadows in order to make them harder to remove. Indeed, given a single image, for a computer to detect automatically which visual structures belong to the watermark and which structures belong to the underlying image is extremely difficult. Manually, the task of removing a watermark from an image is tedious, and even with state-of-the-art editing tools it may take a Photoshop expert several minutes to remove a watermark from one image.However, a fact that has been overlooked so far is that watermarks are typically added in a consistent manner to many images. We show that this consistency can be used to invert the watermarking process — that is, estimate the watermark image and its opacity, and recover the original, watermark-free image underneath. This can all be done automatically, without any user intervention or prior information about the watermark, and by only observing watermarked image collections publicly available online.The consistency of a watermark over many images allows to automatically remove it in mass scale. Left: input collection marked by the same watermark, middle: computed watermark and its opacity, right: recovered[...]