Subscribe: Google Research Blog
Added By: Feedage Forager Feedage Grade A rated
Language: English
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Google Research Blog

Google Research Blog

The latest news on Google Research.

Updated: 2018-02-19T10:33:30.393-08:00


Assessing Cardiovascular Risk Factors with Computer Vision


Posted by Lily Peng MD PhD, Product Manager, Google Brain Team Heart attacks, strokes and other cardiovascular (CV) diseases continue to be among the top public health issues. Assessing this risk is critical first step toward reducing the likelihood that a patient suffers a CV event in the future. To do this assessment, doctors take into account a variety of risk factors — some genetic (like age and sex), some with lifestyle components (like smoking and blood pressure). While most of these factors can be obtained by simply asking the patient, others factors, like cholesterol, require a blood draw. Doctors also take into account whether or not a patient has another disease, such as diabetes, which is associated with significantly increased risk of CV events. Recently, we’ve seen many examples [1–4] of how deep learning techniques can help to increase the accuracy of diagnoses for medical imaging, especially for diabetic eye disease. In “Prediction of Cardiovascular Risk Factors from Retinal Fundus Photographs via Deep Learning,” published in Nature Biomedical Engineering, we show that in addition to detecting eye disease, images of the eye can very accurately predict other indicators of CV health. This discovery is particularly exciting because it suggests we might discover even more ways to diagnose health issues from retinal images. Using deep learning algorithms trained on data from 284,335 patients, we were able to predict CV risk factors from retinal images with surprisingly high accuracy for patients from two independent datasets of 12,026 and 999 patients. For example, our algorithm could distinguish the retinal images of a smoker from that of a non-smoker 71% of the time. In addition, while doctors can typically distinguish between the retinal images of patients with severe high blood pressure and normal patients, our algorithm could go further to predict the systolic blood pressure within 11 mmHg on average for patients overall, including those with and without high blood pressure.LEFT: image of the back of the eye showing the macula (dark spot in the middle), optic disc (bright spot at the right), and blood vessels (dark red lines arcing out from the bright spot on the right). RIGHT: retinal image in gray, with the pixels used by the deep learning algorithm to make predictions about the blood pressure highlighted in shades of green (heatmap). We found that each CV risk factor prediction uses a distinct pattern, such as blood vessels for blood pressure, and optic disc for other predictions.In addition to predicting the various risk factors (age, gender, smoking, blood pressure, etc) from retinal images, our algorithm was fairly accurate at predicting the risk of a CV event directly. Our algorithm used the entire image to quantify the association between the image and the risk of heart attack or stroke. Given the retinal image of one patient who (up to 5 years) later experienced a major CV event (such as a heart attack) and the image of another patient who did not, our algorithm could pick out the patient who had the CV event 70% of the time. This performance approaches the accuracy of other CV risk calculators that require a blood draw to measure cholesterol. More importantly, we opened the “black box” by using attention techniques to look at how the algorithm was making its prediction. These techniques allow us to generate a heatmap that shows which pixels were the most important for a predicting a specific CV risk factor. For example, the algorithm paid more attention to blood vessels for making predictions about blood pressure, as shown in the image above. Explaining how the algorithm is making its prediction gives doctor more confidence in the algorithm itself. In addition, this technique could help generate hypotheses for future scientific investigations into CV risk and the retina.At the broadest level, we are excited about this work because it may represent a new method of scientific discovery. Traditionally, medical discoveries are often made through a sophisticated form of guess and test [...]

Introducing the HDR+ Burst Photography Dataset


Posted by Sam Hasinoff, Software Engineer, Machine Perception Burst photography is the key idea underlying the HDR+ software on Google's recent smartphones, and a fundamental computational photography technique for improving image quality. Every photo taken with HDR+ is actually a composite, generated by capturing and merging a short burst of full-resolution photos. HDR+ has helped the Pixel and the Pixel 2 earn DxO's highest mobile camera ranking for two years in a row. The new portrait mode on the Pixel 2 also relies on HDR+, both for its basic image quality and to improve the quality of its depth estimation. Today we're pleased to announce the public release of an archive of image bursts to the research community. This provides a way for others to compare their methods to the results of Google's HDR+ software running on the same input images. This dataset consists of 3,640 bursts of full-resolution raw images, made up of 28,461 individual images, along with HDR+ intermediate and final results for comparison. The images cover a wide range of photographic situations, including variation in subject, level of motion, brightness, and dynamic range. Using bursts to improve image quality. HDR+ starts from a burst of full-resolution raw images (left). Depending on conditions, between 2 and 10 images are aligned and merged into an intermediate raw image (middle). This merged image has reduced noise and increased dynamic range, leading to a higher quality final result (right).Better Images with Burst PhotographyBurst photography provides the benefits associated with collecting more light, including reduced noise and improved dynamic range, but it avoids the motion blur that would come from increasing exposure times. This is particularly important for small smartphone cameras, whose size otherwise limits the amount of light they can capture. Since HDR+ was first released on Nexus 5 and 6, we've been busy improving the system. As described in our recent SIGGRAPH Asia paper, HDR+ now starts from raw images, which helps improve image quality. This also means that the image processing pipeline is fully implemented using our software. Next, we eliminated shutter lag, which makes photography feel instantaneous. The HDR+ photo you get corresponds to the moment the button was pressed. Finally, we improved processing times and power consumption, by implementing HDR+ on accelerators like the Qualcomm Hexagon DSP and the new Pixel Visual Core.Mosaic of thumbnails illustrating the size and diversity of the HDR+ dataset. Putting a computational photography system like HDR+ into production, where users capture millions of photos per day, means that odd photographic corner cases must be handled in a robust way.Using the DatasetThe scale and diversity of the HDR+ dataset also opens up the opportunity to apply modern machine learning methods. Our dataset has already been incorporated in a recent research paper which uses a neural network to approximate part of the HDR+ pipeline, constrained to a representation suitable for fast image processing. Several more papers that apply learning to the HDR+ dataset are currently under review.Inspired by the Middlebury archive of stereo data, our hope is that a shared dataset will enable the community to concentrate on comparing results. This approach is intrinsically more efficient than expecting researchers to configure and run competing techniques themselves, or to implement them from scratch if the code is proprietary. The HDR+ dataset is released under a Creative Commons license (CC-BY-SA). This license is largely unencumbered, however our main intention is that the dataset be used for scientific purposes. For information about how to cite the dataset, please see the detailed description. We look forward to seeing what else researchers can do with the HDR+ dataset!AcknowledgmentsSpecial thanks to the photographers and subjects of the HDR+ dataset. [...]

The Instant Motion Tracking Behind Motion Stills AR


Posted by Jianing Wei and Tyler Mullen, Software Engineers, Google ResearchLast summer, we launched Motion Stills on Android, which delivered a great video capture and viewing experience on a wide range of Android phones. Then, we refined our Motion Stills technology further to enable the new motion photos feature in Pixel 2.Today, we are excited to announce the new Augmented Reality (AR) mode in Motion Stills for Android. With the new AR mode, a user simply touches the viewfinder to place fun, virtual 3D objects on static or moving horizontal surfaces (e.g. tables, floors, or hands), allowing them to seamlessly interact with a dynamic real-world environment. You can also record and share the clips as GIFs and videos. Motion Stills with instant motion tracking in actionAR mode is powered by instant motion tracking, a six degree of freedom tracking system built upon the technology that powers Motion Text in Motion Stills iOS and the privacy blur on YouTube to accurately track static and moving objects. We refined and enhanced this technology to enable fun AR experiences that can run on any Android device with a gyroscope.When you touch the viewfinder, Motion Stills AR “sticks” a 3D virtual object to that location, making it look as if it’s part of the real-world scene. By assuming that the tracked surface is parallel to the ground plane, and using the device’s accelerometer sensor to provide the initial orientation of the phone with respect to the ground plane, one can track the six degrees of freedom of the camera (3 for translation and 3 for rotation). This allows us to accurately transform and render the virtual object within the scene.When the phone is approximately steady, the accelerometer sensor provides the acceleration due to the Earth’s gravity. For horizontal planes the gravity vector is parallel to normal of the tracked plane and can accurately provide the initial orientation of phone.Instant Motion TrackingThe core idea behind instant motion tracking is to decouple the camera’s translation and rotation estimation, treating them instead as independent optimization problems. First, we determine the 3D camera translation solely from the visual signal of the camera. To do this, we observe the target region's apparent 2D translation and relative scale across frames. A simple pinhole camera model relates both translation and scale of a box in the image plane with the final 3D translation of the camera.The translation and the change in size (relative scale) of the box in the image plane can be used to determine 3D translation between two camera position C1 and C2. However, as our camera model doesn’t assume the focal length of the camera lens, we do not know the true distance/depth of the tracked plane.To account for this, we added scale estimation to our existing tracker (the one used in Motion Text) as well as region tracking outside the field of view of the camera. When the camera gets closer to the tracked surface, the virtual content scales accurately, which is consistent with perception of real-world objects. When you pan outside the field of view of the target region and back the virtual object will reappear in approximately the same spot.Independent translation (from visual signal only as shown by red box) and rotation tracking (from gyro; not shown)After all this, we obtain the device’s 3D rotation (roll, pitch and yaw) using the phone’s built-in gyroscope. The estimated 3D translation combined with the 3D rotation provides us with the ability to render the virtual content correctly in the viewfinder. And because we treat rotation and translation separately, our instant motion tracking approach is calibration free and works on any Android device with a gyroscope.Augmented chicken family with Motion Stills AR modeWe are excited to bring this new mode to Motion Stills for Android, and we hope you’ll enjoy it. Please download the new release of Motion Stills and keep sending us feedback with #motionstills on your favorite social media.[...]

The Google Brain Team — Looking Back on 2017 (Part 2 of 2)


Posted by Jeff Dean, Google Senior Fellow, on behalf of the entire Google Brain TeamThe Google Brain team works to advance the state of the art in artificial intelligence by research and systems engineering, as one part of the overall Google AI effort. In Part 1 of this blog post, we shared some of our work in 2017 related to our broader research, from designing new machine learning algorithms and techniques to understanding them, as well as sharing data, software, and hardware with the community. In this post, we’ll dive into the research we do in some specific domains such as healthcare, robotics, creativity, fairness and inclusion, as well as share a little more about us.HealthcareWe feel there is enormous potential for the application of machine learning techniques to healthcare. We are doing work across many different kinds of problems, including assisting pathologists in detecting cancer, understanding medical conversations to assist doctors and patients, and using machine learning to tackle a wide variety of problems in genomics, including an open-source release of a highly accurate variant calling system based on deep learning. A lymph node biopsy, where our algorithm correctly identifies the tumor and not the benign macrophage.We have continued our work on early detection of diabetic retinopathy (DR) and macular edema, building on the research paper we published December 2016 in the Journal of the American Medical Association (JAMA). In 2017, we moved this project from research project to actual clinical impact. We partnered with Verily (a life sciences company within Alphabet) to guide this work through the regulatory process, and together we are incorporating this technology into Nikon's line of Optos ophthalmology cameras. In addition, we are working to deploy this system in India, where there is a shortage of 127,000 eye doctors and as a result, almost half of patients are diagnosed too late — after the disease has already caused vision loss. As a part of a pilot, we’ve launched this system to help graders at Aravind Eye Hospitals to better diagnose diabetic eye disease. We are also working with our partners to understand the human factors affecting diabetic eye care, from ethnographic studies of patients and healthcare providers, to investigations on how eye care clinicians interact with the AI-enabled system.First patient screened (top) and Iniya Paramasivam, a trained grader, viewing the output of the system (bottom).We have also teamed up with researchers at leading healthcare organizations and medical centers including Stanford, UCSF, and University of Chicago to demonstrate the effectiveness of using machine learning to predict medical outcomes from de-identified medical records (i.e. given the current state of a patient, we believe we can predict the future for a patient by learning from millions of other patients’ journeys, as a way of helping healthcare professionals make better decisions). We’re very excited about this avenue of work and we look to forward to telling you more about it in 2018. RoboticsOur long-term goal in robotics is to design learning algorithms to allow robots to operate in messy, real-world environments and to quickly acquire new skills and capabilities via learning, rather than the carefully-controlled conditions and the small set of hand-programmed tasks that characterize today’s robots. One thrust of our research is on developing techniques for physical robots to use their own experience and those of other robots to build new skills and capabilities, pooling the shared experiences in order to learn collectively. We are also exploring ways in which we can combine computer-based simulations of robotic tasks with physical robotic experience to learn new tasks more rapidly. While the physics of the simulator don’t entirely match up with the real world, we have observed that for robotics, simulated experience plus a small amount of real-world experience gives significantly bet[...]

The Google Brain Team — Looking Back on 2017 (Part 1 of 2)


Posted by Jeff Dean, Google Senior Fellow, on behalf of the entire Google Brain TeamThe Google Brain team works to advance the state of the art in artificial intelligence by research and systems engineering, as one part of the overall Google AI effort. Last year we shared a summary of our work in 2016. Since then, we’ve continued to make progress on our long-term research agenda of making machines intelligent, and have collaborated with a number of teams across Google and Alphabet to use the results of our research to improve people’s lives. This first of two posts will highlight some of our work in 2017, including some of our basic research work, as well as updates on open source software, datasets, and new hardware for machine learning. In the second post we’ll dive into the research we do in specific domains where machine learning can have a large impact, such as healthcare, robotics, and some areas of basic science, as well as cover our work on creativity, fairness and inclusion and tell you a bit more about who we are.Core ResearchA significant focus of our team is pursuing research that advances our understanding and improves our ability to solve new problems in the field of machine learning. Below are several themes from our research last year. AutoMLThe goal of automating machine learning is to develop techniques for computers to solve new machine learning problems automatically, without the need for human machine learning experts to intervene on every new problem. If we’re ever going to have truly intelligent systems, this is a fundamental capability that we will need. We developed new approaches for designing neural network architectures using both reinforcement learning and evolutionary algorithms, scaled this work to state-of-the-art results on ImageNet classification and detection, and also showed how to learn new optimization algorithms and effective activation functions automatically. We are actively working with our Cloud AI team to bring this technology into the hands of Google customers, as well as continuing to push the research in many directions.Convolutional architecture discovered by Neural Architecture SearchObject detection with a network discovered by AutoMLSpeech Understanding and GenerationAnother theme is on developing new techniques that improve the ability of our computing systems to understand and generate human speech, including our collaboration with the speech team at Google to develop a number of improvements for an end-to-end approach to speech recognition, which reduces the relative word error rate over Google’s production speech recognition system by 16%. One nice aspect of this work is that it required many separate threads of research to come together (which you can find on Arxiv: 1, 2, 3, 4, 5, 6, 7, 8, 9).Components of the Listen-Attend-Spell end-to-end model for speech recognitionWe also collaborated with our research colleagues on Google’s Machine Perception team to develop a new approach for performing text-to-speech generation (Tacotron 2) that dramatically improves the quality of the generated speech. This model achieves a mean opinion score (MOS) of 4.53 compared to a MOS of 4.58 for professionally recorded speech like you might find in an audiobook, and 4.34 for the previous best computer-generated speech system. You can listen for yourself.Tacotron 2’s model architectureNew Machine Learning Algorithms and ApproachesWe continued to develop novel machine learning algorithms and approaches, including work on capsules (which explicitly look for agreement in activated features as a way of evaluating many different noisy hypotheses when performing visual tasks), sparsely-gated mixtures of experts (which enable very large models that are still computational efficient), hypernetworks (which use the weights of one model to generate weights for another model), new kinds of multi-modal models (which perform multi-task learning across audio, visual, [...]

Introducing the CVPR 2018 Learned Image Compression Challenge


Posted by Michele Covell, Research Scientist, Google ResearchEdit 17/01/2018: Due to popular request, the CLIC competition submission deadline has been extended to April 22. Please see for more details.Image compression is critical to digital photography — without it, a 12 megapixel image would take 36 megabytes of storage, making most websites prohibitively large. While the signal-processing community has significantly improved image compression beyond JPEG (which was introduced in the 1980’s) with modern image codecs (e.g., BPG, WebP), many of the techniques used in these modern codecs still use the same family of pixel transforms as are used in JPEG. Multiple recent Google projects improve the field of image compression with end-to-end with machine learning, compression through superresolution and creating perceptually improved JPEG images, but we believe that even greater improvements to image compression can be obtained by bringing this research challenge to the attention of the larger machine learning community. To encourage progress in this field, Google, in collaboration with ETH and Twitter, is sponsoring the Workshop and Challenge on Learned Image Compression (CLIC) at the upcoming 2018 Computer Vision and Pattern Recognition conference (CVPR 2018). The workshop will bring together established contributors to traditional image compression with early contributors to the emerging field of learning-based image compression systems. Our invited speakers include image and video compression experts Jim Bankoski (Google) and Jens Ohm (RWTH Aachen University), as well as computer vision and machine learning experts with experience in video and image compression, Oren Rippel (WaveOne) and Ramin Zabih (Google, on leave from Cornell).Training set of 1,633 uncompressed images from both the Mobile and Professional datasets, available on compression.ccA database of copyright-free, high-quality images will be made available both for this challenge and in an effort to accelerate research in this area: Dataset P (“professional”) and Dataset M (“mobile”). The datasets are collected to be representative for images commonly used in the wild, containing thousands of images. While the challenge will allow participants to train neural networks or other methods on any amount of data (but we expect participants to have access to additional data, such as ImageNet and the Open Images Dataset), it should be possible to train on the datasets provided.The first large-image compression systems using neural networks were published in 2016 [Toderici2016, Ballé2016] and were only just matching JPEG performance. More recent systems have made rapid advances, to the point that they match or exceed the performance of modern industry-standard image compression [Ballé2017, Theis2017, Agustsson2017, Santurkar2017, Rippel2017]. This rapid advance in the quality of neural-network-based compression systems, based on the work of a comparatively small number of research labs, leads us to expect even more impressive results when the area is explored by a larger portion of the machine-learning community.We hope to get your help advancing the state-of-the-art in this important application area, and we encourage you to participate if you are planning to attend CVPR this year! Please see for more details about the new datasets and important workshop deadlines. Training data is already available on that site. The test set will be released on February 15 and the deadline for submitting the compressed versions of the test set is February 22. Edit 17/01/2018: Due to popular request, the CLIC competition submission deadline has been extended to April 22. Please see for more details. [...]

Evaluation of Speech for the Google Assistant


Posted by Enrique Alfonseca, Staff Research Scientist, Google AssistantVoice interactions with technology are becoming a key part of our lives — from asking your phone for traffic conditions to work to using a smart device at home to turn on the lights or play music. The Google Assistant is designed to provide help and information across a variety of platforms, and is built to bring together a number of products — including Google Maps, Search, Google Photos, third party services, and more. For some of these products, we have released specific evaluation guidelines, like Search Quality Rating Guidelines. However, the Google Assistant needs its own guidelines in place, as many of its interactions utilize what is called “eyes-free technology,” when there is no screen as part of the experience.In the past we have received requests to see our evaluation guidelines from academics who are researching improvements in voice interactions, question answering and voice-guided exploration. To facilitate their evaluations, we are publishing some of the first Google Assistant guidelines. It is our hope that making these guidelines public will help the research community build and evaluate their own systems. Creating the GuidelinesFor many queries, responses are presented on the display (like a phone) with a graph, a table, or an interactive element, like you’d see for [weather this weekend]. But spoken responses are very different from display results, as what’s on screen needs to be translated into useful speech. Furthermore, the contents of the voice response are sometimes sourced from the web, and in those cases it’s important to provide the user with a link to the original source. While users looking at their mobile device can click through to read the original web page, an eyes free solution presents unique challenges. In order to generate the optimal audio response, we use a combination of explicit linguistic knowledge and deep learning solutions that allow us to keep answers grammatical, fluent and concise.How do we ensure that we consistently meet user expectations on quality, across all answer types and languages? One of the tools we use to measure that are human evaluations. In these, we ask raters to make sure that answers are satisfactory across several dimensions:Information Satisfaction: the content of the answer should meet the information needs of the user.Length: when a displayed answer is too long, users can quickly scan it visually and locate the relevant information. For voice answers, that is not possible. It is much more important to ensure that we provide a helpful amount of information, hopefully not too much or too little. Some of our previous work is currently in use for identifying the most relevant fragments of answers.Formulation: it is much easier to understand a badly formulated written answer than an ungrammatical spoken answer, so more care has to be placed in ensuring grammatical correctness.Elocution: spoken answers must have proper pronunciation and prosody. Improvements in text-to-speech generation, such as WaveNet and Tacotron 2, are quickly reducing the gap with human performance.The current version of the guidelines can be found here. Of course, guidelines are often updated, and these are just a snapshot of something that is a living, changing, always-work-in-progress evaluation! [...]

Tacotron 2: Generating Human-like Speech from Text


Posted by Jonathan Shen and Ruoming Pang, Software Engineers, on behalf of the Google Brain and Machine Perception TeamsGenerating very natural sounding speech from text (text-to-speech, TTS) has been a research goal for decades. There has been great progress in TTS research over the last few years and many individual pieces of a complete TTS system have greatly improved. Incorporating ideas from past work such as Tacotron and WaveNet, we added more improvements to end up with our new system, Tacotron 2. Our approach does not use complex linguistic and acoustic features as input. Instead, we generate human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.A full description of our new system can be found in our paper “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.” In a nutshell it works like this: We use a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encode the audio. These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation. Finally these features are converted to a 24 kHz waveform using a WaveNet-like architecture.A detailed look at Tacotron 2's model architecture. The lower half of the image describes the sequence-to-sequence model that maps a sequence of letters to a spectrogram. For technical details, please refer to the paper.You can listen to some of the Tacotron 2 audio samples that demonstrate the results of our state-of-the-art TTS system. In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings.While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises. Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own.AcknowledgementsJonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu, Sound Understanding team, TTS Research team, and TensorFlow team. [...]

Introducing NIMA: Neural Image Assessment


Posted by Hossein Talebi, Software Engineer and Peyman Milanfar Research Scientist, Machine PerceptionQuantification of image quality and aesthetics has been a long-standing problem in image processing and computer vision. While technical quality assessment deals with measuring pixel-level degradations such as noise, blur, compression artifacts, etc., aesthetic assessment captures semantic level characteristics associated with emotions and beauty in images. Recently, deep convolutional neural networks (CNNs) trained with human-labelled data have been used to address the subjective nature of image quality for specific classes of images, such as landscapes. However, these approaches can be limited in their scope, as they typically categorize images to two classes of low and high quality. Our proposed method predicts the distribution of ratings. This leads to a more accurate quality prediction with higher correlation to the ground truth ratings, and is applicable to general images.In “NIMA: Neural Image Assessment” we introduce a deep CNN that is trained to predict which images a typical user would rate as looking good (technically) or attractive (aesthetically). NIMA relies on the success of state-of-the-art deep object recognition networks, building on their ability to understand general categories of objects despite many variations. Our proposed network can be used to not only score images reliably and with high correlation to human perception, but also it is useful for a variety of labor intensive and subjective tasks such as intelligent photo editing, optimizing visual quality for increased user engagement, or minimizing perceived visual errors in an imaging pipeline.BackgroundIn general, image quality assessment can be categorized into full-reference and no-reference approaches. If a reference “ideal” image is available, image quality metrics such as PSNR, SSIM, etc. have been developed. When a reference image is not available, “blind” (or no-reference) approaches rely on statistical models to predict image quality. The main goal of both approaches is to predict a quality score that correlates well with human perception. In a deep CNN approach to image quality assessment, weights are initialized by training on object classification related datasets (e.g. ImageNet), and then fine-tuned on annotated data for perceptual quality assessment tasks.NIMATypical aesthetic prediction methods categorize images as low/high quality. This is despite the fact that each image in the training data is associated to a histogram of human ratings, rather than a single binary score. A histogram of ratings is an indicator of overall quality of an image, as well as agreements among raters. In our approach, instead of classifying images a low/high score or regressing to the mean score, the NIMA model produces a distribution of ratings for any given image — on a scale of 1 to 10, NIMA assigns likelihoods to each of the possible scores. This is more directly in line with how training data is typically captured, and it turns out to be a better predictor of human preferences when measured against other approaches (more details are available in our paper). Various functions of the NIMA vector score (such as the mean) can then be used to rank photos aesthetically. Some test photos from the large-scale database for Aesthetic Visual Analysis (AVA) dataset, as ranked by NIMA, are shown below. Each AVA photo is scored by an average of 200 people in response to photography contests. After training, the aesthetic ranking of these photos by NIMA closely matches the mean scores given by human raters. We find that NIMA performs equally well on other datasets, with predicted quality scores close to human ratings.Ranking some examples labelled with the “landscape” tag from AVA dataset using NIMA. Predicted NIMA (and ground tr[...]

Improving End-to-End Models For Speech Recognition


Posted by Tara N. Sainath, Research Scientist, Speech Team and Yonghui Wu, Software Engineer, Google Brain TeamTraditional automatic speech recognition (ASR) systems, used for a variety of voice search applications at Google, are comprised of an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are independently trained, and often manually designed, on different datasets [1]. AMs take acoustic features and predict a set of subword units, typically context-dependent or context-independent phonemes. Next, a hand-designed lexicon (the PM) maps a sequence of phonemes produced by the acoustic model to words. Finally, the LM assigns probabilities to word sequences. Training independent components creates added complexities and is suboptimal compared to training all components jointly. Over the last several years, there has been a growing popularity in developing end-to-end systems, which attempt to learn these separate components jointly as a single system. While these end-to-end models have shown promising results in the literature [2, 3], it is not yet clear if such approaches can improve on current state-of-the-art conventional systems.Today we are excited to share “State-of-the-art Speech Recognition With Sequence-to-Sequence Models [4],” which describes a new end-to-end model that surpasses the performance of a conventional production system [1]. We show that our end-to-end system achieves a word error rate (WER) of 5.6%, which corresponds to a 16% relative improvement over a strong conventional system which achieves a 6.7% WER. Additionally, the end-to-end model used to output the initial word hypothesis, before any hypothesis rescoring, is 18 times smaller than the conventional model, as it contains no separate LM and PM. Our system builds on the Listen-Attend-Spell (LAS) end-to-end architecture, first presented in [2]. The LAS architecture consists of 3 components. The listener encoder component, which is similar to a standard AM, takes the a time-frequency representation of the input speech signal, x, and uses a set of neural network layers to map the input to a higher-level feature representation, henc. The output of the encoder is passed to an attender, which uses henc to learn an alignment between input features x and predicted subword units {yn, … y0}, where each subword is typically a grapheme or wordpiece. Finally, the output of the attention module is passed to the speller (i.e., decoder), similar to an LM, that produces a probability distribution over a set of hypothesized words. Components of the LAS End-to-End Model.All components of the LAS model are trained jointly as a single end-to-end neural network, instead of as separate modules like conventional systems, making it much simpler. Additionally, because the LAS model is fully neural, there is no need for external, manually designed components such as finite state transducers, a lexicon, or text normalization modules. Finally, unlike conventional models, training end-to-end models does not require bootstrapping from decision trees or time alignments generated from a separate system, and can be trained given pairs of text transcripts and the corresponding acoustics.In [4], we introduce a variety of novel structural improvements, including improving the attention vectors passed to the decoder and training with longer subword units (i.e., wordpieces). In addition, we also introduce numerous optimization improvements for training, including the use of minimum word error rate training [5]. These structural and optimization improvements are what accounts for obtaining the 16% relative improvement over the conventional model.Another exciting potential application for this research is multi-dialect and multi-lingual systems, where the simplicity of optimizing a single neural [...]

A Summary of the First Conference on Robot Learning


Posted by Vincent Vanhoucke, Principal Scientist, Google Brain Team and Melanie Saldaña, Program Manager, University RelationsWhether in the form of autonomous vehicles, home assistants or disaster rescue units, robotic systems of the future will need to be able to operate safely and effectively in human-centric environments. In contrast to to their industrial counterparts, they will require a very high level of perceptual awareness of the world around them, and to adapt to continuous changes in both their goals and their environment. Machine learning is a natural answer to both the problems of perception and generalization to unseen environments, and with the recent rapid progress in computer vision and learning capabilities, applying these new technologies to the field of robotics is becoming a very central research question.This past November, Google helped kickstart and host the first Conference on Robot Learning (CoRL) at our campus in Mountain View. The goal of CoRL was to bring machine learning and robotics experts together for the first time in a single-track conference, in order to foster new research avenues between the two disciplines. The sold-out conference attracted 350 researchers from many institutions worldwide, who collectively presented 74 original papers, along with 5 keynotes by some of the most innovative researchers in the field. Prof. Sergey Levine, CoRL 2017 co-chair, answering audience questions.Sayna Ebrahimi (UC Berkeley) presenting her research.Videos of the inaugural CoRL are available on the conference website. Additionally, we are delighted to announce that next year, CoRL moves to Europe! CoRL 2018 will be chaired by Professor Aude Billard from the École Polytechnique Fédérale de Lausanne, and will tentatively be held in the Eidgenössische Technische Hochschule (ETH) in Zürich on October 29th-31st, 2018. Looking forward to seeing you there!Prof. Ken Goldberg, CoRL 2017 co-chair, and Jeffrey Mahler (UC Berkeley) during a break. [...]

TFGAN: A Lightweight Library for Generative Adversarial Networks


Posted by Joel Shor, Senior Software Engineer, Machine Perception(Crossposted on the Google Open Source Blog)Training a neural network usually involves defining a loss function, which tells the network how close or far it is from its objective. For example, image classification networks are often given a loss function that penalizes them for giving wrong classifications; a network that mislabels a dog picture as a cat will get a high loss. However, not all problems have easily-defined loss functions, especially if they involve human perception, such as image compression or text-to-speech systems. Generative Adversarial Networks (GANs), a machine learning technique that has led to improvements in a wide range of applications including generating images from text, superresolution, and helping robots learn to grasp, offer a solution. However, GANs introduce new theoretical and software engineering challenges, and it can be difficult to keep up with the rapid pace of GAN research. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src="" width="640">A video of a generator improving over time. It begins by producing random noise, and eventually learns to generate MNIST digits.In order to make GANs easier to experiment with, we’ve open sourced TFGAN, a lightweight library designed to make it easy to train and evaluate GANs. It provides the infrastructure to easily train a GAN, provides well-tested loss and evaluation metrics, and gives easy-to-use examples that highlight the expressiveness and flexibility of TFGAN. We’ve also released a tutorial that includes a high-level API to quickly get a model trained on your data.This demonstrates the effect of an adversarial loss on image compression. The top row shows image patches from the ImageNet dataset. The middle row shows the results of compressing and uncompressing an image through an image compression neural network trained on a traditional loss. The bottom row shows the results from a network trained with a traditional loss and an adversarial loss. The GAN-loss images are sharper and more detailed, even if they are less like the original.TFGAN supports experiments in a few important ways. It provides simple function calls that cover the majority of GAN use-cases so you can get a model running on your data in just a few lines of code, but is built in a modular way to cover more exotic GAN designs as well. You can just use the modules you want — loss, evaluation, features, training, etc. are all independent. TFGAN’s lightweight design also means you can use it alongside other frameworks, or with native TensorFlow code. GAN models written using TFGAN will easily benefit from future infrastructure improvements, and you can select from a large number of already-implemented losses and features without having to rewrite your own. Lastly, the code is well-tested, so you don’t have to worry about numerical or statistical mistakes that are easily made with GAN libraries.Most neural text-to-speech (TTS) systems produce over-smoothed spectrograms. When applied to the Tacotron TTS system, a GAN can recreate some of the realistic-texture, which reduces artifacts in the resulting audio.When you use TFGAN, you’ll be using the same infrastructure that many Google researchers use, and you’ll have access to the cutting-edge improvements that we develop with the library. Anyone can contribute to the github repositories, which we hope will facilitate code-sharing among ML researchers and users. [...]

Introducing Appsperiments: Exploring the Potentials of Mobile Photography


Posted by Alex Kauffmann, Interaction Researcher, Google ResearchEach of the world's approximately two billion smartphone owners is carrying a camera capable of capturing photos and video of a tonal richness and quality unimaginable even five years ago. Until recently, those cameras behaved mostly as optical sensors, capturing light and operating on the resulting image's pixels. The next generation of cameras, however, will have the capability to blend hardware and computer vision algorithms that operate as well on an image's semantic content, enabling radically new creative mobile photo and video applications.Today, we're launching the first installment of a series of photography appsperiments: usable and useful mobile photography experiences built on experimental technology. Our "appsperimental" approach was inspired in part by Motion Stills, an app developed by researchers at Google that converts short videos into cinemagraphs and time lapses using experimental stabilization and rendering technologies. Our appsperiments replicate this approach by building on other technologies in development at Google. They rely on object recognition, person segmentation, stylization algorithms, efficient image encoding and decoding technologies, and perhaps most importantly, fun!StoryboardStoryboard (Android) transforms your videos into single-page comic layouts, entirely on device. Simply shoot a video and load it in Storyboard. The app automatically selects interesting video frames, lays them out, and applies one of six visual styles. Save the comic or pull down to refresh and instantly produce a new one. There are approximately 1.6 trillion different possibilities!Selfissimo!Selfissimo! (iOS, Android) is an automated selfie photographer that snaps a stylish black and white photo each time you pose. Tap the screen to start a photoshoot. The app encourages you to pose and captures a photo whenever you stop moving. Tap the screen to end the session and review the resulting contact sheet, saving individual images or the entire shoot.ScrubbiesScrubbies (iOS) lets you easily manipulate the speed and direction of video playback to produce delightful video loops that highlight actions, capture funny faces, and replay moments. Shoot a video in the app and then remix it by scratching it like a DJ. Scrubbing with one finger plays the video. Scrubbing with two fingers captures the playback so you can save or share it.Try them out and tell us what you think using the in-app feedback links. The feedback and ideas we get from the new and creative ways people use our appsperiments will help guide some of the technology we develop next. AcknowledgementsThese appsperiments represent a collaboration across many teams at Google. We would like to thank the core contributors Andy Dahley, Ashley Ma, Dexter Allen, Ignacio Garcia Dorado, Madison Le, Mark Bowers, Pascal Getreuer, Robin Debreuil, Suhong Jin, and William Lindmeier. We also wish to give special thanks to Buck Bourdon, Hossein Talebi, Kanstantsin Sokal, Karthik Raveendran, Matthias Grundmann, Peyman Milanfar, Suril Shah, Tomas Izo, Tyler Mullen, and Zheng Sun. [...]

Introducing a New Foveation Pipeline for Virtual/Mixed Reality


Posted by Behnam Bastani, Software Engineer Manager and Eric Turner, Software Engineer, DaydreamVirtual Reality (VR) and Mixed Reality (MR) offer a novel way to immerse people into new and compelling experiences, from gaming to professional training. However, current VR/MR technologies present a fundamental challenge: to present images at the extremely high resolution required for immersion places enormous demands on the rendering engine and transmission process. Headsets often have insufficient display resolution, which can limit the field of view, worsening the experience. But, to drive a higher resolution headset, the traditional rendering pipeline requires significant processing power that even high-end mobile processors cannot achieve. As research continues to deliver promising new techniques to increase display resolution, the challenges of driving those displays will continue to grow. In order to further improve the visual experience in VR and MR, we introduce a pipeline that takes advantage of the characteristics of human visual perception to enable an amazing visual experience at low compute and power cost. The pipeline proposed in this article considers the full system dependency including the rendering engine, memory bandwidth and capability of display module itself. We determined that the current limitation is not just in the content creation, but it also may be in transmitting data, handling latency and enabling interaction with real objects (mixed reality applications). The pipeline consists of 1. Foveated Rendering with a focus on reducing of compute per pixel. 2. Foveated Image Processing with a focus on the reduction of visual artifacts and 3. Foveated Transmission with a focus on bits per pixel transmitted.Foveated RenderingIn the human visual system, the fovea centralis allows us to see at high-fidelity in the center of our vision, allowing our brain to pay less attention to things in our peripheral vision. Foveated rendering takes advantage of this characteristic to improve the performance of the rendering engine by reducing the spatial or bit-depth resolution of objects in our peripheral vision. To make this work, the location of the High Acuity (HA) region needs to be updated with eye-tracking to align with eye saccades, which preserves the perception of a constant high-resolution across the field of view. In contrast, systems with no eye-tracking may need to render a much larger HA region. The left image is rendered at full resolution. The right image uses two layers of foveation — one rendered at high resolution (inside the yellow region) and one at lower resolution (outside).A traditional foveation technique may divide a frame buffer into multiple spatial resolution regions. Aliasing introduced by rendering to lower spatial resolution may cause perceptible temporal artifacts when there is motion in the content due to head motion or animation. Below we show an example of temporal artifacts introduced by head rotation. A smooth full rendering (image on the left). The image on the right shows temporal artifacts introduced by motion in foveated region.In the following sections, we present two different methods we use aimed at reducing these artifacts: Phase-Aligned Foveated Rendering and Conformal Foveated Rendering. Each of these methods provide different benefits for visual quality during rendering and are useful under different conditions.Phase-Aligned RenderingAliasing occurs in the Low-Acuity (LA) region during foveated rendering due to the subsampling of rendered content. In traditional foveated rendering discussed above, these aliasing artifacts flicker from frame to frame, since the display pixel grid moves across the virtual scene as the use[...]

DeepVariant: Highly Accurate Genomes With Deep Neural Networks


Posted by Mark DePristo and Ryan Poplin, Google Brain Team(Crossposted on the Google Open Source Blog)Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology. One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.For any given location in the genome, there are multiple reads among the ~1 billion that include a base at that position. Each read is aligned to a reference, and then each of the bases in the read is compared to the base of the reference at that location. When a read includes a base that differs from the reference, it may indicate a variant (a difference in the true sequence), or it may be an error.Today, we announce the open source release of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences. DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise. Each of the four images above is a visualization of actual sequencer reads aligned to a reference genome. A key question is how to use the reads to determine whether there is a variant on both chromosomes, on just one chromosome, or on neither chromosome. There is more than one type of variant, with SNPs and insertions/deletions being the most common. A: a true SNP on one chromosome pair, B: a deletion on one chromosome, C: a deletion on both chromosomes, D: a false variant caused by errors. It's easy to see that these look quite distinct when visualized in this manner.We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produc[...]

Google at NIPS 2017


Posted by Christian Howard, Editor-in-Chief, Research CommunicationsThis week, Long Beach, California hosts the 31st annual Conference on Neural Information Processing Systems (NIPS 2017), a machine learning and computational neuroscience conference that includes invited talks, demonstrations and presentations of some of the latest in machine learning research. Google will have a strong presence at NIPS 2017, with over 450 Googlers attending to contribute to, and learn from, the broader academic research community via technical talks and posters, workshops, competitions and tutorials.Google is at the forefront of machine learning, actively exploring virtually all aspects of the field from classical algorithms to deep learning and more. Focusing on both theory and application, much of our work on language understanding, speech, translation, visual processing, and prediction relies on state-of-the-art techniques that push the boundaries of what is possible. In all of those tasks and many others, we develop learning approaches to understand and generalize, providing us with new ways of looking at old problems and helping transform how we work and live.If you are attending NIPS 2017, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving interesting problems for billions of people, and to see demonstrations of some of the exciting research we pursue. You can also learn more about our work being presented in the list below (Googlers highlighted in blue).Google is a Platinum Sponsor of NIPS 2017.Organizing CommitteeProgram Chair: Samy BengioSenior Area Chairs include: Corinna Cortes, Dale Schuurmans, Hugo LarochelleArea Chairs include: Afshin Rostamizadeh, Amir Globerson, Been Kim, D. Sculley, Dumitru Erhan, Gal Chechik, Hartmut Neven, Honglak Lee, Ian Goodfellow, Jasper Snoek, John Wright, Jon Shlens, Lihong Li, Maya Gupta, Moritz Hardt, Navdeep Jaitly, Ryan Adams, Sally Goldman, Sanjiv Kumar, Surya Ganguli, Tara Sainath, Umar Syed, Viren Jain, Vitaly KuznetsovInvited TalkPowering the next 100 yearsJohn PlattAccepted PapersA Meta-Learning Perspective on Cold-Start Recommendations for ItemsManasi Vartak, Hugo Larochelle, Arvind ThiagarajanAdaGAN: Boosting Generative ModelsIlya Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, Bernhard SchölkopfDeep Lattice Networks and Partial Monotonic FunctionsSeungil You, David Ding, Kevin Canini, Jan Pfeifer, Maya GuptaFrom which world is your graphCheng Li, Varun Kanade, Felix MF Wong, Zhenming LiuHiding Images in Plain Sight: Deep SteganographyShumeet BalujaImproved Graph Laplacian via Geometric Self-ConsistencyDominique Joncas, Marina Meila, James McQueenModel-Powered Conditional Independence TestRajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros Dimakis, Sanjay ShakkottaiNonlinear random matrix theory for deep learningJeffrey Pennington, Pratik WorahResurrecting the sigmoid in deep learning through dynamical isometry: theory and practiceJeffrey Pennington, Samuel Schoenholz, Surya GanguliSGD Learns the Conjugate Kernel Class of the NetworkAmit DanielySVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and InterpretabilityMaithra Raghu, Justin Gilmer, Jason Yosinski, Jascha Sohl-DicksteinLearning Hierarchical Information Flow with Recurrent Neural ModulesDanijar Hafner, Alexander Irpan, James Davidson, Nicolas HeessOnline Learning with Transductive RegretScott Yang, Mehryar Moh[...]

Understanding Bias in Peer Review


Posted by Andrew Tomkins, Director of Engineering and William D. Heavlin, Statistician, Google ResearchIn the 1600’s, a series of practices came into being known collectively as the “scientific method.” These practices encoded verifiable experimentation as a path to establishing scientific fact. Scientific literature arose as a mechanism to validate and disseminate findings, and standards of scientific peer review developed as a means to control the quality of entrants into this literature. Over the course of development of peer review, one key structural question remains unresolved to the current day: should the reviewers of a piece of scientific work be made aware of the identify of the authors? Those in favor argue that such additional knowledge may allow the reviewer to set the work in perspective and evaluate it more completely. Those opposed argue instead that the reviewer may form an opinion based on past performance rather than the merit of the work at hand. Existing academic literature on this subject describes specific forms of bias that may arise when reviewers are aware of the authors. In 1968, Merton proposed the Matthew effect, whereby credit goes to the best established researchers. More recently, Knobloch-Westerwick et al. proposed a Matilda effect, whereby papers from male-first authors were considered to have greater scientific merit that those from female-first authors. But with the exception of one classical study performed by Rebecca Blank in 1991 at the American Economic Review, there have been few controlled experimental studies of such effects on reviews of academic papers. Last year we had the opportunity to explore this question experimentally, resulting in “Reviewer bias in single- versus double-blind peer review,” a paper that just appeared in the Proceedings of the National Academy of Sciences. Working with Professor Min Zhang of Tsinghua University, we performed an experiment during the peer review process of the 10th ACM Web Search and Data Mining Conference (WSDM 2017) to compare the behavior of reviewers under single-blind and double-blind review. Our experiment ran as follows:We invited a number of experts to join the conference Program Committee (PC).We randomly split these PC members into a single-blind cadre and a double-blind cadre.We asked all PC members to “bid” for papers they were qualified to review, but only the single-blind cadre had access to the names and institutions of the paper authors.Based on the resulting bids, we then allocated two single-blind and two double-blind PC members to each paper.Each PC member read his or her assigned papers and entered reviews, again with only single-blind PC members able to see the authors and institutions.At this point, we closed our experiment and performed the remainder of the conference reviewing process under the single-blind model. As a result, we were able to assess the difference in bidding and reviewing behavior of single-blind and double-blind PC members on the same papers. We discovered a number of surprises.Our first finding shows that compared to their double-blind counterparts, single-blind PC members tend to enter higher scores for papers from top institutions (the finding holds for both universities and companies) and for papers written by well-known authors. This suggests that a paper authored by an up-and-coming researcher might be reviewed more negatively (by a single-blind PC member) than exactly the same paper written by an established star of the field.Digging a little deeper, we show some additional findings related to the “bidding process,” in which PC memb[...]

Interpreting Deep Neural Networks with SVCCA


Posted by Maithra Raghu, Google Brain TeamDeep Neural Networks (DNNs) have driven unprecedented advances in areas such as vision, language understanding and speech recognition. But these successes also bring new challenges. In particular, contrary to many previous machine learning methods, DNNs can be susceptible to adversarial examples in classification, catastrophic forgetting of tasks in reinforcement learning, and mode collapse in generative modelling. In order to build better and more robust DNN-based systems, it is critically important to be able to interpret these models. In particular, we would like a notion of representational similarity for DNNs: can we effectively determine when the representations learned by two neural networks are same?In our paper, “SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability,” we introduce a simple and scalable method to address these points. Two specific applications of this that we look at are comparing the representations learned by different networks, and interpreting representations learned by hidden layers in DNNs. Furthermore, we are open sourcing the code so that the research community can experiment with this method.Key to our setup is the interpretation of each neuron in a DNN as an activation vector. As shown in the figure below, the activation vector of a neuron is the scalar output it produces on the input data. For example, for 50 input images, a neuron in a DNN will output 50 scalar values, encoding how much it responds to each input. These 50 scalar values then make up an activation vector for the neuron. (Of course, in practice, we take many more than 50 inputs.)Here a DNN is given three inputs, x1, x2, x3. Looking at a neuron inside the DNN (bolded in red, right pane), this neuron produces a scalar output zi corresponding to each input xi. These values form the activation vector of the neuron.With this basic observation and a little more formulation, we introduce Singular Vector Canonical Correlation Analysis (SVCCA), a technique for taking in two sets of neurons and outputting aligned feature maps learned by both of them. Critically, this technique accounts for superficial differences such as permutations in neuron orderings (crucial for comparing different networks), and can detect similarities where other, more straightforward comparisons fail.As an example, consider training two convolutional neural nets (net1 and net2, below) on CIFAR-10, a medium scale image classification task. To visualize the results of our method, we compare activation vectors of neurons with the aligned features output by SVCCA. Recall that the activation vector of a neuron is the raw scalar outputs on input images. The x-axis of the plot consists of images sorted by class (gray dotted lines showing class boundaries), and the y axis the output value of the neuron. On the left pane, we show the two highest activation (largest euclidean norm) neurons in net1 and net2. Examining highest activations neurons has been a popular method to interpret DNNs in computer vision, but in this case, the highest activation neurons in net1 and net2 have no clear correspondence, despite both being trained on the same task. However, after applying SVCCA, (right pane), we see that the latent representations learned by both networks do indeed share some very similar features. Note that the top two rows representing aligned feature maps are close to identical, as are the second highest aligned feature maps (bottom two rows). Furthermore, these aligned mappings in the right pane also show a clear co[...]

Understanding Medical Conversations


Posted by Katherine Chou, Product Manager and Chung-Cheng Chiu, Software Engineer, Google Brain TeamGood documentation helps create good clinical care by communicating a doctor's thinking, their concerns, and their plans to the rest of the team. Unfortunately, physicians routinely spend more time doing documentation than doing what they love most — caring for patients. Doctors often spend ~6 hours in an 11-hour workday in the Electronic Health Records (EHR) on documentation.1 Consequently, one study found that more than half of surveyed doctors report at least one symptom of burnout.2 In order to help offload note-taking, many doctors have started using medical scribes as a part of their workflow. These scribes listen to the patient-doctor conversations and create notes for the EHR. According to a recent study, introducing scribes not only improved physician satisfaction, but also medical chart quality and accuracy.3 But the number of doctor-patient conversations that need a scribe is far beyond the capacity of people who are available for medical scribing. We wondered: could the voice recognition technologies already available in Google Assistant, Google Home, and Google Translate be used to document patient-doctor conversations and help doctors and scribes summarize notes more quickly?In “Speech Recognition for Medical Conversations”, we show that it is possible to build Automatic Speech Recognition (ASR) models for transcribing medical conversations. While most of the current ASR solutions in medical domain focus on transcribing doctor dictations (i.e., single speaker speech consisting of predictable medical terminology), our research shows that it is possible to build an ASR model which can handle multiple speaker conversations covering everything from weather to complex medical diagnosis.Using this technology, we will start working with physicians and researchers at Stanford University, who have done extensive research on how scribes can improve physician satisfaction, to understand how deep learning techniques such as ASR can facilitate the scribing process of physician notes. In our pilot study, we investigate what types of clinically relevant information can be extracted from medical conversations to assist physicians in reducing their interactions with the EHR. The study is fully patient-consented and the content of the recording will be de-identified to protect patient privacy. We hope these technologies will not only help return joy to practice by facilitating doctors and scribes with their everyday workload, but also help the patients get more dedicated and thorough medical attention, ideally, leading to better care.1↩2↩3↩ [...]

SLING: A Natural Language Frame Semantic Parser


Posted by Michael Ringgaard, Software Engineer and Rahul Gupta, Research ScientistUntil recently, most practical natural language understanding (NLU) systems used a pipeline of analysis stages, from part-of-speech tagging and dependency parsing to steps that computed a semantic representation of the input text. While this facilitated easy modularization of different analysis stages, errors in earlier stages would have cascading effects in later stages and the final representation, and the intermediate stage outputs might not be relevant on their own. For example, a typical pipeline might perform the task of dependency parsing in an early stage and the task of coreference resolution towards the end. If one was only interested in the output of coreference resolution, it would be affected by cascading effects of any errors during dependency parsing.Today we are announcing SLING, an experimental system for parsing natural language text directly into a representation of its meaning as a semantic frame graph. The output frame graph directly captures the semantic annotations of interest to the user, while avoiding the pitfalls of pipelined systems by not running any intermediate stages, additionally preventing unnecessary computation. SLING uses a special-purpose recurrent neural network model to compute the output representation of input text through incremental editing operations on the frame graph. The frame graph, in turn, is flexible enough to capture many semantic tasks of interest (more on this below). SLING's parser is trained using only the input words, bypassing the need for producing any intermediate annotations (e.g. dependency parses). SLING provides fast parsing at inference time by providing (a) an efficient and scalable frame store implementation and (b) a JIT compiler that generates efficient code to execute the recurrent neural network. Although SLING is experimental, it achieves a parsing speed of >2,500 tokens/second on a desktop CPU, thanks to its efficient frame store and neural network compiler. SLING is implemented in C++ and it is available for download on GitHub. The entire system is described in detail in a technical report as well.Frame Semantic ParsingFrame Semantics [1] represents the meaning of text — such as a sentence — as a set of formal statements. Each formal statement is called a frame, which can be seen as a unit of knowledge or meaning, that also contains interactions with concepts or other frames typically associated with it. SLING organizes each frame as a list of slots, where each slot has a name (role) and a value which could be a literal or a link to another frame. As an example, consider the sentence:“Many people now claim to have predicted Black Monday.”The figure below illustrates SLING recognizing mentions of entities (e.g. people, places, or events), measurements (e.g. dates or distances), and other concepts (e.g. verbs), and placing them in the correct semantic roles for the verbs in the input. The word predicted evokes the most dominant sense of the verb "predict", denoted as a PREDICT-01 frame. Additionally, this frame also has interactions (slots) with who made the prediction (denoted via the ARG0 slot, which points to the PERSON frame for people) and what was being predicted (denoted via ARG1, which links to the EVENT frame for Black Monday). Frame semantic parsing is the task of producing a directed graph of such frames linked through slots.Although the example above is fairly simple, frame graphs are powerful enough to model a variety of complex seman[...]

On-Device Conversational Modeling with TensorFlow Lite


Posted by Sujith Ravi, Research Scientist, Google Expander TeamEarlier this year, we launched Android Wear 2.0 which featured the first "on-device" machine learning technology for smart messaging. This enabled cloud-based technologies like Smart Reply, previously available in Gmail, Inbox and Allo, to be used directly within any application for the first time, including third-party messaging apps, without ever having to connect to the cloud. So you can respond to incoming chat messages on the go, directly from your smartwatch. Today, we announce TensorFlow Lite, TensorFlow’s lightweight solution for mobile and embedded devices. This framework is optimized for low-latency inference of machine learning models, with a focus on small memory footprint and fast performance. As part of the library, we have also released an on-device conversational model and a demo app that provides an example of a natural language application powered by TensorFlow Lite, in order to make it easier for developers and researchers to build new machine intelligence features powered by on-device inference. This model generates reply suggestions to input conversational chat messages, with efficient inference that can be easily plugged in to your chat application to power on-device conversational intelligence. The on-device conversational model we have released uses a new ML architecture for training compact neural networks (as well as other machine learning models) based on a joint optimization framework, originally presented in ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections. This architecture can run efficiently on mobile devices with limited computing power and memory, by using efficient “projection” operations that transform any input to a compact bit vector representation — similar inputs are projected to nearby vectors that are dense or sparse depending on type of projection. For example, the messages “hey, how's it going?” and “How's it going buddy?”, might be projected to the same vector representation.Using this idea, the conversational model combines these efficient operations at low computation and memory footprint. We trained this on-device model end-to-end using an ML framework that jointly trains two types of models — a compact projection model (as described above) combined with a trainer model. The two models are trained in a joint fashion, where the projection model learns from the trainer model — the trainer is characteristic of an expert and modeled using larger and more complex ML architectures, whereas the projection model resembles a student that learns from the expert. During training, we can also stack other techniques such as quantization or distillation to achieve further compression or selectively optimize certain portions of the objective function. Once trained, the smaller projection model is able to be used directly for inference on device.For inference, the trained projection model is compiled into a set of TensorFlow Lite operations that have been optimized for fast execution on mobile platforms and executed directly on device. The TensorFlow Lite inference graph for the on-device conversational model is shown here.TensorFlow Lite execution for the On-Device Conversational Model.The open-source conversational model released today (along with code) was trained end-to-end using the joint ML architecture described above. Today’s release also includes a demo app, so you can easily download and try out one-touch smart replies on y[...]

Fused Video Stabilization on the Pixel 2 and Pixel 2 XL


Posted by Chia-Kai Liang, Senior Staff Software Engineer and Fuhao Shi, Android Camera TeamOne of the most important aspects of current smartphones is easily capturing and sharing videos. With the Pixel 2 and Pixel 2 XL smartphones, the videos you capture are smoother and clearer than ever before, thanks to our Fused Video Stabilization technique based on both optical image stabilization (OIS) and electronic image stabilization (EIS). Fused Video Stabilization delivers highly stable footage with minimal artifacts, and the Pixel 2 is currently rated as the leader in DxO's video ranking (also earning the highest overall rating for a smartphone camera). But how does it work?A key principle in videography is keeping the camera motion smooth and steady. A stable video is free of the distraction, so the viewer can focus on the subject of interest. But, videos taken with smartphones are subject to many conditions that make taking a high-quality video a significant challenge:Camera ShakeMost people hold their mobile phones in their hands to record videos - you pull the phone from your pocket, record the video, and the video is ready to share right after recording. However, that means your videos shake as much as your hands do -- and they shake a lot! Moreover, if you are walking or running while recording, the camera motion can make videos almost unwatchable: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src="" width="640">Motion BlurIf the camera or the subject moves during exposure, the resulting photo or video will appear blurry. Even if we stabilize the motion in between consecutive frames, the motion blur in each individual frame cannot be easily restored in practice, especially on a mobile device. One typical video artifact due to motion blur is sharpness inconsistency: the video may rapidly alternate between blurry and sharp, which is very distracting even after the video is stabilized: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src="" width="640">Rolling ShutterThe CMOS image sensor collects one row of pixels, or “scanline”, at a time, and it takes tens of milliseconds to go from the top scanline to the bottom. Therefore, anything moving during this period can appear distorted. This is called the rolling shutter distortion. Even if you have a steady hand, the rolling shutter distortion will appear when you move quickly: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="180" src="" width="640">A simulated rendering of a video with global (left) and rolling (right) shutter.Focus BreathingWhen there are objects of varying distance in a video, the angle of view can change significantly due to objects “jumping” in and out of the foreground. As result, everything shrinks or expands like the video below, which professionals call “breathing”: allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src="" width[...]

Seamless Google Street View Panoramas


Posted by Mike Krainin, Software Engineer and Ce Liu, Research Scientist, Machine PerceptionIn 2007, we introduced Google Street View, enabling you to explore the world through panoramas of neighborhoods, landmarks, museums and more, right from your browser or mobile device. The creation of these panoramas is a complicated process, involving capturing images from a multi-camera rig called a rosette, and then using image blending techniques to carefully stitch them all together. However, many things can thwart the creation of a "successful" panorama, such as mis-calibration of the rosette camera geometry, timing differences between adjacent cameras, and parallax. And while we attempt to address these issues by using approximate scene geometry to account for parallax and frequent camera re-calibration, visible seams in image overlap regions can still occur. Left: A Street View car carrying a multi-camera rosette. Center: A close-up of the rosette, which is made up of 15 cameras. Right: A visualization of the spatial coverage of each camera. Overlap between adjacent cameras is shown in darker gray.Left: The Sydney Opera House with stitching seams along its iconic shells. Right: The same Street View panorama after optical flow seam repair. In order to provide more seamless Street View images, we’ve developed a new algorithm based on optical flow to help solve these challenges. The idea is to subtly warp each input image such that the image content lines up within regions of overlap. This needs to be done carefully to avoid introducing new types of visual artifacts. The approach must also be robust to varying scene geometry, lighting conditions, calibration quality, and many other conditions. To simplify the task of aligning the images and to satisfy computational requirements, we’ve broken it into two steps.Optical FlowThe first step is to find corresponding pixel locations for each pair of images that overlap. Using techniques described in our PhotoScan blog post, we compute optical flow from one image to the other. This provides a smooth and dense correspondence field. We then downsample the correspondences for computational efficiency. We also discard correspondences where there isn’t enough visual structure to be confident in the results of optical flow.The boundaries of a pair of constituent images from the rosette camera rig that need to be stitched together.An illustration of optical flow within the pair’s overlap region.Extracted correspondences in the pair of images. For each colored dot in the overlap region of the left image, there is an equivalently-colored dot in the overlap region of the right image, indicating how the optical flow algorithm has aligned the point. These pairs of corresponding points are used as input to the global optimization stage. Notice that the overlap covers only a small portion of each image.Global OptimizationThe second step is to warp the rosette’s images to simultaneously align all of the corresponding points from overlap regions (as seen in the figure above). When stitched into a panorama, the set of warped images will then properly align. This is challenging because the overlap regions cover only a small fraction of each image, resulting in an under-constrained problem. To generate visually pleasing results across the whole image, we formulate the warping as a spline-based flow field with spatial regularization. The spline parameters are solved for in a non-linear optimization using [...]

Feature Visualization


Posted by Christopher Olah, Research Scientist, Google Brain Team and Alex Mordvintsev, Research Scientist, Google Research Have you ever wondered what goes on inside neural networks? Feature visualization is a powerful tool for digging into neural networks and seeing how they work.Our new article, published in Distill, does a deep exploration of feature visualization, introducing a few new tricks along the way!Building on our work in DeepDream, and lots of work by others since, we are able to visualize what every neuron a strong vision model (GoogLeNet [1]) detects. Over the course of multiple layers, it gradually builds up abstractions: first it detects edges, then it uses those edges to detect textures, the textures to detect patterns, and the patterns to detect parts of objects….But neurons don’t understand the world by themselves — they work together. So we also need to understand how they interact with each other. One approach is to explore interpolations between them. What images can make them both fire, to different extents?Here we interpolate from a neuron that seems to detect artistic patterns to a neuron that seems to detect lizard eyes:We can also let you try adding different pairs of neurons together, to explore the possibilities for yourself:In addition to allowing you to play around with visualizations, we explore a variety of techniques for getting feature visualization to work, and let you experiment with using them.Techniques for visualizing and understanding neural networks are becoming more powerful. We hope our article will help other researchers apply these techniques, and give people a sense of their potential. Check it out on Distill.AcknowledgementWe're extremely grateful to our co-author, Ludwig Schurbert, who made incredible contributions to our paper and especially to the interactive visualizations. [...]

Tangent: Source-to-Source Debuggable Derivatives


Posted by Alex Wiltschko, Research Scientist, Google Brain TeamTangent is a new, free, and open-source Python library for automatic differentiation. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent comes with many more features for debugging and designing machine learning models:Easily debug your backward passFast gradient surgeryForward mode automatic differentiationEfficient Hessian-vector productsCode optimizationsThis post gives an overview of the Tangent API. It covers how to use Tangent to generate gradient code in Python that is easy to interpret, debug and modify.Neural networks (NNs) have led to great advances in machine learning models for images, video, audio, and text. The fundamental abstraction that lets us train NNs to perform well at these tasks is a 30-year-old idea called reverse-mode automatic differentiation (also known as backpropagation), which comprises two passes through the NN. First, we run a “forward pass” to calculate the output value of each node. Then we run a “backward pass” to calculate a series of derivatives to determine how to update the weights to increase the model’s accuracy. Training NNs, and doing research on novel architectures, requires us to compute these derivatives correctly, efficiently, and easily. We also need to be able to debug these derivatives when our model isn’t training well, or when we’re trying to build something new that we do not yet understand. Automatic differentiation, or just “autodiff,” is a technique to calculate the derivatives of computer programs that denote some mathematical function, and nearly every machine learning library implements it. Existing libraries implement automatic differentiation by tracing a program’s execution (at runtime, like TF Eager, PyTorch and Autograd) or by building a dynamic data-flow graph and then differentiating the graph (ahead-of-time, like TensorFlow). In contrast, Tangent performs ahead-of-time autodiff on the Python source code itself, and produces Python source code as its output.As a result, you can finally read your automatic derivative code just like the rest of your program. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility.You can easily inspect and debug your models written in Tangent, without special tools or indirection. Tangent works on a large and growing subset of Python, provides extra autodiff features other Python ML libraries don’t have, is high-performance, and is compatible with TensorFlow and NumPy.Automatic differentiation of Python codeHow do we automatically generate derivatives of plain Python code? Math functions like tf.exp or  tf.log have derivatives, which we can compose to build the backward pass. Similarly, pieces of syntax, such as subroutines, conditionals, and loops, also have backward-pass versions. Tangent contains recipes for generating derivative code for each piece of Python syntax, along with many NumPy and TensorFlow function calls.Tangent has a one-function API:Here’s an[...]