Subscribe: Google Research Blog
Added By: Feedage Forager Feedage Grade A rated
Language: English
data  google  images  learning  machine learning  machine  model  models  network  neural  new  quantum  research  time  video 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Google Research Blog

Google Research Blog

The latest news on Google Research.

Updated: 2018-04-21T02:43:58.889-07:00


Introducing the CVPR 2018 On-Device Visual Intelligence Challenge


Posted by Bo Chen, Software Engineer and Jeffrey M. Gilbert, Member of Technical Staff, Google ResearchOver the past year, there have been exciting innovations in the design of deep networks for vision applications on mobile devices, such as the MobileNet model family and integer quantization. Many of these innovations have been driven by performance metrics that focus on meaningful user experiences in real-world mobile applications, requiring inference to be both low-latency and accurate. While the accuracy of a deep network model can be conveniently estimated with well established benchmarks in the computer vision community, latency is surprisingly difficult to measure and no uniform metric has been established. This lack of measurement platforms and uniform metrics have hampered the development of performant mobile applications.Today, we are happy to announce the On-device Visual Intelligence Challenge (OVIC), part of the Low-Power Image Recognition Challenge Workshop at the 2018 Computer Vision and Pattern Recognition conference (CVPR2018). A collaboration with Purdue University, the University of North Carolina and IEEE, OVIC is a public competition for real-time image classification that uses state-of-the-art Google technology to significantly lower the barrier to entry for mobile development. OVIC provides two key features to catalyze innovation: a unified latency metric and an evaluation platform.A Unified MetricOVIC focuses on the establishment of a unified metric aligned directly with accurate and performant operation on mobile devices. The metric is defined as the number of correct classifications within a specified per-image average time limit of 33ms. This latency limit allows every frame in a live 30 frames-per-second video to be processed, thus providing a seamless user experience1. Prior to OVIC, it was tricky to enforce such a limit due to the difficulty in accurately and uniformly measuring latency as would be experienced in real-world applications on real-world devices. Without a repeatable mobile development platform, researchers have relied primarily on approximate metrics for latency that are convenient to compute, such as the number of multiply-accumulate operations (MACs). The intuition is that multiply-accumulate constitutes the most time-consuming operation in a deep neural network, so their count should be indicative of the overall latency. However, these metrics are often poor predictors of on-device latency due to many aspects of the models that can impact the average latency of each MAC in typical implementations.Even though the number of multiply-accumulate operations (# MACs) is the most commonly used metric to approximate on-device latency, it is a poor predictor of latency. Using data from various quantized and floating point MobileNet V1 and V2 based models, this graph plots on-device latency on a common reference device versus the number of MACs. It is clear that models with similar latency can have very different MACs, and vice versa. The graph above shows that while the number of MACs is correlated with the inference latency, there is significant variation in the mapping. Thus number of MACs is a poor proxy for latency, and since latency directly affects users’ experiences, we believe it is paramount to optimize latency directly rather than focusing on limiting the number of MACs as a proxy.An Evaluation PlatformAs mentioned above, a primary issue with latency is that it has previously been challenging to measure reliably and repeatably, due to variations in implementation, running environment and hardware architectures. Recent successes in mobile development overcome these challenges with the help of a convenient mobile development platform, including optimized kernels for mobile CPUs, light-weight portable model formats, increasingly capable mobile devices, and more. However, these various platforms have traditionally required resources and development capabilities that are only available to larger universities and industry. With that in mind, we are releasing OVIC’s evaluation[...]

DeepVariant Accuracy Improvements for Genetic Datatypes


Posted by Pi-Chuan Chang, Software Engineer and Lizzie Dorfman, Technical Program Manager, Google Brain Team Last December we released DeepVariant, a deep learning model that has been trained to analyze genetic sequences and accurately identify the differences, known as variants, that make us all unique. Our initial post focused on how DeepVariant approaches “variant calling” as an image classification problem, and is able to achieve greater accuracy than previous methods. Today we are pleased to announce the launch of DeepVariant v0.6, which includes some major accuracy improvements. In this post we describe how we train DeepVariant, and how we were able to improve DeepVariant's accuracy for two common sequencing scenarios, whole exome sequencing and polymerase chain reaction sequencing, simply by adding representative data into DeepVariant's training process. Many Types of Sequencing DataApproaches to genomic sequencing vary depending on the type of DNA sample (e.g., from blood or saliva), how the DNA was processed (e.g., amplification techniques), which technology was used to sequence the data (e.g., instruments can vary even within the same manufacturer) and what section or how much of the genome was sequenced. These differences result in a very large number of sequencing "datatypes". Typically, variant calling tools have been tuned for one specific datatype and perform relatively poorly on others. Given the extensive time and expertise involved in tuning variant callers for new datatypes, it seemed infeasible to customize each tool for every one. In contrast, with DeepVariant we are able to improve accuracy for new datatypes simply by including representative data in the training process, without negatively impacting overall performance.Truth Sets for Variant CallingDeep learning models depend on having high quality data for training and evaluation. In the field of genomics, the Genome in a Bottle (GIAB) consortium, which is hosted by the National Institute of Standards and Technology (NIST), produces human genomes for use in technology development, evaluation, and optimization. The benefit of working with GIAB benchmarking genomes is that their true sequence is known (at least to the extent currently possible). To achieve this, GIAB takes a single person's DNA and repeatedly sequences it using a wide variety of laboratory methods and sequencing technologies (i.e. many datatypes) and analyzes the resulting data using many different variant calling tools. A tremendous amount of work then follows to evaluate and adjudicate discrepancies to produce a high-confidence "truth set" for each genome. The majority of DeepVariant’s training data is from the first benchmarking genome released by GIAB, HG001. The sample, from a woman of northern European ancestry, was made available as part of the International HapMap Project, the first large-scale effort to identify common patterns of human genetic variation. Because DNA from HG001 is commercially available and so well characterized, it is often the first sample used to test new sequencing technologies and variant calling tools. By using many replicates and different datatypes of HG001, we can generate millions of training examples which helps DeepVariant learn to accurately classify many datatypes, and even generalize to datatypes it has never seen before. Improved Exome Model in v0.5In the v0.5 release we formalized a benchmarking-compatible training strategy to withhold from training a complete sample, HG002, as well as any data from chromosome 20. HG002, the second benchmarking genome released by GIAB, is from a male of Ashkenazi Jewish ancestry. Testing on this sample, which differs in both sex and ethnicity from HG001, helps to ensure that DeepVariant is performing well for diverse populations. Additionally reserving chromosome 20 for testing guarantees that we can evaluate DeepVariant's accuracy for any datatype that has truth data available. In v0.5 we also focused on exome data, which is the subset of the genome that directly codes for[...]

An Augmented Reality Microscope for Cancer Detection


Posted by Martin Stumpe, Technical Lead and Craig Mermel, Product Manager, Google Brain TeamApplications of deep learning to medical disciplines including ophthalmology, dermatology, radiology, and pathology have recently shown great promise to increase both the accuracy and availability of high-quality healthcare to patients around the world. At Google, we have also published results showing that a convolutional neural network is able to detect breast cancer metastases in lymph nodes at a level of accuracy comparable to a trained pathologist. However, because direct tissue visualization using a compound light microscope remains the predominant means by which a pathologist diagnoses illness, a critical barrier to the widespread adoption of deep learning in pathology is the dependence on having a digital representation of the microscopic tissue.Today, in a talk delivered at the Annual Meeting of the American Association for Cancer Research (AACR), with an accompanying paper “An Augmented Reality Microscope for Real-time Automated Detection of Cancer” (under review), we describe a prototype Augmented Reality Microscope (ARM) platform that we believe can possibly help accelerate and democratize the adoption of deep learning tools for pathologists around the world. The platform consists of a modified light microscope that enables real-time image analysis and presentation of the results of machine learning algorithms directly into the field of view. Importantly, the ARM can be retrofitted into existing light microscopes found in hospitals and clinics around the world using low-cost, readily-available components, and without the need for whole slide digital versions of the tissue being analyzed. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src="" width="640">Modern computational components and deep learning models, such as those built upon TensorFlow, will allow a wide range of pre-trained models to run on this platform. As in a traditional analog microscope, the user views the sample through the eyepiece. A machine learning algorithm projects its output back into the optical path in real-time. This digital projection is visually superimposed on the original (analog) image of the specimen to assist the viewer in localizing or quantifying features of interest. Importantly, the computation and visual feedback updates quickly — our present implementation runs at approximately 10 frames per second, so the model output updates seamlessly as the user scans the tissue by moving the slide and/or changing magnification.Left: Schematic overview of the ARM. A digital camera captures the same field of view (FoV) as the user and passes the image to an attached compute unit capable of running real-time inference of a machine learning model. The results are fed back into a custom AR display which is inline with the ocular lens and projects the model output on the same plane as the slide. Right: A picture of our prototype which has been retrofitted into a typical clinical-grade light microscope.In principle, the ARM can provide a wide variety of visual feedback, including text, arrows, contours, heatmaps, or animations, and is capable of running many types of machine learning algorithms aimed at solving different problems such as object detection, quantification, or classification. As a demonstration of the potential utility of the ARM, we configured it to run two different cancer detection algorithms: one that detects breast cancer metastases in lymph node specimens, and another that detects prostate cancer in prostatectomy specimens. These models can run at magnifications between 4-40x, and the result of a given model is displayed by outlining detected tumor regions with a green contour. These contours help draw the pathologist’s attention to areas of interest without obscuring the underlying tumor cell appear[...]

Introducing Semantic Experiences with Talk to Books and Semantris


Posted by Ray Kurzweil, Director of Engineering and Rachel Bernstein, Product Manager, Google ResearchNatural language understanding has evolved substantially in the past few years, in part due to the development of word vectors that enable algorithms to learn about the relationships between words, based on examples of actual language usage. These vector models map semantically similar phrases to nearby points based on equivalence, similarity or relatedness of ideas and language. Last year, we used hierarchical vector models of language to make improvements to Smart Reply for Gmail. More recently, we’ve been exploring other applications of these methods. Today, we are proud to share Semantic Experiences, a website showing two examples of how these new capabilities can drive applications that weren’t possible before. Talk to Books is an entirely new way to explore books by starting at the sentence level, rather than the author or topic level. Semantris is a word association game powered by machine learning, where you type out words associated with a given prompt. We have also published “Universal Sentence Encoder”, which describes the models used for these examples in more detail. Lastly, we’ve provided a pretrained semantic TensorFlow module for the community to experiment with their own sentence and phrase encoding. Modeling approachOur approach extends the idea of representing language in a vector space by creating vectors for larger chunks of language such as full sentences and small paragraphs. Since language is composed of hierarchies of concepts, we create the vectors using a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales. Relatedness, synonymy, antonymy, meronymy, holonymy, and many other types of relationships may all be represented in vector space language models if we train them in the right way and then pose the right “questions”. We describe this method in our paper, “Efficient Natural Language Response for Smart Reply.” Talk to BooksWith Talk to Books, we provide an entirely new way to explore books. You make a statement or ask a question, and the tool finds sentences in books that respond, with no dependence on keyword matching. In a sense you are talking to the books, getting responses which can help you determine if you’re interested in reading them or not. Talk to BooksThe models driving this experience were trained on a billion conversation-like pairs of sentences, learning to identify what a good response might look like. Once you ask your question (or make a statement), the tools searches all the sentences in over 100,000 books to find the ones that respond to your input based on semantic meaning at the sentence level; there are no predefined rules bounding the relationship between what you put in and the results you get. This capability is unique and can help you find interesting books that a keyword search might not surface, but there’s still room for improvement. For example, this experiment works at the sentence level (rather than at the paragraph level, as in Smart Reply for Gmail) so a “good” matching sentence can still be taken out of context. You might find books and passages that you didn’t expect, or the reason a particular passage was highlighted might not be obvious. You may also notice that being well-known does not make a book sort to the top; this experiment looks only at how well the individual sentences match up. However, one benefit of this is that the tool may help people discover unexpected authors and titles, and surface books in a way that is fresh and innovative.SemantrisWe are also providing Semantris, a word association game that is powered by this technology. When you enter a word or phrase, the game ranks all of the words on-screen, scoring them based on how well they respond to what you typed. Again, similarity, opposites and neighboring concepts are all fair-game using this semantic model. Try it ou[...]

Seeing More with In Silico Labeling of Microscopy Images


Eric Christiansen, Senior Software Engineer, Google ResearchIn the fields of biology and medicine, microscopy allows researchers to observe details of cells and molecules which are unavailable to the naked eye. Transmitted light microscopy, where a biological sample is illuminated on one side and imaged, is relatively simple and well-tolerated by living cultures but produces images which can be difficult to properly assess. Fluorescence microscopy, in which biological objects of interest (such as cell nuclei) are specifically targeted with fluorescent molecules, simplifies analysis but requires complex sample preparation. With the increasing application of machine learning to the field of microscopy, including algorithms used to automatically assess the quality of images and assist pathologists diagnosing cancerous tissue, we wondered if we could develop a deep learning system that could combine the benefits of both microscopy techniques while minimizing the downsides. With “In Silico Labeling: Predicting Fluorescent Labels in Unlabeled Images”, appearing today in Cell, we show that a deep neural network can predict fluorescence images from transmitted light images, generating labeled, useful, images without modifying cells and potentially enabling longitudinal studies in unmodified cells, minimally invasive cell screening for cell therapies, and investigations using large numbers of simultaneous labels. We also open sourced our network, along with the complete training and test data, a trained model checkpoint, and example code.BackgroundTransmitted light microscopy techniques are easy to use, but can produce images in which it can be hard to tell what’s going on. An example is the following image from a phase-contrast microscope, in which the intensity of a pixel indicates the degree to which light was phase-shifted as it passed through the sample.Transmitted light (phase-contrast) image of a human motor neuron culture derived from induced pluripotent stem cells. Outset 1 shows a cluster of cells, possibly neurons. Outset 2 shows a flaw in the image obscuring underlying cells. Outset 3 shows neurites. Outset 4 shows what appear to be dead cells. Scale bar is 40 μm. Source images for this and the following figures come from the Finkbeiner lab at the Gladstone Institutes.In the above figure, it’s difficult to tell how many cells are in the cluster in Outset 1, or the locations and states of the cells in Outset 4 (hint: there’s a barely-visible flat cell in the upper-middle). It’s also difficult to get fine structures consistently in focus, such as the neurites in Outset 3.We can get more information out of transmitted light microscopy by acquiring images in z-stacks: sets of images registered in (x, y) where z (the distance from the camera) is systematically varied. This causes different parts of the cells to come in and out of focus, which provides information about a sample’s 3D structure. Unfortunately, it often takes a trained eye to make sense of the z-stack, and analysis of such z-stacks has largely defied automation. An example z-stack is shown below.A phase-contrast z-stack of the same cells. Note how the appearance changes as the focus is shifted. Now we can see that the fuzzy shape in the lower right of Outset 1 is a single oblong cell, and that the rightmost cell in Outset 4 is taller than the uppermost cell, possibly indicating that it has undergone programmed cell death.In contrast, fluorescence microscopy images are easier to analyze, because samples are prepared with carefully engineered fluorescent labels which light up just what the researchers want to see. For example, most human cells have exactly one nucleus, so a nuclear label (such as the blue one below) makes it possible for simple tools to find and count cells in an image.Fluorescence microscopy image of the same cells. The blue fluorescent label localizes to DNA, highlighting cell nuclei. The green fluorescent label locali[...]

Looking to Listen: Audio-Visual Speech Separation


Posted by Inbar Mosseri and Oran Lang, Software Engineers, Google ResearchPeople are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds. Known as the cocktail party effect, this capability comes natural to us humans. However, automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.In “Looking to Listen at the Cocktail Party”, we present a deep learning audio-visual model for isolating a single speech signal from a mixture of sounds such as other voices and background noise. In this work, we are able to computationally produce videos in which speech of specific people is enhanced while all other sounds are suppressed. Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context. We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src="" width="640">A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech. Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person. The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video.The input to our method is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video.An Audio-Visual Speech Separation ModelTo generate training examples, we started by gathering a large collection of 100,000 high-quality videos of lectures and talks from YouTube. From these videos, we extracted segments with a clean speech (e.g. no mixed music, audience sounds or other speakers) and with a single speaker visible in the video frames. This resulted in roughly 2000 hours of video clips, each of a single person visible to the camera and talking with no background interference. We then used this clean data to generate “synthetic cocktail parties” -- mixtures of face videos and their corresponding speech from separate video sources, along with non-speech background noise we obtained from AudioSet.Using this data, we were able to train a multi-stream convolutional neural network-based model to split the synthetic cocktail mixture into separate audio streams for each speaker in the video. The input to the network are visual features extracted from the face thumbnails of detected speakers in each frame, and a spectrogram representation of the video’s soundtrack. During training, the network learns (separate) encodings for the visual and auditory signals, then it fuses them together to form a joint audio-visual representation. With that joint representation, the network learns to output a time-frequency mask for each speaker. The output masks are multiplied by the noisy input spectrogram and converted back to a time-domain waveform to obtain an isolated,[...]

Announcing the 2018 Google PhD Fellows for North America, Europe and the Middle East


Google created the PhD Fellowship program in 2009 to recognize and support outstanding graduate students doing exceptional research in Computer Science and related disciplines. Now in its ninth year, our fellowship program has supported hundreds of future faculty, industry researchers, innovators and entrepreneurs.

Reflecting our continuing commitment to supporting and building relationships with the academic community, we are excited to announce the 39 recipients from North America, Europe and the Middle East. We offer our sincere congratulations to the 2018 Google PhD Fellows.

Algorithms, Optimizations and Markets
Emmanouil Zampetakis, Massachusetts Institute of Technology
Manuela Fischer, ETH Zurich CS
Thodoris Lykouris, Cornell University
Yuan Deng, Duke University

Computational Neuroscience
Ella Batty, Columbia University
Neha Spenta Wadia, University of California, Berkeley
Reuben Feinman, New York University

Human-Computer Interaction
Gierad Laput, Carnegie Mellon University
Mike Schaekermann, University of Waterloo
Minsuk (Brian) Kahng, Georgia Tech

Machine Learning
Aditi Raghunathan, Stanford University
Lin Chen, Yale University
Qian Yu, University of Southern California
Ravid Shwartz-Ziv, The Hebrew University of Jerusalem
Shuang Liu, University of California, San Diego
Stephen Tu, University of California, Berkeley
Xinchen Yan, University of Michigan, Ann Arbor
Zelda Mariet, Massachusetts Institute of Technology

Mobile Computing
Shilin Zhu, University of California, San Diego

Machine Perception, Speech Technology and Computer Vision
Antoine Miech, INRIA
Arsha Nagrani, University of Oxford (ES)
Joseph Redmon, University of Washington
Raymond Yeh, University of Illinois, Urbana-Champaign
Shanmukha Ramakrishna Vedantam, Georgia Tech

Natural Language Processing
Anne Cocos, University of Pennsylvania
Jonathan Herzig, Tel-Aviv University
Rotem Dror, Technion - Israel Institute of Technology
Yang Liu, The University of Edinburgh
Yoon Kim, Harvard University

Privacy and Security
Aayush Jain, University of California, Los Angeles

Programming Technology and Software Engineering
Gowtham Kaki, Purdue University, West Lafayette
Reyhaneh Jabbarvand, University of California, Irvine
Victor Lanvin, Fondation Sciences Mathématiques de Paris

Quantum Computing
Erika Ye, California Institute of Technology

Structured Data and Database Management
Lingjiao Chen, University of Wisconsin

Systems and Networking
Andrea Lattuada, ETH Zurich CS
Lana Josipović, EPFL CS
Michael Schaarschmidt, University of Cambridge - Computer Laboratory
Rachee Singh, University of Massachusetts, Amherst(image)

MobileNetV2: The Next Generation of On-Device Computer Vision Networks


Posted by Mark Sandler and Andrew Howard, Google ResearchLast year we introduced MobileNetV1, a family of general purpose computer vision neural networks designed with mobile devices in mind to support classification, detection and more. The ability to run deep networks on personal mobile devices improves user experience, offering anytime, anywhere access, with additional benefits for security, privacy, and energy consumption. As new applications emerge allowing users to interact with the real world in real time, so does the need for ever more efficient neural networks.Today, we are pleased to announce the availability of MobileNetV2 to power the next generation of mobile vision applications. MobileNetV2 is a significant improvement over MobileNetV1 and pushes the state of the art for mobile visual recognition including classification, object detection and semantic segmentation. MobileNetV2 is released as part of TensorFlow-Slim Image Classification Library, or you can start exploring MobileNetV2 right away in Colaboratory. Alternately, you can download the notebook and explore it locally using Jupyter. MobileNetV2 is also available as modules on TF-Hub, and pretrained checkpoints can be found on github.MobileNetV2 builds upon the ideas from MobileNetV1 [1], using depthwise separable convolution as efficient building blocks. However, V2 introduces two new features to the architecture: 1) linear bottlenecks between the layers, and 2) shortcut connections between the bottlenecks1. The basic structure is shown below.Overview of MobileNetV2 Architecture. Blue blocks represent composite convolutional building blocks as shown above.The intuition is that the bottlenecks encode the model’s intermediate inputs and outputs while the inner layer encapsulates the model’s ability to transform from lower-level concepts such as pixels to higher level descriptors such as image categories. Finally, as with traditional residual connections, shortcuts enable faster training and better accuracy. You can learn more about the technical details in our paper, “MobileNet V2: Inverted Residuals and Linear Bottlenecks”.How does it compare to the first generation of MobileNets? Overall, the MobileNetV2 models are faster for the same accuracy across the entire latency spectrum. In particular, the new models use 2x fewer operations, need 30% fewer parameters and are about 30-40% faster on a Google Pixel phone than MobileNetV1 models, all while achieving higher accuracy.MobileNetV2 improves speed (reduced latency) and increased ImageNet Top 1 accuracyMobileNetV2 is a very effective feature extractor for object detection and segmentation. For example, for detection when paired with the newly introduced SSDLite [2] the new model is about 35% faster with the same accuracy than MobileNetV1. We have open sourced the model under the Tensorflow Object Detection API [4]. Model Params Multiply-Adds mAP Mobile CPU MobileNetV1 + SSDLite 5.1M 1.3B 22.2% 270ms MobileNetV2 + SSDLite 4.3M 0.8B 22.1% 200ms To enable on-device semantic segmentation, we employ MobileNetV2 as a feature extractor in a reduced form of DeepLabv3 [3], that was announced recently. On the semantic segmentation benchmark, PASCAL VOC 2012, our resulting model attains a similar performance as employing MobileNetV1 as feature extractor, but requires 5.3 times fewer parameters and 5.2 times fewer operations in terms of Multiply-Adds. Model Params Multiply-Adds mIOU MobileNetV1 + DeepLabV3 11.15M 14.25B 75.29% MobileNetV2 + DeepLabV3 2.11M 2.75B 75.32% As we have seen MobileNetV2 provides a very efficient mobile-oriented model that can be used as a base for many visual recognition tasks. We hope by sharing it with the broader academic and open-source community we can help to advance research and application development.Acknowledgements:We would like to acknowledge our core contributors Men[...]

Investing in France’s AI Ecosystem


Posted by Olivier Bousquet, Principal Engineer, Google ZürichRecently, we announced the launch of a new AI research team in our Paris office. And today DeepMind has also announced a new AI research presence in Paris. We are excited about expanding Google’s research presence in Europe, which bolsters the efforts of the existing groups in our Zürich and London offices. As strong supporters of academic research, we are also excited to foster collaborations with France’s vibrant academic ecosystem.Our research teams in Paris will focus on fundamental AI research, as well as important applications of these ideas to areas such as Health, Science or Arts. They will publish and open-source their results to advance the state-of-the-art in core areas such as Deep Learning and Reinforcement Learning.Our approach to research is based on building a strong connection with the academic community; contributing to training the next generation of scientists and establishing a bridge between academic and industrial research. We believe that both objectives are key to fostering a healthy research ecosystem that will flourish in the long term. These ideas are very much aligned with some of the recommendations that Fields Medalist and member of French Parliament Cédric Villani is putting forward in his report on AI to the French government.As we expand our teams in France, we have several initiatives that illustrate our commitment to these goals:We are sponsoring “Artificial Intelligence and Visual Computing” Chair at École Polytechnique (one of the leading higher education institutions in France) which will support their education initiatives in AIWe just established a partnership with INRIA for conducting collaborative research projectsWe are funding academic research with unrestricted grants mostly dedicated to the support of PhD and postdoc positions through our Faculty Research Awards and PhD Fellowship programs, as well as our Focused Research Awards. As one example, we have recently funded a project on large scale optimization of neural networks led by Francis Bach (INRIA and ENS) and Alexandre d’Aspremont (CNRS and ENS)We are working on offering CIFRE PhD positions (joint PhD positions between Google and an academic lab) as well as internships for PhD studentsAdditionally, we are pleased to announce that one of the world’s leading experts in computer vision, Cordelia Schmid, will begin a dual appointment at INRIA and Google Paris. These kind of appointments, together with our Visiting Faculty program, are a great way to share ideas and research challenges, and utilize Google's world-class computing infrastructure to explore new projects at industrial scale. France has a long tradition of research and educational excellence, and has a very dynamic and active machine learning community. This makes it a great place to pursue our goal of building AI-enabled technologies that can benefit everyone, through fundamental advances in machine learning and related fields. [...]

Using Machine Learning to Discover Neural Network Optimizers


Posted by Irwan Bello, Research Associate, Google Brain TeamDeep learning models have been deployed in numerous Google products, such as Search, Translate and Photos. The choice of optimization method plays a major role when training deep learning models. For example, stochastic gradient descent works well in many situations, but more advanced optimizers can be faster, especially for training very deep networks. Coming up with new optimizers for neural networks, however, is challenging due to to the non-convex nature of the optimization problem. On the Google Brain team, we wanted to see if it could be possible to automate the discovery of new optimizers, in a way that is similar to how AutoML has been used to discover new competitive neural network architectures.In “Neural Optimizer Search with Reinforcement Learning”, we present a method to discover optimization methods with a focus on deep learning architectures. Using this method we found two new optimizers, PowerSign and AddSign, that are competitive on a variety of different tasks and architectures, including ImageNet classification and Google’s neural machine translation system. To help others benefit from this work we have made the optimizers available in Tensorflow.Neural Optimizer Search makes use of a recurrent neural network controller which is given access to a list of simple primitives that are typically relevant for optimization. These primitives include, for example, the gradient or the running average of the gradient and lead to search spaces with over 1010 possible combinations. The controller then generates the computation graph for a candidate optimizer or update rule in that search space.In our paper, proposed candidate update rules (U) are used to train a child convolutional neural network on CIFAR10 for a few epochs and the final validation accuracy (R) is fed as a reward to the controller. The controller is trained with reinforcement learning to maximize the validation accuracies of the sampled update rules. This process is illustrated below.An overview of Neural Optimizer Search using an iterative process to discover new optimizers.Interestingly, the optimizers we have found are interpretable. For example, in the PowerSign optimizer we are releasing, each update compares the sign of the gradient and its running average, adjusting the step size according to whether those two values agree. The intuition behind this is that if these values agree, one is more confident in the direction of the update, and thus the step size can be larger. We also discovered a simple learning rate decay scheme, linear cosine decay, which we found can lead to faster convergence.Graph comparing learning rate decay functions for linear cosine decay, stepwise decay and cosine decay.Neural Optimizer Search found several optimizers that outperform commonly used optimizers on the small ConvNet model. Among the ones that transfer well to other tasks, we found that PowerSign and AddSign improve top-1 and top-5 accuracy of a state-of-the-art ImageNet mobile-sized model by up to 0.4%. They also work well on Google’s Neural Machine Translation system, giving an improvement of up to 0.7 using bilingual evaluation metrics (BLEU) on an English to German translation task.We are excited that Neural Optimizer Search can not only improve the performance of machine learning models but also potentially lead to new, interpretable equations and discoveries. It is our hope that open sourcing these optimizers in Tensorflow will be useful to machine learning practitioners. [...]

Expressive Speech Synthesis with Tacotron


Posted by Yuxuan Wang, Research Scientist and RJ Skerry-Ryan, Software Engineer, on behalf of the Machine Perception, Google Brain and TTS Research teamsAt Google, we're excited about the recent rapid progress of neural network-based text-to-speech (TTS) research. In particular, end-to-end architectures, such as the Tacotron systems we announced last year, can both simplify voice building pipelines and produce natural-sounding speech. This will help us build better human-computer interfaces, like conversational assistants, audiobook narration, news readers, or voice design software. To deliver a truly human-like voice, however, a TTS system must learn to model prosody, the collection of expressive factors of speech, such as intonation, stress, and rhythm. Most current end-to-end systems, including Tacotron, don't explicitly model prosody, meaning they can't control exactly how the generated speech should sound. This may lead to monotonous-sounding speech, even when models are trained on very expressive datasets like audiobooks, which often contain character voices with significant variation. Today, we are excited to share two new papers that address these problems.Our first paper, “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron”, introduces the concept of a prosody embedding. We augment the Tacotron architecture with an additional prosody encoder that computes a low-dimensional embedding from a clip of human speech (the reference audio).We augment Tacotron with a prosody encoder. The lower half of the image is the original Tacotron sequence-to-sequence model. For technical details, please refer to the paper.This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing. At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference.Text: *Is* that Utah travel agency? Reference prosody (Australian) Synthesized without prosody embedding (American) Synthesized with prosody embedding (American) The embedding can also transfer fine time-aligned prosody from one phrase to a slightly different phrase, though this technique works best when the reference and target phrases are similar in length and structure.Reference Text: For the first time in her life she had been danced tired.Synthesized Text: For the last time in his life he had been handily embarrassed. Reference prosody (American) Synthesized without prosody embedding (American) Synthesized with prosody embedding (American) Excitingly, we observe prosody transfer even when the reference audio comes from a speaker whose voice is not in Tacotron's training data. Text: I've Swallowed a Pollywog. Reference prosody (Unseen American Speaker) Synthesized without prosody embedding (British) Synthesized with prosody embedding (British) This is a promising result, as it paves the way for voice interaction designers to use their own voice to customize speech synthesis. You can listen to the full set of audio demos for “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron” on this web page.Despite their ability to transfer prosody with high fidelity, the embeddings from the paper above don't completely disentangle prosody from the content of a reference audio clip. (This explains why they transfer prosody best to phrases of similar structure and length.) Furthermore, they require a clip of reference audio at inference time. A natural question then arises: can we develop a model of expressive speech that alle[...]

Reformulating Chemistry for More Efficient Quantum Computation


Posted by Ryan Babbush, Senior Research Scientist, Quantum AI TeamThe first known classical “computer” was the Antikythera mechanism, an analog machine used to simulate the classical mechanics governing dynamics of celestial bodies on an astronomical scale. Similarly, a major ambition of quantum computers is to simulate the quantum mechanics governing dynamics of particles on the atomic scale. These simulations are often classically intractable due to the complex quantum mechanics at play. Of particular interest is the simulation of electrons forming chemical bonds, which give rise to the properties of essentially all molecules, materials and chemical reactions.Left: The first known computing device, the Antikythera mechanism: a classical machine used to simulate classical mechanics. Right: Google’s 22 Xmon qubit “foxtail” chip arranged in a bilinear array on a wafer, the predecessor to Google’s new Bristlecone quantum processor with 72 qubits, a quantum machine we intend to use to simulate quantum mechanics, among other applications.Since the launch of the Quantum AI team in 2013, we have been developing practical algorithms for quantum processors. In 2015, we conducted the first quantum chemistry experiment on a superconducting quantum computing device, published in Physical Review X. More recently, our quantum simulation effort experimentally simulated exotic phases of matter and released the first software package for quantum computing chemistry, OpenFermion. Earlier this month, our hardware team announced the new Bristlecone quantum processor with 72 qubits. Today, we highlight two recent publications with theoretical advances that significantly reduce the cost of these quantum computations. Our results were presented at the Quantum Information Processing and IBM ThinkQ conferences.The first of these works, “Low-Depth Quantum Simulation of Materials,” published this week in Physical Review X, was a collaboration between researchers at Google, the group of Professor Garnet Chan at Caltech and the QuArC group at Microsoft. Our fundamental advance was to realize that by changing how molecules are represented on quantum computers, we can greatly simplify the quantum circuits required to solve the problem. Specifically, we specially design basis sets so that the equations describing the system energies (i.e. the Hamiltonian) become more straightforward to express for quantum computation.To do this, we focused on using basis sets related to functions (plane waves) used in classical electronic structure calculations to provide a periodic representation of the physical system. This enables one to go beyond the quantum simulation of single-molecules and instead use quantum computers to model realistic materials. For instance, instead of simulating a single lithium hydride molecule floating in free space, with our approach one can quantum simulate a crystal of lithium hydride, which is how the material appears in nature. With larger quantum computers one could study other important materials problems such as the degradation of battery cathodes, chemical reactions involving heterogeneous catalysts, or the unusual electrical properties of graphene and superconductors.In “Quantum Simulation of Electronic Structure with Linear Depth and Connectivity,” published last week in Physical Review Letters with the same collaborators and a Google intern from the Aspuru-Guzik group at Harvard, we leverage the structure introduced in the work above to design algorithms for near-term quantum computers with qubits laid out in a linear array. Whereas past methods required such quantum computers to run for time scaling as the fifth power of the number of simulated electrons for each dynamic step, our improved algorithm runs for t[...]

Google Faculty Research Awards 2017


We’ve just completed another round of the Google Faculty Research Awards, our annual open call for proposals on computer science and related topics such as machine learning, machine perception, natural language processing, and quantum computing. Our grants cover tuition for a graduate student and provide both faculty and students the opportunity to work directly with Google researchers and engineers.

This round we received 1033 proposals covering 46 countries and over 360 universities. After expert reviews and committee discussions, we decided to fund 152 projects. The subject areas that received the most support this year were human computer interaction, machine learning, machine perception, and systems. Here are a few observations from this round:
  • There was a 17% increase in the total number of proposals received
  • There was a 87% increase in the number of proposals from Asia Pacific universities
  • Proposals focused on Computational Neuroscience increased 53%
  • Proposals focused on Quantum Computing more than doubled this round
Congratulations to the well-deserving recipients of this round’s awards. If you are interested in applying for the next round (September 2018 deadline), please visit our website for more information. You can find award recipients from previous years here.(image)

Using Deep Learning to Facilitate Scientific Image Analysis


Posted by Samuel Yang, Research Scientist, Google Accelerated Science Team Many scientific imaging applications, especially microscopy, can produce terabytes of data per day. These applications can benefit from recent advances in computer vision and deep learning. In our work with biologists on robotic microscopy applications (e.g., to distinguish cellular phenotypes) we've learned that assembling high quality image datasets that separate signal from noise is a difficult but important task. We've also learned that there are many scientists who may not write code, but who are still excited to utilize deep learning in their image analysis work. A particular challenge we can help address involves dealing with out-of-focus images. Even with the autofocus systems on state-of-the-art microscopes, poor configuration or hardware incompatibility may result in image quality issues. Having an automated way to rate focus quality can enable the detection, troubleshooting and removal of such images.Deep Learning to the RescueIn “Assessing Microscope Image Focus Quality with Deep Learning”, we trained a deep neural network to rate the focus quality of microscopy images with higher accuracy than previous methods. We also integrated the pre-trained TensorFlow model with plugins in Fiji (ImageJ) and CellProfiler, two leading open source scientific image analysis tools that can be used with either a graphical user interface or invoked via scripts.A pre-trained TensorFlow model rates focus quality for a montage of microscope image patches of cells in Fiji (ImageJ). Hue and lightness of the borders denote predicted focus quality and prediction uncertainty, respectively.Our publication and source code (TensorFlow, Fiji, CellProfiler) illustrate the basics of a machine learning project workflow: assembling a training dataset (we synthetically defocused 384 in-focus images of cells, avoiding the need for a hand-labeled dataset), training a model using data augmentation, evaluating generalization (in our case, on unseen cell types acquired by an additional microscope) and deploying the pre-trained model. Previous tools for identifying image focus quality often require a user to manually review images for each dataset to determine a threshold between in and out-of-focus images; our pre-trained model requires no user set parameters to use, and can rate focus quality more accurately as well. To help improve interpretability, our model evaluates focus quality on 84×84 pixel patches which can be visualized with colored patch borders.What about Images without Objects?An interesting challenge we overcame was that there are often "blank" image patches with no objects, a scenario where no notion of focus quality exists. Instead of explicitly labeling these "blank" patches and teaching our model to recognize them as a separate category, we configured our model to predict a probability distribution across defocus levels, allowing it to learn to express uncertainty (dim borders in the figure) for these empty patches (e.g. predict equal probability in/out-of-focus).What's Next?Deep learning-based approaches for scientific image analysis will improve accuracy, reduce manual parameter tuning and may reveal new insights. Clearly, the sharing and availability of datasets and models, and implementation into tools that are proven to be useful within respective communities, will be important for widespread adoption.AcknowledgementsWe thank Claire McQuin, Allen Goodman, Anne Carpenter of the Broad Institute and Kevin Eliceiri of the University of Wisconsin at Madison for assistance with CellProfiler and Fiji integration, respectively. [...]

Using Evolutionary AutoML to Discover Neural Network Architectures


Posted by Esteban Real, Senior Software Engineer, Google Brain TeamThe brain has evolved over a long time, from very simple worm brains 500 million years ago to a diversity of modern structures today. The human brain, for example, can accomplish a wide variety of activities, many of them effortlessly — telling whether a visual scene contains animals or buildings feels trivial to us, for example. To perform activities like these, artificial neural networks require careful design by experts over years of difficult research, and typically address one specific task, such as to find what's in a photograph, to call a genetic variant, or to help diagnose a disease. Ideally, one would want to have an automated method to generate the right architecture for any given task.One approach to generate these architectures is through the use of evolutionary algorithms. Traditional research into neuro-evolution of topologies (e.g. Stanley and Miikkulainen 2002) has laid the foundations that allow us to apply these algorithms at scale today, and many groups are working on the subject, including OpenAI, Uber Labs, Sentient Labs and DeepMind. Of course, the Google Brain team has been thinking about AutoML too. In addition to learning-based approaches (eg. reinforcement learning), we wondered if we could use our computational resources to programmatically evolve image classifiers at unprecedented scale. Can we achieve solutions with minimal expert participation? How good can today's artificially-evolved neural networks be? We address these questions through two papers.In “Large-Scale Evolution of Image Classifiers,” presented at ICML 2017, we set up an evolutionary process with simple building blocks and trivial initial conditions. The idea was to "sit back" and let evolution at scale do the work of constructing the architecture. Starting from very simple networks, the process found classifiers comparable to hand-designed models at the time. This was encouraging because many applications may require little user participation. For example, some users may need a better model but may not have the time to become machine learning experts. A natural question to consider next was whether a combination of hand-design and evolution could do better than either approach alone. Thus, in our more recent paper, “Regularized Evolution for Image Classifier Architecture Search” (2018), we participated in the process by providing sophisticated building blocks and good initial conditions (discussed below). Moreover, we scaled up computation using Google's new TPUv2 chips. This combination of modern hardware, expert knowledge, and evolution worked together to produce state-of-the-art models on CIFAR-10 and ImageNet, two popular benchmarks for image classification.A Simple ApproachThe following is an example of an experiment from our first paper. In the figure below, each dot is a neural network trained on the CIFAR-10 dataset, which is commonly used to train image classifiers. Initially, the population consists of one thousand identical simple seed models (no hidden layers). Starting from simple seed models is important — if we had started from a high-quality model with initial conditions containing expert knowledge, it would have been easier to get a high-quality model in the end. Once seeded with the simple models, the process advances in steps. At each step, a pair of neural networks is chosen at random. The network with higher accuracy is selected as a parent and is copied and mutated to generate a child that is then added to the population, while the other neural network dies out. All other networks remain unchanged during the step. With the application of many such s[...]

Balanced Partitioning and Hierarchical Clustering at Scale


Posted by Hossein Bateni, Research Scientist and Kevin Aydin, Software Engineer, NYC Algorithms and Optimization Research TeamSolving large-scale optimization problems often starts with graph partitioning, which means partitioning the vertices of the graph into clusters to be processed on different machines. The need to make sure that clusters are of near equal size gives rise to the balanced graph partitioning problem. In simple terms, we need to partition the vertices of a given graph into k almost equal clusters, while we minimize the number of edges that are cut by the partition. This NP-hard problem is notoriously difficult in practice because the best approximation algorithms for small instances rely on semidefinite programming which is impractical for larger instances. This post presents the distributed algorithm we developed which is more applicable to large instances. We introduced this balanced graph-partitioning algorithm in our WSDM 2016 paper, and have applied this approach to several applications within Google. Our more recent NIPS 2017 paper provides more details of the algorithm via a theoretical and empirical study.Balanced Partitioning via Linear EmbeddingOur algorithm first embeds vertices of the graph onto a line, and then processes vertices in a distributed manner guided by the linear embedding order. We examine various ways to find the initial embedding, and apply four different techniques (such as local swaps and dynamic programming) to obtain the final partition. The best initial embedding is based on “affinity clustering”.Affinity Hierarchical ClusteringAffinity clustering is an agglomerative hierarchical graph clustering based on Borůvka’s classic Maximum-cost Spanning Tree algorithm. As discussed above, this algorithm is a critical part of our balanced partitioning tool. The algorithm starts by placing each vertex in a cluster of its own: v0, v1, and so on. Then, in each iteration, the highest-cost edge out of each cluster is selected in order to induce larger merged clusters: A0, A1, A2, etc. in the first round and B0, B1, etc. in the second round and so on. The set of merges naturally produces a hierarchical clustering, and gives rise to a linear ordering of the leaf vertices (vertices with degree one). The image below demonstrates this, with the numbers at the bottom corresponding to the ordering of the vertices.Our NIPS’17 paper explains how we run affinity clustering efficiently in the massively parallel computation (MPC) model, in particular using distributed hash tables (DHTs) to significantly reduce running time. This paper also presents a theoretical study of the algorithm. We report clustering results for graphs with tens of trillions of edges, and also observe that affinity clustering empirically beats other clustering algorithms such as k-means in terms of “quality of the clusters”. This video contains a summary of the result and explains how this parallel algorithm may produce higher-quality clusters even compared to a sequential single-linkage agglomerative algorithm.Comparison to Previous WorkIn comparing our algorithm to previous work in (distributed) balanced graph partitioning, we focus on FENNEL, Spinner, METIS, and a recent label propagation-based algorithm. We report results on several public social networks as well as a large private map graph. For a Twitter followership graph, we see a consistent improvement of 15–25% over previous results (Ugander and Backstrom, 2013), and for LiveJournal graph, our algorithm outperforms all the others for all cases except k = 2, where ours is slightly worse than FENNEL's.The following table presents the fraction of cu[...]

Behind the Motion Photos Technology in Pixel 2


Posted by Matthias Grundmann, Research Scientist and Jianing Wei, Software Engineer, Google Research One of the most compelling things about smartphones today is the ability to capture a moment on the fly. With motion photos, a new camera feature available on the Pixel 2 and Pixel 2 XL phones, you no longer have to choose between a photo and a video so every photo you take captures more of the moment. When you take a photo with motion enabled, your phone also records and trims up to 3 seconds of video. Using advanced stabilization built upon technology we pioneered in Motion Stills for Android, these pictures come to life in Google Photos. Let’s take a look behind the technology that makes this possible!Motion photos on the Pixel 2 in Google Photos. With the camera frozen in place the focus is put directly on the subjects. For more examples, check out this Google Photos album.Camera Motion Estimation by Combining Hardware and SoftwareThe image and video pair that is captured every time you hit the shutter button is a full resolution JPEG with an embedded 3 second video clip. On the Pixel 2, the video portion also contains motion metadata that is derived from the gyroscope and optical image stabilization (OIS) sensors to aid the trimming and stabilization of the motion photo. By combining software based visual tracking with the motion metadata from the hardware sensors, we built a new hybrid motion estimation for motion photos on the Pixel 2. Our approach aligns the background more precisely than the technique used in Motion Stills or the purely hardware sensor based approach. Based on Fused Video Stabilization technology, it reduces the artifacts from the visual analysis due to a complex scene with many depth layers or when a foreground object occupies a large portion of the field of view. It also improves the hardware sensor based approach by refining the motion estimation to be more accurate, especially at close distances.Motion photo as captured (left) and after freezing the camera by combining hardware and software For more comparisons, check out this Google Photos album.The purely software-based technique we introduced in Motion Stills uses the visual data from the video frames, detecting and tracking features over consecutive frames yielding motion vectors. It then classifies the motion vectors into foreground and background using motion models such as an affine transformation or a homography. However, this classification is not perfect and can be misled, e.g. by a complex scene or dominant foreground.Feature classification into background (green) and foreground (orange) by using the motion metadata from the hardware sensors of the Pixel 2. Notice how the new approach not only labels the skateboarder accurately as foreground but also the half-pipe that is at roughly the same depth.For motion photos on Pixel 2 we improved this classification by using the motion metadata derived from the gyroscope and the OIS. This accurately captures the camera motion with respect to the scene at infinity, which one can think of as the background in the distance. However, for pictures taken at closer range, parallax is introduced for scene elements at different depth layers, which is not accounted for by the gyroscope and OIS. Specifically, we mark motion vectors that deviate too much from the motion metadata as foreground. This results in a significantly more accurate classification of foreground and background, which also enables us to use a more complex motion model known as mixture homographies that can account for rolling shutter and undo the distortions it causes.Background motion estimation in [...]

Semantic Image Segmentation with DeepLab in TensorFlow


Posted by Liang-Chieh Chen and Yukun Zhu, Software Engineers, Google ResearchSemantic image segmentation, the task of assigning a semantic label, such as “road”, “sky”, “person”, “dog”, to every pixel in an image enables numerous new applications, such as the synthetic shallow depth-of-field effect shipped in the portrait mode of the Pixel 2 and Pixel 2 XL smartphones and mobile real-time video segmentation. Assigning these semantic labels requires pinpointing the outline of objects, and thus imposes much stricter localization accuracy requirements than other visual entity recognition tasks such as image-level classification or bounding box-level detection.Today, we are excited to announce the open source release of our latest and best performing semantic image segmentation model, DeepLab-v3+ [1]*, implemented in TensorFlow. This release includes DeepLab-v3+ models built on top of a powerful convolutional neural network (CNN) backbone architecture [2, 3] for the most accurate results, intended for server-side deployment. As part of this release, we are additionally sharing our TensorFlow model training and evaluation code, as well as models already pre-trained on the Pascal VOC 2012 and Cityscapes benchmark semantic segmentation tasks.Since the first incarnation of our DeepLab model [4] three years ago, improved CNN feature extractors, better object scale modeling, careful assimilation of contextual information, improved training procedures, and increasingly powerful hardware and software have led to improvements with DeepLab-v2 [5] and DeepLab-v3 [6]. With DeepLab-v3+, we extend DeepLab-v3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries. We further apply the depthwise separable convolution to both atrous spatial pyramid pooling [5, 6] and decoder modules, resulting in a faster and stronger encoder-decoder network for semantic segmentation.Modern semantic image segmentation systems built on top of convolutional neural networks (CNNs) have reached accuracy levels that were hard to imagine even five years ago, thanks to advances in methods, hardware, and datasets. We hope that publicly sharing our system with the community will make it easier for other groups in academia and industry to reproduce and further improve upon state-of-art systems, train models on new datasets, and envision new applications for this technology.AcknowledgementsWe would like to thank the support and valuable discussions with Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille (co-authors of DeepLab-v1 and -v2), as well as Mark Sandler, Andrew Howard, Menglong Zhu, Chen Sun, Derek Chow, Andre Araujo, Haozhi Qi, Jifeng Dai, and the Google Mobile Vision team. ReferencesEncoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation, Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, arXiv: 1802.02611, 2018.Xception: Deep Learning with Depthwise Separable Convolutions, François Chollet, Proc. of CVPR, 2017.Deformable Convolutional Networks — COCO Detection and Segmentation Challenge 2017 Entry, Haozhi Qi, Zheng Zhang, Bin Xiao, Han Hu, Bowen Cheng, Yichen Wei, and Jifeng Dai, ICCV COCO Challenge Workshop, 2017.Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille, Proc. of ICLR, 2015.Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, [...]

Introducing the iNaturalist 2018 Challenge


Posted by Yang Song, Staff Software Engineer and Serge Belongie, Visiting Faculty, Google Research Thanks to recent advances in deep learning, the visual recognition abilities of machines have improved dramatically, permitting the practical application of computer vision to tasks ranging from pedestrian detection for self-driving cars to expression recognition in virtual reality. One area that remains challenging for computers, however, is fine-grained and instance-level recognition. Earlier this month, we posted an instance-level landmark recognition challenge for identifying individual landmarks. Here we focus on fine-grained visual recognition, which is to distinguish species of animals and plants, car and motorcycle models, architectural styles, etc. For computers, discriminating fine-grained categories is challenging because many categories have relatively few training examples (i.e., the long tail problem), the examples that do exist often lack authoritative training labels, and there is variability in illumination, viewing angle and object occlusion.To help confront these hurdles, we are excited to announce the 2018 iNaturalist Challenge (iNat-2018), a species classification competition offered in partnership with iNaturalist and Visipedia (short for Visual Encyclopedia), a project for which Caltech and Cornell Tech received a Google Focused Research Award. This is a flagship challenge for the 5th International Workshop on Fine Grained Visual Categorization (FGVC5) at CVPR 2018. Building upon the first iNaturalist challenge, iNat-2017, iNat-2018 spans over 8000 categories of plants, animals, and fungi, with a total of more than 450,000 training images. We invite participants to enter the competition on Kaggle, with final submissions due in early June. Training data, annotations, and links to pretrained models can be found on our GitHub repo.iNaturalist has emerged as a world leader for citizen scientists to share observations of species and connect with nature since its founding in 2008. It hosts research-grade photos and annotations submitted by a thriving, engaged community of users. Consider the following photo from iNaturalist:The map on the right shows where the photo was taken. Image credit: Serge Belongie.You may notice that the photo on the left contains a turtle. But did you also know this is a Trachemys scripta, common name “Pond Slider?” If you knew the latter, you possess knowledge of fine-grained or subordinate categories.In contrast to other image classification datasets such as ImageNet, the dataset in the iNaturalist challenge exhibits a long-tailed distribution, with many species having relatively few images. It is important to enable machine learning models to handle categories in the long-tail, as the natural world is heavily imbalanced – some species are more abundant and easier to photograph than others. The iNaturalist challenge will encourage progress because the training distribution of iNat-2018 has an even longer tail than iNat-2017.Distribution of training images per species for iNat-2017 and iNat-2018, plotted on a log-linear scale, illustrating the long-tail behavior typical of fine-grained classification problems. Image Credit: Grant Van Horn and Oisin Mac Aodha.Along with iNat-2018, FGVC5 will also host the iMaterialist 2018 challenge (including a furniture categorization challenge and a fashion attributes challenge for product images) and a set of “FGVCx” challenges representing smaller scale – but still significant – challenges, featuring content such as food and modern art.FGV[...]

Open Sourcing the Hunt for Exoplanets


Posted by Chris Shallue, Senior Software Engineer, Google Brain Team(Crossposted on the Google Open Source Blog)Recently, we discovered two exoplanets by training a neural network to analyze data from NASA’s Kepler space telescope and accurately identify the most promising planet signals. And while this was only an initial analysis of ~700 stars, we consider this a successful proof-of-concept for using machine learning to discover exoplanets, and more generally another example of using machine learning to make meaningful gains in a variety of scientific disciplines (e.g. healthcare, quantum chemistry, and fusion research).Today, we’re excited to release our code for processing the Kepler data, training our neural network model, and making predictions about new candidate signals. We hope this release will prove a useful starting point for developing similar models for other NASA missions, like K2 (Kepler’s second mission) and the upcoming Transiting Exoplanet Survey Satellite mission. As well as announcing the release of our code, we’d also like take this opportunity to dig a bit deeper into how our model works. A Planet Hunting PrimerFirst, let’s consider how data collected by the Kepler telescope is used to detect the presence of a planet. The plot below is called a light curve, and it shows the brightness of the star (as measured by Kepler’s photometer) over time. When a planet passes in front of the star, it temporarily blocks some of the light, which causes the measured brightness to decrease and then increase again shortly thereafter, causing a “U-shaped” dip in the light curve.A light curve from the Kepler space telescope with a “U-shaped” dip that indicates a transiting exoplanet.However, other astronomical and instrumental phenomena can also cause the measured brightness of a star to decrease, including binary star systems, starspots, cosmic ray hits on Kepler’s photometer, and instrumental noise.The first light curve has a “V-shaped” pattern that tells us that a very large object (i.e. another star) passed in front of the star that Kepler was observing. The second light curve contains two places where the brightness decreases, which indicates a binary system with one bright and one dim star: the larger dip is caused by the dimmer star passing in front of the brighter star, and vice versa. The third light curve is one example of the many other non-planet signals where the measured brightness of a star appears to decrease.To search for planets in Kepler data, scientists use automated software (e.g. the Kepler data processing pipeline) to detect signals that might be caused by planets, and then manually follow up to decide whether each signal is a planet or a false positive. To avoid being overwhelmed with more signals than they can manage, the scientists apply a cutoff to the automated detections: those with signal-to-noise ratios above a fixed threshold are deemed worthy of follow-up analysis, while all detections below the threshold are discarded. Even with this cutoff, the number of detections is still formidable: to date, over 30,000 detected Kepler signals have been manually examined, and about 2,500 of those have been validated as actual planets!Perhaps you’re wondering: does the signal-to-noise cutoff cause some real planet signals to be missed? The answer is, yes! However, if astronomers need to manually follow up on every detection, it’s not really worthwhile to lower the threshold, because as the threshold decreases the rate of false positive detections increases ra[...]

The Building Blocks of Interpretability


Posted by Chris Olah, Research Scientist and Arvind Satyanarayan, Visiting Researcher, Google Brain Team(Crossposted on the Google Open Source Blog)In 2015, our early attempts to visualize how neural networks understand images led to psychedelic images. Soon after, we open sourced our code as DeepDream and it grew into a small art movement producing all sorts of amazing things. But we also continued the original line of research behind DeepDream, trying to address one of the most exciting questions in Deep Learning: how do neural networks do what they do? Last year in the online journal Distill, we demonstrated how those same techniques could show what individual neurons in a network do, rather than just what is “interesting to the network” as in DeepDream. This allowed us to see how neurons in the middle of the network are detectors for all sorts of things — buttons, patches of cloth, buildings — and see how those build up to be more and more sophisticated over the networks layers.Visualizations of neurons in GoogLeNet. Neurons in higher layers represent higher level ideas.While visualizing neurons is exciting, our work last year was missing something important: how do these neurons actually connect to what the network does in practice?Today, we’re excited to publish “The Building Blocks of Interpretability,” a new Distill article exploring how feature visualization can combine together with other interpretability techniques to understand aspects of how networks make decisions. We show that these combinations can allow us to sort of “stand in the middle of a neural network” and see some of the decisions being made at that point, and how they influence the final output. For example, we can see things like how a network detects a floppy ear, and then that increases the probability it gives to the image being a “Labrador retriever” or “beagle”.We explore techniques for understanding which neurons fire in the network. Normally, if we ask which neurons fire, we get something meaningless like “neuron 538 fired a little bit,” which isn’t very helpful even to experts. Our techniques make things more meaningful to humans by attaching visualizations to each neuron, so we can see things like “the floppy ear detector fired”. It’s almost a kind of MRI for neural networks.We can also zoom out and show how the entire image was “perceived” at different layers. This allows us to really see the transition from the network detecting very simple combinations of edges, to rich textures and 3d structure, to high-level structures like ears, snouts, heads and legs.These insights are exciting by themselves, but they become even more exciting when we can relate them to the final decision the network makes. So not only can we see that the network detected a floppy ear, but we can also see how that increases the probability of the image being a labrador retriever.In addition to our paper, we’re also releasing Lucid, a neural network visualization library building off our work on DeepDream. It allows you to make the sort of lucid feature visualizations we see above, in addition to more artistic DeepDream images.We’re also releasing colab notebooks. These notebooks make it extremely easy to use Lucid to reproduce visualizations in our article! Just open the notebook, click a button to run code — no setup required!In colab notebooks you can click a button to run code, and see the result below.This work only scratches the surface of the kind of interfaces that we thi[...]

A Preview of Bristlecone, Google’s New Quantum Processor


Posted by Julian Kelly, Research Scientist, Quantum AI LabThe goal of the Google Quantum AI lab is to build a quantum computer that can be used to solve real-world problems. Our strategy is to explore near-term applications using systems that are forward compatible to a large-scale universal error-corrected quantum computer. In order for a quantum processor to be able to run algorithms beyond the scope of classical simulations, it requires not only a large number of qubits. Crucially, the processor must also have low error rates on readout and logical operations, such as single and two-qubit gates. Today we presented Bristlecone, our new quantum processor, at the annual American Physical Society meeting in Los Angeles. The purpose of this gate-based superconducting system is to provide a testbed for research into system error rates and scalability of our qubit technology, as well as applications in quantum simulation, optimization, and machine learning.Bristlecone is Google’s newest quantum processor (left). On the right is a cartoon of the device: each “X” represents a qubit, with nearest neighbor connectivity.The guiding design principle for this device is to preserve the underlying physics of our previous 9-qubit linear array technology1, 2, which demonstrated low error rates for readout (1%), single-qubit gates (0.1%) and most importantly two-qubit gates (0.6%) as our best result. This device uses the same scheme for coupling, control, and readout, but is scaled to a square array of 72 qubits. We chose a device of this size to be able to demonstrate quantum supremacy in the future, investigate first and second order error-correction using the surface code, and to facilitate quantum algorithm development on actual hardware. 2D conceptual chart showing the relationship between error rate and number of qubits. The intended research direction of the Quantum AI Lab is shown in red, where we hope to access near-term applications on the road to building an error corrected quantum computer. Before investigating specific applications, it is important to quantify a quantum processor’s capabilities. Our theory team has developed a benchmarking tool for exactly this task. We can assign a single system error by applying random quantum circuits to the device and checking the sampled output distribution against a classical simulation. If a quantum processor can be operated with low enough error, it would be able to outperform a classical supercomputer on a well-defined computer science problem, an achievement known as quantum supremacy. These random circuits must be large in both number of qubits as well as computational length (depth). Although no one has achieved this goal yet, we calculate quantum supremacy can be comfortably demonstrated with 49 qubits, a circuit depth exceeding 40, and a two-qubit error below 0.5%. We believe the experimental demonstration of a quantum processor outperforming a supercomputer would be a watershed moment for our field, and remains one of our key objectives.A Bristlecone chip being installed by Research Scientist Marissa Giustina at the Quantum AI Lab in Santa BarbaraWe are looking to achieve similar performance to the best error rates of the 9-qubit device, but now across all 72 qubits of Bristlecone. We believe Bristlecone would then be a compelling proof-of-principle for building larger scale quantum computers. Operating a device such as Bristlecone at low system error requires harmony between a full stack of technology r[...]

Making Healthcare Data Work Better with Machine Learning


Posted by Patrik Sundberg, Software Engineer and Eyal Oren, Product Manager, Google Brain TeamOver the past 10 years, healthcare data has moved from being largely on paper to being almost completely digitized in electronic health records. But making sense of this data involves a few key challenges. First, there is no common data representation across vendors; each uses a different way to structure their data. Second, even sites that use the same vendor may differ significantly, for example, they typically use different codes for the same medication. Third, data can be spread over many tables, some containing encounters, some containing lab results, and yet others containing vital signs. The Fast Healthcare Interoperability Resources (FHIR) standard addresses most of these challenges: it has a solid yet extensible data-model, is built on established Web standards, and is rapidly becoming the de-facto standard for both individual records and bulk-data access. But to enable large-scale machine learning, we needed a few additions: implementations in various programming languages, an efficient way to serialize large amounts of data to disk, and a representation that allows analyses of large datasets. Today, we are happy to open source a protocol buffer implementation of the FHIR standard, which addresses these issues. The current version supports Java, and support for C++, Go, and Python will follow soon. Support for profiles will follow shortly as well, plus tools to help convert legacy data into FHIR.FHIR as the core data modelOver the past few years, as we’ve been partnering with academic medical centers to apply machine learning to de-identified medical records, it became clear that we needed to address the complexity of healthcare data head-on. Indeed, for machine learning to be effective on medical data, we need a holistic view of what happened to each patient over time. And as a bonus, we want a data representation that is directly applicable in a clinical setting. While the FHIR standard addresses most of our needs, making healthcare data substantially easier to manage than “legacy” data structures and enabling large-scale machine-learning independent of vendors, we believe the introduction of protocol buffers can help both application developers and (machine-learning) researchers use FHIR.Current release of protocol buffersWe’ve taken care to make our protocol buffer representation suitable for both programmatic access and database queries. One of the provided examples shows how to upload FHIR data into Google Cloud BigQuery and have it available for querying, and we are adding other examples that upload directly from bulk data export. Our protocol buffers adhere to the FHIR standard (they are in fact auto-generated from it) but make for more elegant queries.The current release does not yet include support for training TensorFlow models, but keep an eye out for future updates. We aim to open-source as much as possible of our recent work, to help make our research more reproducible and applicable to real-world scenarios. Furthermore, we are working closely with our colleagues in Google Cloud on more tools for managing healthcare data at scale.Acknowledgements We enjoyed great discussions and helpful feedback from the FHIR community, including Grahame Grieve, Ewout Kramer, Josh Mandel and others. Thanks to our colleagues at DeepMind, the Google Brain team and our academic collaborators. [...]

Mobile Real-time Video Segmentation


Valentin Bazarevsky and Andrei Tkachenka, Software Engineers, Google Research Video segmentation is a widely used technique that enables movie directors and video content creators to separate the foreground of a scene from the background, and treat them as two different visual layers. By modifying or replacing the background, creators can convey a particular mood, transport themselves to a fun location or enhance the impact of the message. However, this operation has traditionally been performed as a time-consuming manual process (e.g. an artist rotoscoping every frame) or requires a studio environment with a green screen for real-time background removal (a technique referred to as chroma keying). In order to enable users to create this effect live in the viewfinder, we designed a new technique that is suitable for mobile phones.Today, we are excited to bring precise, real-time, on-device mobile video segmentation to the YouTube app by integrating this technology into stories. Currently in limited beta, stories is YouTube’s new lightweight video format, designed specifically for YouTube creators. Our new segmentation technology allows creators to replace and modify the background, effortlessly increasing videos’ production value without specialized equipment.Neural network video segmentation in YouTube stories.To achieve this, we leverage machine learning to solve a semantic segmentation task using convolutional neural networks. In particular, we designed a network architecture and training procedure suitable for mobile phones focusing on the following requirements and constraints:A mobile solution should be lightweight and run at least 10-30 times faster than existing state-of-the-art photo segmentation models. For real time inference, such a model needs to provide results at 30 frames per second.A video model should leverage temporal redundancy (neighboring frames look similar) and exhibit temporal consistency (neighboring results should be similar)High quality segmentation results require high quality annotations.The DatasetTo provide high quality data for our machine learning pipeline, we annotated tens of thousands of images that captured a wide spectrum of foreground poses and background settings. Annotations consisted of pixel-accurate locations of foreground elements such as hair, glasses, neck, skin, lips, etc. and a general background label achieving a cross-validation result of 98% Intersection-Over-Union (IOU) of human annotator quality.An example image from our dataset carefully annotated with nine labels - foreground elements are overlaid over the image.Network InputOur specific segmentation task is to compute a binary mask separating foreground from background for every input frame (three channels, RGB) of the video. Achieving temporal consistency of the computed masks across frames is key. Current methods that utilize LSTMs or GRUs to realize this are too computationally expensive for real-time applications on mobile phones. Instead we first pass the computed mask from the previous frame as a prior by concatenating it as a fourth channel to the current RGB input frame to achieve temporal consistency, as shown below:The original frame (left) is separated in its three color channels and concatenated with the previous mask (middle). This is used as input to our neural network to predict the mask for the current frame (right).Training ProcedureIn video segmentation we need to achieve f[...]

Google-Landmarks: A New Dataset and Challenge for Landmark Recognition


Posted by André Araujo and Tobias Weyand, Software Engineers, Google ResearchImage classification technology has shown remarkable improvement over the past few years, exemplified in part by the Imagenet classification challenge, where error rates continue to drop substantially every year. In order to continue advancing the state of the art in computer vision, many researchers are now putting more focus on fine-grained and instance-level recognition problems – instead of recognizing general entities such as buildings, mountains and (of course) cats, many are designing machine learning algorithms capable of identifying the Eiffel Tower, Mount Fuji or Persian cats. However, a significant obstacle for research in this area has been the lack of large annotated datasets.Today, we are excited to advance instance-level recognition by releasing Google-Landmarks, the largest worldwide dataset for recognition of human-made and natural landmarks. Google-Landmarks is being released as part of the Landmark Recognition and Landmark Retrieval Kaggle challenges, which will be the focus of the CVPR’18 Landmarks workshop. The dataset contains more than 2 million images depicting 30 thousand unique landmarks from across the world (their geographic distribution is presented below), a number of classes that is ~30x larger than what is available in commonly used datasets. Additionally, to spur research in this field, we are open-sourcing Deep Local Features (DELF), an attentive local feature descriptor that we believe is especially suited for this kind of task.Geographic distribution of landmarks in our dataset.Landmark recognition presents some noteworthy differences from other problems. For example, even within a large annotated dataset, there might not be much training data available for some of the less popular landmarks. Additionally, since landmarks are generally rigid objects which do not move, the intra-class variation is very small (in other words, a landmark’s appearance does not change that much across different images of it). As a result, variations only arise due to image capture conditions, such as occlusions, different viewpoints, weather and illumination, making this distinct from other image recognition datasets where images of a particular class (such as a dog) can vary much more. These characteristics are also shared with other instance-level recognition problems, such as artwork recognition — so we hope the new dataset can benefit research for other image recognition problems as well.The two Kaggle challenges provide access to annotated data to help researchers address these problems. The recognition track challenge is to build models that recognize the correct landmark in a dataset of challenging test images, while the retrieval track challenges participants to retrieve images containing the same landmark.A few examples of images from the Google-Landmarks dataset, including landmarks such as Big Ben, Sacre Coeur Basilica, the rock sculpture of Decebalus and the Megyeri Bridge, among others.If you plan to be at CVPR this year, we hope you’ll attend the CVPR’18 Landmarks workshop. However, everyone is able to participate in the challenge, and access to the new dataset is available via the Kaggle website. We hope this resource is valuable to your research and we can’t wait to see the ideas you will come up with for recognizing landmarks!Acknowledgments Jack Sim, Will[...]