Subscribe: Google Research Blog
Added By: Feedage Forager Feedage Grade A rated
Language: English
data  deep learning  google  images  learning  machine learning  machine  model  models  neural  research  training  vision 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Google Research Blog

Google Research Blog

The latest news on Google Research.

Updated: 2017-08-17T11:00:09.435-07:00


Making Visible Watermarks More Effective


Posted by Tali Dekel and Michael Rubinstein, Research ScientistsWhether you are a photographer, a marketing manager, or a regular Internet user, chances are you have encountered visible watermarks many times. Visible watermarks are those logos and patterns that are often overlaid on digital images provided by stock photography websites, marking the image owners while allowing viewers to perceive the underlying content so that they could license the images that fit their needs. It is the most common mechanism for protecting the copyrights of hundreds of millions of photographs and stock images that are offered online daily.It’s standard practice to use watermarks on the assumption that they prevent consumers from accessing the clean images, ensuring there will be no unauthorized or unlicensed use. However, in “On The Effectiveness Of Visible Watermarks” recently presented at the 2017 Computer Vision and Pattern Recognition Conference (CVPR 2017), we show that a computer algorithm can get past this protection and remove watermarks automatically, giving users unobstructed access to the clean images the watermarks are intended to protect. Left: example watermarked images from popular stock photography websites. Right: watermark-free version of the images on the left, produced automatically by a computer algorithm. More results are available below and on our project page. Image sources: Adobe Stock, 123RF.As often done with vulnerabilities discovered in operating systems, applications or protocols, we want to disclose this vulnerability and propose solutions in order to help the photography and stock image communities adapt and better protect its copyrighted content and creations. From our experiments much of the world’s stock imagery is currently susceptible to this circumvention. As such, in our paper we also propose ways to make visible watermarks more robust to such manipulations. allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="" frameborder="0" height="360" src="" width="640">The Vulnerability of Visible WatermarksVisible watermarks are often designed to contain complex structures such as thin lines and shadows in order to make them harder to remove. Indeed, given a single image, for a computer to detect automatically which visual structures belong to the watermark and which structures belong to the underlying image is extremely difficult. Manually, the task of removing a watermark from an image is tedious, and even with state-of-the-art editing tools it may take a Photoshop expert several minutes to remove a watermark from one image.However, a fact that has been overlooked so far is that watermarks are typically added in a consistent manner to many images. We show that this consistency can be used to invert the watermarking process — that is, estimate the watermark image and its opacity, and recover the original, watermark-free image underneath. This can be all be done automatically, without any user intervention, and by only observing watermarked image collections publicly available online.The consistency of a watermark over many images allows to automatically remove it in mass scale. Left: input collection marked by the same watermark, middle: computed watermark and its opacity, right: recovered, watermark-free images. Image sources: COCO dataset, Copyright logo.The first step of this process is identifying which image structures are repeating in the collection. If a similar watermark is embedded in many images, the watermark becomes the signal in the collection and the images become the noise, and simple image operations can be used to pull out a rough estimation of the watermark pattern.Watermark extraction with increasing number of images. Left: watermarked input images, Middle: median intensities over the input images (up to the input image shown), Right: the corresponding estimated (matted) watermark. All images licensed from 123RF.This provides a rough (noisy) estimate [...]

Harness the Power of Machine Learning in Your Browser with Deeplearn


Posted by Nikhil Thorat and Daniel Smilkov, Software Engineers, Google Big Picture TeamMachine learning (ML) has become an increasingly powerful tool, one that can be applied to a wide variety of areas spanning object recognition, language translation, health and more. However, the development of ML systems is often restricted to those with computational resources and the technical expertise to work with commonly available ML libraries. With PAIR — an initiative to study and redesign human interactions with ML — we want to open machine learning up to as many people as possible. In pursuit of that goal, we are excited to announce deeplearn 0.1.0, an open source WebGL-accelerated JavaScript library for machine learning that runs entirely in your browser, with no installations and no backend. There are many reasons to bring machine learning into the browser. A client-side ML library can be a platform for interactive explanations, for rapid prototyping and visualization, and even for offline computation. And if nothing else, the browser is one of the world's most popular programming platforms.While web machine learning libraries have existed for years (e.g., Andrej Karpathy's convnetjs) they have been limited by the speed of Javascript, or have been restricted to inference rather than training (e.g., TensorFire). By contrast, deeplearn offers a significant speedup by exploiting WebGL to perform computations on the GPU, along with the ability to do full backpropagation.The API mimics the structure of TensorFlow and NumPy, with a delayed execution model for training (like TensorFlow), and an immediate execution model for inference (like NumPy). We have also implemented versions of some of the most commonly-used TensorFlow operations. With the release of deeplearn, we will be providing tools to export weights from TensorFlow checkpoints, which will allow authors to import them into web pages for deeplearn inference.You can explore the potential of this library by training a convolutional neural network to recognize photos and handwritten digits — all in your browser without writing a single line of code.We're releasing a series of demos that show deeplearn in action. Play with an image classifier that uses your webcam in real-time and watch the network’s internal representations of what it sees. Or generate abstract art videos at a smooth 60 frames per second. The deeplearn homepage contains these and other demos.Our vision is that this library will significantly increase visibility and engagement with machine learning, giving developers access to powerful tools while simultaneously providing the everyday user with a way to interact with them. We’re looking forward to collaborating with the open source community to drive this vision forward. [...]

Google at ICML 2017


Posted by Christian Howard, Editor-in-Chief, Research CommunicationsMachine learning (ML) is a key strategic focus at Google, with highly active groups pursuing research in virtually all aspects of the field, including deep learning and more classical algorithms, exploring theory as well as application. We utilize scalable tools and architectures to build machine learning systems that enable us to solve deep scientific and engineering challenges in areas of language, speech, translation, music, visual processing and more.As a leader in ML research, Google is proud to be a Platinum Sponsor of the thirty-fourth International Conference on Machine Learning (ICML 2017), a premier annual event supported by the International Machine Learning Society taking place this week in Sydney, Australia. With over 130 Googlers attending the conference to present publications and host workshops, we look forward to our continued colalboration with the larger ML research community.If you're attending ICML 2017, we hope you'll visit the Google booth and talk with our researchers to learn more about the exciting work, creativity and fun that goes into solving some of the field's most interesting challenges. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind Facets, neural audio synthesis with Nsynth, a Q&A session on the Google Brain Residency program and much more. You can also learn more about our research being presented at ICML 2017 in the list below (Googlers highlighted in blue).ICML 2017 CommitteesSenior Program Committee includes: Alex Kulesza, Amr Ahmed, Andrew Dai, Corinna Cortes, George Dahl, Hugo Larochelle, Matthew Hoffman, Maya Gupta, Moritz Hardt, Quoc LeSponsorship Co-Chair: Ryan AdamsPublicationsRobust Adversarial Reinforcement LearningLerrel Pinto, James Davidson, Rahul Sukthankar, Abhinav GuptaTight Bounds for Approximate Carathéodory and BeyondVahab Mirrokni, Renato Leme, Adrian Vladu, Sam WongSharp Minima Can Generalize For Deep NetsLaurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua BengioGeometry of Neural Network Loss Surfaces via Random Matrix TheoryJeffrey Pennington, Yasaman BahriConditional Image Synthesis with Auxiliary Classifier GANsAugustus Odena, Christopher Olah, Jon ShlensLearning Deep Latent Gaussian Models with Markov Chain Monte CarloMatthew D. HoffmanOn the Expressive Power of Deep Neural NetworksMaithra Raghu, Ben Poole, Surya Ganguli, Jon Kleinberg, Jascha Sohl-DicksteinAdaNet: Adaptive Structural Learning of Artificial Neural NetworksCorinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, Scott YangLearned Optimizers that Scale and GeneralizeOlga Wichrowska, Niru Maheswaranathan, Matthew Hoffman, Sergio Gomez, Misha Denil, Nando de Freitas, Jascha Sohl-DicksteinAdaptive Feature Selection: Computationally Efficient Online Sparse Linear Regression under RIPSatyen Kale, Zohar Karnin, Tengyuan Liang, David PalAlgorithms for ℓp Low-Rank ApproximationFlavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Panigrahy, David WoodruffConsistent k-ClusteringSilvio Lattanzi, Sergei VassilvitskiiInput Switched Affine Networks: An RNN Architecture Designed for InterpretabilityJakob Foerster, Justin Gilmer, Jan Chorowski, Jascha Sohl-Dickstein, David SussilloOnline and Linear-Time Attention by Enforcing Monotonic AlignmentsColin Raffel, Thang Luong, Peter Liu, Ron Weiss, Douglas EckGradient Boosted Decision Trees for High Dimensional Sparse OutputSi Si, Huan Zhang, Sathiya Keerthi, Dhruv Mahajan, Inderjit Dhillon, Cho-Jui HsiehSequence Tutor: Conservative fine-tuning of sequence generation models with KL-controlNatasha Jaques, Shixiang Gu, Dzmitry Bahdanau, Jose Hernandez-Lobato, Richard E Turner, Douglas EckUniform Convergence Rates for Kernel Density EstimationHeinrich JiangDensity Level Set Estimation on[...]

Google at ACL 2017


Posted by Christian Howard, Editor-in-Chief, Research CommunicationsThis week, Vancouver, Canada hosts the 2017 Annual Meeting of the Association for Computational Linguistics (ACL 2017), the premier conference in the field of natural language understanding, covering a broad spectrum of diverse research areas that are concerned with computational approaches to natural language. As a leader in natural language processing & understanding and a Platinum sponsor of ACL 2017, Google will be on hand to showcase research interests that include syntax, semantics, discourse, conversation, multilingual modeling, sentiment analysis, question answering, summarization, and generally building better systems using labeled and unlabeled data, state-of-the-art modeling and learning from indirect supervision. If you’re attending ACL 2017, we hope that you’ll stop by the Google booth to check out some demos, meet our researchers and discuss projects and opportunities at Google that go into solving interesting problems for billions of people. Learn more about the Google research being presented at ACL 2017 below (Googlers highlighted in blue).Organizing CommitteeArea Chairs include: Sujith Ravi (Machine Learning), Thang Luong (Machine Translation) Publication Chairs include: Margaret Mitchell (Advisory)Accepted PapersA Polynomial-Time Dynamic Programming Algorithm for Phrase-Based Decoding with a Fixed Distortion Limit Yin-Wen Chang, Michael Collins Cross-Sentence N-ary Relation Extraction with Graph LSTMs Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-Tau YihNeural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak SupervisionChen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, Ni LaoCoarse-to-Fine Question Answering for Long DocumentsEunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, Jonathan BerantAutomatic Compositor Attribution in the First Folio of ShakespeareMaria Ryskina, Hannah Alpert-Abrams, Dan Garrette, Taylor Berg-KirkpatrickA Nested Attention Neural Hybrid Model for Grammatical Error CorrectionJianshu Ji, Qinlong Wang, Kristina Toutanova, Yongen Gong, Steven Truong, Jianfeng GaoGet To The Point: Summarization with Pointer-Generator NetworksAbigail See, Peter J. Liu, Christopher D. ManningIdentifying 1950s American Jazz Composers: Fine-Grained IsA Extraction via Modifier CompositionEllie Pavlick*, Marius PascaLearning to Skim TextAdams Wei Yu, Hongrae Lee, Quoc LeWorkshops2017 ACL Student Research WorkshopProgram Committee includes: Emily Pitler, Brian Roark, Richard SproatWiNLP: Women and Underrepresented Minorities in Natural Language ProcessingOrganizers include: Margaret MitchellGold SponsorBUCC: 10th Workshop on Building and Using Comparable CorporaScientific Committee includes: Richard SproatCLPsych: Computational Linguistics and Clinical Psychology – From Linguistic Signal to ClinicalRealityProgram Committee includes: Brian Roark, Richard SproatRepl4NLP: 2nd Workshop on Representation Learning for NLPProgram Committee includes: Ankur Parikh, John PlattRoboNLP: Language Grounding for RoboticsProgram Committee includes: Ankur Parikh, Tom KwiatkowskiCoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal DependenciesManagement Group includes: Slav PetrovCoNLL-SIGMORPHON-2017 Shared Task: Universal Morphological ReinflectionOrganizing Committee includes: Manaal FaruquiInvited Speaker: Chris DyerSemEval: 11th International Workshop on Semantic EvaluationOrganizers include: Daniel CerALW1: 1st Workshop on Abusive Language OnlinePanelists include: Margaret MitchellEventStory: Events and Stories in the NewsProgram Committee includes: Silvia ParetiNMT: 1st Workshop on Neural Machine TranslationOrganizing Committee includes: Thang LuongProgram Committee includes: Hieu Pham, Taro WatanabeInvited Speaker: Quoc LeTutorialsNatural Language Processing for Precision Medicine Hoifung Poon, Chris Quirk, Kristina Toutanova, Wen-tau YihDeep Learning for Dialogue Sy[...]

Expressions in Virtual Reality


Posted by Avneesh Sud, Software Engineer, Machine PerceptionRecently Google Machine Perception researchers, in collaboration with Daydream Labs and YouTube Spaces, presented a solution for virtual headset ‘removal’ for mixed reality in order to create a more rich and engaging VR experience. While that work could infer eye-gaze directions and blinks, enabled by a headset modified with eye-tracking technology, a richer set of facial expressions — which are key to understanding a person's experience in VR, as well as conveying important social engagement cues — were missing.Today we present an approach to infer select facial action units and expressions entirely by analyzing a small part of the face while the user is engaged in a virtual reality experience. Specifically, we show that images of the user’s eyes captured from an infrared (IR) gaze-tracking camera within a VR headset are sufficient to infer at least a subset of facial expressions without the use of any external cameras or additional sensors.Left: A user wearing a VR HMD modified with eye-tracking used for expression classification (Note that no external camera is used in our method; this is just for visualization). Right: inferred expression from eye images using our model. A video demonstrating the work can be seen here.We use supervised deep learning to classify facial expressions from images of the eyes and surrounding areas, which typically contain the iris, sclera, eyelids and may include parts of the eyebrows and top of cheeks. Obtaining large scale annotated data from such novel sensors is a challenging task, hence we collected training data by capturing 46 subjects while performing a set of facial expressions.To perform expression classification, we fine-tuned a variant of the widespread Inception architecture with TensorFlow using weights from a model trained to convergence on Imagenet. We attempted to partially remove variance due to differences in participant appearance (i.e., individual differences that do not depend on expression), inspired by the standard practice of mean image subtraction. Since this variance removal occurs within-subject, it is effectively personalization. Further details, along with examples of eye-images, and results are presented in our accompanying paper. Results and ExtensionsWe demonstrate that the information required to classify a variety of facial expressions is reliably present in IR eye images captured by a commercial HMD sensor, and that this information can be decoded using a CNN-based method, even though classifying facial expressions from eye-images alone is non-trivial even for humans. Our model inference can be performed in real-time, and we show this can be used to generate expressive avatars in real-time, which can function as an expressive surrogate for users engaged in VR. This interaction mechanism also yields a more intuitive interface for sharing expression in VR as opposed to gestures or keyboard inputs.The ability to capture a user’s facial expressions using existing eye-tracking cameras enables a fully mobile solution to facial performance capture in VR, without additional external cameras. This technology extends beyond animating cartoon avatars; it could be used to provide a richer headset removal experience, enhancing communication and social interaction in VR by transmitting far more authentic and emotionally textured information. AcknowledgementsThe research described in this post was performed by Steven Hickson (as an intern), Nick Dufour, Avneesh Sud, Vivek Kwatra and Irfan Essa. We also thank Hayes Raffle and Alex Wong from Daydream, and Chris Bregler, Sergey Ioffe and authors of TF-Slim from Google Research for their guidance and suggestions. This technology, along with headset removal, will be demonstrated at Siggraph 2017 Emerging Technologies. [...]

So there I was, firing a megawatt plasma collider at work...


Posted by Ted Baltz, Senior Staff Software Engineer, Google Accelerated Science TeamWait, what? Why is Google interested in plasma physics?Google is always interested in solving complex engineering problems, and few are more complex than fusion. Physicists have been trying since the 1950s to control the fusion of hydrogen atoms into helium, which is the same process that powers the Sun. The key to harnessing this power is to confine hydrogen plasmas for long enough to get more energy out from fusion reactions than was put in. This point is called “breakeven.” If it works, it would represent a technological breakthrough, and could provide an abundant source of zero-carbon energy.There are currently several large academic and government research efforts in fusion. Just to rattle off a few, in plasma fusion there are tokamak machines like ITER and stellarator machines like Wendelstein 7-X. The stellarator design actually goes back to 1951, so physicists have been working on this for a while. Oh yeah, and if you like giant lasers, there’s the National Ignition Facility which users lasers to generate X-rays to generate fusion reactions. So far, none of these has gotten to the economic breakeven point.All of these efforts involve complex experiments with many variables, providing an opportunity for Google to help, with our strength in computing and machine learning. Today, we’re publishing “Achievement of Sustained Net Plasma Heating in a Fusion Experiment with the Optometrist Algorithm” in Scientific Reports. This paper describes the first results of Google’s collaboration with the physicists and engineers at Tri Alpha Energy, taking a step towards the breakeven goal.Did you really just say that you got to fire a plasma collider?Yeah. Tri Alpha Energy has a unique scheme for plasma confinement called a field-reversed configuration that’s predicted to get more stable as the energy goes up, in contrast to other methods where plasmas get harder to control as you heat them. Tri Alpha built a giant ionized plasma machine, C-2U, that fills an entire warehouse in an otherwise unassuming office park. The plasma that this machine generates and confines exhibits all kinds of highly nonlinear behavior. The machine itself pushes the envelope of how much electrical power can be applied to generate and confine the plasma in such a small space over such a short time. It’s a complex machine with more than 1000 knobs and switches, an investment (not ours!) in exploring clean energy north of $100 million. This is a high-stakes optimization problem, dealing with both plasma performance and equipment constraints. This is where Google comes in.End-on view of C-2UWait, why not just simulate what will happen? Isn’t this simple physics?The “simple” simulations using magnetohydrodynamics don’t really apply. Even if these machines operated in that limit, which they very much don’t, the simulations make fluid dynamics simulations look easy! The reality is much more complicated, as the ion temperature is three times larger than the electron temperature, so the plasma is far out of thermal equilibrium, also, the fluid approximation is totally invalid, so you have to track at least some of the trillion+ individual particles, so the whole thing is beyond what we know how to do even with Google-scale compute resources.So why are we doing this? Real experiments! With atoms not bits! At Google we love to run experiments and optimize things. We thought it would be a great challenge to see if we could help Tri Alpha. They run a plasma “shot” on the C-2U machine every 8 minutes. Each shot consists of creating two spinning blobs of plasma in the vacuum sealed innards of C-2U, smashing them together at over 600,000 miles per hour, creating a bigger, hotter, spinning football of plasma. Then they blast it continuously with particle beams (actually neutral hydrogen atoms) to keep it spinning. They[...]

Teaching Robots to Understand Semantic Concepts


Posted by Sergey Levine, Faculty Advisor and Pierre Sermanet, Research Scientist, Google Brain TeamMachine learning can allow robots to acquire complex skills, such as grasping and opening doors. However, learning these skills requires us to manually program reward functions that the robots then attempt to optimize. In contrast, people can understand the goal of a task just from watching someone else do it, or simply by being told what the goal is. We can do this because we draw on our own prior knowledge about the world: when we see someone cut an apple, we understand that the goal is to produce two slices, regardless of what type of apple it is, or what kind of tool is used to cut it. Similarly, if we are told to pick up the apple, we understand which object we are to grab because we can ground the word “apple” in the environment: we know what it means. These are semantic concepts: salient events like producing two slices, and object categories denoted by words such as “apple.” Can we teach robots to understand semantic concepts, to get them to follow simple commands specified through categorical labels or user-provided examples? In this post, we discuss some of our recent work on robotic learning that combines experience that is autonomously gathered by the robot, which is plentiful but lacks human-provided labels, with human-labeled data that allows a robot to understand semantics. We will describe how robots can use their experience to understand the salient events in a human-provided demonstration, mimic human movements despite the differences between human robot bodies, and understand semantic categories, like “toy” and “pen”, to pick up objects based on user commands.Understanding human demonstrations with deep visual featuresIn the first set of experiments, which appear in our paper Unsupervised Perceptual Rewards for Imitation Learning, our is aim is to enable a robot to understand a task, such as opening a door, from seeing only a small number of unlabeled human demonstrations. By analyzing these demonstrations, the robot must understand what is the semantically salient event that constitutes task success, and then use reinforcement learning to perform it.Examples of human demonstrations (left) and the corresponding robotic imitation (right).Unsupervised learning on very small datasets is one of the most challenging scenarios in machine learning. To make this feasible, we use deep visual features from a large network trained for image recognition on ImageNet. Such features are known to be sensitive to semantic concepts, while maintaining invariance to nuisance variables such as appearance and lighting. We use these features to interpret user-provided demonstrations, and show that it is indeed possible to learn reward functions in an unsupervised fashion from a few demonstrations and without retraining.Example of reward functions learned solely from observation for the door opening tasks. Rewards progressively increase from zero to the maximum reward as a task is completed.After learning a reward function from observation only, we use it to guide a robot to learn a door opening task, using only the images to evaluate the reward function. With the help of an initial kinesthetic demonstration that succeeds about 10% of the time, the robot learns to improve to 100% accuracy using the learned reward function.Learning progression.Emulating human movements with self-supervision and imitation.In Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation, we propose a novel approach to learn about the world from observation and demonstrate it through self-supervised pose imitation. Our approach relies primarily on co-occurrence in time and space for supervision: by training to distinguish frames from different times of a video, it learns to disentangle and organize reality into useful abstract representations. In a pose i[...]

Google at CVPR 2017


Posted by Christian Howard, Editor-in-Chief, Research CommunicationsFrom July 21-26, Honolulu, Hawaii hosts the 2017 Conference on Computer Vision and Pattern Recognition (CVPR 2017), the premier annual computer vision event comprising the main conference and several co-located workshops and tutorials. As a leader in computer vision research and a Platinum Sponsor, Google will have a strong presence at CVPR 2017 — over 250 Googlers will be in attendance to present papers and invited talks at the conference, and to organize and participate in multiple workshops.If you are attending CVPR this year, please stop by our booth and chat with our researchers who are actively pursuing the next generation of intelligent systems that utilize the latest machine learning techniques applied to various areas of machine perception. Our researchers will also be available to talk about and demo several recent efforts, including the technology behind Headset Removal for Virtual and Mixed Reality, Image Compression with Neural Networks, Jump, TensorFlow Object Detection API and much more.You can learn more about our research being presented at CVPR 2017 in the list below (Googlers highlighted in blue).Organizing CommitteeCorporate Relations Chair - Mei HanArea Chairs include - Alexander Toshev, Ce Liu, Vittorio Ferrari, David LowePapersTraining object class detectors with click supervisionDim Papadopoulos, Jasper Uijlings, Frank Keller, Vittorio FerrariUnsupervised Pixel-Level Domain Adaptation With Generative Adversarial NetworksKonstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, Dilip KrishnanBranchOut: Regularization for Online Ensemble Tracking With Convolutional Neural Networks Bohyung Han, Jack Sim, Hartwig Adam Enhancing Video Summarization via Vision-Language EmbeddingBryan A. Plummer, Matthew Brown, Svetlana LazebnikLearning by Association — A Versatile Semi-Supervised Training Method for Neural Networks Philip Haeusser, Alexander Mordvintsev, Daniel CremersContext-Aware Captions From Context-Agnostic SupervisionRamakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, Gal Chechik Spatially Adaptive Computation Time for Residual NetworksMichael Figurnov, Maxwell D. Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, Ruslan SalakhutdinovXception: Deep Learning With Depthwise Separable ConvolutionsFrançois Chollet Deep Metric Learning via Facility LocationHyun Oh Song, Stefanie Jegelka, Vivek Rathod, Kevin MurphySpeed/Accuracy Trade-Offs for Modern Convolutional Object DetectorsJonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin MurphySynthesizing Normalized Faces From Facial Identity FeaturesForrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. FreemanTowards Accurate Multi-Person Pose Estimation in the WildGeorge Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin MurphyGuessWhat?! Visual Object Discovery Through Multi-Modal DialogueHarm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, Aaron CourvilleLearning discriminative and transformation covariant local feature detectorsXu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu ChangFull Resolution Image Compression With Recurrent Neural NetworksGeorge Toderici, Damien Vincent, Nick Johnston, Sung Jin Hwang, David Minnen, Joel Shor, Michele CovellLearning From Noisy Large-Scale Datasets With Minimal SupervisionAndreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, Serge BelongieUnsupervised Learning of Depth and Ego-Motion From VideoTinghui Zhou, Matthew Brown,&nbs[...]

An Update to Open Images - Now with Bounding-Boxes


Posted by Vittorio Ferrari, Research Scientist, Machine PerceptionLast year we introduced Open Images, a collaborative release of ~9 million images annotated with labels spanning over 6000 object categories, designed to be a useful dataset for machine learning research. The initial release featured image-level labels automatically produced by a computer vision model similar to Google Cloud Vision API, for all 9M images in the training set, and a validation set of 167K images with 1.2M human-verified image-level labels.Today, we introduce an update to Open Images, which contains the addition of a total of ~2M bounding-boxes to the existing dataset, along with several million additional image-level labels. Details include:1.2M bounding-boxes around objects for 600 categories on the training set. These have been produced semi-automatically by an enhanced version of the technique outlined in [1], and are all human-verified.Complete bounding-box annotation for all object instances of the 600 categories on the validation set, all manually drawn (830K boxes). The bounding-box annotations in the training and validations sets will enable research on object detection on this dataset. The 600 categories offer a broader range than those in the ILSVRC and COCO detection challenges, and include new objects such as fedora hat and snowman.4.3M human-verified image-level labels on the training set (over all categories). This will enable large-scale experiments on object classification, based on a clean training set with reliable labels.Annotated images from the Open Images dataset. Left: FAMILY MAKING A SNOWMAN by mwvchamber. Right: STANZA STUDENTI.S.S. ANNUNZIATA by ersupalermo. Both images used under CC BY 2.0 license. See more examples here.We hope that this update to Open Images will stimulate the broader research community to experiment with object classification and detection models, and facilitate the development and evaluation of new techniques.References[1] We don't need no bounding-boxes: Training object class detectors using only human verification, Papadopoulos, Uijlings, Keller, and Ferrari, CVPR 2016 [...]

Motion Stills — Now on Android


Posted by Karthik Raveendran and Suril Shah, Software Engineers, Google ResearchLast year, we launched Motion Stills, an iOS app that stabilizes your Live Photos and lets you view and share them as looping GIFs and videos. Since then, Motion Stills has been well received, being listed as one of the top apps of 2016 by The Verge and Mashable. However, from its initial release, the community has been asking us to also make Motion Stills available for Android. We listened to your feedback and today, we're excited to announce that we’re bringing this technology, and more, to devices running Android 5.1 and later!Motion Stills on Android: Instant stabilization on your device.With Motion Stills on Android we built a new recording experience where everything you capture is instantly transformed into delightful short clips that are easy to watch and share. You can capture a short Motion Still with a single tap like a photo, or condense a longer recording into a new feature we call Fast Forward. In addition to stabilizing your recordings, Motion Stills on Android comes with an improved trimming algorithm that guards against pocket shots and accidental camera shakes. All of this is done during capture on your Android device, no internet connection required!New streaming pipelineFor this release, we redesigned our existing iOS video processing pipeline to use a streaming approach that processes each frame of a video as it is being recorded. By computing intermediate motion metadata, we are able to immediately stabilize the recording while still performing loop optimization over the full sequence. All this leads to instant results after recording — no waiting required to share your new GIF.Capture using our streaming pipeline gives you instant results.In order to display your Motion Stills stream immediately, our algorithm computes and stores the necessary stabilizing transformation as a low resolution texture map. We leverage this texture to apply the stabilization transform using the GPU in real-time during playback, instead of writing a new, stabilized video that would tax your mobile hardware and battery.Fast ForwardFast Forward allows you to speed up and condense a longer recording into a short, easy to share clip. The same pipeline described above allows Fast Forward to process up to a full minute of video, right on your phone. You can even change the speed of playback (from 1x to 8x) after recording. To make this possible, we encode videos with a denser I-frame spacing to enable efficient seeking and playback. We also employ additional optimizations in the Fast Forward mode. For instance, we apply adaptive temporal downsampling in the linear solver and long-range stabilization for smooth results over the whole sequence. Fast Forward condenses your recordings into easy to share clips.Try out Motion StillsMotion Stills is an app for us to experiment and iterate quickly with short-form video technology, gathering valuable feedback along the way. The tools our users find most fun and useful may be integrated later on into existing products like Google Photos. Download Motion Stills for Android from the Google Play store—available for mobile phones running Android 5.1 and later—and share your favorite clips on social media with hashtag #motionstills. AcknowledgementsMotion Stills would not have been possible without the help of many Googlers. We want to especially acknowledge the work of Matthias Grundmann in advancing our stabilization technology, as well as our UX and interaction designers Jacob Zukerman, Ashley Ma and Mark Bowers. [...]

Facets: An Open Source Visualization Tool for Machine Learning Training Data


Posted by James Wexler, Senior Software Engineer, Google Big Picture Team(Cross-posted on the Google Open Source Blog)Getting the best results out of a machine learning (ML) model requires that you truly understand your data. However, ML datasets can contain hundreds of millions of data points, each consisting of hundreds (or even thousands) of features, making it nearly impossible to understand an entire dataset in an intuitive fashion. Visualization can help unlock nuances and insights in large datasets. A picture may be worth a thousand words, but an interactive visualization can be worth even more.Working with the PAIR initiative, we’ve released Facets, an open source visualization tool to aid in understanding and analyzing ML datasets. Facets consists of two visualizations that allow users to see a holistic picture of their data at different granularities. Get a sense of the shape of each feature of the data using Facets Overview, or explore a set of individual observations using Facets Dive. These visualizations allow you to debug your data which, in machine learning, is as important as debugging your model. They can easily be used inside of Jupyter notebooks or embedded into webpages. In addition to the open source code, we've also created a Facets demo website. This website allows anyone to visualize their own datasets directly in the browser without the need for any software installation or setup, without the data ever leaving your computer. Facets OverviewFacets Overview automatically gives users a quick understanding of the distribution of values across the features of their datasets. Multiple datasets, such as a training set and a test set, can be compared on the same visualization. Common data issues that can hamper machine learning are pushed to the forefront, such as: unexpected feature values, features with high percentages of missing values, features with unbalanced distributions, and feature distribution skew between datasets.Facets Overview visualization of the six numeric features of the UCI Census datasets[1]. The features are sorted by non-uniformity, with the feature with the most non-uniform distribution at the top. Numbers in red indicate possible trouble spots, in this case numeric features with a high percentage of values set to 0. The histograms at right allow you to compare the distributions between the training data (blue) and test data (orange).Facets Overview visualization showing two of the nine categorical features of the UCI Census datasets[1]. The features are sorted by distribution distance, with the feature with the biggest skew between the training (blue) and test (orange) datasets at the top. Notice in the “Target” feature that the label values differ between the training and test datasets, due to a trailing period in the test set (“<=50K” vs “<=50K.”). This can be seen in the chart for the feature and also in the entries in the “top” column of the table. This label mismatch would cause a model trained and tested on this data to not be evaluated correctly.Facets DiveFacets Dive provides an easy-to-customize, intuitive interface for exploring the relationship between the data points across the different features of a dataset. With Facets Dive, you control the position, color and visual representation of each data point based on its feature values. If the data points have images associated with them, the images can be used as the visual representations.Facets Dive visualization showing all 16281 data points in the UCI Census test dataset[1]. The animation shows a user coloring the data points by one feature (“Relationship”), faceting in one dimension by a continuous feature (“Age”) and then faceting in another dimension by a discrete feature (“Marital Status”).Facets Dive visuali[...]

Using Deep Learning to Create Professional-Level Photographs


Posted by Hui Fang, Software Engineer, Machine PerceptionMachine learning (ML) excels in many areas with well defined goals. Tasks where there exists a right or wrong answer help with the training process and allow the algorithm to achieve its desired goal, whether it be correctly identifying objects in images or providing a suitable translation from one language to another. However, there are areas where objective evaluations are not available. For example, whether a photograph is beautiful is measured by its aesthetic value, which is a highly subjective concept. A professional(?) photograph of Jasper National Park, Canada.To explore how ML can learn subjective concepts, we introduce an experimental deep-learning system for artistic content creation. It mimics the workflow of a professional photographer, roaming landscape panoramas from Google Street View and searching for the best composition, then carrying out various postprocessing operations to create an aesthetically pleasing image. Our virtual photographer “travelled” ~40,000 panoramas in areas like the Alps, Banff and Jasper National Parks in Canada, Big Sur in California and Yellowstone National Park, and returned with creations that are quite impressive, some even approaching professional quality — as judged by professional photographers.Training the ModelWhile aesthetics can be modelled using datasets like AVA, using it naively to enhance photos may miss some aspect in aesthetics, such as making a photo over-saturated. Using supervised learning to learn multiple aspects in aesthetics properly, however, may require a labelled dataset that is intractable to collect. Our approach relies only on a collection of professional quality photos, without before/after image pairs, or any additional labels. It breaks down aesthetics into multiple aspects automatically, each of which is learned individually with negative examples generated by a coupled image operation. By keeping these image operations semi-”orthogonal”, we can enhance a photo on its composition, saturation/HDR level and dramatic lighting with fast and separable optimizations:A panorama (a) is cropped into (b), with saturation and HDR strength enhanced in (c), and with dramatic mask applied in (d). Each step is guided by one learned aspect of aesthetics.A traditional image filter was used to generate negative training examples for saturation, HDR detail and composition. We also introduce a special operation named dramatic mask, which was created jointly while learning the concept of dramatic lighting. The negative examples were generated by applying a combination of image filters that modify brightness randomly on professional photos, degrading their appearance. For the training we use a generative adversarial network (GAN), where a generative model creates a mask to fix lighting for negative examples, while a discriminative model tries to distinguish enhanced results from the real professional ones. Unlike shape-fixed filters such as vignette, dramatic mask adds content-aware brightness adjustment to a photo. The competitive nature of GAN training leads to good variations of such suggestions. You can read more about the training details in our paper.ResultsSome creations of our system from Google Street View are shown below. As you can see, the application of the trained aesthetic filters creates some dramatic results (including the image we started this post with!):Jasper National Park, Canada.Interlaken, Switzerland.Park Parco delle Orobie Bergamasche, Italy.Jasper National Park, Canada.Professional EvaluationTo judge how successful our algorithm was, we designed a “Turing-test”-like experiment: we mix our creations with other photos at different quality, and show them to several professional p[...]

Building Your Own Neural Machine Translation System in TensorFlow


Posted by Thang Luong, Research Scientist, and Eugene Brevdo, Staff Software Engineer, Google Brain TeamMachine translation – the task of automatically translating between languages – is one of the most active research areas in the machine learning community. Among the many approaches to machine translation, sequence-to-sequence ("seq2seq") models [1, 2] have recently enjoyed great success and have become the de facto standard in most commercial translation systems, such as Google Translate, thanks to its ability to use deep neural networks to capture sentence meanings. However, while there is an abundance of material on seq2seq models such as OpenNMT or tf-seq2seq, there is a lack of material that teaches people both the knowledge and the skills to easily build high-quality translation systems.Today we are happy to announce a new Neural Machine Translation (NMT) tutorial for TensorFlow that gives readers a full understanding of seq2seq models and shows how to build a competitive translation model from scratch. The tutorial is aimed at making the process as simple as possible, starting with some background knowledge on NMT and walking through code details to build a vanilla system. It then dives into the attention mechanism [3, 4], a key ingredient that allows NMT systems to handle long sentences. Finally, the tutorial provides details on how to replicate key features in the Google’s NMT (GNMT) system [5] to train on multiple GPUs. The tutorial also contains detailed benchmark results, which users can replicate on their own. Our models provide a strong open-source baseline with performance on par with GNMT results [5]. We achieve 24.4 BLEU points on the popular WMT’14 English-German translation task.Other benchmark results (English-Vietnamese, German-English) can be found in the tutorial.In addition, this tutorial showcases the fully dynamic seq2seq API (released with TensorFlow 1.2) aimed at making building seq2seq models clean and easy:Easily read and preprocess dynamically sized input sequences using the new input pipeline in padded batching and sequence length bucketing to improve training and inference speeds.Train seq2seq models using popular architectures and training schedules, including several types of attention and scheduled sampling.Perform inference in seq2seq models using in-graph beam search.Optimize seq2seq models for multi-GPU settings.We hope this will help spur the creation of, and experimentation with, many new NMT models by the research community. To get started on your own research, check out the tutorial on GitHub!Core contributorsThang Luong, Eugene Brevdo, and Rui Zhao. AcknowledgementsWe would like to especially thank our collaborator on the NMT project, Rui Zhao. Without his tireless effort, this tutorial would not have been possible. Additional thanks go to Denny Britz, Anna Goldie, Derek Murray, and Cinjon Resnick for their work bringing new features to TensorFlow and the seq2seq library. Lastly, we thank Lukasz Kaiser for the initial help on the seq2seq codebase; Quoc Le for the suggestion to replicate GNMT; Yonghui Wu and Zhifeng Chen for details on the GNMT systems; as well as the Google Brain team for their support and feedback!References[1] Sequence to sequence learning with neural networks, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. NIPS, 2014.[2] Learning phrase representations using RNN encoder-decoder for statistical machine translation, Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. EMNLP 2014.[3] Neural machine translation by jointly learning to align and translate, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ICLR, 2015. [4] Effective approache[...]

Revisiting the Unreasonable Effectiveness of Data


Posted by Abhinav Gupta, Faculty Advisor, Machine PerceptionThere has been remarkable success in the field of computer vision over the past decade, much of which can be directly attributed to the application of deep learning models to this machine perception task. Furthermore, since 2012 there have been significant advances in representation capabilities of these systems due to (a) deeper models with high complexity, (b) increased computational power and (c) availability of large-scale labeled data, much of which is publicly available. And while every year we get further increases in computational power and the model complexity (from 7-layer AlexNet to 101-layer ResNet), available datasets have not scaled accordingly. A 101-layer ResNet with significantly more capacity than AlexNet is still trained with the same 1M images from ImageNet circa 2011. As researchers, we have always wondered: if we scale up the amount of training data 10x, will the accuracy double? How about 100x or maybe even 300x? Will the accuracy plateau or will we continue to see increasing gains with more and more data?While GPU computation power and model sizes have continued to increase over the last five years, the size of the largest training dataset has surprisingly remained constant.In our paper, “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era”, we take the first steps towards clearing the clouds of mystery surrounding the relationship between `enormous visual data' and deep learning for computer vision. Our goal was to explore: (a) if visual representations can be still improved by feeding more and more images with noisy labels to currently existing algorithms; (b) the nature of the relationship between data and performance on standard vision tasks such as classification, object detection and image segmentation; (c) state-of-the-art models for all the tasks in computer vision using large-scale learning.Of course, the elephant in the room is where can we obtain a dataset that is 300x larger than ImageNet? At Google, we have been continuously working on building such datasets automatically to improve computer vision algorithms. Specifically, we have built an internal dataset of 300M images that are labeled with 18291 categories, which we call JFT-300M. The images are labeled using an algorithm that uses complex mixture of raw web signals, connections between web-pages and user feedback. This results in over one billion labels for the 300M images (a single image can have multiple labels). Of the billion image labels, approximately 375M are selected via an algorithm that aims to maximize label precision of selected images. However, there is still considerable noise in the labels: approximately 20% of the labels for selected images are noisy. Since there is no exhaustive annotation, we have no way to estimate the recall of the labels.Our experimental results validate some of the hypotheses but also generate some unexpected surprises:Better Representation Learning Helps. Our first observation is that large-scale data helps in representation learning which in-turn improves the performance on each vision task we study. Our findings suggest that a collective effort to build a large-scale dataset for visual pretraining is important. It also suggests a bright future for unsupervised and semi-supervised representation learning approaches. It seems the scale of data continues to overpower noise in the label space.Performance increases logarithmically based on volume of training data.  We find there is a logarithmic relationship between performance on vision tasks and the amount of training data used for representation learning.Object detection performance when pre-trained on di[...]

The Google Brain Residency Program — One Year Later


Posted by Luke Metz, Research Associate and Yun Liu, Software Engineer, 2016 Google Brain Resident Alumni“Coming from a background in statistics, physics, and chemistry, the Google Brain Residency was my first exposure to both deep learning and serious programming. I enjoyed the autonomy that I was given to research diverse topics of my choosing: deep learning for computer vision and language, reinforcement learning, and theory. I originally intended to pursue a statistics PhD but my experience here spurred me to enroll in the Stanford CS program starting this fall!”- Melody Guan, 2016 Google Brain Residency AlumnaThis month marks the end of an incredibly successful year for our first class of the Google Brain Residency Program. This one-year program was created as an opportunity for individuals from diverse educational backgrounds and experiences to dive into research in machine learning and deep learning. Over the past year, the Residents familiarized themselves with the literature, designed and implemented experiments at Google scale, and engaged in cutting edge research in a wide variety of subjects ranging from theory to robotics to music generation.To date, the inaugural class of Residents have published over 30 papers at leading machine learning publication venues such as ICLR (15), ICML (11), CVPR (3), EMNLP (2), RSS, GECCO, ISMIR, ISMB and Cosyne. An additional 18 papers are currently under review at NIPS, ICCV, BMVC and Nature Methods. Two of the above papers were published in Distill, exploring how deconvolution causes checkerboard artifacts and presenting ways of visualizing a generative model of handwriting. A Distill article by residents interactively explores how a neural network generates handwriting.A system that explores how robots can learn to imitate human motion from observation. For more details, see “Time-Contrastive Networks: Self-Supervised Learning from Multi-View Observation” (Co-authored by Resident Corey Lynch, along with P. Sermanet, , J. Hsu, S. Levine, accepted to CVPR Workshop 2017)A model that uses reinforcement learning to train distributed deep learning networks at large scale by optimizing computations to hardware devices assignment. For more details, see “Device Placement Optimization with Reinforcement Learning” (Co-authored by Residents Azalia Mirhoseini and Hieu Pham, along with Q. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, J. Dean, submitted to ICML 2017).An approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. Final version of the paper “Neural Optimizer Search with Reinforcement Learning” (Co-authored by Residents Irwan Bello and Barret Zoph, along with V. Vasudevan, Q. Le, submitted to ICML 2017) coming soon.Residents have also made significant contributions to the open source community with general-purpose sequence-to-sequence models (used for example in translation), music synthesis, mimicking human sketching, subsampling a sequence for model training, an efficient “attention” mechanism for models, and time series analysis (particularly for neuroscience).The end of the program year marks our Residents embarking on the next stages in their careers. Many are continuing their research careers on the Google Brain team as full time employees. Others have chosen to enter top machine learning Ph.D. programs at schools such as Stanford University, UC Berkeley, Cornell University, Oxford University and NYU, University of Toronto and CMU. We could not be more proud to see where their hard work and experiences will take them next!As we “graduate” our first class, this week we welcome our next class o[...]

MultiModel: Multi-Task Machine Learning Across Domains


Posted by Łukasz Kaiser, Senior Research Scientist, Google Brain Team and Aidan N. Gomez, Researcher, Department of Computer Science Machine Learning Group, University of TorontoOver the last decade, the application and performance of Deep Learning has progressed at an astonishing rate. However, the current state of the field is that the neural network architectures are highly specialized to specific domains of application. An important question remains unanswered: Will a convergence between these domains facilitate a unified model capable of performing well across multiple domains?Today, we present MultiModel, a neural network architecture that draws from the success of vision, language and audio networks to simultaneously solve a number of problems spanning multiple domains, including image recognition, translation and speech recognition. While strides have been made in this direction before, namely in Google’s Multilingual Neural Machine Translation System used in Google Translate, MultiModel is a first step towards the convergence of vision, audio and language understanding into a single network.The inspiration for how MultiModel handles multiple domains comes from how the brain transforms sensory input from different modalities (such as sound, vision or taste), into a single shared representation and back out in the form of language or actions. As an analog to these modalities and the transformations they perform, MultiModel has a number of small modality-specific sub-networks for audio, images, or text, and a shared model consisting of an encoder, input/output mixer and decoder, as illustrated below.MultiModel architecture: small modality-specific sub-networks work with a shared encoder, I/O mixer and decoder. Each petal represents a modality, transforming to and from the internal representation.We demonstrate that MultiModel is capable of learning eight different tasks simultaneously: it can detect objects in images, provide captions, recognize speech, translate between four pairs of languages, and do grammatical constituency parsing at the same time. The input is given to the model together with a very simple signal that determines which output we are requesting. Below we illustrate a few examples taken from a MultiModel trained jointly on these eight tasks1:When designing MultiModel it became clear that certain elements from each domain of research (vision, language and audio) were integral to the model’s success in related tasks. We demonstrate that these computational primitives (such as convolutions, attention, or mixture-of-experts layers) clearly improve performance on their originally intended domain of application, while not hindering MultiModel’s performance on other tasks. It is not only possible to achieve good performance while training jointly on multiple tasks, but on tasks with limited quantities of data, the performance actually improves. To our surprise, this happens even if the tasks come from different domains that would appear to have little in common, e.g., an image recognition task can improve performance on a language task.It is important to note that while MultiModel does not establish new performance records, it does provide insight into the dynamics of multi-domain multi-task learning in neural networks, and the potential for improved learning on data-limited tasks by the introduction of auxiliary tasks. There is a longstanding saying in machine learning: “the best regularizer is more data”; in MultiModel, this data can be sourced across domains, and consequently can be obtained more easily than previously thought. MultiModel provides evidence that training in concert with o[...]

Accelerating Deep Learning Research with the Tensor2Tensor Library


Posted by Łukasz Kaiser, Senior Research Scientist, Google Brain TeamDeep Learning (DL) has enabled the rapid advancement of many useful technologies, such as machine translation, speech recognition and object detection. In the research community, one can find code open-sourced by the authors to help in replicating their results and further advancing deep learning. However, most of these DL systems use unique setups that require significant engineering effort and may only work for a specific problem or architecture, making it hard to run new experiments and compare the results.Today, we are happy to release Tensor2Tensor (T2T), an open-source system for training deep learning models in TensorFlow. T2T facilitates the creation of state-of-the art models for a wide variety of ML applications, such as translation, parsing, image captioning and more, enabling the exploration of various ideas much faster than previously possible. This release also includes a library of datasets and models, including the best models from a few recent papers (Attention Is All You Need, Depthwise Separable Convolutions for Neural Machine Translation and One Model to Learn Them All) to help kick-start your own DL research. Translation Model Training time BLEU (difference from baseline) Transformer (T2T) 3 days on 8 GPU 28.4 (+7.8) SliceNet (T2T) 6 days on 32 GPUs 26.1 (+5.5) GNMT + Mixture of Experts 1 day on 64 GPUs 26.0 (+5.4) ConvS2S 18 days on 1 GPU 25.1 (+4.5) GNMT 1 day on 96 GPUs 24.6 (+4.0) ByteNet 8 days on 32 GPUs 23.8 (+3.2) MOSES (phrase-based baseline) N/A 20.6 (+0.0) BLEU scores (higher is better) on the standard WMT English-German translation task.As an example of the kind of improvements T2T can offer, we applied the library to machine translation. As you can see in the table above, two different T2T models, SliceNet and Transformer, outperform the previous state-of-the-art, GNMT+MoE. Our best T2T model, Transformer, is 3.8 points better than the standard GNMT model, which itself was 4 points above the baseline phrase-based translation system, MOSES. Notably, with T2T you can approach previous state-of-the-art results with a single GPU in one day: a small Transformer model (not shown above) gets 24.9 BLEU after 1 day of training on a single GPU. Now everyone with a GPU can tinker with great translation models on their own: our github repo has instructions on how to do that. Modular Multi-Task TrainingThe T2T library is built with familiar TensorFlow tools and defines multiple pieces needed in a deep learning system: data-sets, model architectures, optimizers, learning rate decay schemes, hyperparameters, and so on. Crucially, it enforces a standard interface between all these parts and implements current ML best practices. So you can pick any data-set, model, optimizer and set of hyperparameters, and run the training to check how it performs. We made the architecture modular, so every piece between the input data and the predicted output is a tensor-to-tensor function. If you have a new idea for the model architecture, you don’t need to replace the whole setup. You can keep the embedding part and the loss and everything else, just replace the model body by your own function that takes a tensor as input and returns a tensor. This means that T2T is flexible, with training no longer pinned to a specific model or dataset. It is so easy that even architectures like the famous LSTM sequence-to-sequence model can be defined in a few dozen lines of code. One can also train a single model on multiple tasks from different domains. Taken to the limit, you can even train a single model on all da[...]

Supercharge your Computer Vision models with the TensorFlow Object Detection API


Posted by Jonathan Huang, Research Scientist and Vivek Rathod, Software Engineer(Cross-posted on the Google Open Source Blog)At Google, we develop flexible state-of-the-art machine learning (ML) systems for computer vision that not only can be used to improve our products and services, but also spur progress in the research community. Creating accurate ML models capable of localizing and identifying multiple objects in a single image remains a core challenge in the field, and we invest a significant amount of time training and experimenting with these systems. Detected objects in a sample image (from the COCO dataset) made by one of our models. Image credit: Michael Miley, original image.Last October, our in-house object detection system achieved new state-of-the-art results, and placed first in the COCO detection challenge. Since then, this system has generated results for a number of research publications1,2,3,4,5,6,7 and has been put to work in Google products such as NestCam, the similar items and style ideas feature in Image Search and street number and name detection in Street View.Today we are happy to make this system available to the broader research community via the TensorFlow Object Detection API. This codebase is an open-source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. Our goals in designing this system was to support state-of-the-art models while allowing for rapid exploration and research. Our first release contains the following:A selection of trainable detection models, including:Single Shot Multibox Detector (SSD) with MobileNetsSSD with Inception V2Region-Based Fully Convolutional Networks (R-FCN) with Resnet 101Faster RCNN with Resnet 101Faster RCNN with Inception Resnet v2Frozen weights (trained on the COCO dataset) for each of the above models to be used for out-of-the-box inference purposes.A Jupyter notebook for performing out-of-the-box inference with one of our released modelsConvenient local training scripts as well as distributed training and evaluation pipelines via Google CloudThe SSD models that use MobileNet are lightweight, so that they can be comfortably run in real time on mobile devices. Our winning COCO submission in 2016 used an ensemble of the Faster RCNN models, which are more computationally intensive but significantly more accurate. For more details on the performance of these models, see our CVPR 2017 paper.Are you ready to get started?We’ve certainly found this code to be useful for our computer vision needs, and we hope that you will as well. Contributions to the codebase are welcome and please stay tuned for our own further updates to the framework. To get started, download the code here and try detecting objects in some of your own images using the Jupyter notebook, or training your own pet detector on Cloud ML engine! AcknowledgementsThe release of the Tensorflow Object Detection API and the pre-trained model zoo has been the result of widespread collaboration among Google researchers with feedback and testing from product groups. In particular we want to highlight the contributions of the following individuals:Core Contributors: Derek Chow, Chen Sun, Menglong Zhu, Matthew Tang, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Jasper Uijlings, Viacheslav Kovalevskyi, Kevin MurphyAlso special thanks to: Andrew Howard, Rahul Sukthankar, Vittorio Ferrari, Tom Duerig, Chuck Rosenberg, Hartwig Adam, Jing Jing Long, Victor Gomes, George Papandreou, Tyler ZhuReferencesSpeed/accuracy trade-offs for moder[...]

MobileNets: Open-Source Models for Efficient On-Device Vision


Posted by Andrew G. Howard, Senior Software Engineer and Menglong Zhu, Software Engineer(Cross-posted on the Google Open Source Blog)Deep learning has fueled tremendous progress in the field of computer vision in recent years, with neural networks repeatedly pushing the frontier of visual recognition technology. While many of those technologies such as object, landmark, logo and text recognition are provided for internet-connected devices through the Cloud Vision API, we believe that the ever-increasing computational power of mobile devices can enable the delivery of these technologies into the hands of our users, anytime, anywhere, regardless of internet connection. However, visual recognition for on device and embedded applications poses many challenges — models must run quickly with high accuracy in a resource-constrained environment making use of limited computation, power and space. Today we are pleased to announce the release of MobileNets, a family of mobile-first computer vision models for TensorFlow, designed to effectively maximize accuracy while being mindful of the restricted resources for an on-device or embedded application. MobileNets are small, low-latency, low-power models parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings and segmentation similar to how other popular large scale models, such as Inception, are used. Example use cases include detection, fine-grain classification, attributes and geo-localization.This release contains the model definition for MobileNets in TensorFlow using TF-Slim, as well as 16 pre-trained ImageNet classification checkpoints for use in mobile projects of all sizes. The models can be run efficiently on mobile devices with TensorFlow Mobile. Model Checkpoint Million MACs Million Parameters Top-1 Accuracy Top-5 Accuracy MobileNet_v1_1.0_224 569 4.24 70.7 89.5 MobileNet_v1_1.0_192 418 4.24 69.3 88.9 MobileNet_v1_1.0_160 291 4.24 67.2 87.5 MobileNet_v1_1.0_128 186 4.24 64.1 85.3 MobileNet_v1_0.75_224 317 2.59 68.4 88.2 MobileNet_v1_0.75_192 233 2.59 67.4 87.3 MobileNet_v1_0.75_160 162 2.59 65.2 86.1 MobileNet_v1_0.75_128 104 2.59 61.8 83.6 MobileNet_v1_0.50_224 150 1.34 64.0 85.4 MobileNet_v1_0.50_192 110 1.34 62.1 84.0 MobileNet_v1_0.50_160 77 1.34 59.9 82.5 MobileNet_v1_0.50_128 49 1.34 56.2 79.6 MobileNet_v1_0.25_224 41 0.47 50.6 75.0 MobileNet_v1_0.25_192 34 0.47 49.0 73.6 MobileNet_v1_0.25_160 21 0.47 46.0 70.7 MobileNet_v1_0.25_128 14 0.47 41.3 66.2Choose the right MobileNet model to fit your latency and size budget. The size of the network in memory and on disk is proportional to the number of parameters. The latency and power usage of the network scales with the number of Multiply-Accumulates (MACs) which measures the number of fused Multiplication and Addition operations. Top-1 and Top-5 accuracies are measured on the ILSVRC dataset.We are excited to share MobileNets with the open-source community. Information for getting started can be found at the TensorFlow-Slim Image Classification Library. To learn how to run models on-device please go to TensorFlow Mobile. You can read more about the technical details of MobileNets in our paper, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.AcknowledgementsMobileNets were made possible with the hard work of many engineers and researchers throughout Google. Specifically we would like to thank:Core Contributors: Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An[...]

The Machine Intelligence Behind Gboard


Posted by Françoise Beaufays, Principal Scientist, Speech and Keyboard Team and Michael Riley, Principal Scientist, Speech and Languages Algorithms TeamMost people spend a significant amount of time each day using mobile-device keyboards: composing emails, texting, engaging in social media, and more. Yet, mobile keyboards are still cumbersome to handle. The average user is roughly 35% slower typing on a mobile device than on a physical keyboard. To change that, we recently provided many exciting improvements to Gboard for Android, working towards our vision of creating an intelligent mechanism that enables faster input while offering suggestions and correcting mistakes, in any language you choose. With the realization that the way a mobile keyboard translates touch inputs into text is similar to how a speech recognition system translates voice inputs into text, we leveraged our experience in Speech Recognition to pursue our vision. First, we created robust spatial models that map fuzzy sequences of raw touch points to keys on the keyboard, just like acoustic models map sequences of sound bites to phonetic units. Second, we built a powerful core decoding engine based on finite state transducers (FST) to determine the likeliest word sequence given an input touch sequence. With its mathematical formalism and broad success in speech applications, we knew that an FST decoder would offer the flexibility needed to support a variety of complex keyboard input behaviors as well as language features. In this post, we will detail what went into the development of both of these systems.Neural Spatial ModelsMobile keyboard input is subject to errors that are generally attributed to “fat finger typing” (or tracing spatially similar words in glide typing, as illustrated below) along with cognitive and motor errors (manifesting in misspellings, character insertions, deletions or swaps, etc). An intelligent keyboard needs to be able to account for these errors and predict the intended words rapidly and accurately. As such, we built a spatial model for Gboard that addresses these errors at the character level, mapping the touch points on the screen to actual keys.Average glide trails for two spatially-similar words: “Vampire” and “Value”.Up to recently, Gboard used a Gaussian model to quantify the probability of tapping neighboring keys and a rule-based model to represent cognitive and motor errors. These models were simple and intuitive, but they didn’t allow us to directly optimize metrics that correlate with better typing quality. Drawing on our experience with Voice Search acoustic models we replaced both the Gaussian and rule-based models with a single, highly efficient long short-term memory (LSTM) model trained with a connectionist temporal classification (CTC) criterion.However, training this model turned out to be a lot more complicated than we had anticipated. While acoustic models are trained from human-transcribed audio data, one cannot easily transcribe millions of touch point sequences and glide traces. So the team exploited user-interaction signals, e.g. reverted auto-corrections and suggestion picks as negative and positive semi-supervised learning signals, to form rich training and test sets. Raw data points corresponding to the word “could” (left), and normalized sampled trajectory with per-sample variances (right).A plethora of techniques from the speech recognition literature was used to iterate on the NSM models to make them small and fast enough to be run on any device. The TensorFlow infra[...]

Introducing the TensorFlow Research Cloud


Posted by Zak Stone, Product Manager for TensorFlowResearchers require enormous computational resources to train the machine learning (ML) models that have delivered recent breakthroughs in medical imaging, neural machine translation, game playing, and many other domains. We believe that significantly larger amounts of computation will make it possible for researchers to invent new types of ML models that will be even more accurate and useful. To accelerate the pace of open machine-learning research, we are introducing the TensorFlow Research Cloud (TFRC), a cluster of 1,000 Cloud TPUs that will be made available free of charge to support a broad range of computationally-intensive research projects that might not be possible otherwise.The TensorFlow Research Cloud offers researchers the following benefits:Access to Google’s all-new Cloud TPUs that accelerate both training and inferenceUp to 180 teraflops of floating-point performance per Cloud TPU64 GB of ultra-high-bandwidth memory per Cloud TPUFamiliar TensorFlow programming interfacesYou can sign up here to request to be notified when the TensorFlow Research Cloud application process opens, and you can optionally share more information about your computational needs. We plan to evaluate applications on a rolling basis in search of the most creative and ambitious proposals.The TensorFlow Research Cloud program is not limited to academia — we recognize that people with a wide range of affiliations, roles, and expertise are making major machine learning research contributions, and we especially encourage those with non-traditional backgrounds to apply. Access will be granted to selected individuals for limited amounts of compute time, and researchers are welcome to apply multiple times with multiple projects.Since the main goal of the TensorFlow Research Cloud is to benefit the open machine learning research community as a whole, successful applicants will be expected to do the following:Share their TFRC-supported research with the world through peer-reviewed publications, open-source code, blog posts, or other open mediaShare concrete, constructive feedback with Google to help us improve the TFRC program and the underlying Cloud TPU platform over timeImagine a future in which ML acceleration is abundant and develop new kinds of machine learning models in anticipation of that futureFor businesses interested in using Cloud TPUs for proprietary research and development, we will offer a parallel Cloud TPU Alpha program. You can sign up here to learn more about this program. We recommend participating in the Cloud TPU Alpha program if you are interested in any of the following:Accelerating training of proprietary ML models; models that take weeks to train on other hardware can be trained in days or even hours on Cloud TPUsAccelerating batch processing of industrial-scale datasets: images, videos, audio, unstructured text, structured data, etc.Processing live requests in production using larger and more complex ML models than ever beforeWe hope the TensorFlow Research Cloud will allow as many researchers as possible to explore the frontier of machine learning research and extend it with new discoveries! We encourage you to sign up today to be among the first to know as more information becomes available. [...]

Using Machine Learning to Explore Neural Network Architecture


Posted by Quoc Le & Barret Zoph, Research Scientists, Google Brain teamAt Google, we have successfully applied deep learning models to many applications, from image recognition to speech recognition to machine translation. Typically, our machine learning models are painstakingly designed by a team of engineers and scientists. This process of manually designing machine learning models is difficult because the search space of all possible models can be combinatorially large — a typical 10-layer network can have ~1010 candidate networks! For this reason, the process of designing networks often takes a significant amount of time and experimentation by those with significant machine learning expertise. Our GoogleNet architecture. Design of this network required many years of careful experimentation and refinement from initial versions of convolutional architectures.To make this process of designing machine learning models much more accessible, we’ve been exploring ways to automate the design of machine learning models. Among many algorithms we’ve studied, evolutionary algorithms [1] and reinforcement learning algorithms [2] have shown great promise. But in this blog post, we’ll focus on our reinforcement learning approach and the early results we’ve gotten so far.In our approach (which we call "AutoML"), a controller neural net can propose a “child” model architecture, which can then be trained and evaluated for quality on a particular task. That feedback is then used to inform the controller how to improve its proposals for the next round. We repeat this process thousands of times — generating new architectures, testing them, and giving that feedback to the controller to learn from. Eventually the controller learns to assign high probability to areas of architecture space that achieve better accuracy on a held-out validation dataset, and low probability to areas of architecture space that score poorly. Here’s what the process looks like:We’ve applied this approach to two heavily benchmarked datasets in deep learning: image recognition with CIFAR-10 and language modeling with Penn Treebank. On both datasets, our approach can design models that achieve accuracies on par with state-of-art models designed by machine learning experts (including some on our own team!).So, what kind of neural nets does it produce? Let’s take one example: a recurrent architecture that’s trained to predict the next word on the Penn Treebank dataset. On the left here is a neural net designed by human experts. On the right is a recurrent architecture created by our method: The machine-chosen architecture does share some common features with the human design, such as using addition to combine input and previous hidden states. However, there are some notable new elements — for example, the machine-chosen architecture incorporates a multiplicative combination (the left-most blue node on the right diagram labeled “elem_mult”). This type of combination is not common for recurrent networks, perhaps because researchers see no obvious benefit for having it. Interestingly, a simpler form of this approach was recently suggested by human designers, who also argued that this multiplicative combination can actually alleviate gradient vanishing/exploding issues, suggesting that the machine-chosen architecture was able to discover a useful new neural net architecture.This approach may also teach us something about why certain types of neural nets work so well. The archi[...]

Efficient Smart Reply, now for Gmail


Posted by Brian Strope, Research Scientist, and Ray Kurzweil, Engineering Director, Google ResearchLast year we launched Smart Reply, a feature for Inbox by Gmail that uses machine learning to suggest replies to email. Since the initial release, usage of Smart Reply has grown significantly, making up about 12% of replies in Inbox on mobile. Based on our examination of the use of Smart Reply in Inbox and our ideas about how humans learn and use language, we have created a new version of Smart Reply for Gmail. This version increases the percentage of usable suggestions and is more algorithmically efficient. Novel thinking: hierarchyInspired by how humans understand languages and concepts, we turned to hierarchical models of language, an approach that uses hierarchies of modules, each of which can learn, remember, and recognize a sequential pattern. The content of language is deeply hierarchical, reflected in the structure of language itself, going from letters to words to phrases to sentences to paragraphs to sections to chapters to books to authors to libraries, etc. Consider the message, "That interesting person at the cafe we like gave me a glance." The hierarchical chunks in this sentence are highly variable. The subject of the sentence is "That interesting person at the cafe we like." The modifier "interesting" tells us something about the writer's past experiences with the person. We are told that the location of an incident involving both the writer and the person is "at the cafe." We are also told that "we," meaning the writer and the person being written to, like the cafe. Additionally, each word is itself part of a hierarchy, sometimes more than one. A cafe is a type of restaurant which is a type of store which is a type of establishment, and so on. In proposing an appropriate response to this message we might consider the meaning of the word "glance," which is potentially ambiguous. Was it a positive gesture? In that case, we might respond, "Cool!" Or was it a negative gesture? If so, does the subject say anything about how the writer felt about the negative exchange? A lot of information about the world, and an ability to make reasoned judgments, are needed to make subtle distinctions.Given enough examples of language, a machine learning approach can discover many of these subtle distinctions. Moreover, a hierarchical approach to learning is well suited to the hierarchical nature of language. We have found that this approach works well for suggesting possible responses to emails. We use a hierarchy of modules, each of which considers features that correspond to sequences at different temporal scales, similar to how we understand speech and language. Each module processes inputs and provides transformed representations of those inputs on its outputs (which are, in turn, available for the next level). In the Smart Reply system, and the figure above, the repeated structure has two layers of hierarchy. The first makes each feature useful as a predictor of the final result, and the second combines these features. By definition, the second works at a more abstract representation and considers a wider timescale.By comparison, the initial release of Smart Reply encoded input emails word-by-word with a long-short-term-memory (LSTM) recurrent neural network, and then decoded potential replies with yet another word-level LSTM. While this type of modeling is very effective in many contexts, even with Google infrastructure, it’s a[...]

Coarse Discourse: A Dataset for Understanding Online Discussions


Posted by Praveen Paritosh, Senior Research Scientist, Ka Wong, Senior Data ScientistEvery day, participants of online communities form and share their opinions, experiences, advice and social support, most of which is expressed freely and without much constraint. These online discussions are often a key resource of information for many important topics, such as parenting, fitness, travel and more. However, these discussions also are intermixed with a clutter of disagreements, humor, flame wars and trolling, requiring readers to filter the content before getting the information they are looking for. And while the field of Information Retrieval actively explores ways to allow users to more efficiently find, navigate and consume this content, there is a lack of shared datasets on forum discussions to aid in understanding these discussions a bit better. To aid researchers in this space, we are releasing the Coarse Discourse dataset, the largest dataset of annotated online discussions to date. The Coarse Discourse contains over half a million human annotations of publicly available online discussions on a random sample of over 9,000 threads from 130 communities from To create this dataset, we developed the Coarse Discourse taxonomy of forum comments by going through a small set of forum threads, reading every comment, and deciding what role the comments played in the discussion. We then repeated and revised this exercise with crowdsourced human editors to validate the reproducibility of the taxonomy's discourse types, which include: announcement, question, answer, agreement, disagreement, appreciation, negative reaction, elaboration, and humor. From this data, over 100,000 comments were independently annotated by the crowdsourced editors for discourse type and relation. Along with the raw annotations from crowdsourced editors, we also provide the Coarse Discourse annotation task guidelines used by the editors to help with collecting data for other forums and refining the task further. An example thread annotated with discourse types and relations. Early findings suggest that question answering is a prominent use case in most communities, while some communities are more converationally focused, with back-and-forth interactions. For machine learning and natural language processing researchers trying to characterize the nature of online discussions, we hope that this dataset is a useful resource. Visit our GitHub repository to download the data. For more details, check out our ICWSM paper, “Characterizing Online Discussion Using Coarse Discourse Sequences.”AcknowledgmentsThis work was done by Amy Zhang during her internship at Google. We would also like to thank Bryan Culbertson, Olivia Rhinehart, Eric Altendorf, David Huynh, Nancy Chang, Chris Welty and our crowdsourced editors. [...]

Neural Network-Generated Illustrations in Allo


Posted by Jennifer Daniel, Expressions Creative Director, Allo Taking, sharing, and viewing selfies has become a daily habit for many — the car selfie, the cute-outfit selfie, the travel selfie, the I-woke-up-like-this selfie. Apart from a social capacity, self-portraiture has long served as a means for self and identity exploration. For some, it’s about figuring out who they are. For others it’s about projecting how they want to be perceived. Sometimes it’s both.Photography in the form of a selfie is a very direct form of expression. It comes with a set of rules bounded by reality. Illustration, on the other hand, empowers people to define themselves - it’s warmer and less fraught than reality. Today, Google is introducing a feature in Allo that uses a combination of neural networks and the work of artists to turn your selfie into a personalized sticker pack. Simply snap a selfie, and it’ll return an automatically generated illustrated version of you, on the fly, with customization options to help you personalize the stickers even further.What makes you, you?The traditional computer vision approach to mapping selfies to art would be to analyze the pixels of an image and algorithmically determine attribute values by looking at pixel values to measure color, shape, or texture. However, people today take selfies in all types of lighting conditions and poses. And while people can easily pick out and recognize qualitative features, like eye color, regardless of the lighting condition, this is a very complex task for computers. When people look at eye color, they don’t just interpret the pixel values of blue or green, but take into account the surrounding visual context.In order to account for this, we explored how we could enable an algorithm to pick out qualitative features in a manner similar to the way people do, rather than the traditional approach of hand coding how to interpret every permutation of lighting condition, eye color, etc. While we could have trained a large convolutional neural network from scratch to attempt to accomplish this, we wondered if there was a more efficient way to get results, since we expected that learning to interpret a face into an illustration would be a very iterative process. That led us to run some experiments, similar to DeepDream, on some of Google's existing more general-purpose computer vision neural networks. We discovered that a few neurons among the millions in these networks were good at focusing on things they weren’t explicitly trained to look at that seemed useful for creating personalized stickers. Additionally, by virtue of being large general-purpose neural networks they had already figured out how to abstract away things they didn’t need. All that was left to do was to provide a much smaller number of human labeled examples to teach the classifiers to isolate out the qualities that the neural network already knew about the image.To create an illustration of you that captures the qualities that would make it recognizable to your friends, we worked alongside an artistic team to create illustrations that represented a wide variety of features. Artists initially designed a set of hairstyles, for example, that they thought would be representative, and with the help of human raters we used these hairstyles to train the network to match the right illustration to the right selfie. We then asked human raters to judge th[...]