Subscribe: O'Reilly Radar - Insight, analysis, and research about emerging technologies
Added By: Feedage Forager Feedage Grade B rated
Language: English
continue reading  continue  data  february  learning  links february  links  machine learning  march  new  reading  short links 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: O'Reilly Radar - Insight, analysis, and research about emerging technologies

All - O'Reilly Media

All of our Ideas and Learning material from all of our topics.

Updated: 2018-02-22T23:52:25Z


Brent Laster on Jenkins 2 and Git



The O’Reilly Programming Podcast: Creating and implementing continuous delivery pipelines.

In this episode of the O’Reilly Programming Podcast, I talk about Jenkins 2 and Git with Brent Laster, who presents a number of live online training courses on these topics (including Building a deployment pipeline with Jenkins 2, and Next level Git). Laster will also present the workshop Power Git at the O’Reilly Open Source Convention, July 16-19, 2018, in Portland, Oregon, and he is the author of the forthcoming O’Reilly book Jenkins 2: Up and Running.

Continue reading Brent Laster on Jenkins 2 and Git.


Four short links: 22 February 2018


Fast Style Transfer, Categorizing Social Media Messages, Finding Secrets, and Misleading Images

  1. A Closed-form Solution to Photorealistic Image Stylization -- Experimental results show that the stylized photos generated by our algorithm are twice more preferred by human subjects in average. Moreover, our method runs 60 times faster than the state-of-the-art approach. Code available. (via Ming-Yu Liu)
  2. Characterizing Social Media Messages by How They Propagate -- Since content information is sparse and noisy on social media, adopting TraceMiner allows you to provide a high degree of classication accuracy even in the absence of content information. Experimental results on real-world data sets show the superiority over state-of-the-art approaches on the task of fake news detection and news categorization. (via Paper a Day)
  3. GitLeaks -- searches full repo history for secrets and keys.
  4. Obfuscated Gradients -- In our recent paper, we evaluate the robustness of eight papers accepted to ICLR 2018 as non-certified white-box-secure defenses to adversarial examples. We find that seven of the eight defenses provide a limited increase in robustness and can be broken by improved attack techniques we develop. It's very easy to make an image that looks to a human like one thing, but which a deep learning classifier will identify as something else. (via Dan Kaminsky)

Continue reading Four short links: 22 February 2018.


5 best practices for effective storyboarding



Master the elements of graphic storytelling to better communicate your visions to stakeholders.

Continue reading 5 best practices for effective storyboarding.


Build a recurrent neural network using Apache MXNet


A step-by-step tutorial to develop an RNN that predicts the probability of a word or character given the previous word or character.In our previous notebooks, we used a deep learning technique called convolution neural network (CNN) to classify text and images. Even though CNN is a powerful technique, it cannot learn temporal features from an input sequence such as audio and text. Moreover, CNN is designed to learn spatial features with a fixed-length convolution kernel. These types of neural networks are called feedforward neural networks . On the other hand, a recurrent neural network (RNN) is a type of neural network that can learn temporal features and has a wider range of applications than a feedforward neural network. In this notebook, we will develop a recurrent neural network that predicts the probability of a word or character given the previous word or character. Almost all of us have a predictive keyboard on our smartphone that suggests upcoming words for super-fast typing. A recurrent neural network allows us to build the most advanced predictive system similar to SwiftKey. We will first cover the limitations of a feedforward neural network. Next, we will implement a basic RNN using a feedforward neural network that can provide a good insight into how RNN works. After that, we will design a powerful RNN with LSTM and GRU layers using MXNet’s Gluon API. We will use this RNN to generate text. We will also talk about the following topics: The limitations of a feedforward neural network The idea behind RNN and LSTM Installing MXNet with the Gluon API Preparing data sets to train the neural network Implementing a basic RNN using a feedforward neural network Implementing an RNN model to auto-generate text using the Gluon API For this tutorial, you need a basic understanding of recurrent neural networks (RNNs), activation functions, gradient descent, and backpropagation. You should also know something about Python’s NumPy library. Feedforward neural network versus recurrent neural network Although feedforward neural networks, including convolution neural networks, have shown great accuracy in classifying sentences and text, they cannot store long-term dependencies in memory (hidden state). Memory can be viewed as temporal state that can updated over time. A feedforward neural network can't interpret the context since it does not store temporal state (“memory”). A CNN can only learn spatial context from a local group of neighbors (image/sequence) within the size of its convolution kernels. Figure 1 shows the convolution neural network spatial context versus RNN temporal context for a sample data set. In CNN, the relationship between "O" and "V" is lost since they are part of different convolution’s spatial context. In RRN, the temporal relationship between the characters "L", "O", "V", "E" is captured. Figure 1. CNN spatial context vs. RNN spatial context. Image by Manu Jeevan. It cannot understand learned context since there is no “memory” state. So it cannot model sequential/temporally data (data with definitive ordering, like the structure of a language). An abstract view of a feedforward neural network is shown in Figure 2. Figure 2. Feedforward neural network. Image by Manu Jeevan. An RNN is more versatile. Its cells accept weighted input and produce both weighted output (WO) and weighted hidden state (WH). The hidden state acts as the memory that stores context. If an RNN represents a person talking on the phone, the weighted output is the words spoken, and the weighted hidden state is the context in which the person utters the word. The intuition behind RNNs In this section, we will explain the similarity between a feedforward neural network and RNN by building an unrolled version of a vanilla RNN using a feedforward neural network. A vanilla RNN has a simple hidden state matrix (memory) and is easy to understand and implement. Suppose we have to predict the 4th character in a stream of text, g[...]

Four short links: 21 February 2018


Fonts for Viz, Map AWS Resources, Technological Unemployment, and Design for Humans

  1. Fonts for Complex Data -- good advice. They even have a section for legal small print! (A non-designer’s first impulse is often to reach for a condensed typeface, on the principle that narrower letters take up less space. Yet, it’s almost always a better option to make the counter-intuitive choice of a wider typeface, and to set the type in a smaller size with tighter leading. Wider letters have more comfortable proportions, they’re more generously spaced, and they have more ample counters, collectively making them the more legible choice.)
  2. CloudMapper -- generates network diagrams of Amazon Web Services (AWS) environments and displays them via your browser. It helps you understand visually what exists in your accounts and identify possible network misconfigurations.
  3. Technological Unemployment -- This is my attempt to figure out what economists and experts think so I can understand the issue, and I’m writing it down to speed your going through the same process. An excellent starting point.
  4. How Technology Is Designed to Bring Out the Worst in Us -- Technology feels disempowering because we haven’t built it around an honest view of human nature.

Continue reading Four short links: 21 February 2018.


How to tune your WAF installation to reduce false positives


Optimizing your NGINX setup with a tuned ModSecurity / Core Rule Set installation.Site administrators put in a web application firewall (WAF) to block malicious or dangerous web traffic, but at the risk of blocking some valid traffic as well. A false positive is an instance of your WAF blocking a valid request. False positives are the natural enemy of every WAF installation. Each false positive means two bad things: your WAF is working too hard, consuming compute resources in order to do something it shouldn't, and legitimate traffic is not being allowed to go through. The damage from a WAF that generates too many false positives could be as bad as the damage from a successful attack—and can lead you to abandon the use of your WAF in frustration. Tuning your WAF installation to reduce false positives is a tedious process. This article will help you reduce false positives on NGINX, leaving you with a clean installation that allows legitimate requests to pass and blocks attacks immediately. ModSecurity, the WAF engine, is most often used in coordination with the OWASP ModSecurity Core Rule Set (CRS). This creates a first line of defense against web application attacks, such as those described by the OWASP Top Ten project. The CRS is a rule set for scoring anomalies among incoming requests. It uses generic blacklisting techniques to detect attacks before they hit the application. The CRS also allows you to adjust the aggressiveness of the rule set, simply by changing its Paranoia Level in the configuration file, crs-setup.conf. False positives intermixing with real attacks The fear of blocking legitimate users due to false positives resulting from use of the CRS is real. If you have a substantial number of users, or a web application with suspicious looking traffic, then the number of alerts can be intimidating. The out-of-the-box CRS configuration has been tuned to aggressively reduce the number of false positives. However, if you are not satisfied with the detection capabilities of the default installation, you will need to change the Paranoia Level to improve the coverage. Raising the Paranoia Level in the configuration file activates rules that are off by default. They are not part of the default installation at Paranoia Level 1 because they have a tendency to produce false positives. The higher the Paranoia Level setting, the more rules are enforced. Thus, the more aggressive the ruleset becomes, and the more false positives are produced. Considering this, you need a strategy to mitigate false positives. If you allow them to intermix with traces of true attacks, they undermine the value of the rule set. So, you need to get rid of the false positives in order to end up with a clean installation that will let the legitimate requests pass and block attackers. The problem is trifold: How to identify a false positive How to deal with individual false positives What does a practical approach look like? (Or: How do you scale this?) When false positives come by the dozen, it is surprisingly difficult to identify them. A deep knowledge of the application helps to tell benign, but suspicious, requests from malicious ones. But if you do not want to look at them one by one, you will need to filter the alerts and make sure you end up with a data set that consists of false positives only. Because if you do not do that, you might end up tuning away real alerts pointing to attacks taking place. You can use the IP addresses to identify known users, known local networks, etc. Alternatively, you can assume that users who passed the authentication successfully are not attackers (which might be naive, depending on the size of your business). Or you can employ some other means of identification. The exact approach really depends on your setup and the testing process you have in place. When you have identified an individu[...]

Rapid data production with a multi-model database



Ingest the data you need in an agile manner.

The demand for data consumption greatly outweighs the data production capabilities of most organizations. Is this because there is a shortage of data? Absolutely not! Data is everywhere, and is generated at a rate faster than most IT systems can handle. One reason data-centric IT systems fail is due to the fact that they must deal with not only a large amount of data, but also a variety of data formats and models.

Traditionally, in order to handle this variety of data, different types of databases get “bolted” together into a single complex architecture. Then, processes for data integration and synchronization are added so that each database is kept up to date with the data it requires to do its specific job. On top of this, we need to add another layer of complexity to accommodate data security, data provenance, failover, redundancy, backup, etc. This polyglot infrastructure also requires a layer for exposing data “views” so that data can be consumed. Because of all this complexity, it becomes difficult to be agile in producing the data that consumers demand.

In order to deal with the problems inherent in the above architecture, a multi-model database can be utilized. The multi-model database enables system implementers to rapidly ingest data as is, and expose data in an agile manner. The multi-model database achieves this using a single, unified back end. This means that functionality like security, data indexing, and synchronization are all handled the same way, even if the data models and structures are completely different. For example, we can ingest JSON data from an external data source, store the data in JSON documents, and expose the data as relational data via SQL queries. This is just one example of the agility that can be achieved with a multi-model database.

Using a multi-model database platform results in a less complex architecture that can take in many types of data models, quickly adapt, and produce the data required by end users. This data-centric approach means that as existing requirements change or new requirements are discovered, data can quickly be brought in and exposed in many different formats with little or no upfront data modeling effort required.

Since the data is already “there,” IT organizations that use a multi-model database platform are seeing rapid time to market. Which is really what agility is all about.

This post is a collaboration between O'Reilly and MarkLogic. See our statement of editorial independence.

Continue reading Rapid data production with a multi-model database.


Choosing a tool to track and mitigate open source security vulnerabilities



How to find the best Software Composition Analysis (SCA) for your organization

Choosing a Software Composition Analysis Solution

Continuously tracking your application’s dependencies for vulnerabilities and efficiently addressing them is no simple feat. In addition, this is a problem shared by all, and is not an area most companies would consider their core competency. Therefore, it is a great opportunity for the right set of tools to help tackle this concern.

As mentioned before, the category of tools addressing this concern is currently known as Software Composition Analysis (SCA). Throughout this book, I’ve referred to different capabilities an SCA solution may or may not have, and the implications therein. Those questions were meant to assist you in designing the right process and selecting the right tools to help.

Continue reading Choosing a tool to track and mitigate open source security vulnerabilities.


Four short links: 20 February 2018


House Simulations, XLS diff, Apache FaaS, and Opening Closed Code

  1. House Simulator -- for AI to learn how houses work. Realistic physics, and 120 scenes based on four room categories: kitchens, living rooms, bedrooms, and bathrooms. Written in Unity.
  2. Git xltrail -- meaningful diffs of XLS files in Git repos.
  3. OpenWhisk -- Apache incubating a function-as-a-service (Lambda) package.
  4. How to Open Up Closed Code (GDS) -- Your team may have old closed code that it needs to open. If there is a lot of closed code, this can be challenging. Here are three ways to open it up.

Continue reading Four short links: 20 February 2018.


Four short links: 19 February 2018


Disambiguation, Learning to Code, Open Source BI, and API Hierarchy

  1. Discovering Types for Disambiguation -- clever! Clump Wikipedia entries into categories, then use the categories to see which meaning of a word (e.g., Jaguar the car, the animal, or the aircraft) best fits the other words in the sentence.
  2. Learning to Program is Getting Harder (Allen Downey) -- The problem is that GUIs hide a lot of information programmers need to know. So, when a user decides to become a programmer, they are suddenly confronted with all the information that's been hidden from them. If someone just wants to learn to program, they shouldn't have to learn operating system concepts first. (via Slashdot)
  3. Apache Superset -- incubating a modern, enterprise-ready business intelligence web application.
  4. Exploring API Security -- an API ecosphere that is open by default, but actively identifies and minimizes harm, rather than over-complicating security requirements or simply performing a compliance activity. The pyramid diagram will be useful if you ever have to communicate the requirements for an API...

Continue reading Four short links: 19 February 2018.


Build generative models using Apache MXNet


A step-by-step tutorial to build generative models through generative adversarial networks (GANs) to generate a new image from existing images.In our previous notebooks, we used a deep learning technique called convolution neural network (CNN) to classify text and images. A CNN is an example of a discriminative model, which creates a decision boundary to classify a given input signal (data) as either being in or out of a classification, such as email spam. Deep learning models in recent times have been used to create even more powerful and useful models called generative models. A generative model doesn't just create a decision boundary, but understands the underlying distribution of values. Using this insight, a generative model can also generate new data or classify a given input data. Here are some examples of generative models: Producing a new song or combining two genres of songs to create an entirely different song Synthesizing new images from existing images Upgrading images to a higher resolution in order to remove fuzziness, improve image quality, and much more In general, generative models can be used on any form of data to learn the underlying distribution, generate new data, and augment existing data. In this tutorial, we are going to build generative models through generative adversarial networks (GANs) to generate a new image from existing images. Our code will use Apache MXNet’s Gluon API. By the end of the notebook, you will be able to: Understand generative models Place generative models into the context of deep neural networks Implement a generative adversarial network (GAN) How generative models go further than discriminative models Let's see the power of generative models using a trivial example. The following depicts the heights of 10 humans and Martians. Martian (height in centimeter): 250,260,270,300,220,260,280,290,300,310 Human (height in centimeter): 160,170,180,190,175,140,180,210,140,200 The heights of human beings follow a normal distribution, showing up as a bell-shaped curve on the a graph (see Figure 1). Martians tend to be much taller than humans, but also have a normal distribution. So, let's input the heights of humans and Martians into both discriminative and generative models. Figure 1. Graph of sample human and Martian heights. Image by Manu Jeevan. If we train a discriminative model, it will just plot a decision boundary (see Figure 2). The model misclassifies just one human—the accuracy is quite good overall. But the model doesn't learn about the underlying distribution of data, so it is not suitable for building the powerful applications listed in the beginning of this article. Figure 2. Boundary between humans and Martians, as found by a discriminative model. Image by Manu Jeevan. In contrast, a generative model will learn the underlying distribution (lower dimension representation) for Martian (mean=274, std=8.71) and Human (mean=174, std=7.32). If we know the normal distribution for Martians (mean=274, std=8.71), we can produce new data by generating a random number between 0 and 1 (uniform distribution) and then querying the normal distribution of Martians to get a value: say, 275 cm. Using the underlying distribution, we can generate new Martians and humans, or a new interbreed species (humars). We have infinite ways to generate data because we can manipulate the underlying distribution of data. We can also use this model for classifying Martians and humans, just like the discriminative model. For a concrete understanding of generative versus discriminative models, please check the article "Generative and Discriminative Text Classification with Recurrent Neural Networks" by Yogatama, et al. Examples of discriminative models include logistic regression and suppor[...]

Four short links: 16 February 2018


Machine Design, Metrics, Layered Learning, and Automatically Mergeable Data Structure

  1. Towards Designing Machines -- survey of theory and approaches to building machines that can design things.
  2. Review of the Tyranny of Metrics (Tim Hartford) -- Rather than rely on the informed judgment of people familiar with the situation, we gather meaningless numbers at great cost. We then use them to guide our actions, predictably causing unintended damage.
  3. Physics Travel Guide -- a tool that makes learning physics easier. Each page here contains three layers which contain explanations with increasing level of sophistication. We call these layers: layman, student and researcher. These layers make sure that readers can always find an explanation they understand. One of these for security or coding would be interesting.
  4. Automerge -- A JSON-like data structure that can be modified concurrently by different users, and merged again automatically.

Continue reading Four short links: 16 February 2018.


Graphs as the front end for machine learning



The O’Reilly Data Show Podcast: Leo Meyerovich on building large-scale, interactive applications that enable visual investigations.

In this episode of the Data Show, I spoke with Leo Meyerovich, co-founder and CEO of Graphistry. Graphs have always been part of the big data revolution (think of the large graphs generated by the early social media startups). In recent months, I’ve come across companies releasing and using new tools for creating, storing, and (most importantly) analyzing large graphs. There are many problems and use cases that lend themselves naturally to graphs, and recent advances in hardware and software building blocks have made large-scale analytics possible.

Starting with his work as a graduate student at UC Berkeley, Meyerovich has pioneered the combination of hardware and software acceleration to create truly interactive environments for visualizing large amounts of data. Graphistry has built a suite of tools that enables analysts to wade through large data sets and investigate business and security incidents. The company is currently focused on the security domain—where it turns out that graph representations of data are things security analysts are quite familiar with.

Continue reading Graphs as the front end for machine learning.


Four short links: 15 February 2018


Donut Drones, Consensus Algorithms, 2FA Spam, and Replacing Founders

  1. Donut Drone (IEEE) -- clever drone that is collision-safe. Nice!
  2. Hitchhiker's Guide to Consensus Algorithms -- In the world of crypto, consensus algorithms exist to prevent double spending. Here’s a quick rundown on some of the most popular consensus algorithms to date, from blockchains to DAGs and everything in-between.
  3. Facebook Spamming Users via Their 2FA Numbers (Mashable) -- when your profits are proportional to engagement, your business model turns your business into a junkie. It will cajole, stalk, berate, and trap users to feed its engagement addiction.
  4. What Happens When Startups Replace The Founder? (HBR) -- about 20% are replaced; noncompete laws help/hinder recruitment; it's overall beneficial; startups perform better when the founder leaves the company; raising external funding raises the probability that the founder will be replaced.

Continue reading Four short links: 15 February 2018.


Working with data in the financial industry: Legal considerations



Alysa Hutnik discusses the Fair Credit Reporting Act, the Equal Credit Opportunity Act, the Gramm-Leach Bliley Act, and the FTC’s focus on FinTech.

Continue reading Working with data in the financial industry: Legal considerations.


Building deep learning neural networks using TensorFlow layers


A step-by-step tutorial on how to use TensorFlow to build a multi-layered convolutional network.Deep learning has proven its effectiveness in many fields, such as computer vision, natural language processing (NLP), text translation, or speech to text. It takes its name from the high number of layers used to build the neural network performing machine learning tasks. There are several types of layers as well as overall network architectures, but the general rule holds that the deeper the network is, the more complexity it can grasp. This article will explain fundamental concepts of neural network layers and walk through the process of creating several types using TensorFlow. TensorFlow is the platform that contributed to making artificial intelligence (AI) available to the broader public. It’s an open source library with a vast community and great support. TensorFlow provides a set of tools for building neural network architectures, and then training and serving the models. It offers different levels of abstraction, so you can use it for cut-and-dried machine learning processes at a high level or go more in-depth and write the low-level calculations yourself. TensorFlow offers many kinds of layers in its tf.layers package. The module makes it easy to create a layer in the deep learning model without going into many details. At the moment, it supports types of layers used mostly in convolutional networks. For other types of networks, like RNNs, you may need to look at tf.contrib.rnn or tf.nn. The most basic type of layer is the fully connected one. To implement it, you only need to set up the input and the size in the Dense class. Other kinds of layers might require more parameters, but they are implemented in a way to cover the default behaviour and spare the developers’ time. There is some disagreement on what a layer is and what it is not. One opinion states that a layer must store trained parameters (like weights and biases). This means, for instance, that applying the activation function is not another layer. Indeed, tf.layers implements such a function by using the activation parameter. Layers introduced in the module don’t always strictly follow this rule, though. You can find a large range of types there: fully connected, convolution, pooling, flatten, batch normalization, dropout, and convolution transpose. It may seem that, for example, layer flattening and max pooling don’t store any parameters trained in the learning process. Nonetheless, they are performing more complex operations than activation function, so the authors of the module decided to set them up as separate classes. Later in the article, we’ll discuss how to use some of them to build a deep convolutional network. A typical convolutional network is a sequence of convolution and pooling pairs, followed by a few fully connected layers. A convolution is like a small neural network that is applied repeatedly, once at each location on its input. As a result, the network layers become much smaller but increase in depth. Pooling is the operation that usually decreases the size of the input image. Max pooling is the most common pooling algorithm, and has proven to be effective in many computer vision tasks. In this article, I’ll show the use of TensorFlow in applying a convolutional network to image processing, using the MNIST data set for our example. The task is to recognize a digit ranging from 0 to 9 from its handwritten representation. First, TensorFlow has the capabilities to load the data. All you need to do is to use the input_data module: from tensorflow.examples.tutorials.mnist import input_data mnist = input_[...]

Modern data is continuous, diverse, and ever accelerating



How companies such as athenahealth can transform legacy data into insights.

Application development in the big data space has changed rapidly over the past decade, driven by three factors: the need for continuous availability, the richness of diverse data processing pipelines, and the pressure for accelerated development. Here we look at each of these factors and how they combine to require new environments for data processing. We also look at how one company, athenahealth, is adjusting its legacy systems, for billing, scheduling, and treatment for health care providers, to accommodate these trends, using the Mesosphere DC/OS platform.

Continuous availability: Data is always on

This is a world where people could be working or visiting your site at three in the morning, and if it's unavailable or slow you'll be hearing complaints. Failure recovery and scaling are both critical capabilities. These used to be handled separately: failure recovery revolved around heartbeats and redundant resources, whereas scaling was an element of long-term planning by IT management. But the two capabilities are now handled through the same kinds of operations. This makes sense, because they both require monitoring and resource management.

Continue reading Modern data is continuous, diverse, and ever accelerating.


Four short links: 14 February 2018


CS Ethics, Experience the Retail Struggle, Front-End Interview Handbook, and Label Shift

  1. New CS Ethics Courses (NYT) -- Harvard, MIT, Stanford, and UT Austin all offering ethics classes around the challenges that computer scientists and programmers face as they research and develop the future.
  2. American Mall -- Bloomberg's mock-retro game to illustrate the difficulties of keeping American retail malls open. I'm a huge fan of using games to let people experience/simulate a situation.
  3. Front End Interview Handbook -- Answers to front-end interview questions. I can’t begin to imagine the rate of change in this repository.
  4. Detecting and Correcting for Label Shift with Black Box Predictors -- Faced with distribution shift between training and test set, we wish to detect and quantify the shift, and to correct our classifiers without test set labels. Nice. For you discover that your training set underrepresented one of the variables. (Their example is: trained on a data set with .2% pneumonia occurrence but now you learn that pneumonia has 5% prevalence in the population.)

Continue reading Four short links: 14 February 2018.


Just released: 30+ new live online trainings on O'Reilly's learning platform


Get instructor-led training in Python, React, PMP, security, and more.We just opened up more than 30 new live online trainings for February, March, and April on our learning platform. These trainings give you hands-on instruction from expert practitioners in critical topics. Space is limited and these trainings often fill up. Reactive Spring and Spring Boot, February 27 Python: Beyond the Basics, March 1-2 Having Difficult Conversations, March 2 Python: The Next Level, March 5-6 Python for Applications: Beyond Scripts, March 7-8 Medium R Programming: Beyond the Basics, March 8-9 Building Effective and Adaptive Teams, March 12 Getting Started with React and Redux, March 12 Pythonic Object-Oriented Programming, March 12 Test-Driven Development in Python, March 13 Building a Deployment Pipeline with Jenkins 2, March 14-15 Python: Beyond the Basics, March 14-15 Intro to Deep Learning Part 1: Theory and Practice Featuring Keras, March 19 Networking in AWS, March 19-20 Python: The Next Level, March 19-20 Migrating Jenkins Environments to Jenkins 2, March 19 and 21 Java 8 Generics in 3 Hours, March 20 Python for Applications: Beyond Scripts, March 21-22 Apache Hadoop, Spark and Big Data Foundations, March 23 PMP Crash Course, March 26-27 Pythonic Object-Oriented Programming, March 28 Mastering Relational SQL Querying, March 28-29 Test-Driven Development in Python, March 29 Cyber Security Fundamentals, March 29-30 Hands-on Introduction to Apache Hadoop and Spark Programming, March 29-30 Design Thinking: Practice and Measurement Essentials, April 2 From Monolith to Microservicess, April 4-5 Design Thinking: 90-Minute Introduction, April 5 Building Chatbots with AWS, April 6 Deep Learning for NLP, April 6 Analyzing Container Performance, April 9 Architecture Without an End State, April 9-10 Design Patterns Boot Camp, April 9-10 Working with Web Push Notifications, April 10-11 High Performance Machine Learning and Data Analysis with Julia, April 12 From Developer to Software Architect, April 16-17 Microservices Architecture and Design, April 16-17 Fundamental PostgreSQL, April 17-18 Visit our learning platform for more information on these and our other live online trainings. Continue reading Just released: 30+ new live online trainings on O'Reilly's learning platform.[...]

The fundamentals of voice design by way of voice enabling



Learn how to integrate voice with your product.

Continue reading The fundamentals of voice design by way of voice enabling.


Iterative data modeling to avoid dreaded ETL


Gain agility by loading first and transforming later.In today’s world full of big data, every large enterprise faces a similar problem: how do I leverage my data more effectively when it’s spread across dozens or even hundreds of systems? Businesses build mission-critical business applications on relational databases filled with structured data, and they also have unstructured data to worry about (think patient notes, photos, reports, etc.). They want to get a better grasp of all this data, and build new applications that leverage it to innovate and better serve their customers. The ETL problem Integrating data from various silos into a relational database requires significant investment in the extract, transform, load (ETL) phase of any data project. Before building an application that leverages integrated data, data architects must first reconcile all of the data in their source systems, finalizing the schema before the data can be ingested. This data modeling effort may take years. And, additional effort will be necessary with each change in an input system data scheme or application requirement. This approach is not agile, and in today’s world, it means that a business constantly plays catch-up. Not to mention, ETL tools and the work that goes into using them can eat up 60% of a project’s budget, despite providing little additional value (see the report TDWI, Evaluating ETL and Data Integration Platforms). The meaningful work of building an application and delivering actual value only begins after all the ETL works is complete. The ELT solution No, “ELT” is not a typo. Instead of ETL, the flexibility of a document database makes it possible to extract, load...and then transform (hence, "ELT"). This process, known as “schema-on-read” (instead of the traditional schema-on-write), lets you apply your own lens to the data when you read it back out. So instead of requiring a schema first, before doing anything with your data, you can use the latent schema already with the data and update this existing schema later as desired or needed. That means taking all of your data, from all of your systems—structured and unstructured, however it comes—and ingest it as is. Developers can start using it immediately to build applications. By loading it into a database that can support different schemas and data types (document, RDF, geospatial, binary, SQL, and text), data architects don’t have to worry about defining the schema, type, or format up front and can focus instead on how to use that data down the line. Once loaded, it is possible to iteratively make adjustments to the data as needed to address current requirements. Now you can transform that data, harmonize it, and make it usable for your business needs, as you need it. Over time, as requirements and downstream systems change, so might your data transformations. In part three of the recently released MarkLogic Cookbook, Dave Cassel illustrates a variety of ways to transform and harmonize data in MarkLogic after data has been loaded. In fact, he shows how you can transform around a given field as you load. Of course, from a governance standpoint, you don’t want to actually change the data. You can use the MarkLogic Envelope Pattern to wrap newly harmonized data around the original data to preserve its original form. You can also transform the data that gets stored in indexes without physically changing the data stored in documents. And, finally, you can use the pl[...]

How neural networks learn distributed representations


Deep learning’s effectiveness is often attributed to the ability of neural networks to learn rich representations of data.The concept of distributed representations is often central to deep learning, particularly as it applies to natural language tasks. Those beginning in the field may quickly understand this as simply a vector that represents some piece of data. While this is true, understanding distributed representations at a more conceptual level increases our appreciation of the role they play in making deep learning so effective. To examine different types of representation, we can do a simple thought exercise. Let’s say we have a bunch of “memory units” to store information about shapes. We can choose to represent each individual shape with a single memory unit, as demonstrated in Figure 1. Figure 1. Sparse or local, non-distributed representation of shapes. Image by Garrett Hoffman. This non-distributed representation, referred to as "sparse" or "local," is inefficient in multiple ways. First, the dimensionality of our representation will grow as the number of shapes we observe grows. More importantly, it doesn’t provide any information about how these shapes relate to each other. This is the true value of a distributed representation: its ability to capture meaningful “semantic similarity” between between data through concepts. Figure 2. Distributed representation of shapes. Image by Garrett Hoffman. Figure 2 shows a distributed representation of this same set of shapes where information about the shape is represented with multiple “memory units” for concepts related to orientation and shape. Now the “memory units” contain information both about an individual shape and how each shape relates to each other. When we come across a new shape with our distributed representation, such as the circle in Figure 3, we don’t increase the dimensionality and we also know some information about the circle, as it relates to the other shapes, even though we haven’t seen it before. Figure 3. Distributed representation of a circle; This representation is more useful as it provides us with information about how this new shape is related to our other shapes. Image by Garrett Hoffman. While this shape example is oversimplified, it serves as a great high-level, abstract introduction to distributed representations. Notice, in the case of our distributed representation for shapes, that we selected four concepts or features (vertical, horizontal, rectangle, ellipse) for our representation. In this case, we were required to know what these important and distinguishing features were beforehand, and in many cases, this is a difficult or impossible thing to know. It is for this reason that feature engineering is such a crucial task in classical machine learning techniques. Finding a good representation of our data is critical to the success of downstream tasks like classification or clustering. One of the reasons that deep learning has seen tremendous success is a neural networks' ability to learn rich distributed representations of data. To examine this, we will revisit the problem we tackled in our LSTM tutorial—predicting stock market sentiment from social media posts from StockTwits. In this tutorial, we built a multi-layered LSTM to predict the sentiment of a message from the raw body of text. When processing our message data, we created a mapping of our vocabulary to an integer index. This mapping of[...]

Four short links: 13 February 2018


Machine Learning, CSP Reporting, Remembering Learning, and Viz for Human Rights

  1. Prodigy -- Radically efficient machine teaching. An annotation tool powered by active learning.
  2. Report URI JS -- contenty security policies are awesome, but they are enforced on the browser before your server sees any requests. Use this script to find out what is being blocked by your CSP. (via BoingBoing)
  3. I Wrote Down Everything I Learned While Programming for a Month -- I do this and find it hugely valuable. It's one thing to say "I'm learning all the time" but another to actually be able to point to what you're learning.
  4. Visualizing Data for Human Rights Advocacy -- A guidebook and workshop activity.

Continue reading Four short links: 13 February 2018.


4 trends in security data science for 2018


A glimpse into what lies ahead for response automation, model compliance, and repeatable experiments.This is the third consecutive year I’ve tried to read the tea leaves for security analytics. Last year’s trends post manifested well: from a rise in adversarial machine learning (ML) to the deep learning craze (such that entire conferences are now dedicated to this subject). This year, Hyrum Anderson, technical director of data science from Endgame, joins me in calling out the trends in security data science for 2018. We present a 360-degree view of the security data science landscape—from unicorn startups to established enterprises. The format remains mostly remains the same: four trends to map to each quarter of the year. For each trend, we provide a rationale about why the time is right to capitalize on the trend, offer practical tips on what you can do now to join the conversation, and include links to papers, GitHub repositories, tools, and tutorials. We also added a new section “What won’t happen in 2018” to help readers look beyond the marketing material and stay clear of hype. 1. Machine learning for response (semi-)automation In 2016, we predicted a shift from detection to intelligent investigation. In 2018, we’re predicting a shift from rich investigative information toward distilled recommended actions, backed by information-rich incident reports. Infosec analysts have long stopped clamoring for “more alerts!” from security providers. In the coming year, we’ll see increased customer appetite for products to recommend actions based on solid evidence. Machine learning has, in large part, proven itself a valuable tool for detecting evidence of threats used to compile an incident report. Security professionals subconsciously train themselves to respond to (or ignore) the evidence of an incident in a certain way. The linchpin to scale in information security rests still on the information security analyst, and many of the knee-jerk responses can be automated. In some cases, the response might be ML-automated, but in many others it will be at least ML-recommended. Why now? The information overload pain point is as old as IDS technology—not a new problem for machine learning to tackle—and some in the industry have invested in ML-based (semi-) automated remediation. However, there are a few pressures driving more widespread application of ML to simplify response through ML distillation rather than complicate with additional evidence: (1) market pressure to optimize workflows instead of alerts—to scale human response, (2) diminishing returns on reducing time-to-detect compared to time-to-remediate. What can you do? Assess remediation workflows of security analysts in your organization: (1) What pieces of evidence related to the incident provide high enough confidence to respond? (2) What evidence determines how to respond? (3) For a typical incident, how many decisions must be made during remediation? (4) How long does remediation take for a typical incident? (5) What is currently being automated reliably? (6) What tasks could still be automated? Don’t force a solution on security analysts—chances are, they are creating custom remediation scripts in powershell or bash. You may already be using a mixed-bag of commercial and open source tools for remediation (e.g., Ansible to task commands to different groups, or open source @davehull’s Kan[...]

Responding to new open source vulnerability disclosures



Best practices for quick remediation and response

Responding to New Vulnerability Disclosures

The techniques to find, fix, and prevent vulnerable dependencies are very similar to other quality controls. They revolve around issues in our application, and maintaining quality as the application changes. The last piece in the vulnerable library puzzle is a bit different.

In addition to their known vulnerabilities, the libraries you use also contain unknown vulnerabilities. Every now and then, somebody (typically a library’s authors, its users, or security researchers) will discover and report such a vulnerability. Once a vulnerability is discovered and publicly disclosed, you need to be ready to test your applications for it and fix the findings quickly—before attackers exploit it.

Continue reading Responding to new open source vulnerability disclosures.


10 software architecture resources on O'Reilly's online learning platform



Learn about new architecture patterns, event-driven microservices, fast data, and more.

Get a fresh start on building a new skill or augment what you currently know with one of these new and popular titles on O'Reilly's online learning platform.


Continue reading 10 software architecture resources on O'Reilly's online learning platform.


Four short links: 12 February 2018


Tech vs. Culture, Fairness and Accountability, People Typeface, and Reproducibility Suite

  1. Containers Will Not Fix Your Broken Culture (Bridget Kromhout) -- words of truth in the tech industry, but "{some tech thing} will not fix your broken culture" is true everywhere (e.g., iPads in schools, chatbots in customer-hating organizations, etc.)
  2. FAT -- proceedings from Conference on Fairness, Accountability, and Transparency in machine learning research.
  3. Wee People -- A typeface of people sillhouettes, to make it easy to build web graphics featuring little people instead of dots. (via Flowing Data)
  4. Stencila -- The office suite for reproducible research. Like a cross between a word processor and a spreadsheet. Almost a Jupyter-style notebook, but WYSIWYG and with a different underlying structure. One to watch!

Continue reading Four short links: 12 February 2018.


Four short links: 9 February 2018


Small GUI, Dangerous URLs, Face-Recognition Glasses, and The Future is Hard

  1. Nuklear -- a single-header ANSI C GUI library, with a lot of bindings (Python, Golang, C#, etc.). (via Hacker News)
  2. unfurl -- a tool that analyzes large collections of URLs and estimates their entropies to sift out URLs that might be vulnerable to attack. (via this blog)
  3. Chinese Police Using Face Recognition Glasses -- In China, people must use identity documents for train travel. This rule works to prevent people with excessive debt from using high-speed trains, and limit the movement of religious minorities who have had identity documents confiscated and can wait years to get a valid passport. We asked for glasses that would help us remember people's names, we got Robocop 0.5a/BETA2FINAL. {Obligatory "Black Mirror" reference goes here} (via BoingBoing)
  4. Why I Barely Read SF These Days (Charlie Stross) -- SF should—in my view—be draining the ocean and trying to see at a glance which of the gasping, flopping creatures on the sea bed might be lungfish. But too much SF shrugs at the state of our seas and settles for draining the local aquarium, or even just the bathtub, instead. In pathological cases, it settles for gazing into the depths of a brightly coloured computer-generated fishtank screensaver. Earlier in the essay he talks about how the first to a field defines the tropes and borders that others play in, and it's remarkably hard to find authors who can and will break out of them. (via Matt Jones)

Continue reading Four short links: 9 February 2018.


Richard Warburton and Raoul-Gabriel Urma on Java 8 and Reactive Programming



The O’Reilly Programming Podcast: Building reactive applications.

In this episode of the O’Reilly Programming Podcast, I talk with Richard Warburton and Raoul-Gabriel Urma of Iteratr Learning. They are the presenters of a series of O’Reilly Learning Paths, including Getting Started with Reactive Programming and Build Reactive Applications in Java 8. Warburton is the author of Java 8 Lambdas, and Urma is the author of Java 8 in Action.

Continue reading Richard Warburton and Raoul-Gabriel Urma on Java 8 and Reactive Programming.


5 best practices when requesting visuals for your content



What a design request should look like when you're talking to an external entity.

Continue reading 5 best practices when requesting visuals for your content.


Four short links: 8 February 2018


Data for Problems, Quantum Algorithms, Network Transparency, and AI + Humans

  1. Solving Public Problems With Data -- an introduction to data science and data analytical thinking in the public interest. Online lecture series. Beth Noveck gives one of them. (via The Gov Lab)
  2. Quantum Algorithms: An Overview -- Here we briefly survey some known quantum algorithms, with an emphasis on a broad overview of their applications rather than their technical details. We include a discussion of recent developments and near-term applications of quantum algorithms. (via A Paper A Day)
  3. X11's Network Transparency is Largely a Failure -- Basic X clients that use X properties for everything may be genuinely network transparent, but there are very few of those left these days.
  4. How to Become a Centaur -- When you create a Human+AI team, the hard part isn’t the "AI". It isn’t even the “Human”. It’s the “+”. Interesting history and current state of human and AI systems. (via Tom Stafford)

Continue reading Four short links: 8 February 2018.


HVMN’s better-body biohacking



Learn how biohacking is unlocking human potential.

Technology is unique in the fact that its improvement provides an intuitive next step. Products are refined, updated, and necessarily upgraded at any given time. Optimization is never viewed as a bonus in the tech industry; it’s the name of the game.

But what happens when we attempt to expand optimization goals to include the very facilitators of progress: our minds? Crossing the boundary between hard science and pseudoscience, biohacking companies are exploring the principle of “upgrading” the human body in the hopes that our inherited genetics are more malleable than we think. One such company is HVMN (pronounced “human”). The company’s main product is NOOTROBOX, a line of nootropics or colloquially-dubbed “smart drugs” meant to enhance neural performance in areas such as memory, learning, and focus.

Continue reading HVMN’s better-body biohacking.


Delivering effective communication in software teams



Optimize for business value with clear feedback loops and quality standards.

We’ve had the privilege to work with many clients from different business sectors. Each client has granted us the opportunity to see how their teams perceive the value of software within their organizations. We’ve also witnessed how the same types of systems (e.g. ERPs) in competing organizations raise completely different problems and challenges. As a result of these experiences, we’ve come to understand that the key to building high-quality software architecture is effective communication between every team member involved in the project who expects to gain value from a software system.

So, if you’re a software architect or developer and you want to improve your architectures or codebases, you’ll have to address the organizational parts as well. Research conducted by Graziotin et al1. states that software development is dominated by these often-neglected organizational elements, and that the key to high-quality software and productive developers is the happiness and satisfaction of those developers. In turn, the key to happy and productive developers is empowerment - both on an organizational and technical level.

Continue reading Delivering effective communication in software teams.


Four short links: 7 February 2018


Identity Advice, Customer Feedback, Fun Toy, and Reproducibility Resources

  1. 12 Best Practices for User Account, Authorization, and Password Management (Google) -- Your users are not an email address. They're not a phone number. They're not the unique ID provided by an OAUTH response. Your users are the culmination of their unique, personalized data and experience within your service. A well-designed user management system has low coupling and high cohesion between different parts of a user's profile.
  2. Customer Satisfaction at the Push of a Button (New Yorker) -- simply getting binary good/bad feedback is better than no feedback, even if it's not as good as using NPS with something like Thematic. Also an interesting story about the value of physical interactions over purely digital.
  3. XXY Oscilloscope -- try this or this to get started. (via Hacker News)
  4. Reproducibility Workshop -- slides and handouts from a workshop to highlight some of the resources available to help share code, data, reagents, and methods. (via Lenny Teltelman)

Continue reading Four short links: 7 February 2018.


Re-thinking marketing: Generating attention you can turn into profitable demand


The media and ad tech sessions at the Strata Data Conference in San Jose will dig deep into how media businesses are changing.First-year business students are taught that marketing consists of four Ps: product, place (or channel), price, and promotion. But this thinking is dated. In an era of information saturation, simply creating another piece of information in the form of a spec sheet, white paper, or press release compounds the problem. I’ve been using a newer definition of marketing in recent years: generating attention you can turn into profitable demand. This underscores the “long funnel” of conversion from initial consumer awareness and engagement, to desirable outcomes like sales, word-of-mouth referral, and the retention of loyal customers. At the start of the long funnel is media. Traditionally, media was a one-to-many model, in which a few organizations—armed with printing presses and broadcast studios—sent a single message out to the masses. They made money through purchases, subscriptions, and in many cases, advertising. Much has changed. Today’s communication is bidirectional, flowing from the audience back to the publisher. It’s individualized, with each of us experiencing a tailored feed of information. The cost of publishing is vanishingly small, with anyone able to share a video with the world for practically nothing. And most importantly, we expect media to be free. This expectation stems from two simple facts: there’s too much content out there, and users create most of it. The abundance of content is a consequence of how easy it is to publish. Anyone can become an expert; we consume tailored news. I might read 10 publications’ technology sections, but ignore all sports news. Gone are the days of reading a single publication cover to cover. I choose podcasts to suit my interests, seldom exploring. And the world of user-generated content has birthed a second kind of media. Facebook, Medium, Twitter, Reddit, and their ilk don’t employ writers, but we consume most of our words there. Traditional media outlets with paid reporting and editorial calendars are being squeezed out. Jeff Jarvis has said that advertising is failure. It means you haven’t sold an issue, or a subscription. It’s a bad outcome. And yet, it’s the basis for most of what we consume today. Craigslist decimated newspapers partly because the classified ads were the only thing keeping them alive. The nature of media has shifted, too. It’s gaming, and betting, and theme parks, and blogs, and Youtube channels, and streaming subscriptions. Omnichannel analytics means tracking a customer’s engagement with a brand or some content across many platforms and devices. With a sprawl of media, and an increased reliance on advertising despite razor-thin margins, media creators of all stripes take analytics very seriously. Data is the difference between dominance and obsolescence, whether you’re keeping a player engaged, trying to get a subscriber to stick around, recommending the next best song, serving a tailored ad, or satisfying a die-hard spor[...]

Introducing capsule networks


How CapsNets can overcome some shortcomings of CNNs, including requiring less training data, preserving image details, and handling ambiguity.Capsule networks (CapsNets) are a hot new neural net architecture that may well have a profound impact on deep learning, in particular for computer vision. Wait, isn't computer vision pretty much solved already? Haven't we all seen fabulous examples of convolutional neural networks (CNNs) reaching super-human level in various computer vision tasks, such as classification, localization, object detection, semantic segmentation or instance segmentation (see Figure 1)? Figure 1. Some of the main computer vision tasks. Today, each of these tasks requires a very different CNN architecture, for example ResNet for classification, YOLO for object detection, Mask R-CNN for instance segmentation, and so on. Image by Aurélien Géron. Well, yes, we’ve seen fabulous CNNs, but: They were trained on huge numbers of images (or they reused parts of neural nets that had). CapsNets can generalize well using much less training data. CNNs don’t handle ambiguity very well. CapsNets do, so they can perform well even on crowded scenes (although, they still struggle with backgrounds right now). CNNs lose plenty of information in the pooling layers. These layers reduce the spatial resolution (see Figure 2), so their outputs are invariant to small changes in the inputs. This is a problem when detailed information must be preserved throughout the network, such as in semantic segmentation. Today, this issue is addressed by building complex architectures around CNNs to recover some of the lost information. With CapsNets, detailed pose information (such as precise object position, rotation, thickness, skew, size, and so on) is preserved throughout the network, rather than lost and later recovered. Small changes to the inputs result in small changes to the outputs—information is preserved. This is called "equivariance." As a result, CapsNets can use the same simple and consistent architecture across different vision tasks. Finally, CNNs require extra components to automatically identify which object a part belongs to (e.g., this leg belongs to this sheep). CapsNets give you the hierarchy of parts for free. Figure 2. The DeepLab2 pipeline for image segmentation, by Liang-Chieh Chen, et al.: notice that the output of the CNN (top right) is very coarse, making it necessary to add extra steps to recover some of the lost details. From the paper DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs, figure reproduced with the kind permission of the authors. See this great post by S. Chilamkurthy to see how diverse and complex the architectures for semantic segmentation can get. CapsNets were first introduced in 2011 by Geoffrey Hinton, et al., in a paper called Transforming Autoencoders, but it was only a few months ago, in November 2017, that Sara Sabour, Nicholas Frosst, and Geoffrey Hinton published a paper called Dynamic Routing between Capsules, where they introduced a CapsNet architect[...]

Four short links: 6 February 2018


Mine Research, Fight for Attention, AI Metaphors, and Research Browser Extensions

  1. metaDigitise -- Digitising functions in R for extracting data and summary statistics from figures in primary research papers.
  2. Center for Humane Technology -- Silicon Valley tech insiders fighting against attention-vacuuming tech design. (via New York Times)
  3. Tools, Substitutes, or Companions -- three metaphors for how we think about digital and robotic technologies. (via Tom Stafford)
  4. Unpaywall -- browser extension. Click the green tab and skip the paywall on millions of peer-reviewed journal articles. It's fast, free, and legal. Pair with the open access button. (via Swarthmore Libraries)

Continue reading Four short links: 6 February 2018.


Integrating continuous testing for improved open source security



Testing to prevent vulnerable open source libraries.

Integrating Testing to Prevent Vulnerable Libraries

Once you’ve found and fixed (or at least acknowledged) the security flaws in the libraries you use, it’s time to look into tackling this problem continuously.

There are two ways for additional vulnerabilities to show up in your dependencies:

Continue reading Integrating continuous testing for improved open source security.


Why I won't whitelist your site


Publishers need to take responsibility for code they run on my systems.Many internet users—perhaps most—use an ad blocker. I’m one of them. All of us are familiar with the sites that won’t let us in without whitelisting them, or (only somewhat better) that repeatedly nag us to whitelist. I’m not whitelisting anyone. I don’t have any fundamental problem with advertising; I wish ads weren’t as intrusive, and I believe advertisers would be better served by advertisements that had more respect for their viewers. But that’s not really why I use an ad blocker. The real problem with ads is that they’re a vector for malware. It’s relatively easy to fold malware into otherwise-innocent advertisements, and that malware executes even if you don’t click on the ads. I’ve received malware from sites as otherwise legitimate as the BBC, and there are reports of malware from virtually every major online publisher—including sites like Forbes that won’t let you in if you don’t whitelist them. The New York Times, Reuters, MSN, and many others have all spread malware. And no one takes responsibility for the advertisements or the damage they cause. The publishers just say “hey, we don’t control the ads; that’s the ad placement company.” The advertisers similarly say “hey, our ads come from a marketing firm, and they use some kind of web contractor to do the coding.” And the ad placement companies and marketing firms? All you get from them is the sound of silence. Here’s the deal. I’m willing to whitelist any online publisher that will agree to a license in which they take responsibility for any code they run on my systems. Call it a EULA for using my browser on my computer. If you deliver malware, you will pay for the damages: my lost time, my lost data. If the idea catches on, managing all the contracts sounds like a problem, but I think it’s a business opportunity. Something would be needed to track all the licenses in an authoritative ledger. This sounds like an application for a blockchain. Maybe even a blockchain startup. If I really need to read something on your site, and you won’t let me in because I am running an ad blocker, I might read your site anyway. That’s trivial—I have four or five browsers on all of my machines, and not all of them have ad blockers installed. But I won’t link to you, quote you, or tweet you. You’re dead to me. I’ve been asked whether I have any proposals for a business model other than advertising. Not really. Though my employer, O’Reilly Media, does a bit of online publishing, and we don’t take advertising. But advising publishers on their business model isn’t my job—and they’ve yet to ask me for advice, anyway. My job is keeping my systems safe, and that requires keeping malware out. Again, I have nothing against advertising as a business model. However, that model (and the businesses relying on it) deserve to fail if publishers won’t take responsibility for the ads they deliver. While I understand that publish[...]

Four short links: 5 February 2018


Company Principles, DeepFake, AGI, and Missing Devices

  1. Principles of Technology Leadership (Bryan Cantrill) -- (slides) what cultural values and principles do you want to guide *your* company? (via Bryan Cantrill)
  2. Fun With DeepFakes; or How I Got My Wife on The Tonight Show -- this is going to further erode trust. How can you know what happened if all evidence can be convincingly faked? (via Simon Willison)
  3. MIT 6.S099: Artificial General Intelligence -- The lectures will introduce our current understanding of computational intelligence and ways in which strong AI could possibly be achieved, with insights from deep learning, reinforcement learning, computational neuroscience, robotics, cognitive modeling, psychology, and more. Additional topics will include AI safety and ethics. Worth noting that we can't build an artificial general intelligence right now, and may never be able to. Don't freak out because of the course headline.
  4. Catalog of Missing Devices (EFF) -- Things we’d pay money for—things you could earn money with—don’t exist thanks to the chilling effects of an obscure copyright law: Section 1201 of the Digital Millennium Copyright Act (DMCA 1201). From "third-party consumables for 3D printers" to an "ads-free YouTube for Kids," they're good ideas.

Continue reading Four short links: 5 February 2018.


Why product managers should master the art of user story writing



A well-written user story allows product managers to clearly communicate to their Agile development teams.

Continue reading Why product managers should master the art of user story writing.


Four short links: 2 February 2018


Digitize and Automate, Video Editor, AI + Humans, and Modest JavaScript

  1. Port Automation (Fortune) -- By digitizing and automating activities once handled by human crane operators and cargo haulers, seaports can reduce the amount of time ships sit in port and otherwise boost port productivity by up to 30%. "Digitize and automate" will be the mantra of the next decade.
  2. Shot Cut App -- a free, open source, cross-platform video editor.
  3. The Working Relationship Between Humans and AI (Mike Loukides) -- Whether we're talking about doctors, lawyers, engineers, Go players, or taxi drivers, we shouldn't expect AI systems to give us unchallengeable answers ex silico. We shouldn't be told that we need to "trust AI." What's important is the conversation.
  4. Stimulus-- modest JavaScript framework for the HTML you already have.

Continue reading Four short links: 2 February 2018.


Logo detection using Apache MXNet


Image recognition and machine learning for mar tech and ad tech.Digital marketing is the marketing of products, services, and offerings on digital platforms. Advertising technology, commonly known as "ad tech," is the use of digital technologies by vendors, brands, and their agencies to target potential clients, deliver personalized messages and offerings, and analyze the impact of online spending: sponsored stories on Facebook newsfeeds; Instagram stories; ads that play on YouTube before the video content begins; the recommended links at the end of a CNN article, powered by Outbrain—these all are examples of ad tech at work. In the past year, there has been a significant use of deep learning for digital marketing and ad tech. In this article, we will delve into one part of a popular use case: mining the Web for celebrity endorsements. Along the way, we’ll see the relative value of deep learning architectures, run actual experiments, learn the effects of data sizes, and see how to augment the data when we don’t have enough. Use case overview In this article, we will see how to build a deep learning classifier that will predict the company, given an image with logo. This section provides an overview of where this model could be used. Celebrities endorse a number of products. Quite often, they post pictures on social media showing off a brand they endorse. A typical post of that type contains an image, with the celebrity and some text they have written. The brand, in turn, is eager to learn about the appearance of such postings, and to show them to potential customers who might be influenced by them. The ad tech application, therefore, works as follows: large numbers of postings are fed to a processor that figures out the celebrity, the brand, and the message. Then, for each potential customer, the machine learning model generates a very specific advertisement based on the time, location, message, brand, customers' preferred brands, and other things. Another model identifies the target customer base. And the targeted ad is now sent. Figure 1 shows the workflow: Figure 1. Celebrity brand-endorsement bot workflow. Image by Tuhin Sharma. As you can see, the system is composed of a number of machine learning models. Consider the image. The picture could have been taken in any setting. The first goal is to identify the objects and the celebrity in the picture. This is done by object detection models. Then, the next step is to identify the brand, if one appears. The easiest way to identify the brand is by its logo. In this article, we will look into building a deep learning model to identify a brand by its logo in an image. Subsequent articles will talk about building some of the other pieces of the bot (object detection, text generation, etc.). Problem definition The problem addressed in this article is: given an image, predict the company (brand) in the image by identifying the l[...]

Machine learning needs machine teaching



The O’Reilly Data Show Podcast: Mark Hammond on applications of reinforcement learning to manufacturing and industrial automation.

In this episode of the Data Show, I spoke with Mark Hammond, founder and CEO of Bonsai, a startup at the forefront of developing AI systems in industrial settings. While many articles have been written about developments in computer vision, speech recognition, and autonomous vehicles, I’m particularly excited about near-term applications of AI to manufacturing, robotics, and industrial automation. In a recent post, I outlined practical applications of reinforcement learning (RL)—a type of machine learning now being used in AI systems. In particular, I described how companies like Bonsai are applying RL to manufacturing and industrial automation. As researchers explore new approaches for solving RL problems, I expect many of the first applications to be in industrial automation.

Continue reading Machine learning needs machine teaching.


Different continents, different data science


Regardless of country or culture, any solid data science plan needs to address veracity, storage, analysis, and use.Over the last four years, I’ve had conversations about data science, machine learning, ethics, and the law on several continents. This has included startups, big companies, governments, academics, and nonprofits. And over that time, some patterns are starting to emerge. Figure 1. This was last year, and I didn’t have location services turned on all the time. Screenshot by Alistair Croll. I’m going to be making some sweeping generalizations in this post. Everyone is different; every circumstance is somehow unique. But in digging into these patterns with colleagues, friends, and audiences both at home and abroad, they reflect many of the concerns of those cultures. Briefly: in China, they worry about the veracity of the data. In Europe, they worry about the storage and analysis. And in North America, they worry about unintended consequences of acting on it. Let me dig into those a bit more, and explain how I think external factors influence each. Data veracity If you don’t trust your data, everything you build atop it is a house of cards. When I’ve spoken about Lean Analytics or data science and critical thinking in China, many of the questions are about knowing whether the data is real or genuine. China is a country in transition. A recent talk by Xi Jinping outlined a plan in which the country creates things first, rather than copying. They want to produce the best students, rather than send them abroad. They’re transitioning from a culture of mimicry and cheap copies to one of leadership and innovation. Just look at their policies on electric cars, or their planned cities, or the dominance of Wechat as a ubiquitous payment system. When I was in Paris a few years ago, I visited Les Galleries Lafayette, an over-the-top mall whose gold decor and outlandish ornamentation is a paeon to all things commercial. Outside one of the high-end retail outlets was a long queue of Chinese tourists, being let in to buy a purse a few at a time. As each person completed their purchase, they’d pause at the exit and take a picture of themselves with their new-found luxury item, in front of the store logo. I asked the busdriver what was going on. “They want proof it’s the real,” he replied. Proof it’s the real. In a country with a history of copying, where data is conflated with propaganda and competition is relatively unregulated, it’s no wonder veracity is in question. There are many things a data analyst can do to test whether data is real. One of the most interesting is Benford’s Law, which states that natural data of many kinds follows a power curve. In a random sample of that data, there will be more numbers beginning with a one than a two, more with a two than a three, and so on. It seems like a mag[...]

Four short links: 1 February 2018


Tor + Bitcoin = De-anonymization, Classic Papers, 3D Holograms, and Big Data Privacy

  1. Deanonymizing Tor Hidden Service Users Through Bitcoin Transactions Analysis -- This, for example, allows an adversary to link a user with @alice Twitter address to a Tor hidden service with private.onion address by finding at least one past transaction in the blockchain that involves their publicly declared Bitcoin addresses.
  2. Great Moments in Computing -- the reading list for this Princeton course is fascinating! (via Paper a Day)
  3. Volumetric 3D Images that Float in the Air (Kurzweil AI) -- the video is impressive! Trap a particle with a laser, move it around really fast while illuminating it with red, green, and blue lights. Result, thanks to persistence of vision: illusion of 3D object. Brilliant!
  4. A Precautionary Approach to Big Data Privacy -- In Section 3, we discuss the levers that policymakers can use to influence data releases: research funding choices that incentivize collaboration between privacy theorists and practitioners, mandated transparency of re-identification risks, and innovation procurement. Meanwhile, practitioners and policymakers have numerous pragmatic options for narrower releases of data. In Section 4, we present advice for six of the most common use cases for sharing data. Our thesis is that the problem of “what to do about re-identification” unravels once we stop looking for a one-size-fits-all solution, and in each of the six cases we propose a solution that is tailored, yet principled.

Continue reading Four short links: 1 February 2018.


New releases from O'Reilly for February 2018



Find out what's new in machine learning, network automation, security, and more.

Get a fresh start on building a new skill or augment what you currently know with one of these five newly released titles from O'Reilly.

Machine Learning and Security


Continue reading New releases from O'Reilly for February 2018.


Be fast, be secure, be accessible



Learn why performance, security, and accessibility are the pillars of web development and the O’Reilly Fluent Conference.

When my fellow program chairs of Fluent, Kyle Simpson and Tammy Everts, and I started thinking about how we'd describe a theme for the event back in 2016, we came up with "Building a Better Web." While we recognized it can sound a little hand-wavey, a big part of this theme is a crucial layer of the developer experience — the bigger-picture more goal-oriented perspective that comes along with skill development. When we thought about what it takes to build a better web, we kept coming back to the idea of a fast, secure, accessible web — one that works for users of all backgrounds and abilities, one that reaches users of varied connection speeds and devices, and one that keeps its users safe. If there are three main pillars of the modern web, they are: performance, security, and accessibility.

It may seem obvious that these pillars should be key areas of focus and investment for engineering and product teams, and yet so often they're treated as an afterthought. And while the practice of calling out these domains can feel like a hackneyed reminder to "eat your vegetables," it's worth it to think about the ways these areas intersect in our organizations and impact customers.

Continue reading Be fast, be secure, be accessible.


How to solve 90% of NLP problems: A step-by-step guide


Using machine learning to understand and leverage text.Whether you are an established company or working to launch a new service, you can always leverage text data to validate, improve, and expand the functionalities of your product. The science of extracting meaning and learning from text data is an active topic of research called natural language processing (NLP). NLP produces new and exciting results on a daily basis, and is a very large field. However, having worked with hundreds of companies, the Insight team has seen a few key practical applications come up much more frequently than any other: Identifying different cohorts of users/customers (e.g., predicting churn, lifetime value, product preferences) Accurately detecting and extracting different categories of feedback (positive and negative reviews/opinions, mentions of particular attributes such as clothing size/fit...) Classifying text according to intent (e.g., request for basic help, urgent problem) While many NLP papers and tutorials exist online, we have found it hard to find guidelines and tips on how to approach these problems efficiently from the ground up. How to build machine learning solutions to solve problems After leading hundreds of projects a year and gaining advice from top teams all over the United States, we wrote this post to explain how to build machine learning solutions to solve problems like the ones mentioned above. We’ll begin with the simplest method that could work, and then move on to more nuanced solutions, such as feature engineering, word vectors, and deep learning. After reading this article, you’ll know how to: Gather, prepare, and inspect data Build simple models to start, and transition to deep learning if necessary Interpret and understand your models to make sure you are actually capturing information and not noise We wrote this post as a step-by-step guide; it can also serve as a high-level overview of highly effective standard approaches. This post is accompanied by an interactive notebook demonstrating and applying all these techniques. Feel free to run the code and follow along. Step 1: Gather your data Example data sources Every Machine Learning problem starts with data, such as a list of emails, posts, or tweets. Common sources of textual information include: Product reviews (on Amazon, Yelp, and various App Stores) User-generated content (tweets, Facebook posts, StackOverflow questions) Troubleshooting (customer requests, support tickets, chat logs) For this post, we will use a data set generously provided by CrowdFlower, called “Disasters on Social Media,” where: Contributors looked at over 10,000 tweets culled with a variety of searches like “ablaze,” “quarantine,” and “p[...]

Four short links: 31 January 2018


Fairness, Typesetting, Anomalies, and Faking Out Speech Recognition

  1. The Problem with Building a Fair System (Mike Loukides) -- We're ultimately after justice, not fairness. And by stopping with fairness, we are shortchanging the people most at risk. If justice is the real issue, what are we missing?
  2. Bookish -- open source tool that translates augmented markdown into HTML or latex.
  3. -- An open source framework for real-time anomaly detection using Python, ElasticSearch, and Kibana. See also the announcement.
  4. Audio Adversarial Examples -- Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (at a rate of up to 50 characters per second). You say "potato," I say "single quote semicolon drop table users semicolon dash dash."

Continue reading Four short links: 31 January 2018.


“Data as a feature” is coming. Are product managers ready?


By packaging and delivering actionable data in applications, product managers can help users achieve their goals.More and more, apps are capable of delivering more than a service, such as access to a bank account or the ability to order a pizza. Apps can offer users data—and not just a dump of data that has little value, but data specifically designed to be of value to the user. This is called “data as a feature”—it is the act and process of treating data as a core feature of a software product in a way that delivers value to the user. Taking this definition a step further, a product with data as a feature delivers that data in a way that helps the user meet a goal. The trend of consumerizing applications and making data easier for users to consume and make decisions with is affecting apps and services across all industries. As consumers more and more come to expect, and even demand, value and deep insights, software product managers are increasingly tasked with the responsibility for making sure that “data as a feature” is successfully implemented in the apps they are bringing to market. By packaging and delivering data effectively in a product, they can help users become more informed and better able to take action. Implications for product managers So, why build data as a feature? Because everyone is being overwhelmed with data. Business users as well as consumers have data bombarding them from all angles. The challenge for people has shifted. They used to ask for more data all the time. Now they are getting too much of it. They want to get value out of it, but are overwhelmed. Software developers are responding to this need by translating and presenting data to users in a way they can immediately grasp, and on which they can take action. They’re building data as a feature into their products and displaying that data in visually appealing, intuitive, and easily consumable ways. In turn, product managers today face a significant addition to their duties—they now need to begin treating data as a feature in the products they’re building. In other words, not viewing data as only a byproduct of the apps they’re charged with developing, but as a prominent feature. To do this successfully, it’s critical to understand who is going to be using the products, what their data needs are, and how a specific data-driven “slice of business functionality” could help users meet those needs. To achieve this, “design thinking” is important, even critical. But even more so is “goal thinking”: what are users’ goals, and how do you present data in a way that helps them achieve those goals? These are chall[...]

7 on-the-rise technology trends to track and learn



AI, Python, Java, blockchain, and cloud technologies are active topics on O’Reilly’s online learning platform.

When developers and tech leaders are figuring out what to do next, how to advance their careers, or offer more value to their companies, they need to know what’s hot and what’s not living up to the hype.

Toward that end, we dove into the last two year’s worth of search data from our online learning platform to identify the topics you should consider exploring in the months ahead. We find search activity particularly effective for spotting shifts in technology usage: what’s gaining traction, what’s falling out of favor, and what topics are maturing.

Continue reading 7 on-the-rise technology trends to track and learn.


The problem with building a “fair” system


The ability to appeal may be the most important part of a fair system, and it's one that isn't often discussed in data circles.Fairness is a slippery concept. We haven't yet gotten past the first-grade playground: what's "fair" is what's fair to me, not necessarily to everyone else. That's one reason we need to talk about ethics in the first place: to move away from the playground's "that's not fair" (someone has my favorite toy) to a statement about justice. There have been several important discussions of fairness recently. Cody Marie Wild's “Fair and Balanced? Thoughts on Bias in Probabilistic Modeling” and Kate Crawford's NIPS 2017 keynote, “The Trouble with Bias,” do an excellent job of discussing how and why bias keeps reappearing in our data-driven systems. Neither of these papers pretend to have any final answer to the problem of fairness. Nor do I. I would like to expose some of the problems, and suggest some directions for making progress toward the elusive goal of "fairness." The nature of data itself presents a fundamental problem. "Fairness" is aspirational: we want to be fair, we hope to be fair. Fairness has much more to do with breaking away from our past and transcending it than with replicating it. But data is inevitably historical, and it reflects all the prejudices and biases of the past. If our systems are driven by data, can they possibly be "fair"? Or do they just legitimize historical biases under the guise of science and mathematics? Is it possible to make fair systems out of data that reflects historical biases? I'm uncomfortable with the idea that we can tweak the outputs of a data-driven system to compensate for biases; my instincts tell me that approach will lead to pain and regret. Some research suggests that de-biasing the input data may be a better approach, but it's still early. It is easier to think about fairness when there's only one dimension. Does using a system like COMPASS lead to harsher punishments for blacks than non-blacks? Was Amazon's same-day delivery service initially offered only in predominantly white neighborhoods? (Amazon has addressed this problem.) Those questions are relatively easy to evaluate. But in reality, these problems have many dimensions. A machine learning system that is unfair to people of color might also be unfair to the elderly or the young; it might be unfair to people without college degrees, women, and the handicapped. We actually don't know; for the most part, we haven't asked those questions. We do know that AI is good at finding groups with similar characteristics (such as "doesn't have a college degree"), even when that chara[...]

Mitigating known security risks in open source libraries



Fixing vulnerable open source packages.

Fixing Vulnerable Packages

Finding out if you’re using vulnerable packages is an important step, but it’s not the real goal. The real goal is to fix those issues!

This chapter focuses on all you should know about fixing vulnerable packages, including remediation options, tooling, and various nuances. Note that SCA tools traditionally focused on finding or preventing vulnerabilities, and most put little emphasis on fix beyond providing advisory information or logging an issue. Therefore, you may need to implement some of these remediations yourself, at least until more SCA solutions expand to include them.

Continue reading Mitigating known security risks in open source libraries.


60+ new live online trainings just launched on O'Reilly's learning platform


Get hands-on training in machine learning, AI, Python, security, usability, and many more topics.We just opened up more than 60 live online trainings on our learning platform. These trainings give you hands-on experience in critical technology, design, and business topics. You'll learn from instructors in O’Reilly’s network of tech innovators and expert practitioners and from our trusted partners. Space is limited and these trainings often fill up. Hands-on Machine Learning with Python: Clustering, Dimension Reduction, and Time Series Analysis on February 14 Getting Started with Python’s Pytest on February 14 Hands-on Machine Learning with Python: Classification and Regression on February 15 Mastering Python’s Pytest on February 15 AWS Security Fundamentals on March 1 Customer Research for Product Managers on March 1 Advanced Agile: Scaling in the Enterprise on March 2 Foundational Data Science with R on March 5-6 Linux Filesystem Administration on March 5-6 Introduction to Lean on March 6 Introduction to Kubernetes on March 6-7 Learn the Basics of Scala in 3 Hours on March 7 Effective Design Workshops on March 7 Practical AI on iOS on March 7 Docker: Up and Running on March 7-8 Cloud Native Architecture Patterns on March 7-8 Getting Started With Vue on March 8 CISSP Stumbling blocks: Security Architecture, Engineering, and Cryptography on March 9 Get Started with Natural Language Processing in Python on March 12 Testing and Validating Product Ideas with Lean on March 12 Explore, Visualize, and Predict Using Pandas and Jupyter on March 12-13 Getting started with Python 3 on March 12-13 Building and Managing Kubernetes Applications on March 13 IPv4 Subnetting on March 13-14 Go Programming for Distributed Computing on March 13-14 How to do Great Customer Interviews on March 14 Beginner’s Guide to Creating Prototypes in Sketch on March 14 SQL Fundamentals for Data on March 14-15 Negotiation Fundamentals on March 15 Big Data and Hadoop for Beginners on March 15 Managing Enterprise Data Strategies with Hadoop, Spark, and Kafka on March 15 Managing Enterprise Data Strategies with Hadoop, Spark, and Kafka on March 16 CSS Layout Fundamentals: From Floats to Flexbox and CSS Grid on March 16 Porting from Python 2 to Python 3 on March 16 CISSP Stumbling blocks: Software Development Security and Identity on March 16 Introduction to Critical Thinking on March 19 Troubleshooting Agile on March 19 Scala Beyond the Basics on March 19-20  Data Science [...]

Four short links: 30 January 2018


Podcast Data, Data Stories, Distributed Systems, and Tech Future Scenarios

  1. Podcast Data -- Apple’s Podcast Analytics feature finally became available last month[...]. Though it’s still early days, the numbers podcasters are seeing are highly encouraging. [...] Listeners are typically getting through 80-90% of content. [...] According to Panoply, the few listeners who do skip ads continue to remain engaged with the episode, rather than dropping off at the first sign of an interruption.
  2. The Anatomy of a Data Story -- Great data stories: connect with people; try to convey one idea; keep it simple; explore a topic you know well.
  3. Designing Distributed Systems (Microsoft) -- 160 pages from Microsoft with repeatable, generic patterns, and reusable components to make developing reliable systems easier and more efficient.
  4. Scenario -- How will society change over the next 50 years? Will we still have jobs as we do today, perhaps with slightly shorter working weeks, or will the so-called "technological singularity" lead us to totally restructure our society? Perhaps reality lies somewhere in the middle. We look at three scenarios for the next few decades of technological development. From Scenario Magazine.

Continue reading Four short links: 30 January 2018.


From big data to fast data



Designing application architectures for real-time decisions.

Enterprise data needs change constantly but at inconsistent rates, and in recent years change has come at an increasing clip. Tools once considered useful for big data applications are not longer sufficient. When batch operations predominated, Hadoop could handle most of an organization’s needs. Development in other IT areas (think IoT, geolocation, etc.) have changed the way data is collected, stored, distributed, processed and analyzed. Real-time decision needs complicate this scenario and new tools and architectures are needed to handle these challenges efficiently.

Think of the 3 V's of data: volume, velocity, and variety. For a while big data emphasized data volume; now fast data applications mean velocity and variety are key. Two tendencies have emerged from this evolution: first, the variety and velocity of data that enterprise needs for decision making continues to grow. This data includes not only transactional information, but also business data, IoT metrics, operational information, and application logs. Second, modern enterprise needs to make those decisions in real time, based on all that collected data. This need is best clarified by looking at how modern shopping websites work.

Continue reading From big data to fast data.


Four short links: 29 January 2018


Dangerous Data, Data Linter, Participatory Budgeting, and Security Wargames

  1. Aggregated Data is Dangerous Even When Aggregated -- jogging app releases visualization of all its customers' data, inadvertently exposing military bases. It is dangerous to use data for purposes other than that for which it was collected.
  2. Data Linter -- identifies potential issues (lints) in your ML training data.
  3. Participatory Budgeting -- This research identified significant challenges in the participatory budgeting sphere, from a very common lack of goals to be achieved through participatory budgeting exercises, to very weak network links and peer support for implementers, to the frustrations of the exercises as a result of political corruption or subversion. The migration to managing participatory budgeting digitally presents the very real risk of the process becoming gentrified, and is just one example of the consequences of scale in participatory budgeting only being achieved at the expense of disenfranchising the most under-represented. There are recommendations as well.
  4. Over The Wire -- wargames to help you learn and practice security concepts. (via Hacker News)

Continue reading Four short links: 29 January 2018.


Four short links: 26 January 2018


Bitcoin, Ted Nelson, Constraint Modeling, and NLP

  1. Beyond the Bitcoin Bubble (Steven Johnson) -- a fine exegesis of the thesis that blockchain tech is a return to the open-to-innovation protocols of the early days of the internet. Right now, the only real hope for a revival of the open-protocol ethos lies in the blockchain.
  2. Ted Nelson on What Modern Programmers Can Learn From The Past -- We thought computing would be artisinal. We did not imagine large monopolies. We thought the Citizen Programmer would be the leader.
  3. MiniZinc -- a free and open source constraint modeling language. See also Hakan Kjellerstrand's page on it. (via Hacker News)
  4. How to Solve 90% of NLP Problems: A Step-by-Step Guide -- with an interactive notebook!

Continue reading Four short links: 26 January 2018.


Stream all the things



Streaming architectures for data sets that never end.

Continue reading Stream all the things.