Subscribe: O'Reilly Radar - Insight, analysis, and research about emerging technologies
Added By: Feedage Forager Feedage Grade B rated
Language: English
continue reading  continue  data  deep learning  learning  links november  links  new  reading  security  short links  short  spark 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: O'Reilly Radar - Insight, analysis, and research about emerging technologies

All - O'Reilly Media

All of our Ideas and Learning material from all of our topics.

Updated: 2017-11-23T05:22:18Z


The current state of Apache Kafka



The O’Reilly Data Show Podcast: Neha Narkhede on data integration, microservices, and Kafka’s roadmap.

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.

On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

Continue reading The current state of Apache Kafka.


Christie Terrill on building a high-caliber security program in 90 days



The O’Reilly Security Podcast: Aligning security objectives with business objectives, and how to approach evaluation and development of a security program.

In this episode of the Security Podcast, I talk with Christie Terrill, partner at Bishop Fox. We discuss the importance of educating businesses on the complexities of “being secure,” how to approach building a strong security program, and aligning security goals with the larger processes and goals of the business.

Continue reading Christie Terrill on building a high-caliber security program in 90 days.


Clinical trial? There’s an app for that


Several new apps are making it easier for doctors and patients to conduct clinical trials. When the Cleveland Clinic was looking for participants in its more than 130 cancer trials, its outreach teams didn’t just cold-call doctors and hospitals. They tried something new: they launched an app. The Cancer Trial App is designed for two distinct populations: patients looking for clinical trials and doctors who are treating those patients. Users who download the app either on iOS or Android receive information on trials by disease, phase, physician, and hospital location. In addition, the app details each trial’s objective, eligibility rules, and progress. The Cleveland Clinic’s app is a simple solution to a complex problem: how to make clinical trials easier for individuals to participate in. Because of a lack of vendors, complex regulatory tools, and institutional inertia, many of the digital workflow and recordkeeping tools that are commonplace in other parts of biology never made it to clinical trials. According to Premier Research, a life sciences consulting firm, only several hundred of the more than 150,000 mobile health applications published as of December 2016 focus on clinical trials. And of those, most are directory apps like the Cleveland Clinic’s, rather than more complicated apps that enhance the patient experience. However, innovation is happening in the clinical trial mobile app space—even if it’s taking longer than expected. In the United States, the Food & Drug Administration is working with stakeholders in clinical trials to examine ways smartphone apps can enable speedier, more cost-effective clinical trials that speed up the new drug approval pipeline. The FDA has a public docket out on “Using Technologies and Innovative Methods to Conduct FDA-Regulated Clinical Investigations of Investigational Drugs” that seeks to create consensus. The Challenges Designing apps for cancer studies and automating patient data is far more complicated developmentally and legally than creating a new smartphone game, which is why development has been slow. Stakeholders need to manage the following factors: The cost of app development, which is a challenge when many pharmaceutical companies don’t have robust, in-house Android and iOS teams and may not know what questions to ask outside contractors to keep costs down Privacy and quality expectations in randomized clinical trial research Regulatory issues Inside the clinical trial world, a culture that prioritizes written documentation and written paperwork over digital recordkeeping, at least when interacting directly with patients There are several ways of dealing with these challenges. Additionally, new innovations such as Apple’s ResearchKit and CareKit open source frameworks break down the barriers that prevent researchers, pharma companies, and others from building mobile apps that enhance the patient experience in clinical trials. In promotional materials aimed at researchers, Apple emphasizes how easy these apps are to use: “Perform(ing) activities using the advanced sensors in iPhone to generate incredibly precise data wherever you are, providing a source of information that’s more objective than ever before.” Helping Researchers with Trials The potential of smartphone apps for easing patient experience during the research and trial process has fascinated clinicians. No 483 For Me! (Figure 1-1) is an iOS app developed by William Tobia, a lead clinical research instructor at GlaxoSmithKline (GSK). It’s designed for clinical research site staff and aims to sharply reduce FDA inspection findings by training them on what to look for, thus enhancing the patient experience. Figure 1-1. No 483 For Me! (Screenshot by Neal Ungerleider.) In an email interview, Tobia said that “compliance with the protocol and FDA regulations is a priority in clinical research but, for some staff, it is difficult to locate the applicable[...]

Four short links: 22 November 2017


Decision-making, Code Duplication, Container Security, and Information vs Attention

  1. Decision-making Under Stress -- Under acute (short-lived, high-intensity) stress, we focus on short-term rapid responses at the expense of complex thinking. Application to startup culture left as exercise to the reader.
  2. DéjàVu: A Map of Code Duplicates on GitHub -- This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub, representing over 482 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. (via Paper a Day)
  3. NIST Guidance on Application Container Security -- This bulletin offers an overview of application container technology and its most notable security challenges. It starts by explaining basic application container concepts and the typical application container technology architecture, including how that architecture relates to the container life cycle. Next, the article examines how the immutable nature of containers further affects security. The last portion of the article discusses potential countermeasures that may help to improve the security of application container implementations and usage.
  4. Modern Media Is a DoS Attack on Your Free Will -- What’s happened is, really rapidly, we’ve undergone this tectonic shift, this inversion between information and attention. Most of the systems that we have in society—whether it’s news, advertising, even our legal systems—still assume an environment of information scarcity. THIS.

Continue reading Four short links: 22 November 2017.


Implementing request/response using context in Go network clients



Learn how NATS requests work and how to use the context package for cancellation.

The first two parts of this series created a general-purpose client that can subscribe to channels on a NATS server, send messages, and wait for responses. But one of the most common communication models is request/response, where two clients engage in one-to-one bidirectional communication. NATS is a pure PubSub system, meaning that everything is built on top of publish and subscribe operations. The NATS Go client supports the Request/Response model of communication by building it on top of the PubSub methods we have already developed.

Because making a request involves awaiting for a response, a Go package that is gaining increasing adoption is context, which was designed for request APIs by providing support for deadlines, cancellation signals, and request-scoped data. Cancellation propagation is an important topic in Go, because it allows us quickly reclaim any resources that may have been in use as soon as the inflight request or a parent context is cancelled. If there is a blocking call in your library, users of your library can benefit from a context-aware API to let them manage the cancellation in an idiomatic way. Contexts also allow you to set a hard deadline to timeout the request, and to include request-scoped data such as a request ID, which might be useful for tracing the request.

Continue reading Implementing request/response using context in Go network clients.


Four short links: 21 November 2017


Storytelling, Decompilation, Face Detection, and Dependency Alerts

  1. Scrollama -- a modern and lightweight JavaScript library for scrollytelling. (via Nathan Yau)
  2. Dangers of the Decompiler -- a sampling of anti-decompilation techniques.
  3. An On-device Deep Neural Network for Face Detection (Apple) -- how the face unlock works, roughly at "technical blog post" levels of complexity.
  4. GitHub Security Alerts -- With your dependency graph enabled, we’ll now notify you when we detect a vulnerability in one of your dependencies and suggest known fixes from the GitHub community.

Continue reading Four short links: 21 November 2017.


Four short links: 20 November 2017


Ancient Data, Tech Ethics, Session Replay, and Cache Filesystem​

  1. Trade, Merchants, and the Lost Cities of the Bronze Age -- We analyze a large data set of commercial records produced by Assyrian merchants in the 19th Century BCE. Using the information collected from these records, we estimate a structural gravity model of long-distance trade in the Bronze Age. We use our structural gravity model to locate lost ancient cities. (via WaPo)
  2. Tech Ethics Curriculum -- a Google sheet of tech ethics courses, with pointers to syllabi.
  3. Session Replay Scripts (Ed Felton) -- lately, more and more sites use “session replay” scripts. These scripts record your keystrokes, mouse movements, and scrolling behavior, along with the entire contents of the pages you visit, and send them to third-party servers. Unlike typical analytics services that provide aggregate statistics, these scripts are intended for the recording and playback of individual browsing sessions, as if someone is looking over your shoulder. (via BoingBoing)
  4. RubiX -- Cache File System optimized for columnar formats and object stores.

Continue reading Four short links: 20 November 2017.


Four short links: 17 November 2017


Interactive Marginalia, In-Person Interactions, Welcoming Groups, and Systems Challenges

  1. Interactive Marginalia (Liza Daly) -- wonderfully thoughtful piece about web annotations.
  2. In-Person Interactions -- Casual human interaction gives you lots of serendipitous opportunities to figure out that the problem you thought you were solving is not the most important problem, and that you should be thinking about something else. Computers aren't so good at that. So true! (via Daniel Bachhuber)
  3. Pacman Rule -- When standing as a group of people, always leave room for 1 person to join your group. (via Simon Willison)
  4. Berkeley View of Systems Challenges for AI -- In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI’s potential to improve lives and society.

Continue reading Four short links: 17 November 2017.


Four short links: 16 November 2017


Regulate IoT, Visualize CRISPR, Distract Strategically, and Code Together

  1. It's Time to Regulate IoT To Improve Security -- Bruce Schneier puts it nicely: internet security is now becoming "everything" security.
  2. Real-Space and Real-Time Dynamics of CRISPR-Cas9 (Nature) -- great visuals, written up for laypeople in The Atlantic. (via Hacker News)
  3. How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, not Engaged Argument -- research paper. Application to American media left as exercise to the reader.
  4. Coding Together in Real Time with Teletype for Atom -- what it says on the box.

Continue reading Four short links: 16 November 2017.


The tools that make TensorFlow productive


Analytical frameworks come with an entire ecosystem.Deployment is a big chunk of using any technology, and tools to make deployment easier have always been an area of innovation in computing. For instance, the difficulties and uncertainties of installing software and keeping it up-to-date were one factor driving companies to offer software as a service over the Web. Likewise, big data projects present their own set of issues: how do you prepare and ingest the data? How do you view the choices made by algorithms that are complex and dynamic? Can you use hardware acceleration (such as GPUs) to speed analytics, which may need to operate on streaming, real-time data? Those are just a few deployment questions associated with deep learning. In the report Considering TensorFlow for the Enterprise, authors Sean Murphy and Allen Leis cover the landscape of tools for working with TensorFlow, one of the most popular frameworks currently in big data analysis. They explain the importance of seeing deep learning as an integral part of a business environment—even while acknowledging that many of the techniques are still experimental—and review some useful auxiliary utilities. These exist for all of the major stages of data processing: preparation, model building, and inference (submitting requests to the model), as well as debugging. Given that the decisions made by deep learning algorithms are notoriously opaque (it's hard to determine exactly what combinations of features led to a particular classification), one intriguing part of the report addresses the possibility of using TensorBoard to visualize what's going on in the middle of a neural network. The UI offers you a visualization of the stages in the neural network, and you can see what each stage sends to the next. Thus, some of the mystery in deep learning gets stripped away, and you can explain to your clients some of the reasons that a particular result was reached. Another common bottleneck for many companies stems from the sizes of modern data sets, which often beg for help in getting ingested and through the system. One study found that about 20% of businesses handle data sets in the range of terabytes, with smaller ranges (gigabytes) being most common, and larger ones (petabytes) quite rare. For that 20% or more using unwieldy data sets, Murphy and Leis’s report is particularly valuable because special tools can help tie TensorFlow analytics to the systems that pass data through its analytics, such as Apache Spark. The authors also cover options for hardware acceleration: a lot of research has been done on specialized hardware that can accelerate deep learning even more than GPUs do. The essential reason for using artificial intelligence in business is to speed up predictions. To reap the most benefit from AI, therefore, one should find the most appropriate hardware and software combination to run the AI analytics. Furthermore, you want to reduce the time it takes to develop the analytics, which will allow you to react to changes in fast-moving businesses and reduce the burden on your data scientists. For many reasons, understanding the tools associated with TensorFlow makes its use more practical. This post is part of a collaboration between O'Reilly and TensorFlow. See our statement of editorial independence. Continue reading The tools that make TensorFlow productive.[...]

Implementing the pipes and filters pattern using actors in Akka for Java



How messages help you decouple, test, and re-use your software’s code.

We would like to introduce a couple of interesting concepts from Akka by giving an overview of how to implement the pipes and filters enterprise integration pattern. This is a commonly used pattern that helps us flexibly compose together sequences of alterations to a message. In order to implement this pattern we use Akka - a popular library that provides new approaches to write modern reactive software in Java and Scala.

The Business problem

Recently we came across an author publishing application made available as a service. It was responsible for processing markdown text. It would execute a series of operations back to back:

Continue reading Implementing the pipes and filters pattern using actors in Akka for Java.


Nathaniel Schutta on succeeding as a software architect



The O’Reilly Programming Podcast: The skills needed to make the move from developer to architect.

In this episode of the O’Reilly Programming Podcast, I talk with Nathaniel Schutta, a solutions architect at Pivotal, and presenter of the video I’m a Software Architect, Now What?. He will be giving a presentation titled Thinking Architecturally at the 2018 O’Reilly Software Architecture Conference, February 25-28, 2018, in New York City.

Continue reading Nathaniel Schutta on succeeding as a software architect.


Modern HTTP service virtualization with Hoverfly



Service virtualization brings a lightweight, automatable means of simulating external dependencies.

In modern software systems, it’s very common for applications to depend on third party or internal services. For example, an ecommerce site might depend on a third party payment service to process card payments, or a social network to provide authentication. These sorts of applications can be challenging to test in isolation, as their dependencies can introduce problems like:

  • Non-determinism
  • Slow and costly builds
  • Unmockable client libraries
  • Rate-limiting
  • Expensive licensing costs
  • Incompleteness
  • Slow provisioning

To get around this, service virtualization, or replacing these components with a process which simulates them, can emulate these dependencies. Unlike mocking, which replaces your application code, service virtualization lives externally, typically operating at the network level. It is non-invasive, and is essentially just like the real thing from the perspective of its consumer.

Continue reading Modern HTTP service virtualization with Hoverfly.


Four short links: 15 November 2017


Paywalled Research, Reproducing AI Research, Spy Teardown, and Peer-to-Peer Misinformation

  1. 65 of the 100 Most-Cited Papers Are Paywalled -- The weighted average of all the paywalls is: $32.33 [...] [T]he open access articles in this list are, on average, cited more than the paywalled ones.
  2. AI Reproducibility -- Participants have been tasked with reproducing papers submitted to the 2018 International Conference on Learning Representations, one of AI’s biggest gatherings. The papers are anonymously published months in advance of the conference. The publishing system allows for comments to be made on those submitted papers, so students and others can add their findings below each paper. [...] Proprietary data and information used by large technology companies in their research, but withheld from papers, is holding the field back.
  3. Inside a Low-Budget Consumer Hardware Espionage Implant -- The S8 data line locator is a GSM listening and location device hidden inside the plug of a standard USB data/charging cable. Has a microphone but no GPS, remotely triggered via SMS messages, uses data to report cell tower location to a dodgy server...and is hidden in a USB cable.
  4. She Warned of ‘Peer-to-Peer Misinformation.’ Congress Listened (NY Times) -- Renee's work on anti-vaccine groups (and her college thesis on propaganda in the 2004 Russian elections) led naturally to her becoming an expert on Russian propaganda in the 2016 elections.

Continue reading Four short links: 15 November 2017.


Scaling messaging in Go network clients



Learn how the NATS client implements fast publishing and messages processing schemes viable for production use.

The previous article in this series created a client that communicated with a server in a simple fashion. This article shows how to add features that make the client more viable for production use. Problems we’ll solve include:

  1. Each message received from the server will block the read loop while executing the callback that handles the message, because the loop and callback run in the same goroutine. This also means that we cannot the implement Request() and Flush() methods that the NATS Go client offers.
  2. All publish commands are triggering a flush to the server and blocking when doing so, impacting performance.

We’ll fix these problems in this article. The third and last section of the article will build on the client we create here to build in Request/Response functionality for one-to-one communication. Other useful functionality that is not covered in this series, but that a production client should have, include:

Continue reading Scaling messaging in Go network clients.


5 tips for driving design thinking in a large organization



How user-centered design focused on user needs and delivery can bring about real change and still be respected in the boardroom.

Continue reading 5 tips for driving design thinking in a large organization.


C++17 upgrades you should be using in your code



Structured bindings, new library types, and containers add efficiency and readability to your code.

C++17 is a major release, with over 100 new features or significant changes. In terms of big new features, there's nothing as significant as the rvalue references we saw in C++11, but there are a lot of improvements and additions, such as structured bindings and new container types. What’s more, a lot has been done to make C++ more consistent and remove unhelpful and unnecessary behavior, such as support for trigraphs and std::auto_ptr.

This article discusses two significant C++17 upgrades that developers need to adopt when writing their own C++ code. I’ll explore structured bindings, which is a useful new way to work with structured types, and then some of the new types and containers that have been added to the Standard Library.

Continue reading C++17 upgrades you should be using in your code.


Four short links: 14 November 2017


AI Microscope, Android Geriatrics, Doxing Research, and Anti-Goals

  1. AI-Powered Microscope Counts Malaria Parasites in Blood Samples (IEEE Spectrum) -- The EasyScan GO microscope under development would combine bright-field microscope technology with a laptop computer running deep learning software that can automatically identify parasites that cause malaria. Human lab workers would mostly focus on preparing the slides of blood samples to view under the microscope and verifying the results. Currently 20m/slide (same as a human), but they want to cut it to 10m/slide.
  2. A Billion Outdated Android Devices in Use -- never ask why security researchers drink more than the rest of society.
  3. Datasette (Simon Willison) -- instantly create and publish an API for your SQLite databases.
  4. Fifteen Minutes of Unwanted Fame: Detecting and Characterizing Doxing -- This work analyzes over 1.7 million text files posted to,, and, sites frequently used to share doxes online, over a combined period of approximately 13 weeks. Notable findings in this work include that approximately 0.3% of shared files are doxes, that online social networking accounts mentioned in these dox files are more likely to close than typical accounts, that justice and revenge are the most often cited motivations for doxing, and that dox files target males more frequently than females.
  5. The Power of Anti-Goals (Andrew Wilkinson) -- instead of exhausting aspirations, focus on avoiding the things that deplete your life. (via Daniel Bachhuber)

Continue reading Four short links: 14 November 2017.


Four short links: 13 November 2017


Software 2.0, Watson Walkback, Robot Fish, and Smartphone Data

  1. Software 2.0 (Andrej Karpathy) -- A large nimber of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze, and visualize data that feeds neural networks. Supported by Pete Warden: I know this will all sound like more deep learning hype, and if I wasn’t in the position of seeing the process happening every day, I’d find it hard to swallow too, but this is real. Bill Gates is supposed to have said "Most people overestimate what they can do in one year and underestimate what they can do in 10 years," and this is how I feel about the replacement of traditional software with deep learning. There will be a long ramp-up as knowledge diffuses through the developer community, but in 10 years, I predict most software jobs won’t involve programming. As Andrej memorably puts it, “[deep learning] is better than you”!
  2. IBM Watson Not Even Close -- The interviews suggest that IBM, in its rush to bolster flagging revenue, unleashed a product without fully assessing the challenges of deploying it in hospitals globally. While it has emphatically marketed Watson for cancer care, IBM hasn’t published any scientific papers demonstrating how the technology affects physicians and patients. As a result, its flaws are getting exposed on the front lines of care by doctors and researchers who say that the system, while promising in some respects, remains undeveloped. AI has been drastically overhyped, and there will be more disappointments to come.
  3. Robot Spy Fish -- “The fish accepted the robot into their schools without any problem,” says Bonnet. “And the robot was also able to mimic the fish’s behavior, prompting them to change direction or swim from one room to another.”
  4. Politics Gets Personal: Effects of Political Partisanship and Advertising on Family Ties -- Using smartphone-tracking data and precinct-level voting, we show that politically divided families shortened Thanksgiving dinners by 20-30 minutes following the divisive 2016 election.[...] we estimate 27 million person-hours of cross-partisan Thanksgiving discourse were lost in 2016 to ad-fueled partisan effects Smartphone data is useful data. (via Marginal Revolution)

Continue reading Four short links: 13 November 2017.


“Not hotdog” vs. mission-critical AI applications for the enterprise


Drawing parallels and distinctions around neural networks, data sets, and hardware.Artificial intelligence has come a long way since the concept was introduced in the 1950s. Until recently, the technology had an aura of intrigue, and many believed its place was strictly inside research labs and science fiction novels. Today, however, the technology has become very approachable. The popular TV show Silicon Valley recently featured an app called “Not Hotdog,” based on cutting-edge machine learning frameworks, showcasing how easy it is to create a deep learning application. Gartner has named applied AI and machine-learning-powered intelligent applications as the top strategic technology trend for 2017, and reports that by 2020, 20% of companies will dedicate resources to AI. CIOs are under serious pressure to commit resources to AI and machine learning. It is becoming easier to build an AI app like Not Hotdog for fun and experimentation, but what does it take to build a mission-critical AI application that a CIO can trust to help run a business? Let’s take a look. For the purpose of this discussion, we will limit our focus to applications similar to Not Hotdog, (i.e., applications based on image recognition and classification), although the concepts can be applied to a wide variety of deep learning applications. We will also limit the discussion to systems and frameworks, because personnel requirements can vary significantly based on the application. For example, for an image classification application built for retinal image classification, Google required the assistance of 54 ophthalmologists. Whereas for an application built for recognizing dangerous driving, we are going to require significantly less expertise and fewer people. Image classification: Widely applicable deep learning use case At its core, Not Hotdog is an image classification application. It classifies images into two categories: “hotdogs” and “not hotdogs.” Figure 1. Screenshot from the “Not Hotdog” app courtesy of Ankur Desai. Image classification has many applications across industries. In health care, it can be used for medical imaging and diagnosing diseases. In retail, it can be used to spot malicious activities in stores. In agriculture, it can be used to determine the health of crops. In consumer electronics, it can provide face recognition and autofocus to camera-enabled devices. In the public sector, it can be used to identify dangerous driving with traffic cameras. The list goes on. The fundamental difference between these applications and Not Hotdog is the core purpose of the application. Not Hotdog is intentionally meant to be farcical. As a result, it is an experimental app. However, the applications listed above are meant to be critical to core business processes. Let’s take a look at how “Not Hotdog” is built, and then we will discuss additional requirements for mission-critical deep learning applications. Not Hotdog: How is it built? This blog takes us through the wonderful journey of Not Hotdog’s development process. Following is the summary of how it is built. Not Hotdog uses the following key software components: React Native: An app development framework that makes it easy to build mobile apps. TensorFlow: An open source software library for machine learning. It makes building deep learning neural networks easy with pre-built libraries. Keras: An open source neural network library written in Python. It is capable of running on top of Ten[...]

Four short links: 10 November 2017


Syntactic Sugar, Surprise Camera, AI Models, and Git Recovery

  1. Ten Features From Modern Programming Languages -- interesting collection of different flavors of syntactic sugar.
  2. Access Both iPhone Cameras Any Time Your App is Running -- Once you grant an app access to your camera, it can: access both the front and the back camera; record you at any time the app is in the foreground; take pictures and videos without telling you; upload the pictures/videos it takes immediately; run real-time face recognition to detect facial features or expressions.
  3. Deep Learning Models with Demos -- portable and searchable compilation of pre-trained deep learning models. With demos and code. Pre-trained models are deep learning model weights that you can download and use without training. Note that computation is not done in the browser.
  4. Git flight rules -- Flight rules are the hard-earned body of knowledge recorded in manuals that list, step-by-step, what to do if X occurs, and why. Essentially, they are extremely detailed, scenario-specific standard operating procedures. [...]

Continue reading Four short links: 10 November 2017.


Building a natural language processing library for Apache Spark



The O’Reilly Data Show Podcast: David Talby on a new NLP library for Spark, and why model development starts after a model gets deployed to production.

When I first discovered and started using Apache Spark, a majority of the use cases I used it for involved unstructured text. The absence of libraries meant rolling my own NLP utilities, and, in many cases, implementing a machine learning library (this was pre deep learning, and MLlib was much smaller). I’d always wondered why no one bothered to create an NLP library for Spark when many people were using Spark to process large amounts of text. The recent, early success of BigDL confirms that users like the option of having native libraries.

In this episode of the Data Show, I spoke with David Talby of Pacific.AI, a consulting company that specializes in data science, analytics, and big data. A couple of years ago I mentioned the need for an NLP library within Spark to Talby; he not only agreed, he rounded up collaborators to build such a library. They eventually carved out time to build the newly released Spark NLP library. Judging by the reception received by BigDL and the number of Spark users faced with large-scale text processing tasks, I suspect Spark NLP will be a standard tool among Spark users.

Talby and I also discussed his work helping companies build, deploy, and monitor machine learning models. Tools and best practices for model development and deployment are just beginning to emerge—I summarized some of them in a recent post, and, in this episode, I discussed these topics with a leading practitioner.

Continue reading Building a natural language processing library for Apache Spark.


Four short links: 9 November 2017


Culture, Identifying Bots, Attention Economy, and Machine Bias

  1. Culture is the Behaviour You Reward and Punish -- When all the “successful” people behave in the same way, culture is made.
  2. Identifying Viral Bots and Cyborgs in Social Media -- it is readily possible to identify social bots and cyborgs on both Twitter and Facebook using information entropy and then to find groups of successful bots using network analysis and community detection.
  3. An Economy Based on Attention is Easily Gamed (The Economist) -- Americans touch their smartphones on average more than 2,600 times a day (the heaviest users easily double that). The population of America farts about 3m times a minute. It likes things on Facebook about 4m times a minute.
  4. Frankenstein's Legacy: Four conversations about Artificial Intelligence, Machine Learning, and the Modern World (CMU) -- A machine isn’t a human. It’s not going to necessarily incorporate bias even from biased training data in the same way that a human would. Machine learning isn’t necessarily going to adopt—for lack of a better word—a clearly racist bias. It’s likely to have some kind of much more nuanced bias that is far more difficult to predict. It may, say, come up with very specific instances of people it doesn’t want to hire that may not even be related to human bias.

Continue reading Four short links: 9 November 2017.


The phone book is on fire



Lessons from the Dyn DNS DDoS.

Continue reading The phone book is on fire.


Guidelines for how to design for emotions



Learn what makes for a rich emotional experience and why, even if we make our technology invisible, the connection will still be emotional.

Continue reading Guidelines for how to design for emotions.


Identifying viral bots and cyborgs in social media


Analyzing tweets and posts around Trump, Russia, and the NFL using information entropy, network analysis, and community detection algorithms. Particularly over the last several years, researchers across a spectrum of scientific disciplines have studied the dynamics of social media networks to understand how information propagates as the networks evolve. Social media platforms like Twitter and Facebook include not only actual human users but also bots, or automated programs, that can significantly alter how certain messages are spread. While some information-gathering bots are beneficial or at least benign, it was made clear by the 2016 U.S. Presidential election and the 2017 elections in France that bots and sock puppet accounts (that is, numerous social accounts controlled by a single person) were effective in influencing political messaging and propagating misinformation on Twitter and Facebook. It is thus crucial to identify and classify social bots to combat the spread of misinformation and especially the propaganda of enemy states and violent extremist groups. This article is a brief summary of my recent bot detection research. It describes the techniques I applied and the results of identifying battling groups of viral bots and cyborgs that seek to sway opinions online. For this research, I have applied techniques from complexity theory, especially information entropy, as well as network graph analysis and community detection algorithms to identify clusters of viral bots and cyborgs (human users who use software to automate and amplify their social posts) that differ from typical human users on Twitter and Facebook. I briefly explain these approaches below, so deep prior knowledge of these areas is not necessary. In addition to commercial bots focused on promoting click traffic, I discovered competing armies of pro-Trump and anti-Trump political bots and cyborgs. During August 2017, I found that anti-Trump bots were more successful than pro-Trump bots in spreading their messages. In contrast, during the NFL protest debates in September 2017, anti-NFL (and pro-Trump) bots and cyborgs achieved greater successes and virality than pro-NFL bots. Obtaining Twitter source data The data sets for my Twitter bot detection research consisted of ~60M tweets that mentioned the terms “Trump,” “Russia,” “FBI,” or “Comey”; the tweets were collected via the free Twitter public API in separate periods between May 2017 and September 2017. I have made the source tweet IDs as well as many of our analysis results files available in a data project published at Researchers who wish to collaborate on this project at should send a request email to Detecting bots using information entropy Information entropy is defined as the “the average amount of information produced by a probabilistic stochastic source of data.” As such, it is one effective way to quantify the amount of randomness within a data set. Because one can reasonably conjecture that actual humans are more complicated than automated programs, entropy can be a useful signal when one is attempting to identify bots, as has been done by a number of previous researchers. Of the recent research in social bot detection, particularly notable is the excellent work by groups of researchers from the University of California and Indiana Univ[...]

Building messaging in Go network clients



Learn useful Go communication techniques from the internals of the NATS client.

In today's world of data-intensive applications and frameworks tailored to tasks as varied as service discovery and stream processing, distributed systems are becoming ubiquitous. A reliable and high-performing networking client is essential to accessing and scaling such systems. Often, implementing a network client for a custom protocol can seem like a daunting task. If a protocol is too complex (as custom protocols have a tendency to become), maintaining and implementing the client can be a burden. Moreover, it is ideal to have good language support for doing asynchronous programming when implementing anything at a higher level than working with sockets. Handling all this customization and multi-language support can be greatly simplified by picking the right approach from the outset.

Fortunately, the Go programming language facilitates the task of developing networking clients by offering a robust standard library, excellent concurrency built-ins, and great tooling for doing benchmarks and static analysis, as well as having great performance and overall being a very flexible language for systems programming.

Continue reading Building messaging in Go network clients.


Consumer-driven innovation for continuous glucose monitoring in diabetes patients


CGMs are unique in the way consumers have taken it upon themselves to create modifications to medical devices.Imagine if your life suddenly depended on monitoring your body’s reaction every time you had a snack, skipped a meal, or ate a piece of candy. This is a reality for approximately 1.25 million people in the USA who have been diagnosed with Type 1 Diabetes (T1D). People with T1D experience unhealthy fluctuations in blood glucose levels due to the destruction of beta cells in the pancreas by the person’s own immune system. Beta cells produce insulin, which is a hormone that allows your body to break down, use, or store glucose, while maintaining a healthy blood sugar level throughout the day. Presently, there is no cure for T1D, so patients must be constantly vigilant about maintaining their blood glucose levels within a healthy range in order to avoid potentially deadly consequences. Currently, continuous glucose monitors (CGMs) are the most effective way to manage T1D. However, consumers have already become frustrated with the limitations of commercially available CGMs, and are developing at-home modifications to overcome them. This in turn, is influencing the direction of research and development in the biomedical devices industry, as multiple companies compete to create a CGM that appeals to the largest consumer population. Thus, consumer-driven innovation in CGM data access, CGM-insulin pump integration, and glucose sensor lifespan has led to rapid growth in the field of diabetes management devices. Coping with the highs and lows Patients with T1D need to monitor their blood glucose levels to ensure they don’t become hyperglycemic(high blood glucose levels), or hypoglycemic (low blood glucose levels), both of which can cause life-threatening complications. Throughout the late 1980s and 1990s, home glucose blood monitoring devices were the most accurate way to measure blood glucose levels. These devices use a lancet to prick the person’s finger to obtain real-time glucose levels from a drop of blood. Although still used today by some diabetics as a primary means of T1D management, finger prick devices have considerable drawbacks. These include the physical pain that comes from frequent finger pricks, the static nature of the glucose reading, and the indiscretion and inconvenience of taking multiple readings throughout the day and night. It is no wonder then that the market potential for a device that conveniently and accurately measures blood glucose levels continues to soar. The continuous glucose monitor (CGM) At the turn of the 21st century, the integration of technology and medicine introduced a novel way for patients to gain control of T1D. In 1999, MiniMed obtained approval from the U.S. Food and Drug Administration (FDA) for the first continuous glucose monitor (CGM). The device was implanted by a physician and recorded the patient’s glucose levels for three days. The patient then returned to the clinic to have the sensor removed and discuss any trends revealed by the CGM. In 2001, MiniMed was acquired by Medtronic, a medical device company that specializes in making diabetes management devices. In 2003, Medtronic received FDA approval to launch the first, real-time, patient-use CGM device. This kick-started an ongoing competition to create more accurate, user[...]

Susan Sons on building security from first principles



The O’Reilly Security Podcast: Recruiting and building future open source maintainers, how speed and security aren’t mutually exclusive, and identifying and defining first principles for security.

In this episode of the Security Podcast, O’Reilly’s Mac Slocum talks with Susan Sons, senior systems analyst for the Center for Applied Cybersecurity Research (CACR) at Indiana University. They discuss how she initially got involved with fixing the open source Network Time Protocol (NTP) project, recruiting and training new people to help maintain open source projects like NTP, and how security needn’t be an impediment to organizations moving quickly.

Continue reading Susan Sons on building security from first principles.


Four short links: 8 November 2017


Shadow Profiles, Theories of Learning, Feature Visualization, and Time to Reflect Reality

  1. How Facebook Figures Out Everyone You've Ever Met (Gizmodo) -- Behind the Facebook profile you’ve built for yourself is another one, a shadow profile, built from the inboxes and smartphones of other Facebook users. Contact information you’ve never given the network gets associated with your account, making it easier for Facebook to more completely map your social connections. (via Slashdot)
  2. Theories of Deep Learning (STATS 385) -- Stanford class. Lecture videos are posted after the lectures are given.
  3. Feature Visualization (Distill) -- How neural networks build up their understanding of images. Wonderfully visual.
  4. Mapping's Intelligent Agents -- Industry players are developing dynamic HD maps, accurate within inches, that would afford the car’s sensors some geographic foresight, allowing it to calculate its precise position relative to fixed landmarks. [...] Yet, achieving real-time “truth” throughout the network requires overcoming limitations in data infrastructure. The rate of data collection, processing, transmission, and actuation is limited by cellular bandwidth as well as on-board computing power. Mobileye is attempting to speed things up by compressing new map information into a “Road Segment Data” capsule that can be pushed between the master map in the Cloud and cars in the field. If nothing else, the system has given us a memorable new term, “Time to Reflect Reality,” which is the metric of lag time between the world as it is and the world as it is known to machines.

Continue reading Four short links: 8 November 2017.


Automated root cause analysis for Spark application failures


Reduce troubleshooting time from days to seconds.Spark’s simple programming constructs and powerful execution engine have brought a diverse set of users to its platform. Many new big data applications are being built with Spark in fields like health care, genomics, financial services, self-driving technology, government, and media. Things are not so rosy, however, when a Spark application fails. Similar to applications in other distributed systems that have a large number of independent and interacting components, a failed Spark application throws up a large set of raw logs. These logs typically contain thousands of messages, including errors and stacktraces. Hunting for the root cause of an application failure from these messy, raw, and distributed logs is hard for Spark experts—and a nightmare for the thousands of new users coming to the Spark platform. We aim to radically simplify root cause detection of any Spark application failure by automatically providing insights to Spark users like what is shown in Figure 1. Figure 1. Insights from automatic root cause analysis improve Spark user productivity. Source: Adrian Popescu and Shivnath Babu. Spark platform providers like Amazon, Azure, Databricks, and Google clouds as well as application performance management (APM) solution providers like Unravel have access to a large and growing data set of logs from millions of Spark application failures. This data set is a gold mine for applying state-of-the-art artificial intelligence (AI) and machine learning (ML) techniques. In this blog, we look at how to automate the process of failure diagnosis by building predictive models that continuously learn from logs of past application failures for which the respective root causes have been identified. These models can then automatically predict the root cause when an application fails[1]. Such actionable root-cause identification improves the productivity of Spark users significantly. Clues in the logs A number of logs are available every time a Spark application fails. A distributed Spark application consists of a driver container and one or more executor containers. The logs generated by these containers have information about the application as well as how the application interacts with the rest of the Spark platform. These logs form the key data set that Spark users scan for clues to understand why an application failed. However, the logs are extremely verbose and messy. They contain multiple types of messages, such as informational messages from every component of Spark, error messages in many different formats, stacktraces from code running on the Java Virtual Machine (JVM), and more. The complexity of Spark usage and internals make things worse. Types of failures and error messages differ across Spark SQL, Spark Streaming, iterative machine learning and graph applications, and interactive applications from Spark shell and notebooks (e.g., Jupyter, Zeppelin). Furthermore, failures in distributed systems routinely propagate from one component to another. Such propagation can cause a flood of error messages in the log and obscure the root cause. Figure 2 shows our overall solution to deal with these problems and to automate r[...]

Implementing continuous delivery



The architectural design, automated quality assurance, and deployment skills needed for delivering continuous software.

There is an ever-increasing range of best practices emerging around microservices, DevOps, and the cloud, with some offering seemingly contradictory guidelines. There is one thing that developers can agree on: continuous delivery adds enormous value to the software delivery lifecycle through fast feedback and the automation of both quality assurance and deployment processes. However, the challenges for modern software developers are many, and attempting to introduce a methodology like continuous delivery—which touches all aspect of software design and delivery—means several new skills typically outside of a developer’s comfort zone must be mastered.

These are the key developer skills I believe are needed to harness the benefits of continuous delivery:

Continue reading Implementing continuous delivery.


Four short links: 7 November 2017


Disturbing YouTube, Sketchy Presentation Tool, Yammer UI, and Dance Your Ph.D. Winners

  1. Something is Wrong on the Internet (James Bridle) -- This is a deeply dark time, in which the structures we have built to sustain ourselves are being used against us — all of us — in systematic and automated ways. It is hard to keep faith with the network when it produces horrors such as these. While it is tempting to dismiss the wilder examples as trolling, of which a significant number certainly are, that fails to account for the sheer volume of content weighted in a particularly grotesque direction. This is another reason why propping your kids in front of YouTube is unsafe and unwise.
  2. ChalkTalk -- a digital presentation and communication language in development at New York University's Future Reality Lab. Using a blackboard-like interface, it allows a presenter to create and interact with animated digital sketches in order to demonstrate ideas and concepts in the context of a live presentation or conversation.
  3. YamUI -- Microsoft open-sourced the reusable component framework that they built for Yammer. [B]uilt with React on top of Office UI Fabric components.
  4. Dance Your Ph.D. Finalists -- look at the finalists on this site, read about the winners on Smithsonian.

Continue reading Four short links: 7 November 2017.


Developing successful AI apps for the enterprise


The IBM team encourages developers to ask tough questions, be patient, and be ready to fail gracefully.In this episode of the O’Reilly Media Podcast, I sat down with Josh Zheng and Tom Markiewicz, developer advocates for IBM Watson. We discussed how natural language processing (NLP) APIs, and chatbots in particular, represent just one of the ways AI is augmenting humans and boosting productivity in enterprises today. In order to apply AI to the enterprise, Zheng and Markiewicz explain, developers first need to understand the importance of sourcing and cleaning the organization’s data, much of which is coming in unstructured formats like email, customer support chats, and PDF documents. This can be “unglamorous” work, but it’s also critical to building a successful NLP app, or chatbot. From there, Zheng and Markiewicz offer some practical tips for developers looking to build chatbots: to have context awareness, to fail gracefully, and to have patience—building a successful chatbot can take time. Below are some highlights from the discussion: The hype behind chatbots Josh Zheng: I think one of the biggest propellers for [chatbots] now is the increase in availability of the NLP capabilities. So, a chatbot uses a couple NLP techniques to make the whole thing work, but these things are actually not very new. They've been around for a while. I think what's different is that they've always been kind of locked up in research labs. There have been open source tools like Python's NLTK that made them more accessible, but it's not until recently, where companies like IBM and Google have put APIs on the cloud and made them very user-friendly and easily accessible, that large enterprises—which are usually more behind on the adoption curve—are able to access them and use them. Use cases for chatbots in tech and travel Josh Zheng: Autodesk built a customer support virtual agent on top of IBM Watson. This need came when they first moved from a client-per-software model into more of a SaaS model. They really widened their customer reach, but with that came a lot more customer increase and the need for customer support. ... They were able to build a chatbot [Autodesk Virtual Agent] that is able to answer a lot of the questions. And it turns out, a lot of the questions people have are very similar. ... A lot of these are very simple questions that a machine can take over and let the humans focus on the complex questions or the complex requests. ... They were able to reduce the average time-to-resolution by a huge margin. After implementing the chatbot, we see that on average it takes 1.5 days to resolve questions involving humans and only 5.6 minutes to resolve chatbot-only questions. Developer-first mentality: Prototyping your way to successful AI apps Tom Markiewicz: You can try all of the APIs for free and just build little prototypes to see if that fits into what you're trying to do before planning a giant budget and going through the process. That's the beauty of the shift over the last couple of years, with more of a developer-first kind of mentality—the understanding is no longer, ‘Okay, we're g[...]

Four short links: 6 November 2017


IoT Standard, Probabilistic Programming, Go Scripting, and Front-End Checklist

  1. A Firmware Update Architecture for Internet of Things Devices -- draft submitted to IETF. It has a long way to go before it's a standard, but gosh it'd be nice to have this stuff without everyone reinventing it from scratch. (via Bleeping Computer)
  2. Pyro -- a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the back end. Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling.
  3. Neugram -- scripting language integrated with Go. Overview of the language.
  4. Front-End Checklist -- an exhaustive list of all elements you need to have / to test before launching your site / HTML page to production. (website)

Continue reading Four short links: 6 November 2017.


Four short links: 3 November 2017


End of Startups, Company Strategy, Complex Futures, and Bitcoin Energy

  1. Ask Not For Whom The Deadpool Tolls -- We live in a new world now, and it favors the big, not the small. The pendulum has already begun to swing back. Big businesses and executives, rather than startups and entrepreneurs, will own the next decade; today’s graduates are much more likely to work for Mark Zuckerberg than follow in his footsteps.
  2. Notes on Developing a Strategy and Designing a Company -- These notes provide a sequence of steps for creating or evaluating a strategy and associated company design, drawing clear lines to quantitative and evidence-based evaluation of enterprise performance and to financial valuation. The notes are intended for practical use by managers or instructors of MBAs and executive MBAs.
  3. Designing Our Complex Future with Machines (Joi Ito) -- We should learn from our history of applying over-reductionist science to society and try to, as Wiener says, “cease to kiss the whip that lashes us.” While it is one of the key drivers of science—to elegantly explain the complex and reduce confusion to understanding—we must also remember what Albert Einstein said: “Everything should be made as simple as possible, but no simpler.” We need to embrace the unknowability—the irreducibility—of the real world that artists, biologists, and those who work in the messy world of liberal arts and humanities are familiar with.
  4. Bitcoin Energy Consumption -- 7.51 U.S. households powered for a day by one transaction; $1B of energy used in a year to mine; Bitcoin has the same energy consumption as all of Nigeria. "Bitcoin" is how homo economicus pronounces "externality."

Continue reading Four short links: 3 November 2017.


Establishing the “why” of your product


How vision, mission, and values help you craft a winning product roadmap. A product vision should be about having an impact on the lives of the people your product serves, as well as on your organization. It’s easy to get overwhelmed by the various concepts and terminology surrounding product development, and even more when you start to consider the terminology involved with strategy. There are mission statements, company visions, values, goals, strategy, problem statements, purpose statements, and success criteria. Further, there are acronyms like KPI and OKR, which also seem potentially useful in guiding your efforts. How do you know which ideas apply to your situation, and where to start? Whether your organization is mission-, vision-, or values-driven (or a combination thereof), these are all considered guiding principles to draw from and offer your team direction. For the purposes of this book, we’ll establish definitions for mission, vision, and values, so we have a common language. Bear with us if you have different definitions of them yourself. Mission defines your intent A mission is not what you value, nor is it a vision for the future; it’s the intent you hold right now and the purpose driving you to realize your vision. A well-written mission statement will clarify your business’s intentions. Most often we find mission statements contain a mix of realism and optimism, which are sometimes at odds with each other. There are four key elements to a well-crafted mission statement: Value What value does your mission bring to the world? Inspiration How does your mission inspire your team to make the vision a reality? Plausibility Is your mission realistic and achievable? If not, it’s disheartening, and people won’t be willing to work at it. If it seems achievable, however, people will work their tails off to make it happen. Specificity Is your mission specific to your business, industry, and/or sector? Make sure it’s relevant and resonates with the organization. Here are two example missions. Can you guess the company for either? Company A To refresh the world... To inspire moments of optimism and happiness... To create value and make a difference. Company B To inspire and nurture the human spirit—one person, one cup, and one neighborhood at a time. Company A is Coca-Cola, and Company B is Starbucks. While these missions may also be considered marketing slogans due to the size and popularity of each company, it’s important to note their aspirational context. Another aspect of mission that’s often overlooked is that it has to reflect what you do for someone else.That someone else is typically not your shareholders, but your customers. Vision statements are very often conflated with mission. We’ve seen many company vision statements that are actually mission statements. Vision statements are a challenge to not be self-centered to “be the best ___.” Vision is the outcome you seek A company vision should be about a longer-term outcome that has an impact on the lives of the pe[...]

Matt Stine on cloud-native architecture



The O’Reilly Programming Podcast: Applying architectural patterns and pattern languages to build systems for the cloud.

In this episode of the O’Reilly Programming Podcast, I talk with Matt Stine, global CTO of architecture at Pivotal. He is the presenter of the O’Reilly live online training course Cloud-Native Architecture Patterns, and he has spoken about cloud-native architecture at the recent O’Reilly Software Architecture Conference and O’Reilly Security Conference.

Continue reading Matt Stine on cloud-native architecture.


Deep convolutional generative adversarial networks with TensorFlow


How to build and train a DCGAN to generate images of faces, using a Jupyter Notebook and TensorFlow.The concept of generative adversarial networks (GANs) was introduced less than four years ago by Ian Goodfellow. Goodfellow uses the metaphor of an art critic and an artist to describe the two models—discriminators and generators—that make up GANs. An art critic (the discriminator) looks at an image and tries to determine if its real or a forgery. An artist (the generator) who wants to fool the art critic tries to make a forged image that looks as realistic as possible. These two models “battle” each other; the discriminator uses the output of the generator as training data, and the generator gets feedback from the discriminator. Each model becomes stronger in the process. In this way, GANs are able to generate new complex data, based on some amount of known input data, in this case, images. It may sound scary to implement GANs, but it doesn’t have to be. In this tutorial, we will use TensorFlow to build a GAN that is able to generate images of human faces. Architecture of our DCGAN In this tutorial, we are not trying to mimic simple numerical data—we are trying to mimic an image, which should even be able to fool a human. The generator takes a randomly generated noise vector as input data and then uses a technique called deconvolution to transform the data into an image. The discriminator is a classical convolutional neural network, which classifies real and fake images. Figure 1. Simplified visualization of a GAN. Image source: “Generative Adversarial Networks for Beginners,” O’Reilly. We are going to use the original DCGAN architecture from the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, which consists of four convolutional layers for the discriminator and four deconvolutional layers for the generator. Setup Please access the code and Jupyter Notebook for this tutorial on GitHub. All instructions are in the README file in the GitHub repository. A helper function will automatically download the CelebA data set to get you up and running quickly. Be sure to have matplotlib installed to actually see the images and requests to download the data set. If you don’t want to install it yourself, there is a Docker image included in the repository. The CelebA data set The CelebFaces Attributes data set contains more than 200,000 celebrity images, each with 40 attribute annotations. Since we just want to generate images of random faces, we are going to ignore the annotations. The data set includes more than 10,000 different identities, which is perfect for our cause. Figure 2. Some examples of the CelebA data set. Image courtesy of Dominic Monn. At this point, we are also going to define a function for batch generation. This function will load our images and give us an array of images according to a batch size we are going to set later. To get better results, we will crop the images, s[...]

Becoming an accidental architect



How software architects can balance technical proficiencies with an appropriate mastery of communication.

One of the demographics Brian and I noticed in the several O'Reilly Software Architecture Conferences we've hosted is the Accidental Architect: someone who makes architecture-level decisions on projects without a formal Architect title. Over time, we're building more material into the conference program to accommodate this common role.

But how does one transition from developer to Accidental Architect? It doesn't happen overnight.

Continue reading Becoming an accidental architect.


Four short links: 2 November 2017


Capsule Neural Networks, Adversarial Objects, Deep Learning Language, and Crowdsourced Pop Star

  1. Dynamic Routing Between Capsules -- new paper from one of the deep learning luminaries, Geoff Hinton. Hacker Noon explains: In this paper the authors project that human brains have modules called “capsules.” These capsules are particularly good at handling different types of visual stimulus and encoding things like pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. The brain must have a mechanism for “routing” low-level visual information to what it believes is the best capsule for handling it.
  2. Adversarial Objects -- Here is a 3D-printed turtle that is classified at every viewpoint as a “rifle” by Google’s InceptionV3 image classifier, whereas the unperturbed turtle is consistently classified as “turtle.”
  3. DeepNLP 2017 -- Oxford University applied course focussing on recent advances in analyzing and generating speech and text using recurrent neural networks.
  4. Virtual Singer Becomes Japanese Mega-Star (Bloomberg) -- CG-rendered pop star, singing crowdsourced songs. Crucial to Miku’s success is the ability for devotees to purchase the Yamaha-powered Vocaloid software and write their own songs for the star to sing right back at them. Fans then can upload songs to the web and vie for the honor of having her perform them at “live” gigs, in which the computer-animated Miku takes center stage, surrounded by human guitarists, drummers and pianists. This is fantastic. (via Slashdot)

Continue reading Four short links: 2 November 2017.


Building a culture of security at the New York Times



Runa Sandvik shares practical lessons on how to build and foster a culture of security across an organization.

Continue reading Building a culture of security at the New York Times.


An infinite set of security tools



Window Snyder says security basics are hard to implement consistently, but they're worth the effort.

Continue reading An infinite set of security tools.


2017 O'Reilly Defender Awards



The O’Reilly Defender Awards celebrate those who have demonstrated exceptional leadership, creativity, and collaboration in the defensive security field.

Continue reading 2017 O'Reilly Defender Awards.


Developing a successful data governance strategy


Multi-model database architectures provide a flexible data governance platformData governance has become increasingly critical as more organizations rely on data to make better decisions, optimize operations, create new products and services, and improve profitability. Upcoming data security regulations like the new EU GPDR law will require organizations to have a forward-looking approach in order to comply with these requirements. Additionally, regulated industries, such as health care and finance, spend a tremendous amount of money on compliance with regulations that are constantly changing. Developing a successful data governance strategy requires careful planning, the right people, and the appropriate tools and technologies. It is necessary to implement the required policies and procedures across all of an organization’s data in order to guarantee that everyone acts in accordance with the regulatory framework. Implementing a modern data governance framework requires the use of new technologies. Traditional technologies, known as Relational Database Management Systems (RDBMS), are based on the relational model, in which data are presented and stored in a tabular form. RDBMS are not flexible enough to easily update this relational schema when data need to change frequently. Basically, RDBMS are not good in the data governance context, because you need to define your model in advance. However, NoSQL technologies, such as document-oriented databases, provide a way to store and retrieve data that can be modeled in a non-tabular form, and they do not require having a model in advance. A flexible data governance framework prevents situations in which complex engineering systems have several disconnected pieces of data along with expensive hardware. A flexible data governance framework is able to ingest data without needing a long extract-transform-load (ETL) process, and it should support schema-free databases storing data from multiple disparate data sources. An important characteristic of flexible data governance frameworks is the ability to support semantic relationships using Graphs, RDF triples or Ontologies, as shown in the following figure. Figure 1. Example of a multi-model database system that stores all the entities as documents and the relationships as triples. Image source: MarkLogic. The multi-model database is a general NoSQL approach that can store, index, and query data in one or more of the previous models. Due to this flexibility, the multi-model database is the best approach for addressing data governance. (For more information about multi-model databases see the O’Reilly ebook: Building on Multi-Model Databases. With a multi-model database, all relevant data is stored as a document, and all information about source, date created, priveleges, etc., are stored as metadata in an envelope around the document[...]

FDA regulation defines business strategy in direct-to-consumer genetic testing


The FDA is entering a new era of regulation as whole genome sequencing becomes more accessible to consumers.Why do consumers seek direct-to-consumer (DTC) genetic testing? Consumers purchase services that sequence and analyze portions of their DNA to understand their risk for familial cancer, plan a safer pregnancy, optimize diet and fitness routines, and satisfy their curiosity about the secrets of their genome and ancestry. The diversifying reasons for consumer interest in DTC genetic testing are estimated to increase its global market value to $350 million by 2022. With such a valuable market at stake, regulation of DTC genetic testing by the U.S. Food and Drug Administration (FDA) has been under intense surveillance by the biotech industry, health care providers, and consumers alike. The FDA has been regulating medical devices since 1976 when Congress passed the Medical Device Amendments to the Federal Food, Drug, and Cosmetic Act. A medical device is defined as anything that can be used to diagnose, cure, treat, mitigate, or prevent disease, including an instrument, reagent, or “similar or related article.” In vitro genetic tests are therefore considered medical devices. The FDA regulates both genetic tests that are ordered and performed at home (DTC) and those that are ordered and performed in a health care setting or laboratory (a laboratory-developed test, or LDT). These two types of tests require different levels of FDA regulation. LDTs are ordered by a physician, developed by and performed in a single laboratory, are not sold to other laboratories, and are not marketed to consumers. In theory, this reduces the risk of misunderstanding the results and the possibility of erroneous health-related decision-making by the consumer. On the other hand, DTC tests must pass a higher regulatory bar and demonstrate that they clearly and safely relay information to consumers in the absence of a medical professional. DTC tests do not provide an “informed intermediary” such as a physician or trained expert to explain results, reduce stress, and discuss follow-up options, while physician-delivered reports from LDTs do. Before the FDA began regulating DTC tests, consumers were purchasing these tests to learn about their risk for Parkinson’s disease, how they might respond to certain types of drugs, if they were likely to develop Alzheimer’s disease, their ancestry, and more. Many of these results were diagnostic in nature, which prompted the FDA to intervene. FDA crackdown on DTC testing The FDA watchfully waited as DTC genetic testing companies developed products. The FDA assessed the potential risks and impacts of the products on the consumer, and did not regulate the conduct of DTC genetic testing companies until the companies brought products to market that could be classified as medical devices. [...]

Four short links: 1 November 2017


Crypto Docs, Ultrasound, Anti-Innovation Investors, and IoT Security

  1. Airborn OS -- attempt to do an open source Google Docs with crypto.
  2. ButterflyIQ -- ultrasound on a chip. IEEE covers it: announced FDA clearance for 13 clinical applications, including cardiac scans, fetal and obstetric exams, and musculoskeletal checks. Rather than using a dedicated piece of hardware for the controls and image display, the iQ works with the user’s iPhone. The company says it will start shipping units in 2018 at an initial price of about $2,000. See also adding orientation to ultrasound to turn 2D into 3D.
  3. Innovation vs. Activist Investors (Steve Blank) -- "activist investor" is all about financial games to transfer cash from banks to the investors, by loading the company with debt. The bad news is that, once they take control of a company, activist investors’ goal is not long-term investment. They often kill any long-term strategic initiatives. Often, the short-term cuts directly affect employee salaries, jobs, and long-term investment in R&D. The first things to go are R&D centers and innovation initiatives. They don't want genuine growth; they want fake growth that leaves the company weaker.
  4. Security, Privacy, and the Internet of Things (Matt Webb) -- if I meet a startup that has spent ages on its security, pre getting some real customer traction, I am going to be nervous that they have over-engineered the product and won't be able to iterate. The product will be too brittle or too rigid to wiggle and iterate and achieve fit. So, it's a balance.

Continue reading Four short links: 1 November 2017.


Empowering through security



Fredrick Lee shines a light on the ways security can be allowed into the world to do more.

Continue reading Empowering through security.


Great software is secure software



Chris Wysopal explains how defenders can help developers create secure software through coaching, shared code, and services.

Continue reading Great software is secure software.


Why cloud-native enterprise security matters



Matt Stine looks at three principles of cloud-native security and explains an approach that addresses the increasing volume and velocity of threats.

Continue reading Why cloud-native enterprise security matters .


The Dao of defense: Choosing battles based on the seven chakras of security



Katie Moussouris explains how to turn the forces that resist defense activities into the biggest supporters.

Continue reading The Dao of defense: Choosing battles based on the seven chakras of security.


Highlights from the O'Reilly Security Conference in New York 2017



Watch highlights covering security, defense, culture, and more. From the O'Reilly Security Conference in New York 2017.

Defenders from across the security world are coming together for the O'Reilly Security Conference in New York. Below you'll find links to highlights from the event.

Continue reading Highlights from the O'Reilly Security Conference in New York 2017.


Enterprise security: A new hope



Haroon Meer says a new type of security engineering is taking root, which suggests hope for effective corporate security at enterprise scale.

Continue reading Enterprise security: A new hope.


Amit Vij on GPU-accelerated analytics databases



The convergence of big data, artificial intelligence, and business intelligence

In this episode of the O’Reilly Podcast, I speak with Amit Vij, CEO and co-founder of Kinetica, a company that has developed an analytics database that uses graphics processing units (GPUs). We talk about how organizations are using GPU-accelerated databases to converge artificial intelligence (AI) and business intelligence (BI) on a single platform.

src="" height="166" width="100%" frameborder="no" scrolling="no">

Discussion points:

  • The benefits of converging AI and BI in a single system: “You are orders of magnitude faster, and you have the ability to operate on real-time data, as opposed to operating on yesterday’s data,” Vij says.
  • The processing speed of GPUs: “The GPU really leverages parallel processing so you can maximize your throughput and take advantage of the advancements in hardware that have come about,” he says.
  • How GPU databases break down the walls between the data science and business domains: “Nowadays machine learning scientists and mathematicians can, in just three lines through SQL, execute their algorithms directly on data sets that are billions of objects,” Vij says.
  • How GPU databases integrate with machine learning tools such as TensorFlow, and how GPU applications can use the cloud

Continue reading Amit Vij on GPU-accelerated analytics databases.


Four short links: 31 October 2017


AI for Databases, One-Pixel Attacks, Adtech Uncanny Valley, and Mindreading Video

  1. Inference and Regeneration of Programs that Manipulate Relational Databases -- We present a new technique that infers models of programs that manipulate relational databases. This technique generates test databases and input commands, runs the program, then observes the resulting outputs and updated databases to infer the model. Because the technique works only with the externally observable inputs, outputs, and databases, it can infer the behavior of programs written in arbitrary languages using arbitrary coding styles and patterns.
  2. One-Pixel Attack for Fooling Deep Neural Networks -- The results show that 73.8% of the test images can be crafted to adversarial images with modification just on one pixel with 98.7% confidence on average.
  3. Facebook Is Not Listening To You -- but we are deep in the adtech uncanny valley.
  4. Recovering Video from fMRI -- the video and stills are impressive. (Still a blurry black-and-white picture and a set of guessed possible labels.)

Continue reading Four short links: 31 October 2017.


Four short links: 30 October 2017


README Maturity Model, Open Source Project Maturity Model, Walmart Robots, and Sparse Array Database

  1. README Maturity Model -- from bare minimum to purpose.
  2. Apache's Open Source Project Maturity Model -- It does not describe all the details of how our projects operate, but aims to capture the invariants of Apache projects and point to additional information where needed.
  3. Walmart is Getting Robots -- The retailer has been testing the robots in a small number of stores in Arkansas and California. It is now expanding the program and will have robots in 50 stores by the end of January.
  4. TileDB -- manages massive dense and sparse multi-dimensional array data that frequently arise in important scientific applications.

Continue reading Four short links: 30 October 2017.


Four short links: 27 October 2017


Gentle PR, Readable Arxiv, Sentiment Bias, and AI Coding from Sketches

  1. Tick Tock List (Matt Webb) -- simple and good advice for building working relationships with journalists.
  2. Arxiv Vanity -- renders papers from Arxiv as responsive web pages so you don't have to squint at a PDF.
  3. Sentiment Analysis Bias -- By classifying the sentiment of words using GloVe, the researchers "found every linguistic bias documented in psychology that we have looked for." Unsurprising, since the biases are present in the people who generate the text from which these systems are trained.
  4. AI Turns Sketched Interfaces into Prototype Code -- We built an initial prototype using about a dozen hand-drawn components as training data, open source machine learning algorithms, and a small amount of intermediary code to render components from our design system into the browser. We were pleasantly surprised with the result.

Continue reading Four short links: 27 October 2017.


5 steps to identify and validate a value proposition



Value propositions are a dime a dozen. Learn how to choose the ones that work.

Continue reading 5 steps to identify and validate a value proposition.


How to pick the right authoring tools for VR and AR



Identify the options available to develop an effective immersive experience.

The world of virtual reality (VR), augmented reality (AR), and mixed reality (MR) is growing at a seemingly exponential pace. Just a few key examples: Microsoft partnered with Asus and HP to release new MR headsets, Google glasses have made a comeback, Facebook Spaces launched, and a patent for AR glasses, filed by Apple in 2015, was just discovered during a patent search.

At the Apple WorldWide Developer Conference (WWDC) this past June, Apple announced ARKit, which will make augmented reality available to all 700 million worldwide users of iPhone and iPad. The momentum and economic impact of these experiences continues to accelerate, so it’s the perfect time to begin developing for them, and that means picking an authoring tool that’s right for the reality you want to create.

Continue reading How to pick the right authoring tools for VR and AR.


Machine intelligence for content distribution, logistics, smarter cities, and more



The O’Reilly Data Show Podcast: Rhea Liu on technology trends in China.

In this episode of the Data Show, I spoke with Rhea Liu, analyst at China Tech Insights, a new research firm that is part of Tencent’s Online Media Group. If there’s one place where AI and machine learning are discussed even more than the San Francisco Bay Area, that would be China. Each time I go to China, there are new applications that weren’t widely available just the year before. This year, it was impossible to miss bike sharing, mobile payments seemed to be accepted everywhere, and people kept pointing out nascent applications of computer vision (facial recognition) to identity management and retail (unmanned stores).

Continue reading Machine intelligence for content distribution, logistics, smarter cities, and more.