Subscribe: O'Reilly Radar - Insight, analysis, and research about emerging technologies
Added By: Feedage Forager Feedage Grade B rated
Language: English
big  continue reading  continue  data  learning  links november  links  reading  security  service  short links  short  systems 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: O'Reilly Radar - Insight, analysis, and research about emerging technologies

All - O'Reilly Media

All of our Ideas and Learning material from all of our topics.

Updated: 2017-12-12T03:31:15Z


Four short links: 11 December 2017


Programming Falsehoods, Money Laundering, Vulnerability Markets, and Algorithmic Transparency

  1. Falsehoods Programmers Believe About Programming -- I feel like "understanding programming" is like learning about science in school: it's a progressive series of "well, actually it's more complicated than that" until you're left questioning your own existence. (Descartes would tell us computo ergo sum.)
  2. Kleptocrat -- You are a corrupt politician, and you just got paid. Can you hide your dirty money from The Investigator and cover your tracks well enough to enjoy it? The game is made by a global investigative firm that specializes in tracing assets. A+ for using games to Share What You Know. (via BoingBoing)
  3. Economic Factors of Vulnerability Trade and Exploitation -- In this paper, we provide an empirical investigation of the economics of vulnerability exploitation, and the effects of market factors on likelihood of exploit. Our data is collected first-handedly from a prominent Russian cybercrime market where the trading of the most active attack tools reported by the security industry happens. Our findings reveal that exploits in the underground are priced similarly or above vulnerabilities in legitimate bug-hunting programs, and that the refresh cycle of exploits is slower than currently often assumed. On the other hand, cybercriminals are becoming faster at introducing selected vulnerabilities, and the market is in clear expansion both in terms of players, traded exploits, and exploit pricing. We then evaluate the effects of these market variables on likelihood of attack realization, and find strong evidence of the correlation between market activity and exploit deployment. (via Paper a Day)
  4. Principles for Algorithmic Transparency (ACM) -- Awareness; Access and redress; Accountability; Explanation; Data provenance; Auditability; and Validation and Testing. (via Pia Waugh)

Continue reading Four short links: 11 December 2017.


Handling user-initiated actions in an asynchronous, message-based architecture


A simple framework for implementing message-based, user-initiated CRUD operations.A message-based microservices architecture offers many advantages, making solutions easier to scale and expand with new services. The asynchronous nature of interservice interactions inherent to this architecture, however, poses challenges for user-initiated actions such as create-read-update-delete (CRUD) requests on an object. CRUD requests sent via messages may be lost or arrive out of order. Or, multiple users may publish request messages nearly simultaneously, requiring a repeatable method for resolving conflicting requests. To avoid these complexities, user-initiated actions are often treated synchronously, often via direct API calls to object management services. However, these direct interactions compromise some of the message-based architecture’s benefits by increasing the burden of managing and scaling the object-management service. The ability to handle these scenarios via message-based interactions can prove particularly useful for managing objects that can be modified both by direct user interaction and by service-initiated requests, as well as for enabling simultaneous user updates to an object. In this blog post, we discuss an asynchronous pattern for implementing user-initiated requests via decoupled messages in a manner that can handle requests from multiple users and handle late-arriving messages. A simple scenario: Team management To illustrate our pattern, let us consider a Team Service that manages teams in a software-as-a-service solution. Teams have multiple users, or team members, associated with them. Teams also have two attributes that any team member can request to update: screen-name and display color-scheme. In this scenario, only one user, User A, has permission to make requests to create or delete any team. Any user can request to read the state of any team. All permissions and security policies, including team membership, are managed by a separate authorization service that can be accessed synchronously. User interaction is facilitated by a user-interaction service, which allows users to publish CRUD requests to the message bus. Those events are then retrieved by the Team Service, which stores all allowed requests into an event registry and collapses that request history into the current state of a team upon receipt of a READ request. While the addition of a user-interaction service increases the complexity of the user’s interaction with Team Service, it results in a more scalable and efficient architecture overall. The user-interaction service can be used to publish user requests for any other service as well, so that scaling up to accommodate bursts in user activity can be confined to the user-interaction service and applied to other services secondarily as needed. This approach also minimizes network traffic throughout the architecture, as all external calls are handled by this service. And while this post describes a solution that checks policies at the time of message receipt, a more sophisticated user-interaction service could implement policy checking at the time of publication. Let us consider a use case where multiple users are attempting to update and read the state of a team. Figure 1 shows the request messages, the time they were published to the message bus, the time they were received by the Team Service, and the messages published to the message bus by the Team Service in response. Figure 1. Incoming request messages and output messages published based on the message-based interprocess communication pattern described in this post. Figure courtesy of Arti Garg & Jon Powell. The output messages published by the Team Service based on our pattern demonstrate how we achieve eventual consistency with the users’ desired team state. Our Team Service denies all requests the requesting user does not have permission to make. It also denies any UPDATE or DELETE request associated with a team that does not already exist. Team s[...]

Four short links: 8 December 2017


Books for Young Engineers, Fake News, Digital Archaeology, and Bret Victor

  1. Books for Budding Engineers (UCL) -- a great list of books for kids who have a STEM bent.
  2. Data-Driven Analysis of "Fake News" -- In sheer numerical terms, the information to which voters were exposed during the election campaign was overwhelmingly produced not by fake news sites or even by alt-right media sources, but by household names like "The New York Times," "The Washington Post," and CNN. Without discounting the role played by malicious Russian hackers and naïve tech executives, we believe that fixing the information ecosystem is at least as much about improving the real news as it about stopping the fake stuff. A lot of data to support this conclusion. (via Dean Eckles)
  3. Digital Archaeology -- papers from a conference, whose highlights were tweeted here. In case you thought for a second there was some corner of the world that software wasn't going to eat.
  4. Dynamicland -- rumours of Bret Victor's new AR project about computing with space. See also the Twitter account showing off goodies.

Continue reading Four short links: 8 December 2017.


The sixth wave: Automation of decisions



Amr Awadallah explains the historic importance of the next wave in automation.

Continue reading The sixth wave: Automation of decisions.


Impacting a nation



Ajey Gore looks at how the impossible can be made possible with technology and data insights.

Continue reading Impacting a nation.

(image) security intelligence and analytics: From big data to big impact



Tony Lee outlines the unique big data and AI challenges is tackling.

Continue reading security intelligence and analytics: From big data to big impact.


From smart cities to intelligent societies



Carme Artigas asks: Are innovations like autonomous vehicles and flying drones making our societies more intelligent?

Continue reading From smart cities to intelligent societies.


Sentiment and emotion-aware natural language processing



Pascale Fung explains how emotional interaction is being integrated into machines.

Continue reading Sentiment and emotion-aware natural language processing.


Highlights from Strata Data Conference in Singapore 2017


Watch highlights covering machine learning, smart cities, automation, and more. From Strata Data Conference in Singapore 2017.Experts from across the data world came together in Singapore for Strata Data Conference. Below you'll find links to highlights from the event. Computational challenges and opportunities of astronomical big data Melanie Johnston-Hollitt discusses a radio telescope project that will produce data on a scale that dwarfs most big data efforts. Watch "Computational challenges and opportunities of astronomical big data." Siri: The journey to consolidation Cesar Delgado joins Mick Hollison to look at how Apple is using its big data stack and expertise to solve non-data problems. Watch "Siri: The journey to consolidation." Technology for humanity Steve Leonard explores how Singapore is bringing together ambitious and capable people to build technology that can solve the world’s toughest challenges. Watch "Technology for humanity." Responsible deployment of machine learning Ben Lorica explains how to guard against flaws and failures in your machine learning deployments. Watch "Responsible deployment of machine learning." Stop the fights. Embrace data Felipe Hoffa says data-based conclusions are possible when stakeholders can easily analyze all relevant data. Watch "Stop the fights. Embrace data." Industrial machine learning Joshua Bloom explains why the real revolution will happen—in improved and saved lives—when machine learning automation is coupled with industrial data. Watch "Industrial machine learning." Freedom or safety? Giving up rights to make our roads and cities safer and smarter Bruno Fernandez-Ruiz outlines the tradeoffs we make to ensure safer transportation. Watch "Freedom or safety? Giving up rights to make our roads and cities safer and smarter." From smart cities to intelligent societies Carme Artigas asks: Are innovations like autonomous vehicles and flying drones making our societies more intelligent? Watch "From smart cities to intelligent societies." The sixth wave: Automation of decisions Amr Awadallah explains the historic importance of the next wave in automation. Watch "The sixth wave: Automation of decisions." Impacting a nation Ajey Gore looks at how the impossible can be made possible with technology and data insights. Watch "Impacting a nation." security intelligence and analytics: From big data to big impact Tony Lee outlines the unique big data and AI challenges is tackling. Watch " security intelligence and analytics: From big data to big impact." Mining electronic health records and the web for drug repurposing Kira Radinsky describes a system that mines medical records and Wikipedia to reduce spurious correlations and provide guidance about drug repurposing. Watch "Mining electronic health records and the web for drug repurposing." Sentiment and emotion-aware natural language processing Pascale Fung explains how emotional interaction is being integrated into machines. Watch "Sentiment and emotion-aware natural language processing." Continue reading Highlights from Strata Data Conference in Singapore 2017.[...]

Mining electronic health records and the web for drug repurposing



Kira Radinsky describes a system that mines medical records and Wikipedia to reduce spurious correlations and provide guidance about drug repurposing.

Continue reading Mining electronic health records and the web for drug repurposing.


Machine learning at Spotify: You are what you stream



The O’Reilly Data Show Podcast: Christine Hung on using data to drive digital transformation and recommenders that increase user engagement.

In this episode of the Data Show, I spoke with Christine Hung, head of data solutions at Spotify. Prior to joining Spotify, she led data teams at the NY Times and at Apple (iTunes). Having led teams at three different companies, I wanted to hear her thoughts on digital transformation, and I wanted to know how she approaches the challenge of building, managing, and nurturing data teams.

I also wanted to learn more about what goes into building a recommender system for a popular consumer service like Spotify. Engagement should clearly be the most important metric, but there are other considerations, such as introducing users to new or “long tail” content.

Continue reading Machine learning at Spotify: You are what you stream.


Four short links: 7 December 2017


Measurement, Value, Privacy, and Openness

  1. Emerging Gov Tech: Measurement -- presenters from inside and outside of government to share how they were using measurement to inform decision-making. The hologram reminding people to dump biosecurity material was nifty, and the Whare Hauora project is much needed in a country with a lot of dank draughty houses.
  2. When Is a Dollar Not a Dollar -- a dollar of cost savings is worth one dollar to the customer, but a dollar of extra revenue is usually worth dimes or pennies (depending on the customer's profit margin).
  3. Learning with Privacy at Scale (Apple) -- about their differential privacy work. Their attention to detail is lovely. Whenever an event is generated on-device, the data is immediately privatized via local differential privacy and temporarily stored on-device using data protection, rather than being immediately transmitted to the server. After a delay based on device conditions, the system randomly samples from the differentially private records subject to the above limit and sends the sampled records to the server.
  4. When Open Data is a Trojan Horse: The Weaponization of Transparency in Science and Governance -- We suggest that legislative efforts that invoke the language of data transparency can sometimes function as ‘‘Trojan Horses’’ through which other political goals are pursued. Framing these maneuvers in the language of transparency can be strategic, because approaches that emphasize open access to data carry tremendous appeal, particularly in current political and technological contexts.

Continue reading Four short links: 7 December 2017.


When two trends fuse: PyTorch and recommender systems


A look at the rise of the deep learning library PyTorch and simultaneous advancements in recommender systems.In the last few years, we have experienced the resurgence of neural networks owing to availability of large data sets, increased computational power, innovation in model building via deep learning, and, most importantly, open source software libraries that ease use for non-researchers. In 2016, the rapid rise of the TensorFlow library for building deep learning models allowed application developers to take state-of-the-art models and put them into production. Deep learning-based neural network research and application development is currently a very fast moving field. As such, in 2017 we have seen the emergence of the deep learning library PyTorch. At the same time, researchers in the field of recommendation systems continue to pioneer new ways to increase performance as the number of users and items increases. In this post, we will discuss the rise of PyTorch, and how its flexibility and native Python integration make it an ideal tool for building recommender systems. Differentiating PyTorch and TensorFlow The commonalities between TensorFlow and PyTorch stop at both being general purpose analytic problem-solving libraries and both using the Python language as its primary interface. PyTorch roots are in dynamic libraries such as Chainer, where execution of operations in a computation graph takes place immediately. This is in contrast to TensorFlow-style design, where the computation graph is compiled and executed as a whole. (Note: Recently, TensorFlow has added Eager mode, which allows dynamic execution of computation graphs.) The rise of PyTorch PyTorch was created to address challenges in the adoption of its predecessor library, Torch. Due to the low popularity and general unwillingness among users to learn the programming language Lua, Torch—a mainstay in computer vision for several years—never saw the explosive growth of TensorFlow. Both Torch and PyTorch are primarily developed, managed, and maintained by the team at Facebook AI Research (FAIR). PyTorch has seen a rise in adoption due to native Python-style imperative programming already familiar to researchers, data scientists, and developers of popular Python libraries such as NumPy and SciPy. This imperative flexible approach to building deep learning models allows for easier debugging compared to a compiled model. Whereas in a compiled model errors will not be detected until the computation graph is submitted for execution, in a Define-by-Run-style PyTorch model, errors can be detected and debugging can be done as models are defined. This flexible approach is notably important for building models where the model architecture can change based on input. Researchers focusing on recurrent neural network (RNN)-based approaches for solving language understanding, translation, and other variable sequence-based problems have found a particular liking to this Define-by-Run approach. Lastly, the built-in automatic differentiation feature in PyTorch allows model builders an easy way to perform the error-reducing back propagation step. Late in the summer of 2017, with release 0.2.0, PyTorch achieved a significant milestone by adding distributed training of deep learning models, a common necessity to reduce model training time when working with large data sets. Furthermore, the ability to translate PyTorch models to Caffe 2 (another library from FAIR) was added via the Open Neural Network Exchange (ONNX). ONNX allows those struggling to put PyTorch into production to generate an intermediate representation of the model that can be transferred to Caffe 2 library for deployment from servers to mobile devices. Certai[...]

Who, me? They warned you about me?


Thoughts on "We are the people they warned you about." Chris Anderson recently published "We are the people they warned you about," a two part article about the development of killer drones. Here's the problem he's wrestling with: "I’m an enabler ... but I have no idea what I should do differently." That's a good question to ask. It's a question everyone in technology needs to ask, not just people who work on drones. It's related to the problem of ethics at scale: almost everything we do has consequences. Some of those consequences are good, some are bad. Our ability to multiply our actions to internet scale means we have to think about ethics in a different way. The second part of Anderson's article gets personal. He talks about writing code for swarming behavior after reading Kill Decision, a science fiction novel about swarming robots running amok. And he struggles with three issues. First (I'm very loosely paraphrasing Anderson's words), "I have no idea how to write code that can't run amok; I don't even know what that means." Second, "If I don't write this code, someone else will—and indeed, others have." And third, "Fine, but my code (and the other open source code) doesn't exhibit bad behavior—which is what the narrator of Kill Decision would have said, right up to the point where the novel's drones became lethal." How do we protect ourselves, and others, from the technology we invent? Anderson tries to argue against regulatory solutions by saying that swarming behavior is basically math; regulation is essentially regulating math, and that makes no sense. As Anderson points out, Ben Hammer, CTO of Kaggle, tweeted that regulating artificial intelligence essentially means regulating matrix multiplication and derivatives. I like the feel of this reductio ad absurdum, but neither Anderson nor I buy it—if you push far enough, it can be applied to anything. The FCC regulates electromagnetic fields; the FAA regulates the Bernoulli effect. We can regulate the effects or applications of technology, though even that's problematic. We can require AI systems to be "fair" (if we can agree on what "fair" means); we can require that drones not attack people (though that might mean regulating emergent and unpredictable behavior). A bigger issue is that we can only regulate agents that are willing to be regulated. A law against weaponized drones doesn't stop the military from developing them. It doesn't even prevent me from building one in my basement; any punishment for violation comes after the fact. (For that matter, regulation rarely, if ever, happens before the technology has been abused.) Likewise, laws don't prevent governments or businesses from abusing data. As any speeder knows, it's only a violation if you get caught. A better point is that, whether or not we regulate, we can't prevent inventions from being invented, and once invented, they can't be put back into the box. The myth of Pandora's box is powerful and resonant in the 21st century. The box is always opened. It's always already opened; the desire to open the box, the desire to find the box and then open it, is what drives invention in the first place. Since our many Pandora's boxes are inevitably opened, and since we can't in advance predict (or even mitigate) the consequences of opening them, perhaps we should look at the conditions under which those boxes are opened. The application of any technology is determined by the context in which it was invented. Part of the reason we're so uncomfortable with nuclear energy is that it has been the domain of the military. A large part of the reason we don't have Thorium reactors, which[...]

How to bring fast data access to microservice architecture with in-memory data grids



For stack scalability, elasticity at the business logic layer should be matched with elasticity at the caching layer.

Ever increasing numbers of people are using mobile and web apps and the average time consumed using them continues to grow. This use comes with expectations of real-time response, even during peak access times. ​ Modern, cloud-native applications have to stand up to all this application demand. In addition to user-initiated requests, applications respond to requests from other applications or handle data streamed in from sensors and monitors. Lapses in service will add friction to customer touch points, cause big declines in operational efficiency, and make it difficult to take advantage of fleeting sales opportunities during periods of high activity, like Cyber Monday.

What are your options for dealing with demand?

Scaling large monoliths dynamically on demand becomes impractical as these systems grow. Meeting demands like Cyber Monday by scaling up a large clunky deployment of a monolith, and then scaling it back down when the higher capacity is no longer needed, is impractical, because as these systems grow, they become increasingly fragile, severely limiting the scope of what can be changed.

Continue reading How to bring fast data access to microservice architecture with in-memory data grids.


How self-service data avoids the dangers of “shadow analytics”


Without the proper cataloging, curation, and security that self-service data platforms allow, companies are left vulnerable to cybersecurity threats and misinformation.In our personal lives, data makes the world go round. You can answer almost any question in a second thanks to Google. Booking travel anywhere on the planet is just a few clicks away. And your smartphone has apps for pretty much anything you can think of, and more. It’s a great time to love coffee and craft beer. And it’s a great time to be a consumer of data. Life has never been better. When we get to work, however, our relationship with data isn’t nearly as friendly. While everyone’s job depends on data, most of us struggle to use it as seamlessly as we do in our personal lives. At work, data is hard to find. It’s slow to access. Each data set has its own tools and “cheat sheet” to use successfully. Frequently, the data we need isn’t available to us in a shape we need, so we open a ticket with IT, then wait and hope for the best. Collaborating with other users to work with data is far from simple, and typically the solution is to copy the data into Excel and email it to someone. Alternately, we might set up a BI server or database we manage ourselves, even potentially on a server hiding in a closet or under someone’s desk. We’ve seen this rodeo before If this sounds familiar, it should. For the past decade, workers have found ways to work around limitations in the hardware and software provided by IT—a trend we call “shadow IT.” Workers started to bring their own laptops, iPads, and smartphones until IT either made these devices available or adopted “bring your own device” policies. Popular apps like Evernote, Dropbox, and Gmail, as well as cloud service providers like AWS and Google Cloud, quickly became everyday tools for millions of people in the workplace, opening up companies to massive security vulnerabilities that most are still trying to address today. Companies learned that simply shutting down access to these systems and keeping the status quo was not an option. They learned to improve the quality of hardware and software in order to keep people using governed systems, removing the need to take matters into their own hands. A new trend: Shadow analytics What we saw with software and hardware over the past decade is now happening with data, a phenomenon we call “shadow analytics.” People want to do their jobs, and they’ll find a way to be more productive if IT organizations don’t provide the right tools. They are frustrated with their inability to access and use data, and they’re finding workarounds by moving data into ungoverned environments that sidestep the essential controls put in place by organizations. For example, users download data into spreadsheets, load data into cloud applications, and even run their own database and analytics software on their desktops. Shadow analytics creates an environment where users can reach misleading conclusions. Because data is disconnected from the source, users can lose important updates in their copies, and the answers to questions they developed may no longer apply. In addition, with each user managing their own copy of the data, each copy can be wrong in different ways. As a result, IT organizations are frequently asked, “My colleague and I have different answers to an essential question—why?” Why shadow analytics is a big deal Data is the greatest asset—and the greatest liability—of most organizations. Cybersecurity threats are evolving rapidly. In the past few years, phishing attacks and intellectual property theft have g[...]

Stop the fights. Embrace data



Felipe Hoffa says data-based conclusions are possible when stakeholders can easily analyze all relevant data.

Continue reading Stop the fights. Embrace data .


Responsible deployment of machine learning



Ben Lorica explains how to guard against flaws and failures in your machine learning deployments.

Continue reading Responsible deployment of machine learning.


Technology for humanity



Steve Leonard explores how Singapore is bringing together ambitious and capable people to build technology that can solve the world’s toughest challenges.

Continue reading Technology for humanity.


Computational challenges and opportunities of astronomical big data



Melanie Johnston-Hollitt discusses a radio telescope project that will produce data on a scale that dwarfs most big data efforts.

Continue reading Computational challenges and opportunities of astronomical big data.


Siri: The journey to consolidation



Cesar Delgado joins Mick Hollison to discuss how Apple is using its big data stack and expertise to solve non-data problems.

Continue reading Siri: The journey to consolidation.


Industrial machine learning



Joshua Bloom explains why the real revolution will happen—in improved and saved lives—when machine learning automation is coupled with industrial data.

Continue reading Industrial machine learning.


Freedom or safety? Giving up rights to make our roads and cities safer and smarter



Bruno Fernandez-Ruiz discusses the tradeoffs we make to ensure safer transportation.

Continue reading Freedom or safety? Giving up rights to make our roads and cities safer and smarter.


DARPA and the future of synthetic biology



Learn how the Defense Advanced Research Projects Agency (DARPA) has spurred significant advances in the promising field of synthetic biology.

Google, Venture Capital, and BioPharma

Biotech is a different world from Silicon Valley’s tech scene, and one with its own ways and traditions. Where Silicon Valley has a reputation for being brash and anarchistic, biotech developments are perceived—even if it’s inaccurate—as being bound by regulation and a cat’s cradle of connections with academia, big pharma, and government. But they have one thing in common: an overlapping venture capital community ready to fund the next big thing, be it cloud computing or CRISPR.

And, for many biology startups, there’s something unexpected: Google is there.

Continue reading DARPA and the future of synthetic biology.


Rich Smith on redefining success for security teams and managing security culture



The O’Reilly Security Podcast: The objectives of agile application security and the vital need for organizations to build functional security culture.

In this episode of the Security Podcast, I talk with Rich Smith, director of labs at Duo Labs, the research arm of Duo Security. We discuss the goals of agile application security, how to reframe success for security teams, and the short- and long-term implications of your security culture.

Continue reading Rich Smith on redefining success for security teams and managing security culture.


Four short links: 6 December 2017


TouchID for SSH, Pen Testing Checklist, Generativity, and AI Data

  1. SeKey -- an SSH agent that allow users to authenticate to UNIX/Linux SSH servers using the Secure Enclave.
  2. Web Application Penetration Testing Checklist -- a useful checklist of things to poke at if you're doing a hygiene sweep.
  3. The Bullet Hole Misconception -- Computer technology has not yet come close to the printing press in its power to generate radical and substantive thoughts on a social, economical, political, or even philosophical level. I really like this metric of success.
  4. AI Index (Stanford) -- This report aggregates a diverse set of data, makes that data accessible, and includes discussion about what is provided and what is missing. Most importantly, the AI Index 2017 Report is a starting point for the conversation about rigorously measuring activity and progress in AI in the future.

Continue reading Four short links: 6 December 2017.


Transfer learning from multiple pre-trained computer vision models


Using the keras TensorFlow abstraction library, the method is simple, easy to implement, and often produces surprisingly good results.The multitude of methods jointly referred to as “deep learning” have disrupted the fields of machine learning and data science, rendering decades of engineering know-how almost completely irrelevant—or so common opinion would have it. Of all these, one method that stands out in its overwhelming simplicity, robustness, and usefulness is the transfer of learned representations. Especially for computer vision, this approach has brought about unparalleled ability, accessible to practitioners of all levels, and making previously insurmountable tasks as easy as from keras.applications import *. Put simply, the method dictates that a large data set should be used in order to learn to represent the object of interest (image, time-series, customer, even a network) as a feature vector, in a way that lends itself to downstream data science tasks such as classification or clustering. Once learned, the representation machinery may then be used by other researchers, and for other data sets, almost regardless of the size of the new data or computational resources available. In this blog post, we demonstrate the use of transfer learning with pre-trained computer vision models, using the keras TensorFlow abstraction library. The models we will use have all been trained on the large ImageNet data set, and learned to produce a compact representation of an image in the form of a feature vector. We will use this mechanism to learn a classifier for species of birds. There are many ways to use pre-trained models, the choice of which generally depends on the size of the data set and the extent of computational resources available. These include: Fine tuning: In this scenario, the final classifier layer of a network is swapped out and replaced with a softmax layer the right size to fit the current data set, while keeping the learned parameters of all other layers. This new structure is then further trained on the new task. Freezing: The fine-tuning approach necessitates relatively large computational power and larger amounts of data. For smaller data sets, it is common to “freeze” some first layers of the network, meaning the parameters of the pre-trained network are not modified in these layers. The other layers are trained on the new task as before. Feature extraction: This method is the loosest usage of pre-trained networks. Images are fed-forward through the network, and a specific layer (often a layer just before the final classifier output) is used as a representation. Absolutely no training is performed with respect to the new task. This image-to-vector mechanism produces an output that may be used in virtually any downstream task. In this post, we will use the feature extraction approach. We will first use a single pre-trained deep learning model, and then combine four different ones using a stacking technique. We will classify the CUB-200 data set. This data set (brought to us by vision.caltech) contains 200 species of birds, and was chosen, well...for the beautiful bird images. Figure 1. 100 random birds drawn from the CUB-200 data set. Image courtesy of Yehezkel Resheff. First, we download and prepare the data set. On Mac \ Linux this is done by: curl[...]

Four short links: 5 December 2017


Analog Computing, Program Synthesis, Midwestern Investment, and Speed Email

  1. A New Analog Computer (IEEE) -- Digital programming made it possible to connect the input of a given analog block to the output of another one, creating a system governed by the equation that had to be solved. No clock was used: voltages and currents evolved continuously rather than in discrete time steps. This computer could solve complex differential equations of one independent variable with an accuracy that was within a few percent of the correct solution.
  2. Barliman -- Barliman is a prototype "smart editor" that performs real-time program synthesis to try to make the programmer's life a little easier. Barliman has several unusual features: given a set of tests for some function foo, Barliman tries to "guess" how to fill in a partially specified definition of foo to make all of the tests pass; given a set of tests for some function foo, Barliman tries to prove that a partially specified definition of foo is inconsistent with one or more of the tests; given a fully or mostly specified definition of some function foo, Barliman will attempt to prove that a partially specified test is consistent with, or inconsistent with, the definition of foo.
  3. Investing in the Midwest (NYT) -- Steve Case closes a fund backed by every tech billionaire you've heard of, for investing in midwestern businesses. Mr. Schmidt of Alphabet said he was sold on the idea from the moment he first heard about it. “I felt it was a no-brainer,” he said. “There is a large selection of relatively undervalued businesses in the heartland between the coasts, some of which can scale quickly.”
  4. Email Like a CEO -- see also How to Write Email with Military Precision. (via Hacker News)

Continue reading Four short links: 5 December 2017.


3 reasons why you should submit a proposal to speak at Fluent 2018



Find out how to get your voice heard and bring a positive impact to a receptive audience.

The 2018 Fluent CFP closes in a few days — on December 8, 2017. Don’t be alarmed! You still have plenty of time to send in your proposal. Here’s why I really, really hope you do.

1. Our industry always needs new voices

I can’t even begin to express how much we want to see fresh faces and hear new voices on our stages. Yes, we all love seeing talks by industry rockstars (and we’ll definitely have our share of those), but what makes Fluent an important event is that it’s also a launchpad for the next generation of industry leaders. (Hint: This means you.)

Continue reading 3 reasons why you should submit a proposal to speak at Fluent 2018.


Four short links: 4 December 2017


Campaign Cybersecurity, Generated Games, Copyright-Induced Style, and Tech Ethics

  1. Campaign Cybersecurity Playbook -- The information assembled here is for any campaign in any party. It was designed to give you simple, actionable information that will make your campaign’s information more secure from adversaries trying to attack your organization—and our democracy.
  2. Games By Angelina -- The aim is to develop an AI system that can intelligently design videogames, as part of an investigation into the ways in which software can design creatively. The creator's GitHub account has some interesting procedural generation projects, too. (via MIT Technology Review)
  3. Every Frame a Painting -- Nearly every stylistic decision you see about the channel—the length of the clips, the number of examples, which studios’ films we chose, the way narration and clip audio weave together, the reordering and flipping of shots, the remixing of 5.1 audio, the rhythm and pacing of the overall video—all of that was reverse engineered from YouTube’s Copyright ID. [...] So, something that was designed to restrict us ended up becoming our style. And yet, there were major problems with all of these decisions. We wouldn’t realize it until years later, but by creating such a simple, approachable style that skirted the edge of legality, we pretty much cut ourselves off from our most ambitious topics.
  4. Love the Sin, Hate the Sinner (Cory Doctorow) -- the best review of Tim's new book that I've seen. [T]he reason tech went toxic was because unethical people made unethical choices, but those choices weren't inevitable or irreversible.

Continue reading Four short links: 4 December 2017.


Four short links: 1 December 2017


Creepy Kid Videos, Cache Smearing, Single-Image Learning, and Connected Gift Guide

  1. /r/ElsaGate -- Reddit community devoted to understanding and tackling YouTube's creepy kid videos, from business models to software used to create them.
  2. Cache Smearing (Etsy) -- to solve the problem where one key is so powerful it overloads a single server, a technique for turning a single key into multiple so they can be spread over several servers.
  3. Deep Image Prior -- Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, superresolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flash/no flash input pairs.
  4. Privacy Not Included (Mozilla) -- shopping guide for connected gifts, to help you know whether they respect your privacy or not. (most: not so much)

Continue reading Four short links: 1 December 2017.


10,000 messages a minute



Lessons learned from building engineering teams under pressure.

Continue reading 10,000 messages a minute.


Katharine Jarmul on using Python for data analysis



The O’Reilly Programming Podcast: Wrangling data with Python’s libraries and packages.

In this episode of the O’Reilly Programming Podcast, I talk with Katharine Jarmul, a Python developer and data analyst whose company, Kjamistan, provides consulting and training on topics surrounding machine learning, natural language processing, and data testing. Jarmul is the co-author (along with Jacqueline Kazil) of the O’Reilly book Data Wrangling with Python, and she has presented the live online training course Practical Data Cleaning with Python.

Continue reading Katharine Jarmul on using Python for data analysis.


Four short links: 30 November 2017


Object Models, Open Source Voice Recognition, IoT OS, and High-Speed Robot Wars

  1. Object Models -- a (very) brief run through the inner workings of objects in four very dynamic languages. Readable and informative. (via Simon Willison)
  2. Mozilla Releases Open Source Voice Recognition and Voice Data Set -- we have included pre-built packages for Python, NodeJS, and a command-line binary that developers can use right away to experiment with speech recognition. Data set features samples from more than 20,000 people, reflecting a diversity of voices globally.
  3. FreeRTOS -- Amazon adds sync and promises OTA updates. Very clever from Amazon: this is foundational software for IoT.
  4. Japanese Sumo Robots (YouTube) -- omg, the speed of these robots. (via BoingBoing)

Continue reading Four short links: 30 November 2017.


DUO: Connecting the home to the hospital



New technology is allowing doctors to diagnose and treat heart disease faster and more efficiently.

Detection of heart sounds in the early 1800s and before was limited to direct contact between the physician’s ear and the patient’s chest. Auscultation changed markedly in 1816 with René Laënnec’s invention of the stethoscope, though the wooden tube prototype was far from the bi-aural apparatus of modern medical practice. Throughout the years, the stethoscope has witnessed minor adjustments to improve material quality and ease of use; however, the fundamental design has remained largely the same. Biotech startup Eko addresses this stagnancy with a digital take on a tool utilized by over 30 million clinicians around the world.

Like many pioneers, Berkeley-based company Eko began with a question: if there is a technical gap in cardiology, how can it be resolved? Its answer came in the form of a digital stethoscope known as CORE, which transmits heart sound data straight to a clinician’s compatible device. While CORE represented an unprecedented improvement in auscultation, according to cofounder and COO Jason Bellet, the company’s newest device provides an equally, if not more, powerful tool.

Continue reading DUO: Connecting the home to the hospital.


Ethics at scale


Scale changes the problems of privacy, security, and honesty in fundamental ways.For the past decade or more, the biggest story in the technology world has been scale: building systems that are larger, that can handle more customers, and can deliver more results. How do we analyze the habits of tens, then thousands, then millions or even billions of users? How do we give hordes of readers individualized ads that make them want to click? As technologists, we've never been good at talking about ethics, and we’ve rarely talked about the consequences of our systems as they’ve grown. Since the start of the 21st century, we’ve acquired the ability to gather and (more important) store data at global scale; we’ve developed the computational power and machine learning techniques to analyze that data; and we’ve spawned adversaries of all creeds who are successfully abusing the systems we have created. This “perfect storm” makes a conversation about ethics both necessary and unavoidable. While the ethical problems we face are superficially the same as always (privacy, security, honesty), scale changes these problems in fundamental ways. We need to understand how these problems change. Just as we’ve learned how to build systems that scale, we need to learn how to think about ethical issues at scale. Let's start with the well-known and well-reported story of the pregnant teenager who was outed to her parents by Target's targeted marketing. Her data trail showed that she was buying products consistent with being pregnant, so Target sent her coupon circulars advertising the baby products she would eventually need. Her parents wondered why their daughter was suddenly receiving coupons for disposable diapers and stretch-mark cream, and drew some conclusions. Many of us find that chilling. Why? Nothing happened that couldn't have happened at any small town pharmacy. Any neighborhood pharmacist could notice that a girl had added some weight, and was looking at a different selection of products. The pharmacist could then draw some conclusions, and possibly make a call to her parents. The decision to call would depend on community values: in some cultures and communities, informing the parents would be the pharmacist's responsibility, while others would value the girl’s privacy. But that's not the question that's important here, and it's not why we find Target's action disturbing. The Target case is chilling because it isn't a case about a single girl and a single pregnancy. It's about privacy at scale. It's a case about everyone who shops at any store larger than a neighborhood grocery. The analysis that led Target to send coupons for diapers is the same analysis they do to send coupons to you and me. Most of the time, another piece of junk mail goes into the trash, but that’s not always the case. If a non-smoker buys a pack of cigarettes, do their insurance rates go up? If an alcoholic buys a six-pack, who finds out? What can be gathered from our purchase histories and conveyed to others, and what are the consequences? And who is making decisions about ho[...]

Associative memory AI aids in the battle against financial crime


The O’Reilly Media Podcast: Gayle Sheppard, Saffron AI Group at Intel, and David Thomas, Bank of New Zealand.In this episode of the O’Reilly Media Podcast, I spoke with Gayle Sheppard, vice president and general manager of Saffron AI Group at Intel, and David Thomas, chief analytics officer for Bank of New Zealand (BNZ). Our conversations centered around the utility of artificial intelligence in the financial services industry. Associative memory AI: Enabling machines to fight financial crime with human-like reasoning According to Sheppard, associative memory AI technologies are best thought of as reasoning systems that combine the memory-based learning seen in humans—recognizing patterns, spotting anomalies, and detecting new features almost instantly—with data. Compared to traditional machine learning methods, Sheppard says, associative memory AI unifies multiple data sources—both structured and unstructured—without relying on pre-defined models. Associative memory AI reasons on that data to deliver insights quickly, accurately, and with less training data—in some cases, using as little as 20% of the available training set. Furthermore, transparency is built into the fabric of associative memory AI, so one can more easily explain the system’s path to insight. Applications of associative memory AI in the enterprise are varied. “Our strategy is to build comprehensive decision systems for financial services, supply chain management, and manufacturing and defense. ... These systems combine what we think are the best of learning approaches, such as deep learning, traditional statistical machine learning, associative learning, and others. Our goal is to deliver a sum that is much greater than its individual parts.” Intel has developed a sharp focus on the financial services industry, with its October launch of the Intel Saffron Anti-Money Laundering (AML) Advisor. Sheppard described four challenges and opportunities that Intel sees in the financial services industry: Financial institutions—specifically banks and insurers—collect data at massive scale, and this data is expected to double every two years. Human and machine-generated data is growing 10 times faster than traditional business data. Structured, transactional data has dominated their systems, but almost everything in banking is customer-pattern based, or unstructured. Excluding this data from analysis significantly reduces the relevance of its outcomes. Models become stale quickly, but criminals are constantly evolving. It can take up to nine months to update statistical models, from design to test, for banking customers. That’s before these new models can even be put into production. “The fastest time for the industry to deploy simple model changes is considered five to six months. That’s a lot of time for crime to run rampant for these institutions,” said Sheppard. Financial organizations can store data on several hundred systems of record. How can customers efficiently access and analyze this data that’s spread across multiple location[...]

Four short links: 29 November 2017


Avoiding State Surveillance, Parallel Algorithms, Smart Tactics, and Voting Security

  1. The Motherboard Guide to Avoiding State Surveillance -- a lot of good advice, even if you're not at risk from a nation state (e.g., don't run your own mail server).
  2. A Library of Parallel Algorithms (CMU) -- what it says on the box. See also CMU's "Algorithm Design: Parallel and Sequential" book.
  3. EFF's Clever Tactic (Cory Doctorow) -- when you argue about DRM, the pro-DRM side always says that all this stuff is an unfortunate side-effect of the law, and that they're really only trying to stop pirates, promise and cross my heart. So, here's what we did at the W3C: we proposed a membership rule that would allow members to use DRM law to sue anyone who infringed their copyrights—but took away their rights to sue people who were breaking DRM for some other reason, like adapting works for people with disabilities, or investigating critical security flaws, or creating legal, innovative new businesses. Needless to say, they didn't go for that proposal, which revealed their true motives.
  4. Cybersecurity of Voting Machines (Matt Blaze) -- his written testimony before Congress. I offer three specific recommendations: (1) Paperless DRE voting machines should be immediately phased out from U.S. elections in favor of systems, such as precinct-counted optical scan ballots, that leave a direct artifact of the voter’s choice. (2) Statistical “risk limiting audits” should be used after every election to detect software failures and attacks. (3) Additional resources, infrastructure, and training should be made available to state and local voting officials to help them more effectively defend their systems against increasingly sophisticated adversaries.

Continue reading Four short links: 29 November 2017.


Quickly learn about the common methods for analyzing architecture tradeoffs



Mark Richards explores two basic techniques for analyzing tradeoffs of architecture characteristics.

Continue reading Quickly learn about the common methods for analyzing architecture tradeoffs.


What it means to be data-driven in the media industry


O’Reilly Media Podcast: David Hsieh, of Qubole, in conversation with John Slocum, of MediaMath.In a recent episode of the O’Reilly Media Podcast, David Hsieh, senior vice president of marketing at Qubole, sat down with John Slocum, vice president of MediaMath’s data management platform (DMP), to discuss DataOps in the media industry. “DataOps” refers to the promotion of communication between formerly siloed data, teams, and systems. As discussed in Creating a Data-Driven Enterprise with DataOps, a report published by Qubole and O’Reilly in 2016, DataOps leverages process change, organizational realignment, and technology to facilitate relationships between everyone who handles data: developers, data engineers, data scientists, analysts, and business users. As a programmatic advertising platform, MediaMath has a unique lens into the shifting business models across the media industry, and how DataOps is playing a role in those shifts. During the podcast, Hsieh and Slocum discussed how data has transformed the culture and overall goals of organizations in the media industry in the past 10 years, and shared some best practices for companies that are just embarking on their journey toward becoming data driven. src="" height="166" width="100%" frameborder="no" scrolling="no"> Here are some highlights from the conversation: Greater focus on outcomes, return on investment What we're seeing specifically evolve over the past 10 years since the introduction of programmatic is that clients are more focused on outcomes than they previously were. Outcomes being return on marketing investment, return on spend, whereas previous goals might have just been to spend a particular budget. It might have been driving a particular number of clicks or visitors and driving reach, but with data, incorporating that into our analytics and optimization in our platform, we're able to get a sense of what clients should expect to achieve and help them achieve that. We find our most sophisticated clients are able to differentiate themselves from their competition. We see data providing that differentiation. Rapid evolution of devices, tool sets leads to sophisticated usage of data MediaMath has long offered simple aggregated reporting that will help advertisers understand the performance they're seeing in their campaigns. ... That was certainly good enough the first few years of MediaMath's operation, but what we started to see maybe four or five years ago or so, was a lot of demand for more granular insight. More custom insight. Some of our more sophisticated clients were asking for the ability to see performance by a specific sample of audience data. They may not want to see performance aggregated in a particular campaign or a strategy, but they might want to sample performance elsewhere and look for audiences, that ma[...]

Four short links: 28 November 2017


Code for One, Grid Component, Tinder Data, and Engineering Reorg

  1. Structure -- He wrote Structur. He wrote Alpha. He wrote mini-macros galore. Structur lacked an “e” because, in those days, in the Kedit directory eight letters was the maximum he could use in naming a file. In one form or another, some of these things have come along since, but this was 1984 and the future stopped there. Howard, who died in 2005, was the polar opposite of Bill Gates—in outlook as well as income. Howard thought the computer should be adapted to the individual and not the other way around. One size fits one. The programs he wrote for me were molded like clay to my requirements—an appealing approach to anything called an editor. Personalized software is a wonderful luxury. Programmers forget how rare it is. (via Clive Thompson)
  2. React Data Grid -- open source Excel-like grid component built with React.
  3. What Tinder Knows (Guardian) -- the UK laws that let you request this data are wonderful; without it, we'd have little idea how much of our lives we reveal.
  4. How We Reorganized Instagram’s Engineering Team While Quadrupling Its Size (HBR) -- Once we decided to reorg, the first thing we did was determine our desired outcomes as a team. We gathered our leadership in a room and came up with 20 different outcomes—from speed to cost efficiency—and prioritized them, No. 1 to No. 20. We picked our top five outcomes, which became our organizational principles: Minimize dependencies between teams and code; Have clear accountability with the fewest decision-makers; Groups have clear measures; Top-level organizations have roadmaps; Performance, stability, and code quality have owners.

Continue reading Four short links: 28 November 2017.


The state of data analytics and visualization adoption


A survey of usage, access methods, projects, and skills.If you’re an IT professional, software engineer, or software product manager, over the past few years, you’ve likely considered using modern data platforms such as Apache Hadoop; NoSQL databases like MongoDB, Cassandra, and Kudu; search databases like Solr and Elasticsearch; in-memory systems like Spark and MemSQL; and cloud data stores such as Amazon Redshift, Google BigQuery, and Snowflake. But are these modern data technologies here to stay, or are they a flash-in-the-pan with the traditional relational database still reigning supreme? In the Spring of 2017, Zoomdata commissioned O’Reilly Media to create and execute a survey assessing the state of the data and analytics industry. The focus was on understanding the penetration of modern big and streaming data technologies, how data analytics are being consumed by users, and what skills organizations are most interested in staffing. Nearly 900 people from a diverse set of industries, as well as government and academia, responded to the survey. Below is a preview of some of the insights provided by the survey. Modern data platforms have eclipsed relational databases as a main data source Of course, relational databases continue to be the core of online transactional processing (OLTP) systems. However, one of the most interesting findings was that when asked about their organization’s main data sources, less than one-third of survey respondents listed the relational database, with around two-thirds selecting non-relational sources. This is a clear indication that these non-relational data platforms have firmly crossed the chasm from early adopters into mainstream use. Of further interest is the fact that just over 40% of respondents indicated their organizations are using what could be categorized as “modern data sources” such as Hadoop, in-memory, NoSQL, and search databases as a main data source. These modern data sources are optimized to handle what is often referred to as the “three V’s” of big data: very large data volumes; high velocity streaming data; and high variety of unstructured and semi-structured data, such as text and log files. Drilling further into the details, analytic databases (19%) and Hadoop (14%)  were the two most popular non-relational sources. Analytic databases are a category of SQL-based data stores such as Teradata, Vertica, and MemSQL that typically make use of column-store and/or massively parallel processing (MPP) to greatly speed up the kinds of large aggregate queries used when analyzing data. Hadoop, as many readers know, is a software framework used for distributed storage and processing of very large structured and unstructured data sets on computer clusters built from commodity hardware. Download the full report to learn about other findings we uncovered in this survey, including: The proportion of organizations with bi[...]

How to set up and run your own Architectural Katas



Neal Ford explains the ground rules for building software architectures.

Continue reading How to set up and run your own Architectural Katas.


5 things every product manager should know how to measure


Using analytics to improve your product doesn’t have to be complicated.As a product manager, you need to understand who your users are in order to build a great product that meets your users’ needs. That is easy to say but much harder to do. One approach to understanding your users and finding new ways to improve your product is through analytics. However, analytics tools can seem complicated to configure and even once configured, much of the data included in the reports can seem overwhelming or, worse yet, meaningless. Using analytics to improve your product doesn’t have to be complicated. There are simple ways to start, and you don’t have to measure everything. Instead, you want to use analytics tools to measure just a few aspects of your product’s performance and your user’s interests. Even a little bit of data collected and analyzed can help you make better decisions. To help you get started, here are five questions every product manager should ask about their users and their product, along with some ideas on how to begin answering these questions. 1. What words do people use to describe your product? One of the more critical things to measure is how the people who use your product talk about whatever it is your product does. At a basic level, these words offer clues to the problems users face and the types of solutions they are seeking. This helps you decide what features to build and what improvements to make. By knowing what users care about the most, you can also get ideas on how to prioritize the long list of tasks. Knowing how your users speak can give you even more information. By knowing the words people use, you can get insight into the minds of your users. As you study how people speak, patterns will emerge about the user’s motivations and intentions for working with products like yours. How big is this problem they are facing? Is this a systemic problem or something new? Are your users optimistic or pessimistic about the problems your product solves? Do they view your product as an essential aspect of their lives or their company’s success? When you get inside people’s minds, you can do a better job connecting with your users. As you adjust your product’s features to align with these thoughts, your users will feel like your product “gets” them because your product will talk and think the way they talk and think. There are numerous ways to learn how your users speak—through analyzing forums, support tickets, and more. But doing some type of word analysis of forum posts, tickets, social shares, or other similar written conversations with users can get tricky fast. A simpler way to begin is to research the keywords people type into Google related to your product and your industry. Keyword research tools aren’t just for search engine marketers—they are a vital first step for product managers to understand somet[...]

Four short links: 27 November 2017


PV Growth, Digital Rights, Unit Testing, and Open Source Innovation

  1. Photovoltaic Growth: Reality vs. Projections of the International Energy Agency -- that graph.
  2. Digital Rights in Australia -- three aims: to assess the evolving citizen uses of digital platforms, and associated digital rights and responsibilities in Australia and Asia, identifying key dynamics and issues of voice, participation, marginalization and exclusion; to develop a framework for establishing the rights and legitimate expectations that platform stakeholders—particularly everyday users—should enjoy and the responsibilities they may bear; to identify the best models for governance arrangements for digital platforms and for using these environments as social resources in political, social, and cultural change.
  3. Unit Testing Doesn’t Affect Codebases the Way You Would Think -- nice approach to checking hypotheses like "unit testing results in fewer lines of code per method," with results (it doesn't).
  4. Capabilities for Open Source Innovation (Allison Randal) -- Over the past decade, I’ve been researching open source and technology innovation, partly through employment at multiple different companies that engage in open source, and partly through academic work toward completing a Master’s degree and soon starting a Ph.D. The heart of this research is looking into what makes companies successful at open source and also at technology innovation. It turns out there are actually many things in common between the two.

Continue reading Four short links: 27 November 2017.


Four short links: 24 November 2017


Modern Spam, Communist Cybernetics, Computer Simulation, and Retail Big Data

  1. Spam is Back -- “The bulk of our lives online could be spammy,” Brunton said. “Our whole experience could be monetized. We could just get used—forgive my language—to really shitty content all the time.” Some say this has already happened.
  2. Communist Cybernetics -- "Cybernetics provides the theory of social control with precise quantitative methods for analyzing control processes and especially social information, which is a necessary attribute of control." So, the effect was to create a discourse in which cybernetics was emptied of its utopian promise and turned into a system for managing data about governance.
  3. SimH -- The Computer History Simulation Project. Source on GitHub.
  4. Big Data Systems, Labour, Control, and Modern Retail Stores -- It was found that retail work involves a continual movement between a governance regime of control reliant on big data systems that seek to regulate and harnesses formal labour and automation into enterprise planning, and a disciplinary regime that deals with the symbolic, interactive labour that workers perform and acts as a reserve mode of governmentality if control fails. This continual movement is caused by new systems of control being open to vertical and horizontal fissures. While retail functions as a coded assemblage of control, systems are too brittle to sustain the code/space and governmentality desired.

Continue reading Four short links: 24 November 2017.


Four short links: 23 November 2017


Fuzzing, Time Series, Unix 1ed, and Failing

  1. The Art of Fuzzing -- demos here.
  2. Clustering of Time Series is Meaningless -- clusters extracted from these time series are forced to obey a certain constraint that is pathologically unlikely to be satisfied by any data set, and because of this, the clusters extracted by any clustering algorithm are essentially random. While this constraint can be intuitively demonstrated with a simple illustration and is simple to prove, it has never appeared in the literature.
  3. Run the First Edition of Unix via Docker -- In this article, you'll see how to run a PDP-11 simulator through Docker to interact with Unix as it was back in 1972. (via Simon Willison)
  4. Failing Well -- “What we’re trying to teach is that failure is not a bug of learning; it’s the feature,” said Rachel Simmons, a leadership development specialist in Smith’s Wurtele Center for Work and Life.

Continue reading Four short links: 23 November 2017.


The current state of Apache Kafka



The O’Reilly Data Show Podcast: Neha Narkhede on data integration, microservices, and Kafka’s roadmap.

In this episode of the Data Show, I spoke with Neha Narkhede, co-founder and CTO of Confluent. As I noted in a recent post on “the age of machine learning,” data integration and data enrichment are non-trivial and ongoing challenges for most companies. Getting data ready for analytics—including machine learning—remains an area of focus for most companies. It turns out, “data lakes” have become staging grounds for data; more refinement usually needs to be done before data is ready for analytics. By making it easier to create and productionize data refinement pipelines on both batch and streaming data sources, analysts and data scientists can focus on analytics that can unlock value from data.

On the open source side, Apache Kafka continues to be a popular framework for data ingestion and integration. Narkhede was part of the team that created Kafka, and I wanted to get her thoughts on where this popular framework is headed.

Continue reading The current state of Apache Kafka.


Christie Terrill on building a high-caliber security program in 90 days



The O’Reilly Security Podcast: Aligning security objectives with business objectives, and how to approach evaluation and development of a security program.

In this episode of the Security Podcast, I talk with Christie Terrill, partner at Bishop Fox. We discuss the importance of educating businesses on the complexities of “being secure,” how to approach building a strong security program, and aligning security goals with the larger processes and goals of the business.

Continue reading Christie Terrill on building a high-caliber security program in 90 days.


Clinical trial? There’s an app for that


Several new apps are making it easier for doctors and patients to conduct clinical trials. When the Cleveland Clinic was looking for participants in its more than 130 cancer trials, its outreach teams didn’t just cold-call doctors and hospitals. They tried something new: they launched an app. The Cancer Trial App is designed for two distinct populations: patients looking for clinical trials and doctors who are treating those patients. Users who download the app either on iOS or Android receive information on trials by disease, phase, physician, and hospital location. In addition, the app details each trial’s objective, eligibility rules, and progress. The Cleveland Clinic’s app is a simple solution to a complex problem: how to make clinical trials easier for individuals to participate in. Because of a lack of vendors, complex regulatory tools, and institutional inertia, many of the digital workflow and recordkeeping tools that are commonplace in other parts of biology never made it to clinical trials. According to Premier Research, a life sciences consulting firm, only several hundred of the more than 150,000 mobile health applications published as of December 2016 focus on clinical trials. And of those, most are directory apps like the Cleveland Clinic’s, rather than more complicated apps that enhance the patient experience. However, innovation is happening in the clinical trial mobile app space—even if it’s taking longer than expected. In the United States, the Food & Drug Administration is working with stakeholders in clinical trials to examine ways smartphone apps can enable speedier, more cost-effective clinical trials that speed up the new drug approval pipeline. The FDA has a public docket out on “Using Technologies and Innovative Methods to Conduct FDA-Regulated Clinical Investigations of Investigational Drugs” that seeks to create consensus. The Challenges Designing apps for cancer studies and automating patient data is far more complicated developmentally and legally than creating a new smartphone game, which is why development has been slow. Stakeholders need to manage the following factors: The cost of app development, which is a challenge when many pharmaceutical companies don’t have robust, in-house Android and iOS teams and may not know what questions to ask outside contractors to keep costs down Privacy and quality expectations in randomized clinical trial research Regulatory issues Inside the clinical trial world, a culture that prioritizes written documentation and written paperwork over digital recordkeeping, at least when interacting directly with patients There are several ways of dealing with these challenges. Additiona[...]

Four short links: 22 November 2017


Decision-making, Code Duplication, Container Security, and Information vs Attention

  1. Decision-making Under Stress -- Under acute (short-lived, high-intensity) stress, we focus on short-term rapid responses at the expense of complex thinking. Application to startup culture left as exercise to the reader.
  2. DéjàVu: A Map of Code Duplicates on GitHub -- This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub, representing over 482 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. (via Paper a Day)
  3. NIST Guidance on Application Container Security -- This bulletin offers an overview of application container technology and its most notable security challenges. It starts by explaining basic application container concepts and the typical application container technology architecture, including how that architecture relates to the container life cycle. Next, the article examines how the immutable nature of containers further affects security. The last portion of the article discusses potential countermeasures that may help to improve the security of application container implementations and usage.
  4. Modern Media Is a DoS Attack on Your Free Will -- What’s happened is, really rapidly, we’ve undergone this tectonic shift, this inversion between information and attention. Most of the systems that we have in society—whether it’s news, advertising, even our legal systems—still assume an environment of information scarcity. THIS.

Continue reading Four short links: 22 November 2017.


Implementing request/response using context in Go network clients



Learn how NATS requests work and how to use the context package for cancellation.

The first two parts of this series created a general-purpose client that can subscribe to channels on a NATS server, send messages, and wait for responses. But one of the most common communication models is request/response, where two clients engage in one-to-one bidirectional communication. NATS is a pure PubSub system, meaning that everything is built on top of publish and subscribe operations. The NATS Go client supports the Request/Response model of communication by building it on top of the PubSub methods we have already developed.

Because making a request involves awaiting for a response, a Go package that is gaining increasing adoption is context, which was designed for request APIs by providing support for deadlines, cancellation signals, and request-scoped data. Cancellation propagation is an important topic in Go, because it allows us quickly reclaim any resources that may have been in use as soon as the inflight request or a parent context is cancelled. If there is a blocking call in your library, users of your library can benefit from a context-aware API to let them manage the cancellation in an idiomatic way. Contexts also allow you to set a hard deadline to timeout the request, and to include request-scoped data such as a request ID, which might be useful for tracing the request.

Continue reading Implementing request/response using context in Go network clients.


Four short links: 21 November 2017


Storytelling, Decompilation, Face Detection, and Dependency Alerts

  1. Scrollama -- a modern and lightweight JavaScript library for scrollytelling. (via Nathan Yau)
  2. Dangers of the Decompiler -- a sampling of anti-decompilation techniques.
  3. An On-device Deep Neural Network for Face Detection (Apple) -- how the face unlock works, roughly at "technical blog post" levels of complexity.
  4. GitHub Security Alerts -- With your dependency graph enabled, we’ll now notify you when we detect a vulnerability in one of your dependencies and suggest known fixes from the GitHub community.

Continue reading Four short links: 21 November 2017.


Four short links: 20 November 2017


Ancient Data, Tech Ethics, Session Replay, and Cache Filesystem​

  1. Trade, Merchants, and the Lost Cities of the Bronze Age -- We analyze a large data set of commercial records produced by Assyrian merchants in the 19th Century BCE. Using the information collected from these records, we estimate a structural gravity model of long-distance trade in the Bronze Age. We use our structural gravity model to locate lost ancient cities. (via WaPo)
  2. Tech Ethics Curriculum -- a Google sheet of tech ethics courses, with pointers to syllabi.
  3. Session Replay Scripts (Ed Felton) -- lately, more and more sites use “session replay” scripts. These scripts record your keystrokes, mouse movements, and scrolling behavior, along with the entire contents of the pages you visit, and send them to third-party servers. Unlike typical analytics services that provide aggregate statistics, these scripts are intended for the recording and playback of individual browsing sessions, as if someone is looking over your shoulder. (via BoingBoing)
  4. RubiX -- Cache File System optimized for columnar formats and object stores.

Continue reading Four short links: 20 November 2017.


Four short links: 17 November 2017


Interactive Marginalia, In-Person Interactions, Welcoming Groups, and Systems Challenges

  1. Interactive Marginalia (Liza Daly) -- wonderfully thoughtful piece about web annotations.
  2. In-Person Interactions -- Casual human interaction gives you lots of serendipitous opportunities to figure out that the problem you thought you were solving is not the most important problem, and that you should be thinking about something else. Computers aren't so good at that. So true! (via Daniel Bachhuber)
  3. Pacman Rule -- When standing as a group of people, always leave room for 1 person to join your group. (via Simon Willison)
  4. Berkeley View of Systems Challenges for AI -- In this paper, we propose several open research directions in systems, architectures, and security that can address these challenges and help unlock AI’s potential to improve lives and society.

Continue reading Four short links: 17 November 2017.


Four short links: 16 November 2017


Regulate IoT, Visualize CRISPR, Distract Strategically, and Code Together

  1. It's Time to Regulate IoT To Improve Security -- Bruce Schneier puts it nicely: internet security is now becoming "everything" security.
  2. Real-Space and Real-Time Dynamics of CRISPR-Cas9 (Nature) -- great visuals, written up for laypeople in The Atlantic. (via Hacker News)
  3. How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, not Engaged Argument -- research paper. Application to American media left as exercise to the reader.
  4. Coding Together in Real Time with Teletype for Atom -- what it says on the box.

Continue reading Four short links: 16 November 2017.


The tools that make TensorFlow productive


Analytical frameworks come with an entire ecosystem.Deployment is a big chunk of using any technology, and tools to make deployment easier have always been an area of innovation in computing. For instance, the difficulties and uncertainties of installing software and keeping it up-to-date were one factor driving companies to offer software as a service over the Web. Likewise, big data projects present their own set of issues: how do you prepare and ingest the data? How do you view the choices made by algorithms that are complex and dynamic? Can you use hardware acceleration (such as GPUs) to speed analytics, which may need to operate on streaming, real-time data? Those are just a few deployment questions associated with deep learning. In the report Considering TensorFlow for the Enterprise, authors Sean Murphy and Allen Leis cover the landscape of tools for working with TensorFlow, one of the most popular frameworks currently in big data analysis. They explain the importance of seeing deep learning as an integral part of a business environment—even while acknowledging that many of the techniques are still experimental—and review some useful auxiliary utilities. These exist for all of the major stages of data processing: preparation, model building, and inference (submitting requests to the model), as well as debugging. Given that the decisions made by deep learning algorithms are notoriously opaque (it's hard to determine exactly what combinations of features led to a particular classification), one intriguing part of the report addresses the possibility of using TensorBoard to visualize what's going on in the middle of a neural network. The UI offers you a visualization of the stages in the neural network, and you can see what each stage sends to the next. Thus, some of the mystery in deep learning gets stripped away, and you can explain to your clients some of the reasons that a particular result was reached. Another common bottleneck for many companies stems from the sizes of modern data sets, which often beg for help in getting ingested and through the system. One study found that about 20% of businesses handle data sets in the range of terabytes, with smaller ranges (gigabytes) being most common, and larger ones (petabytes) quite rare. For that 20% or more using unwieldy data sets, Murphy and Leis’s report is particularly valuable because special tools can help tie TensorFlow analytics to the systems that pass data through its analytics, such as Apache Spark. The authors also cover options[...]

Implementing the pipes and filters pattern using actors in Akka for Java



How messages help you decouple, test, and re-use your software’s code.

We would like to introduce a couple of interesting concepts from Akka by giving an overview of how to implement the pipes and filters enterprise integration pattern. This is a commonly used pattern that helps us flexibly compose together sequences of alterations to a message. In order to implement this pattern we use Akka - a popular library that provides new approaches to write modern reactive software in Java and Scala.

The Business problem

Recently we came across an author publishing application made available as a service. It was responsible for processing markdown text. It would execute a series of operations back to back:

Continue reading Implementing the pipes and filters pattern using actors in Akka for Java.


Nathaniel Schutta on succeeding as a software architect



The O’Reilly Programming Podcast: The skills needed to make the move from developer to architect.

In this episode of the O’Reilly Programming Podcast, I talk with Nathaniel Schutta, a solutions architect at Pivotal, and presenter of the video I’m a Software Architect, Now What?. He will be giving a presentation titled Thinking Architecturally at the 2018 O’Reilly Software Architecture Conference, February 25-28, 2018, in New York City.

Continue reading Nathaniel Schutta on succeeding as a software architect.


Modern HTTP service virtualization with Hoverfly



Service virtualization brings a lightweight, automatable means of simulating external dependencies.

In modern software systems, it’s very common for applications to depend on third party or internal services. For example, an ecommerce site might depend on a third party payment service to process card payments, or a social network to provide authentication. These sorts of applications can be challenging to test in isolation, as their dependencies can introduce problems like:

  • Non-determinism
  • Slow and costly builds
  • Unmockable client libraries
  • Rate-limiting
  • Expensive licensing costs
  • Incompleteness
  • Slow provisioning

To get around this, service virtualization, or replacing these components with a process which simulates them, can emulate these dependencies. Unlike mocking, which replaces your application code, service virtualization lives externally, typically operating at the network level. It is non-invasive, and is essentially just like the real thing from the perspective of its consumer.

Continue reading Modern HTTP service virtualization with Hoverfly.