Subscribe: O'Reilly Radar - Insight, analysis, and research about emerging technologies
Added By: Feedage Forager Feedage Grade B rated
Language: English
continue reading  continue  data  learning  links april  links  machine learning  new  reading  short links  short  time  web 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: O'Reilly Radar - Insight, analysis, and research about emerging technologies

All - O'Reilly Media

All of our Ideas and Learning material from all of our topics.

Updated: 2018-04-22T09:51:51Z


Traits you’ll find in good managers



Work with your manager to get what you need, when you need it.

Continue reading Traits you’ll find in good managers.


Four short links: 20 April 2018


Functional Programming, High-Dimensional Data, Games and Datavis, and Container Management

  1. Interview with Simon Peyton-Jones -- I had always assumed that the more bleeding-edge changes to the type system, things like type-level functions, generalized algebraic data types (GADTs), higher rank polymorphism, and existential data types, would be picked up and used enthusiastically by Ph.D. students in search of a topic, but not really used much in industry. But in fact, it turns out that people in companies are using some of these still-not-terribly-stable extensions. I think it's because people in companies are writing software that they want to still be able to maintain and modify in five years time. SPJ is the creator of Haskell, and one of the leading thinkers in functional programming.
  2. HyperTools -- A Python toolbox for visualizing and manipulating high-dimensional data. Open source. High-dimensional = "a lot of columns in each row".
  3. What Videogames Have to Teach Us About Data Visualization -- super-interesting exploration of space, storytelling, structure, and annotations.
  4. Titus -- Netflix open-sourced their container management platform. There aren't many companies with the scale problems of Amazon, Netflix, Google, etc., so it's always interesting to see what comes out of them.

Continue reading Four short links: 20 April 2018.


Thinking beyond bots: How AI can drive social impact


A few ways to think differently and integrate innovation and AI into your company's altruistic pursuits.What do artificial intelligence (AI), invention, and social good have in common? While on the surface they serve very different purposes, at their core, they all require you to do one thing in order to be successful at them: think differently. Take the act of inventing—in order to develop a great patent, trade secret, or other intellectual property, you need to think outside of the box. Similarly, at the heart of AI is the act of unlocking new capabilities, whether that’s making virtual personal assistants like Alexa more useful, or creating a chatbot that provides a personalized experience to customers. And because of the constantly changing economic and social landscapes, coming up with impactful social good initiatives requires you to constantly approach things through a new lens. Individually, these fields have seen notable advancements over the past year, including new technologies that are bringing improvements to AI and large companies that are prioritizing giving back. But even more exciting is that we’re seeing more and more business leaders and nonprofits combining AI, innovation, and social good to reach communities in innovative ways, at a scale we’ve never before seen. There’s no better time than now to explore how your organization approaches your social good efforts. Here are a few ways you can think differently and integrate innovation and AI into your company’s altruistic pursuits. Approach social good through the mind of an inventor As a master inventor at IBM, I’m part of the team responsible for helping the company become the leading recipient of U.S. patents for the last quarter century. While developing patents and intellectual properties might not be what you’re setting out to do as part of your humanitarian efforts, the way we approach our jobs as inventors is something that can be applied across all aspects of giving back. Consider the United Nations’ 17 Sustainable Development Goals, which aim to eradicate things like poverty, hunger, disease, and more. These are game-changing initiatives that definitely require new ideas. What’s more, the United Nations estimates that we’re $5 trillion short on resources needed to accomplish these goals. How do we bridge this gap? Well, we need to start thinking differently. Foundationally, coming up with a great invention is identifying a problem that needs to be solved and coming up with an out-of-the-box idea that’s smart, has the biggest impact, and the lowest risk. To do this, we look around us to see which relevant technologies we can use that are already at our disposal so we don’t have to completely reinvent the wheel if we don’t have to. We also identify which parts of the solution need a completely new idea to be created from scratch. Additionally, we look at the issue we’re trying to solve and the current landscape as a whole so we can predict any issues or future problems that may arise, and we try to address them ahead of time in our invention. The same approach should be applied to social good—identify the problem you want to solve, the tools that already exist that can help you solve this dilemma, and the resources that need to be created or brought in from outside properties in order to execute your plan. At the heart of social good, similar to most inventions, are the people you’re trying to help. You need to make sure you’re maximizing the reach of your project while also minimizing any risks that may unintentionally create additional problems for the people you’re trying to help. To do this, you need to be creative in your approach. As an example, this is exactly the approach InvestEd is taking (full disclosure: I am an advisor for InvestEd). They started off by realizing they could commercialize and create social good at the same time by enabling financial education and facilitating microloans for[...]

5 best practices for delivering design critiques



Real critique helps teams strengthen their designs, products, and services.

Continue reading 5 best practices for delivering design critiques.


How to run a custom version of Spark on hosted Kubernetes


Learn how Spark 2.3.0+ integrates with K8s clusters on Google Cloud and Azure.Do you want to try out a new version of Apache Spark without waiting around for the entire release process? Does running alpha-quality software sound like fun? Does setting up a test cluster sound like work? This is the blog post for you, my friend! We will help you deploy code that hasn't even been reviewed yet (if that is the adventure you seek). If you’re a little cautious, reading this might sound like a bad idea, and often it is, but it can be a great way to ensure that a PR really fixes your bug, or the new proposed Spark release doesn’t break anything you depend on (and if it does, you can raise the alarm). This post will help you try out new (2.3.0+) and custom versions of Spark on Google/Azure with Kubernetes. Just don't run this in production without a backup and a very fancy support contract for when things go sideways. Note: This is a cross-vendor post (Azure's Spark on AKS and Google Cloud's Custom Spark on GKE), each of which have their own vendor-specific posts if that’s more your thing. Warning: it’s important to make sure your tests don’t destroy your real data, so consider using a sub-account with lesser permissions. Setting up your version of Spark to run If there is an off-the-shelf version of Spark you want to run, you can go ahead and download it. If you want to try out a specific patch, you can checkout the pull request to your local machine with git fetch origin pull/ID/head:BRANCHNAME, where ID is the PR number, and then follow the directions to build Spark (remember to include the -P components you want/need, including your cluster manager of choice). Now that we’ve got Spark built, we will build a container image and upload it to the registry of your choice, like shipping a PXE boot image in the early 90s (bear with me, I miss the 90s). Depending on which registry you want to use, you’ll need to point both the build tool and spark-submit in the correct location. We can do this with an environment variable—for Docker Hub, this is the name of the registry; for Azure Container Registry (ACR), this value is the ACR login server name; and for Google Container Registry, this is$PROJECTNAME. export REGISTRY=value For Google cloud users who want to use the Google-provided Docker registry, you will need to set up Docker to run through gcloud. In the bash shell, you can do this with an alias: shopt -s expand_aliases && alias docker="gcloud docker --" For Azure users who want to use Azure Container Registry (ACR), you will need to grant Azure Container Service (AKS) cluster read access to the ACR resource. For non-Google users, you don’t need to wrap the Docker command, and just skip that step and keep going: export DOCKER_REPO=$REGISTRY/spark export SPARK_VERSION=`git rev-parse HEAD` ./bin/ -r $DOCKER_REPO -t $SPARK_VERSION build ./bin/ -r $DOCKER_REPO -t $SPARK_VERSION push Building your Spark project for deployment (or, optionally, starting a new one) Spark on K8s does not automatically handle pushing JARs to a distributed file system, so we will need to upload whatever JARs our project requires to work. One of the easiest ways to do this is to turn our Spark project into an assembly JAR. If you’re starting a new project and you have sbt installed, you can use the Spark template project: sbt new holdenk/sparkProjectTemplate.g8 If you have an existing SBT-based project, you can add the sbt-assembly plugin: touch project/assembly.sbt echo 'addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.6")' >> project/assembly.sbt With SBT, once you have the SBT assembly plugin (either through creating a project with it included in the template or adding it to an existing one), you can produce an assembly JA[...]

Four short links: 19 April 2018


Free Multics, Community Relevance, Speech Synthesis, and Dandelion Data

  1. BAN.AI Multics -- free multiuser Multics (predecessor to Unix) emulation. This Multics guide will be useful.
  2. The Art of Relevance -- explores how mission-driven organizations can matter more to more people. The book is packed with inspiring examples, rags-to-relevance case studies, research-based frameworks, and practical advice on how your work can be more vital to your community. Should be read by startups (relevant to your customers?) and anyone who is trying to build a community around their software. Text available for free online, print versions still available for purchase.
  3. VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop -- We present a new neural text to speech (TTS) method that is able to transform text to speech in voices that are sampled in the wild. Unlike other systems, our solution is able to deal with unconstrained voice samples and without requiring aligned phonemes or linguistic features. The Presidential voices are impressive. Code and paper available.
  4. No Boundaries for Facebook Data -- Today we report yet another type of surreptitious data collection by third-party scripts that we discovered: the exfiltration of personal identifiers from websites through “login with Facebook” and other such social login APIs. Specifically, we found two types of vulnerabilities: seven third parties abuse websites’ access to Facebook user data; one third party uses its own Facebook “application” to track users around the web.

Continue reading Four short links: 19 April 2018.


Four short links: 18 April 2018


Open Source Slack-alike, Open Source MailChimp-alike, DeepFake PSA, and Secure Devices

  1. Zulip -- FOSS Slack-type chat.
  2. Mailtrain -- self-hosted GPLv3 MailChimp-style newsletter service that you can hook up to your favorite mail service (e.g., Mailgun).
  3. Fake News PSA -- DeepFake video of Barack Obama saying things that Obama never said (made for Buzzfeed).
  4. Seven Properties of Highly Secure Devices (Microsoft) -- Hardware-based root of trust; small trusted computing base; defense in depth; compartmentalization; certificate-based authentication; renewable security; failure reporting.

Continue reading Four short links: 18 April 2018.


From USENET to Facebook: The second time as farce


Demanding and building a social network that serves us and enables free speech, rather than serving a business metric that amplifies noise, is the way to end the farce.Re-interpreting Hegel, Marx said that everything in history happens twice, the first time as tragedy, the second as farce. That’s a fitting summary of Facebook’s Very Bad Month. There’s nothing here we haven’t seen before, nothing about abuse, trolling, racism, spam, porn, and even bots that hasn’t already happened. This time as farce? Certainly Zuckerberg’s 14-year Apology Tour, as Zeynep Tufecki calls it, has the look and feel of a farce. He just can’t stop apologizing for Facebook’s messes. Except that the farce isn’t over yet. We’re in the middle of it. As Tufekci points out, 2018 isn’t the first time Zuckerberg has said “we blew it, we’ll do better.” Apology has been a roughly biennial occurrence since Facebook’s earliest days. So, the question we face is simple: how do we bring this sad history to an endpoint that isn’t farce? The third time around, should there be one, it isn’t even farce; it’s just stupidity. We don’t have to accept future apologies, whether they come from Zuck or some other network magnate, as inevitable. I want to think about what we can learn from the forerunners of modern social networks—specifically about USENET, the proto-internet of the 1980s and 90s. (The same observations probably apply to BBSs, though I’m less familiar with them.) USENET was a decentralized and unmanaged system that allowed Unix users to exchange “posts” by sending them to hundreds of newsgroups. It started in the early 80s, peaked sometime around 1995, and arguably ended as tragedy (though it went out with a whimper, not a bang). As a no-holds-barred Wild West sort of social network, USENET was filled with everything we rightly complain about today. It was easy to troll and be abusive; all too many participants did it for fun. Most groups were eventually flooded by spam, long before spam became a problem for email. Much of that spam distributed pornography or pirated software (“warez”). You could certainly find newsgroups in which to express your inner neo-Nazi or white supremacist self. Fake news? We had that; we had malicious answers to technical questions that would get new users to trash their systems. And yes, there were bots; that technology isn’t as new as we’d like to think. But there was a big divide on USENET between moderated and unmoderated newsgroups. Posts to moderated newsgroups had to be approved by a human moderator before they were pushed to the rest of the network. Moderated groups were much less prone to abuse. They weren’t immune, certainly, but moderated groups remained virtual places where discussion was mostly civilized, and where you could get questions answered. Unmoderated newsgroups were always spam-filled and frequently abusive, and the alt.* newsgroups, which could be created by anyone, for any reason, matched anything we have now for bad behavior. So, the first thing we should learn from USENET is the importance of moderation. Fully human moderation at Facebook scale is impossible. With seven billion pieces of content shared per day, even a million moderators would have to scan seven thousand posts each: roughly 4 seconds per post. But we don’t need to rely on human moderation. After USENET’s decline, research showed that it was possible to classify users as newbies, helpers, leaders, trolls, or flamers, purely by their communications patterns—with only minimal help from the content. This could be the basis for automated moderation assistants that kick suspicious posts over to human moderators, who would then have the final word. Whether automated or human, moderators prevent many of the bad posts from being made in the first place. It’s no fun being a troll[...]

Four short links: 17 April 2018


Dubsteganography, Parsing History, Hackin' the Jack In, and Model Bias

  1. Hide Data in Dubstep Drops -- the blog post shows how to use it. Skrillex meets steganography!
  2. Parsing Timeline -- wonderfully detailed, yet it reads almost chatty. Interesting and informative.
  3. Securing Wireless Neurostimulators -- a hack and discussion of the risk of insecure implantable medical devices that interface with the brain. (via Paper a Day)
  4. Text Embedding Models Contain Bias (Google) -- great to see this making its way to research outputs, instead of being the province of damage control and bad PR. The Developers section of the Semantic Experiences microsite talks about "unwanted associations": In Semantris, the list of words we're showing are hand curated and reviewed. To the extent possible, we've excluded topics and entities that we think particularly invite unwanted associations, or can easily complement them as inputs. In Talk to Books, while we can't manually vet each sentence of 100,000 volumes, we use a popularity measure which increases the proportion of volumes that are published by professional publishing houses. There are additional measures that could be taken. For example, a toxicity classifier or sensitive topics classifier could determine when the input or the output is something that may be objectionable or party to an unwanted association. We recommend taking bias-impact mitigation steps when crafting end-user applications built with these models.

Continue reading Four short links: 17 April 2018.


Relato: Turking the business graph


A failed analytics startup post-mortem.In order to conquer a market, you must first understand it. We often speak of markets in the abstract, as addressable segments of the economy, defining them by examples of companies and by comparisons to others engaged in similar activities. Sales and marketing leaders have richer internal models of markets they use to guide their organizations as they fight for their share of the markets they contest. In January 2015, I set out to build an external representation of a market every bit as rich as those in the minds of leading executives driving successful companies; I founded an analytics startup called Relato—a startup that, unfortunately, did not succeed. In this post, I’ll present the story of the company and the work I did there, the entrepreneurship and network science involved in my work, and some insight into how not to run a young analytics startup, and a little about how to do so as well. My mission with Relato was to build a deeper understanding of the modern networked economy, a vast network in which companies are best defined according to their business relationships with other companies. When it comes to understanding companies, it’s “who you know.” These relationships translate to connections in the business graph, made up of connections between customer, partner, competitor, and investor. Mission: Mapping markets I started Relato with an experiment to see how much market intelligence I could gather from the business web. Having worked at LinkedIn, I missed their social graph. I wondered, “Could a copy of the business graph be collected from the open web?” The answer to that research question is what led me to found Relato. I started by surveying the state of the market for data on companies. I discovered that while basic firmographic data was available—things like address, industry code, website technologies—there was nothing that captured the actual business activity of companies. By contrast, when I surveyed the websites of businesses, I found a treasure trove of information about their business relationships with other companies. Starting with a market I knew—big data—I manually transcribed the partnership pages of the major players: Hortonworks, Cloudera, MapR, and Pivotal. The combined list came to hundreds of companies—not a bad survey of the big data market. Figure 1. Where it all started: Hortonworks’ partnership page. Screenshot by Russell Jurney. I saw opportunity! I could transcribe the companies listed on partnership pages to learn how companies actually did business. Then I could use graph analytics on this data to extract next-generation profiles on companies. This data could be used in lead scoring and lead generation systems to provide a breakthrough in their level of performance. In short, I could provide leads for enterprise customers that would convert to sales at a rate never before seen! I got excited. What if sales calls only came to people who wanted your product, because Relato told you so? I could optimize the economy and change the world! I was inspired. I figured out roughly how I could collect this data using natural language processing, an area that I know a little bit about but is not one of my core skills. Building this model and making it good enough to be saleable would take at least a year. I did not have a year of cash to burn in the bank. Fortunately, there was a faster alternative. There was a shift in “big data” to hybrid human/machine processing, where humans located in places where wages are low would perform many instances of simple tasks to cheaply create data sets for machine learning systems. Using this method, there would ultimately be a higher cost per record collected because I would be paying real humans wages, but t[...]

The eight rules of good documentation


Like good code, good documentation is difficult and time consuming to write.Imagine for a moment two common scenarios in the life of a web developer. In the first scenario, meet Harlow. Today is Harlow's first day on a new project. The team has a well-established codebase, a great working environment, and a robust test suite. As Harlow sits down at her desk, she's excited to get up to speed with the team. After the morning stand-up meeting she's pointed to the project's documentation for installation with a slight grimace from her colleague Riley. He mentions that the docs "might be a little out of date, but should hopefully be enough to get you going." Harlow then spends the rest of the day following the documentation until she gets stuck, at which point she is forced to dig through code or ask colleagues for guidance. What might have taken a few minutes becomes a day-long exercise in frustration, tampering Harlow's initial excitement. In the second scenario, meet Harrison. He's working on a web app and finds a library that, at first glance, seems incredibly useful for his project. As he attempts to integrate it with his codebase he discovers that parts of the API seem to be glossed over in the documentation or even undocumented. In the end, he walks away from the project in favor of another solution. Though these scenarios may be slightly exaggerated, I'm reasonably certain that many of us can relate. These problems were not primarily caused by low-quality code, but rather by poor documentation. If useful documentation is so important to the success of projects and developer well-being, why don't all projects have it? The answer, I believe, is that like good code, good documentation is difficult and time consuming to write. In my eyes, there are eight rules that we can follow to produce good documentation: Write documentation that is inviting and clear Write documentation that is comprehensive, detailing all aspects of the project Write documentation that is skimmable Write documentation that offers examples of how to use the software Write documentation that has repetition, when useful Write documentation that is up-to-date Write documentation that is easy to contribute to Write documentation that is easy to find The most important rule of good documentation is for it to be as inviting as possible. This means that we should aim to write it in the clearest terms possible without skipping over any steps. We should avoid making assumptions about what our users may know. Sometimes this can seem to be overkill, and we may be tempted to say something like "every X developer knows about Y," but we each bring our own background and set of experiences to a project. Though this may result in more verbose documentation, it is ultimately simpler, as there is less guesswork involved for developers with all levels of experience. Documentation should aim to be comprehensive. This means that all aspects of the project are documented. Undocumented features or exceptions can lead to frustration and become a time suck as users and other developers are forced to read through code to find the answers they need. Fully documenting all features takes away this kind of ambiguity. When we write documentation that is skimmable, we help users find the content they need quickly. Making documentation skimmable can be accomplished by using clear headings, bulleted lists, and links. For large project documentation, a table of contents or clear navigation will help users to skip straight to what they need, rather than scrolling through a single long document. Documentation that features examples allows users to see how they might use the code themselves. Aim to provide examples of the most common use cases for the pro[...]

Stephen Gates on the growing risks posed by malicious bots


The O’Reilly Podcast: Protecting your organization against current and future threats.In this episode of the O’Reilly podcast, I spoke with Stephen Gates of Oracle Dyn. Gates joined the Oracle Dyn Global Business Unit from Zenedge, the web application security company recently acquired by Oracle. Gates and I discussed how growing malicious bot activity impacts organizations. src="" height="166" width="100%" frameborder="no" scrolling="no"> Here are some highlights: The rise of malicious bots One of the factors driving the current proliferation of malicious bots is the Mirai malware. It works by using a list of default usernames and passwords (from previous data breaches) to take control of IoT devices. One key differentiator with Mirai is that it’s self-propagating—each infected device has the ability to scan the internet to find similar devices and subsequently infect them. This has also spurred other self-propagating, copycat malware. Another key factor driving malicious bot growth is the increase in malware that focuses on exploiting vulnerabilities (versus relying on usernames and passwords). The malware automates the process of scanning and infecting IoT devices for known vulnerabilities. Then, they're exploiting those known software vulnerabilities, which are quite common in IoT devices. The volume of attack traffic these devices can generate is a huge differentiator because many of these devices have access to pretty sizeable CPUs, and they've got access to a lot of bandwidth. As a result, we're seeing DDoS attacks being launched by these botnets in excess of 1.5 terabits per second. That's enough traffic to take a small country offline. Mitigating malicious bot threats To manage the risks malicious bots pose, organizations need to be aware that, realistically, bots represent a significant portion of site and application traffic. They must understand the threats against their specific business models and recognize the need to build systems that can differentiate between good bot traffic, bad bot traffic, and human activity. Sites and applications must allow good bots to continue performing critical activities (scrape data for Google search queries, for example), but also mitigate malicious bot activity. Done poorly, you could reduce the effectiveness of your SEO, or worse yet, block paying customers from your sites or applications. An effective DDoS incident response plan includes detection and mitigation of these bot-driven attacks. Defenses should be layered, including cloud-based defenses for volumetric attacks, and most likely web application firewalls for the more measured attacks. Without having a DDoS response plan in place, you’re a sitting duck and effectively just waiting for one of these attacks to take your organization offline. Preparing for the malicious bot attacks of the future The malicious bot threat landscape will continue to evolve rapidly. To prepare, organizations must embrace advanced data capabilities, such as AI and supervised machine learning, to detect and defeat sophisticated malicious bot attacks. Additionally, businesses need to focus on hiring security intelligence analysts. Embracing AI and machine learning is only helpful if we have the capabilities to analyze the output. Accordingly, security analyst skills will be in very high demand. This post is a collaboration between O'Reilly and Oracle Dyn. See our statement of editorial independence. Continue reading Stephen Gates on the growing risks posed by malicious bots.[...]

Simon Moss on using artificial intelligence to fight financial crimes



Innovations that increase detection of, and response to, criminal attacks of financial systems.

In this episode of the O’Reilly Podcast, I talk with Simon Moss, vice president of industry consulting and solutions, Americas, at Teradata. We discuss how machine learning and deep learning techniques are being used to fight financial crimes, such as credit card fraud, identity theft, health care fraud, and money laundering.

src="" height="166" width="100%" frameborder="no" scrolling="no">

Discussion points:

  • Moss says AI techniques are “a new set of weapons” against perpetrators of financial crime. “If we use them right, we can finally at least slow down the constant tide of financial crime.”
  • AI can be more effective than traditional methods of combatting identity fraud, health care fraud, and money laundering, because for those issues, he explains, “you’re not looking for a needle in a haystack. You’re looking for a needle in a stack of needles. You are trying to find individuals whose nefarious activity is disguised and hidden in pure normality, by completely innocuous activity.”
  • Moss compares the use of machine learning techniques to a rules engine: “a rules engine looks for behaviors that have already happened, whereas machine learning is trying to connect different bread crumbs. It’s running multiple scenarios at the same time to try to look at the problem from multiple different angles.”
  • Machine learning can add efficiency to the detection process: “It can take multiple data sources, map the data to a case, analyze it, and then in seconds, make a decision on whether it’s a false positive, whether it’s normal business activity, whether it’s something that needs further investigation, or whether it is outright criminality,” Moss says.

Other links:

This post is a collaboration between Teradata and O’Reilly. See our statement of editorial independence.

Continue reading Simon Moss on using artificial intelligence to fight financial crimes.


Four short links: 16 April 2018


Light-Powered Camera, Government Blogging, TensorFlow, and Metanotation

  1. Light-Powered Camera -- prototype gets 15 frames/second, no external power. The light is used for both image sensing and solar power.
  2. Government Blogs and Government Bloggers (Public Strategist) -- the blogging spectrum 2x2 is solid and explains why government blogs are often about prototypes, not operations.
  3. Introducing TensorFlow -- an open source library you can use to define, train, and run machine learning models entirely in the browser, using Javascript and a high-level layers API.
  4. It's Time for a New Old Programming Language (YouTube) -- Guy L. Steele Jr.'s talk about the Computer Science Metanotation that CS papers use to indicate programs without having to use a specific programming language. This is one for your inner CS meta-nerd.

Continue reading Four short links: 16 April 2018.


Four short links: 13 April 2018


Compositing, Exfiltrating, Listening, and Munging

  1. Deep Painterly Harmonisation -- composite and preserve the style of the destination image. The examples are impressive.
  2. PowerHammer: Exfiltrate Data Over Power Lines -- In this case, a malicious code running on a compromised computer can control the power consumption of the system by intentionally regulating the CPU utilization. Data is modulated, encoded, and transmitted on top of the current flow fluctuations, and then it is conducted and propagated through the power lines.
  3. Learn To Listen At The Cocktail Party -- We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality.
  4. prototool -- a Swiss Army Knife for protocol buffers.

Continue reading Four short links: 13 April 2018.


Jupyter is where humans and data science intersect


Discover how data-driven organizations are using Jupyter to analyze data, share insights, and foster practices for dynamic, reproducible data science.I'm grateful to join Fernando Pérez and Brian Granger as a program co-chair for JupyterCon 2018. Project Jupyter, NumFOCUS, and O'Reilly Media will present the second annual JupyterCon in New York City August 21–25, 2018. Timing for this event couldn't be better. The human side of data science, machine learning/AI, and scientific computing is more important than ever. This is seen in the broad adoption of data-driven decision-making in human organizations of all kinds, the increasing importance of human centered design in tools for working with data, the urgency for better data insights in the face of complex socioeconomic conditions worldwide, as well as dialogue about the social issues these technologies bring to the fore: collaboration, security, ethics, data privacy, transparency, propaganda, etc. To paraphrase our co-chairs, Brian Granger: Jupyter is where humans and data science intersect. And Fernando Perez: The better the technology, the more important that human judgement becomes. Consequently, we'll explore three main themes at JupyterCon 2018: Interactive computing with data at scale: the technical best practices and organizational challenges of supporting interactive computing in companies, universities, research collaborations, etc., (JupyterHub) Extensible user interfaces for data science, machine learning/AI, and scientific computing (JupyterLab) Computational communication: taking the artifacts of interactive computing and communicating them to different audiences A meta-theme that ties these together is extensible software architecture for interactive computing with data. Jupyter is built on a set of flexible, extensible, and re-usable building blocks that can be combined and assembled to address a wide range of usage cases. These building blocks are expressed through the various open protocols, APIs, and standards of Jupyter. The Jupyter community has much to discuss and share this year. For example, success stories such as the data science program at UC Berkeley illustrate the power of JupyterHub deployments at scale in education, research, and industry. As universities and enterprise firms learn to handle the technical challenges of rolling out hands-on, interactive computing at scale, a cohort of organizational challenges come to the fore: practices regarding collaboration, security, compliance, data privacy, ethics, etc. These points are especially poignant in verticals such as health care, finance, and education, where the handling of sensitive data is rightly constrained by ethical and legal requirements (HIPAA, FERPA, etc.). Overall, this dialogue is extremely relevant—it is happening at the intersection of contemporary political and social issues, industry concerns, new laws (GDPR), the evolution of computation, plus good storytelling and communication in general—as we'll explore with practitioners throughout the conference. The recent beta release of JupyterLab embodies the meta-theme of extensible software architecture for interactive computing with data. While many people think of Jupyter as a "notebook," that's merely one building block needed for interactive computing with data. Other building blocks include terminals, file browsers, LaTeX, markdown, rich outputs, text editors, and renderers/viewers for different data formats. JupyterLab is the next-generation user interface for Project Jupyter, and provides these different building blocks in a flexible, configurable, customizable environment. This opens the door[...]

The importance of transparency and user control in machine learning



The O’Reilly Data Show Podcast: Guillaume Chaslot on bias and extremism in content recommendations.

In this episode of the Data Show, I spoke with Guillaume Chaslot, an ex-YouTube engineer and founder of AlgoTransparency, an organization dedicated to helping the public understand the profound impact algorithms have on our lives. We live in an age when many of our interactions with companies and services are governed by algorithms. At a time when their impact continues to grow, there are many settings where these algorithms are far from transparent. There is growing awareness about the vast amounts of data companies are collecting on their users and customers, and people are starting to demand control over their data. A similar conversation is starting to happen about algorithms—users are wanting more control over what these models optimize for and an understanding of how they work.

I first came across Chaslot through a series of articles about the power and impact of YouTube on politics and society. Many of the articles I read relied on data and analysis supplied by Chaslot. We talked about his work trying to decipher how YouTube’s recommendation system works, filter bubbles, transparency in machine learning, and data privacy.

Continue reading The importance of transparency and user control in machine learning.


Four short links: 12 April 2018


Probabilistic Programming, Bad Copyright, Technical Debt, and Video Data Set

  1. TensorFlow Probability -- a probabilistic programming toolbox for machine learning.
  2. European Copyright Law Isn't Great. It Could Soon Get a Lot Worse. (EFF) -- The practical effect of this could be to make it impossible for a news publisher to publish their stories for free use, for example by using a Creative Commons license. (via BoingBoing)
  3. A Taxonomy of Technical Debt -- you can argue about whether his categories are your categories, but it's useful to have words for the nuance.
  4. Moments in Time Data Set -- A large-scale data set for recognizing and understanding action in videos. (via MIT News)

Continue reading Four short links: 12 April 2018.


Using qualitative and quantitative data to design better user experiences



It's important to know the "why" in addition to the "what" in UX design.

Continue reading Using qualitative and quantitative data to design better user experiences.


Probing the pill box: Repurposing drugs for new treatments



As the cost of developing new drugs rises, researchers are investigating new ways to use existing medicines

With the development of new medicines often exceeding 10 years and topping $1 billion, faster and cheaper methods of drug development are sorely needed. Drug repurposing, whereby drugs approved for one condition are used to treat a completely different disease, is a fast-emerging strategy aimed at circumventing this daunting pipeline. Several biotechnology companies now specialize in identifying new drug-disease pairs by interrogating massive biomedical datasets with ever-evolving artificial intelligence algorithms. This systematic approach to repurposing, augmenting the chance application of drugs to new diseases, represents a powerful tool for producing new medicines. Indeed, major drug companies such as GlaxoSmithKline have committed to probing “dark data,” or failed trial data, to achieve this goal.

The Promise of Drug Repositioning

The time and money required to bring a new drug to market severely limits the number of new treatments, in part explaining the steady decline in the number of approved drugs entering use each year. The drug development process, overseen in the United States by the Food and Drug Administration (FDA), has three main stages. Candidate compounds, which are being generated in abundance by discovery science and have established mechanisms of action, safe dosages, and a pharmacokinetic profile, begin in Phase I trials. In this trial stage, the safety profile of the compound is established, usually in healthy volunteers. Promising candidates are selected for Phase II trials, which determine the effectiveness of the compound at treating a particular disease or indication in a small sample group. Positive candidates are admitted to a Phase III trial, which takes place in a larger patient population, where the efficacy of the drug and its side-effects are fully assessed.

Continue reading Probing the pill box: Repurposing drugs for new treatments.


Data engineers vs. data scientists


The two positions are not interchangeable—and misperceptions of their roles can hurt teams and compromise productivity.It’s important to understand the differences between a data engineer and a data scientist. Misunderstanding or not knowing these differences are making teams fail or underperform with big data. A key misunderstanding is the strengths and weaknesses of each position. I think some of these misconceptions come from the diagrams that are used to describe data scientists and data engineers. Figure 1. Overly simplistic venn diagram with data scientists and data engineers. Illustration by Jesse Anderson. Venn diagrams like Figure 1 oversimplify the complex positions and how they’re different. It makes the two positions seem interchangeable. Yes, both positions work on big data. However, what each position does to create value or data pipelines with big data is very different. This difference comes from the base skills of each position. What are data scientists and data engineers? When I work with organizations on their team structures, I don’t use a Venn diagram to illustrate the relationship between a data engineer and a data scientist. I draw the diagram as shown in Figure 2. Figure 2. Diagram showing the core competencies of data scientists and data engineers and their overlapping skills. Illustration by Jesse Anderson and the Big Data Institute. Data scientists’ skills At their core, data scientists have a math and statistics background (sometimes physics). Out of this math background, they’re creating advanced analytics. On the extreme end of this applied math, they’re creating machine learning models and artificial intelligence. Just like their software engineering counterparts, data scientists will have to interact with the business side. This includes understanding the domain enough to make insights. Data scientists are often tasked with analyzing data to help the business, and this requires a level of business acumen. Finally, their results need to be given to the business in an understandable fashion. This requires the ability verbally and visually communicate complex results and observations in a way that the business can understand and act on them. My one sentence definition of a data scientist is: a data scientist is someone who has augmented their math and statistics background with programming to analyze data and create applied mathematical models. A common data scientist trait is that they’ve picked up programming out of necessity to accomplish what they couldn’t do otherwise. When I talk to data scientists, this is a common thing they tell me. In order to accomplish a more complicated analysis or because of an otherwise insurmountable problem, they learned how to program. Their programming and system creation skills aren’t the levels that you’d see from a programmer or data engineer—nor should they be. Data engineers’ skills At their core, data engineers have a programming background. This background is generally in Java, Scala, or Python. They have an emphasis or specialization in distributed systems and big data. A data engineer has advanced programming and system creation skills. My one sentence definition of a data engineer is: a data engineer is someone who has specialized their skills in creating software solutions around big data. Using these engineering skills, they create data pipelines. Creating a data pipeline may sound easy or trivial, but at big data scale, this means bringing together 10-30 different big [...]

Four short links: 11 April 2018


Assignment, Warranties, Data, and Public Goods

  1. Why Does "=" Mean Assignment -- marvellous history lesson.
  2. Warranty Void if Removed Stickers Are Bull -- Federal law says you can repair your own things, and manufacturers cannot force you to use their own repair services. (via BoingBoing)
  3. TXR -- a pattern language and a Lisp variant for data problems.
  4. Roman Roads and Persistence Development -- In some ways, the emergence of the Roman road network is almost a natural experiment—in light of the military purpose of the roads, the preferred straightness of their construction, and their construction in newly conquered and often undeveloped regions. This type of public good seems to have had a persistent influence on subsequent public good allocations and comparative development. At the same time, the abandonment of the wheel shock in MENA appears to have been powerful enough to cause that degree of persistence to break down. Overall, our analysis suggests that a public good provision is a powerful channel through which persistence in comparative development comes about. I wonder whether this kind of analysis is even conceivable with internet public policy like broadband, coding classes, and laws. (via BoingBoing)

Continue reading Four short links: 11 April 2018.


4 things business leaders should know as they explore AI and deep learning


Our survey reveals how organizations are using tools, techniques, and training to apply AI through deep learning. We’re at an exciting point with artificial intelligence (AI). Years of research are yielding tangible results, specifically in the area of deep learning. New projects and related technologies are blossoming. Enthusiasm is high. Yet the path toward real and practical application of AI and deep learning remains unclear for many organizations. Business and technology leaders are searching for clarity. Where do I start? How can I train my teams to perform this work? How do I avoid the pitfalls? We conducted a survey[1] to help leaders better understand how organizations are applying AI through deep learning and where they’re encountering the biggest obstacles. We identified four notable survey findings that apply to organizations. 1. There’s an AI skills gap Of particular note is an AI skills gap revealed in the survey. 28% of respondents are using deep learning now and 54% say it will play a key role in their future projects. Who will do this work? AI talent is scarce, and the increase in AI projects means the talent pool will likely get smaller in the near future. 2. Companies are addressing the AI skills gap through training Deep learning remains a relatively new technique, one that hasn’t been part of the typical suite of algorithms employed by industrial data scientists. So, it’s no surprise that the main factor holding companies back from trying deep learning is the skills gap. To overcome this gap, a majority (75%) of respondents said their company is using some form of in-house or external training program. Almost half (49%) of respondents said their company offered “in-house on-the-job training.” 35% indicated their company used either formal training from a third party or from individual training consultants or contractors. 3. Initial deep learning projects often focus on safe upgrades The rise of deep learning can be traced to its success in computer vision, speech technologies, and game playing, but our survey shows developers and data scientists are more likely to use it to work with structured or semistructured data. Why? There are good reasons. Upgrading familiar applications with deep learning is a safer investment than starting something new, businesses have a lot of structured and semistructured data already, and the number of businesses that can currently make use of computer vision (to say nothing of gaming) is limited. That said, our respondents see value in vision technology, and new deep learning applications for vision will grow in tandem with text and semistructured data. 4. TensorFlow is the most popular deep learning tool Most respondents (73%) said they’ve begun playing with deep learning software. TensorFlow is by far the most popular tool among our respondents, with Keras in second place, and PyTorch in third. Other frameworks like MXNet, CNTK, and BigDL have growing audiences as well. We expect all of these frameworks—including those that are less popular now—to continue to add users and use cases. Looking for more insight? Download our free report, "How companies are putting AI to work through deep learning," for full findings from our AI and deep learning survey. We'll also explore these and related AI topics at Artificial Intelligence Conference in New York, April 29-May 2, 2018. [1] In early 2018 we conducted a survey of subs[...]

Strong feedback loops make strong software teams


Enhance overall code quality through a blend of interpersonal communication and tool-based analysis.Software quality takes time. And good quality products come from properly working feedback loops. Timely feedback can mean clarity over confusion; a validation of assumptions can mean shorter development cycles. For example, let’s say you have a project that needs to be delivered next month, but you and your development team know it will take at least two more months to complete. How do you communicate this to key stakeholders? First off, you need to establish a shared understanding of goals and quality amongst all involved participants. As a developer, you tend to base your behavior and build products and architectures around values and assumptions. If these values and assumptions are not aligned and validated, you will never end up with what you intended—let alone on time and within budget. Assuming your assumptions are accurate, you get carried away and spend way too much time on something before gathering feedback. But honestly, when would you rather hear all of your effort was a waste: after you spent a day working on it, or after working on it for a week? A feedback loop is straightforward: it uses its input as one of its inputs. In its simplest form, a developer changes a code base and then gets feedback from the system by unit testing. This feedback will now be input for the developer’s next steps to improve the code. However, reality is not that simple. Plus, humans have an irrepressible tendency to include as many people as possible in one loop. If you follow such a course, you’ll end up with feedback chaos: massive “loops” including every potential player make it impossible to control, validate assumptions, and create a shared sense of reality. Quite simply, there’s too much going on. But there’s a solution: reflection. Reflection helps you identify existing feedback loops and determine who needs to be included. The shorter the feedback loop, the better. There are two forms of feedback: personal and tool based. Personal feedback is given on an interpersonal level—people discussing code, products, or processes and identifying where things can be improved. Tool-based feedback, such as static analysis, provides you with code-level feedback and tells you where to improve your code (or specific parts of your code) to increase quality. Personal feedback is often specific for projects, more sensitive to context, and offers concrete suggestions to implement. Tool-based feedback enables faster feedback loops, allows for scalability by iteration, and is more objective. But which form of feedback is better? There is a false dichotomy between full automation and human intervention. Successful quality control combines tool-based measurement with manual review and discussion. At the end of the day, the most effective feedback loops are a mixture of daily best practices, automation, tools, and human intervention. In an upcoming follow-up post, I’ll discuss specific practices that integrate personal and tool-based feedback. These practices will help you bolster your code and architectural quality. This post is a collaboration between O'Reilly and SIG. See our statement of editorial independence. Continue reading Strong feedback loops make strong software teams.[...]

Four short links: 10 April 2018


Deep Learning Learnings, Reverse Engineering WhatsApp, Database Client, and Social Science

  1. Lessons Learned Reproducing a Deep Reinforcement Learning Paper -- REALLY good retrospective on eight months reproducing a paper, with lots of lessons learned, like starting a reinforcement learning project, you should expect to get stuck like you get stuck on a math problem. It’s not like my experience of programming in general so far where you get stuck but there’s usually a clear trail to follow and you can get unstuck within a couple of days at most. It’s more like when you’re trying to solve a puzzle, there are no clear inroads into the problem, and the only way to proceed is to try things until you find the key piece of evidence or get the key spark that lets you figure it out.
  2. Reverse Engineering WhatsApp -- This project intends to provide a complete description and re-implementation of the WhatsApp Web API, which will eventually lead to a custom client. WhatsApp Web internally works using WebSockets; this project does as well.
  3. DatabaseFlow -- an open source self-hosted SQL client, GraphQL server, and charting application that works with your database. Visualize schemas, query plans, charts, and results. You can run Database Flow locally for your own use, or install to a shared server for the whole team.
  4. Code and Data for the Social Sciences -- This handbook is about translating insights from experts in code and data into practical terms for empirical social scientists.

Continue reading Four short links: 10 April 2018.


Four short links: 9 April 2018


Monads, GDPR, Blockchain, and Search

  1. What We Talk About When We Talk About Monads -- This paper is not a monad tutorial. It will not tell you what a monad is. Instead, it helps you understand how computer scientists and programmers talk about monads and why they do so.
  2. Publishers and GDPR -- a nice explanation of what GDPR is bringing to companies like Facebook and Google, how it's changing ad-serving, and what it means for content publishers.
  3. Blockchain is Not Only Crappy Technology But a Bad Vision for the Future -- There is no single person in existence who had a problem they wanted to solve, discovered that an available blockchain solution was the best way to solve it, and therefore became a blockchain enthusiast.
  4. Typesense -- open source typo tolerant search engine that delivers fast and relevant results out of the box.

Continue reading Four short links: 9 April 2018.


Four short links: 6 April 2018


Library Management, Flame Graphs, Silent Speech Interface, and Cloud Backup

  1. Thou Shalt Not Depend on Me (ACM) -- with 37% of websites using at least one known vulnerable library, and libraries often being included in quite unexpected ways, there clearly is room for improvement in library handling on the web.
  2. FlameScope -- Netflix's open source visualization tool for exploring different time ranges as Flame Graphs. (via Netflix Tech Blog)
  3. AlterEgo: A Personalized Wearable Silent Speech Interface -- The results from our preliminary experiments show that the accuracy of our silent speech system is at par with the reported word accuracies of state-of-the-art speech recognition systems, in terms of being robust enough to be deployed as voice interfaces, albeit on smaller vocabulary sets. (via MIT News)
  4. Duplicity -- Encrypted bandwidth-efficient backup using the rsync algorithm. Common use case is backing up server to S3, but there's an impressive number of connective services, including Google Drive, Azure,, and Dropbox.

Continue reading Four short links: 6 April 2018.


It's time to usher in a new era of UX curation



Empowering groups to understand design for themselves will lead to new user experiences for users and designers alike.

Continue reading It's time to usher in a new era of UX curation.


Kyle Simpson and Tammy Everts on the challenges of the modern web



The O’Reilly Programming Podcast: Rising barriers to entry, the complexity of the modern web, and a preview of upcoming Fluent sessions.

In this episode of the O’Reilly Programming Podcast, I talk with two of the program chairs for the upcoming O’Reilly Fluent Conference (July 11-14 in San Jose), Kyle Simpson and Tammy Everts. Simpson is co-author of the HTML 5 Cookbook, and the author of the You Don’t Know JS series of books. Everts is the chief experience officer at SpeedCurve and the author of Time is Money: The Business Value of Web Performance.

Continue reading Kyle Simpson and Tammy Everts on the challenges of the modern web.


Four short links: 5 April 2018


Interactive Notebooks, Molecule-making AI, Interpersonal Dynamics, and Javascript Motion Library

  1. MyBinder -- Turn a GitHub repo into a collection of interactive notebooks. (via Julia Evans)
  2. Molecule-Making AI (Nature) -- The new AI tool, developed by Marwin Segler, an organic chemist and artificial intelligence researcher at the University of Münster in Germany, and his colleagues, uses deep learning neural networks to imbibe essentially all known single-step organic-chemistry reactions—about 12.4 million of them. This enables it to predict the chemical reactions that can be used in any single step. The tool repeatedly applies these neural networks in planning a multi-step synthesis, deconstructing the desired molecule until it ends up with the available starting reagents. (via Slashdot)
  3. Interpersonal Dynamics -- The list of common corrosive dynamics rang true: bone-deep competition; fear of being found out; my reality is not the reality; it's no fun being the squeaky wheel; feedback stays at the surface; denial that work is personal.
  4. Popmotion -- A functional, flexible JavaScript motion library.

Continue reading Four short links: 5 April 2018.


5 tips for architecting fast data applications


Considerations for setting the architectural foundations for a fast data platform.We live in the era of the connected experience, where our daily interactions with the world can be digitized, collected, processed, and analyzed to generate valuable insights. Back in the days of Web 1.0, Google founders figured out smart ways to rank websites by analyzing their connection patterns and using that information to improve the relevance of search results. Google was among the pioneers that created “web scale” architectures to analyze the massive data sets that resulted from “crawling” the web that gave birth to Apache Hadoop, MapReduce, and NoSQL databases. Those were the days when “connected” meant having some web presence, “interactions” were measured in number of clicks, and the analysis happened in batch overnight processes. Fast forward to the present day and we find ourselves in a world where the number of connected devices is constantly increasing. These devices not only respond to our commands, but are also able to autonomously interact with each other. Each of these interactions generates data that collectively amount to high-volume data streams. Accumulating all this data to process overnight is not an option anymore. First, we want to generate actionable insights as fast as possible, and second, one night might not be long enough to process all the data collected the previous day. At the same time, our expectations as users have also evolved to the point where we demand that applications deliver personalized user experiences in near real time. To remain competitive in a market that demands real-time responses to these digital pulses, organizations are adopting fast data applications as key assets in their technology portfolio. There are many challenges that need to be addressed to create the right architecture to support the range of fast data applications that your enterprise needs. Here are five considerations every software architect and developer needs to take into account when setting the architectural foundations for a fast data platform. 1. Determine requirements first Although this seems the obvious starting point of every software architecture, there are specific considerations to observe when we define the set of requirements for a software platform to support fast data applications. Data in motion can be tricky to characterize, as there are usually probabilistic factors involved in the generation, transmission, collection, and processing of messages. These are some of the questions we need answered in order to help us drive the architecture: General data shape How large is each message? How many messages per time unit do we expect? Do we expect large changes in the frequency of message delivery? Are there peak hours? Are there “Black Friday” events in our business? Output expectations How fast do we need a result? Do we need to process each record individually? Or can we process them in small collections (micro-batch) Process tolerance How “dirty” is the data? What do we do with “dirty” data? Drop it? Report it? Clean and reprocess it? Do I need to preserve ordering? Are there inherent time relationsh[...]

What becomes of the broken hearted? Blueprint of a donor-free world using custom heart technologies



Advances in 3D-printing technology have the potential to lower the cost and increase the availability of organ transplants.

Imagine feeling like you ran a marathon when you’re actually just getting off the couch. Imagine the extreme anxiety you might experience from living with bouts of dizziness, chest pain, and accelerated heartbeat until a doctor explains to you that these symptoms are not “nothing”, and that in fact, you have cardiomyopathy. This condition could lead to heart failure and eventually a heart transplant, but this desperately needed organ may not be available in time.

Organ transplants are in high demand in the United States. The heart is the third most requested organ, with 4,000 candidates on the waitlist and over 2,000 heart transplant surgeries performed in 2017.

Continue reading What becomes of the broken hearted? Blueprint of a donor-free world using custom heart technologies.


Four short links: 4 April 2018


Forum Software, Data Analytics, Datalog Query, and Online != High-Tech

  1. Spectrum -- open source forum software. (via announcement)
  2. MacroBase -- a data analytics tool that prioritizes attention in large data sets using machine learning [...] specialized for one task: finding and explaining unusual or interesting trends in data.
  3. datahike -- a durable database with an efficient datalog query engine.
  4. Why So Many Online Mattress Brands -- trigger for a rant: software is eating everything, but that doesn't make everything an innovative company. If you're applying the online sales playbook to product X (kombucha, mattresses, yoga mats) it doesn't make you a Level 9 game-changing disruptive TechCo, it makes you a retail business keeping up with the times. I'm curious where the next interesting bits of tech are—@gnat me with your ideas.

Continue reading Four short links: 4 April 2018.


100+ new live online trainings just launched on O'Reilly's learning platform


Get hands-on training in AWS, Python, Java, blockchain, management, and many other topics.Develop and refine your skills with 100+ new live online trainings we opened up for April and May on our learning platform. Space is limited and these trainings often fill up. Creating Serverless APIs with AWS Lambda and API Gateway, April 6 Getting Started with Amazon Web Services (AWS), April 19-20 Python Data Handling: A Deeper Dive, April 20 How Product Management Leads Change in the Enterprise, April 23 Beyond Python Scripts: Logging, Modules, and Dependency Management, April 23 Beyond Python Scripts: Exceptions, Error Handling, and Command-Line Interfaces, April 24 Getting Started with Go, April 24-25 End-to-End Data Science Workflows in Jupyter Notebooks, April 27 Getting Started with Vue, April 30 Java Full Throttle with Paul Deitel: A One-Day, Code-Intensive Java Standard Edition Presentation, April 30 Building a Cloud Roadmap, May 1 Git Fundamentals, May 1-2 AWS Certified SysOps Administrator (Associate) Crash Course , May 1-2 OCA Java SE 8 Programmer Certification Crash Course, May 1-3 Getting Started with DevOps in 90 Minutes, May 2 Learn the Basics of Scala in 3 hours, May 2 IPv4 Subnetting, May 2-3 SQL Fundamentals for Data, May 2-3 SAFe 4.5 (Scaled Agile Framework) Foundations, May 3 Managing Team Conflict, May 3 Hands-On Machine Learning with Python: Clustering, Dimension Reduction, and Time Series Analysis, May 3 Google Cloud Platform Professional Cloud Architect Certification Crash Course, May 3-4 Cyber Security Fundamentals, May 3-4 Advanced Agile: Scaling in the Enterprise, May 4 Network Troubleshooting Using the Half Split and OODA, May 4 Software Architecture for Developers, May 4 Hands-On Machine Learning with Python: Classification and Regression, May 4 Building and Managing Kubernetes Applications, May 7 Introducing Blockchain, May 7 Get Started with NLP, May 7 Introduction to Digital Forensics and Incident Response (DFIR), May 7 Essential Machine Learning and Exploratory Data Analysis with Python and Jupyter Notebooks, May 7-8 Building Deployment Pipelines with Jenkins 2, May 7 and 9 Introduction to Apache Spark 2.x, May 7-9 Deep Learning Fundamentals, May 8 Acing the CCNA Exam, May 8 Emotional Intelligence for Managers, May 8 Scala Core Programming: Methods, Classes, and Traits, May 8 Design Patterns Boot Camp, May 8-9 Introduction to Lean, May 9 Beginner’s Guide to Creating Prototypes in Sketch, May 9 AWS Certified Solutions Architect Associate Crash Course, May 9-10 Cloud Native Architecture Patterns, May 9-10 Amazon Web Services: Architect Associate Certification - AWS Core Architecture Concepts, May 9-11 Blockchain Applications and Smart Contracts, May 10 Deep Reinforcement Learning, May 10 Getting Started with Machine Learning, May 10 Introduction to Ethical Hacking and Penetration Testing, May 10-11 Explore, Visualize, and Predict using pandas and Jupyter, May 10-11 Scalable Web Development with Angular, May 10-11 Apache Hadoop, Spark, and Big Data Foundations,[...]

It's time to rebuild the web


The web was never supposed to be a few walled gardens of concentrated content owned by a few major publishers; it was supposed to be a cacophony of different sites and voices.Anil Dash's "The Missing Building Blocks of the Web" is an excellent article about the web as it was supposed to be, using technologies that exist but have been neglected or abandoned. It's not his first take on the technologies the web has lost, or on the possibility of rebuilding the web, and I hope it's not his last. And we have to ask ourselves what would happen if we brought back those technologies: would we have a web that's more humane and better suited to the future we want to build? I've written several times (and will no doubt write more) about rebuilding the internet, but I've generally assumed the rebuild will need peer-to-peer technologies. Those technologies are inherently much more complex than anything Dash proposes. While many of the technologies I'd use already exist, rebuilding the web around blockchains and onion routing would require a revolution in user interface design to have a chance; otherwise it will be a playground for the technology elite. In contrast, Dash's "missing building blocks" are fundamentally simple. They can easily be used by people who don't have a unicorn's worth of experience as web developers and security administrators. Dash writes about the demise of the View Source browser feature, which dispays the HTML from which the web page is built. View Source isn't dead, but it's sick. He's right that the web succeeded, in part, because people with little background could look at the source for the pages they liked, copy the code they wanted, and end up with something that looks pretty good. Today, you can no longer learn by copying; while View Source still exists on most browsers, the complexity of modern web pages have made it next to useless. The bits you want are wrapped in megabytes (literally) of JavaScript and CSS. But that doesn't have to be the end of the story. HTML can be functional without being complex. Most of what I write (including this piece) goes into a first draft as very simple HTML, using only a half-dozen tags. Simple editors for basic web content still exist. Dash points out that Netscape Gold (the paid version of Netscape) had one, back in the day, and that there are many free editors for basic HTML. We'd have to talk ourselves out of the very complex formatting and layout that, after all, just gets in the way. Ask (almost) any designer: simplicity wins, not a drop-dead gorgeous page. We may have made View Source useless, but we haven't lost simplicity. And if we make enough simple sites, sites from which viewers can effectively copy useful code, View Source will become useful again, too. You can't become a web developer by viewing Facebook's source; but you might by looking at a new site that isn't weighed down by all that CSS and JavaScript. The web was never supposed to be a few walled gardens of concentrated content owned by Facebook, YouTube, Twitter, and a few other major publishers. It was supposed to be a cacophony of different sites and voices. And it would be easy [...]

How companies around the world apply machine learning


Strata Data London will introduce technologies and techniques; showcase use cases; and highlight the importance of ethics, privacy, and security.The growing role of data and machine learning cuts across domains and industries. Companies continue to use data to improve decision-making (business intelligence and analytics) and for automation (machine learning and AI). At the Strata Data Conference in London, we’ve assembled a program that introduces technologies and techniques, showcases use cases across many industries, and highlights the importance of ethics, privacy, and security. We are bringing back the Strata Business Summit, and this year, we have two days of executive briefings. Data Science and Machine Learning sessions will cover tools, techniques, and case studies. This year, we have many sessions on managing and deploying models to production, and applications of deep learning in enterprise applications. This year’s sessions on Data Engineering and Architecture showcases streaming and real-time applications, along with the data platforms used at several leading companies. Privacy and security The enforcement date for the General Data Protection Regulation (GDPR) is the day after the end of the conference (May 25, 2018) and for the past few months, companies have been scrambling to learn this new set of regulations. We have a tutorial and sessions to help companies learn how to comply with GDPR. Implementing data security and privacy remain foundational, but one of the key changes advanced by GDPR—“privacy-by-design”—will require companies to reassess how they design and architect products. Security and Privacy sessions Visualization, Design, and UX sessions Unlocking popular data types: Text, temporal data, and graphs The need for scaleout and streaming infrastructure can often be traced back to the importance of text, temporal data, and graphs. After one sets up infrastructure for collecting, storing, and querying these data types, the next step is to uncover interesting patterns or to use them to make predictions. Over the past year, companies have been turning to machine learning, in many cases to deep learning, when faced with large amounts of text, graphs, or temporal data. On the infrastructure side, we have sessions from members of some of the leading stream processing and storage communities. Text and Natural Language sessions Time-series and Graphs sessions Stream Processing and Real-time Applications sessions Data platforms How do some of the best companies architect and develop data platforms that help accelerate innovation and digital transformation? In a series of sessions, companies will share their internal platforms for business intelligence and machine learning. These are battle-tested platforms used in production, some at extremely large scale. Many of these data platforms encourage collaboration and sharing of data, features, and models. In addition, since we’re very much in an empirical era for machine learning[...]

It’s time for data ethics conversations at your dinner table


In an era where fake news travels faster than the truth, our communities are at a critical juncture.With 2.5 quintillion records of data created every day, people are being defined by how they travel, surf the internet, eat, and live their lives. We are in the midst of a “data revolution,” where individuals and organizations can store and analyze massive amounts of information. Leveraging data can allow for surprising discoveries and innovations with the power to fundamentally alter society: from applying machine learning to cancer research to harnessing data to create “smart” cities, data science efforts are increasingly surfacing new insights—and new questions. Working with large databases, new analytical tools, and data-enabled methods promises to bring many benefits to society. However, “data-driven technologies also challenge the fundamental assumptions upon which our societies are built,” says Margo Boenig-Liptsin, co-instructor of UC Berkeley’s “Human Contexts and Ethics of Data” course. Boenig-Liptsin notes, “In this time of rapid social and technological change, concepts like ‘privacy,’ ‘fairness,’ and ‘representation’ are reconstituted.” Indeed, bias in algorithms may favor some groups over others, as evidenced by notorious cases such as the finding by MIT Researcher Joy Buolamwini that certain facial recognition software fails to work for those with dark skin tones. Moreover, lack of transparency and data misuse at ever-larger scales has prompted calls for greater scrutiny on behalf of more than 50 million Facebook users. In an era where fake news travels faster than the truth, our communities are at a critical juncture, and we need to be having difficult conversations about our individual and collective responsibility to handle data ethically. These conversations, and the principles and outcomes that emerge as a result, will benefit from being intentionally inclusive. What does responsible data sharing and use look like—for a data scientist, a parent, or a business? How are our socioeconomic structures and methods of interaction shaping behavior? How might we ensure that our technologies and practices are fair and unbiased? One idea that has gained traction is the need for a ‘Hippocratic Oath’ for data scientists. Just as medical professionals pledge to “do no harm,” individuals working with data should sign and abide by one or a set of pledges, manifestos, principles, or codes of conduct. At Bloomberg’s Data for Good Exchange (D4GX) in New York City in September 2017, the company announced a partnership with Data for Democracy and BrightHiveto bring the data science community together to explore this very topic. More than 100 volunteers from universities, nonprofits, local and federal government agencies, and tech companies participated, drafting a set of guiding principles that could be adopted as a code of ethics. Notably, this is an ongoing and iterative process that must be community driven, respecting and recognizing the value of diverse thoughts and experiences. The group re-convened o[...]

Four short links: 3 April 2018


Internet of Battle Things, Program Fuzzing, Data Sheets for Data Sets, and Retro Port

  1. Challenges and Characteristics of Intelligent Autonomy for Internet of Battle Things in Highly Adversarial Environments -- Numerous artificially intelligent, networked things will populate the battlefield of the future, operating in close collaboration with human warfighters, and fighting as teams in highly adversarial environments. This paper explores the characteristics, capabilities, and intelligence required of such a network of intelligent things and humans—Internet of Battle Things (IOBT). It will experience unique challenges that are not yet well addressed by the current generation of AI and machine learning. (via Slashdot)
  2. T-Fuzz: Fuzzing by Program Transformation -- clever! To improve coverage, existing approaches rely on imprecise heuristics or complex input mutation techniques (e.g., symbolic execution or taint analysis) to bypass sanity checks. Our novel method tackles coverage from a different angle: by removing sanity checks in the target program. T-Fuzz leverages a coverage-guided fuzzer to generate inputs. Whenever the fuzzer can no longer trigger new code paths, a lightweight, dynamic tracing-based technique detects the input checks that the fuzzer-generated inputs fail. These checks are then removed from the target program. Fuzzing then continues on the transformed program, allowing the code protected by the removed checks to be triggered and potential bugs discovered.
  3. Data Sheets for Data Sets -- Currently there is no standard way to identify how a data set was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a data sheet for data sets, a short document to accompany public data sets, commercial APIs, and pretrained models.
  4. Porting Prince of Persia to the BBC Master -- the author of the original 1980s game, Jordan Mechner, found and posted the source code to the Apple II version. These fine folks ported it to a different 1980s computer. I love the creativity of people who hack on small retro systems. I find big web stuff lacks that these days: it's all up-to-your-elbows in frameworks.

Continue reading Four short links: 3 April 2018.


Four short links: 2 April 2018


Game Networking, Grep JSON, Voting Ideas, and UIs from Pictures

  1. Valve's Networking Code -- a basic transport layer for games. The features are: connection-oriented protocol (like TCP)...but message-oriented instead of stream-oriented; mix of reliable and unreliable messages; messages can be larger than underlying MTU, the protocol performs fragmentation and reassembly, and retransmission for reliable; bandwidth estimation based on TCP-friendly rate control (RFC 5348); encryption; AES per packet, Ed25519 crypto for key exchange and cert signatures; the details for shared key derivation and per-packet IV are based on Google QUIC; tools for simulating loss and detailed stats measurement.
  2. gron -- grep JSON from the command line.
  3. The Problem With Voting -- I don't agree with all of the analysis, but the proposed techniques are interesting. I did like the term "lazy consensus" where consensus is assumed to be the default state (i.e., “default to yes”). The underlying theory is that most proposals are not interesting enough to discuss. But if anyone does object, a consensus seeking process begins. (via Daniel Bachhuber)
  4. pix2code -- open source code that generates Android, iOS, and web source code for a UI from just a photo. It's not coming for your job any time soon (over 77% of accuracy), but it's still a nifty idea. (via Two Minute Papers)

Continue reading Four short links: 2 April 2018.


6 creative ways to solve problems with Linux containers and Docker


An outside-the-box exploration of how containers can be used to provide novel solutions.Most people are introduced to Docker and Linux containers as a way to approach solving a very specific problem they are experiencing in their organization. The problem they want to solve often revolves around either making the dev/test cycle faster and more reliable while simultaneously shortening the related feedback loops, or improving the packaging and deploying of applications into production in a very similar fashion. Today, there are a lot of tools in the ecosystem that can significantly decrease the time it takes to accomplish these tasks while also vastly improving the ability of individuals, teams, and organizations to reliably perform repetitive tasks successfully. That being said, tools have become such a big focus in the ecosystem that there are many people who haven’t really spent much time thinking about all the ways containers alone can provide interesting solutions to problems that can occur in the course of any technical task. To get the creative juices flowing and help folks start thinking outside the box, we’ll examine a few scenarios and explore how containers can be used to provide possible solutions. You'll notice that many of these examples utilize file mounts to access data stored on local machines. Note that all of these were tested on Mac OS X running a current stable release of Docker: Community Edition. Also, most of the examples assume you have a unix-based operating system, but they can often be adjusted to work on Windows. Preparation If you are planning on running these examples, go ahead and download the following images ahead of time so you can see how the commands run without the additional time required to pull down the images the first time: $ docker pull acleancoder/imagemagick-full:latest $ docker pull jasperla/docker-go-cross:latest $ docker pull spkane/dell-openmanage:latest $ docker pull debian:latest $ docker pull spkane/train-os:latest $ docker pull alpine:latest $ docker pull jess/firefox:latest Scenario 1 Using containers for console commands There are often applications that are very useful to have but don't run or are very difficult to compile on the platform we are using. Containers can provide a very easy way to run these applications, despite the apparent barriers (and even if we can run the application natively, containers can be a very compelling approach to packaging and distributing programs). In this example, we are using an ImageMagick container to resize an image. Although this particular example is easy to accomplish in other ways, it should give some insight into how a container can be used to take advantage of a wide variety of similar console-based tools. $ curl -o docker.png \ $ ls $ docker run -ti -v $(pwd):/data acleancoder/imagemagick-full:latest \ convert /data/docker.png -resize 50% /data/half_docker.png $ ls Scenario 2 Using containers [...]

Four short links: 30 March 2018


Data Literacy, Data Science Readings, Bloated Data Architectures, and AI Ruins Everything

  1. Data Defenders -- game for grade 4-6 that teaches children and pre-teens the concept of personal information and its economic value, and introduces them to ways to manage and protect their personal information on the websites and apps they enjoy. (via BoingBoing)
  2. Readings in Applied Data Science -- pointers to interesting papers, via Hadley Wickham's Stanford class.
  3. COST: Configuration that Outperforms a Single Thread -- The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation. [...] We survey measurements of data-parallel systems recently reported in SOSP and OSDI, and find that many systems have either a surprisingly large COST, often hundreds of cores, or simply underperform one thread for all their reported configurations.
  4. Finding Alternative Musical Scales -- is there nothing that AI cannot improve/ruin? We search for alternative musical scales that share the main advantages of classical scales: pitch frequencies that bear simple ratios to each other, and multiple keys based on an underlying chromatic scale with tempered tuning. We conduct the search by formulating a constraint satisfaction problem that is well suited for solution by constraint programming. We find that certain 11-note scales on a 19-note chromatic stand out as superior to all others. These scales enjoy harmonic and structural possibilities that go significantly beyond what is available in classical scales and therefore provide a possible medium for innovative musical composition. (via Mark J. Nelson)

Continue reading Four short links: 30 March 2018.


What machine learning engineers need to know



The O’Reilly Data Show Podcast: Jesse Anderson and Paco Nathan on organizing data teams and next-generation messaging with Apache Pulsar.

In this episode of the Data Show, I spoke with Jesse Anderson, managing director of the Big Data Institute, and my colleague Paco Nathan, who recently became co-chair of Jupytercon. This conversation grew out of a recent email thread the three of us had on machine learning engineers, a new job role that LinkedIn recently pegged as the fastest growing job in the U.S. In our email discussion, there was some disagreement on whether such a specialized job role/title was needed in the first place. As Eric Colson pointed out in his beautiful keynote at Strata Data San Jose, when done too soon, creating specialized roles can slow down your data team.

Continue reading What machine learning engineers need to know.


UX challenges in the Internet of Things



A look the various ways the IoT asks consumers to think like programmers, and the risks inherent in exponentially increasing educators.

Continue reading UX challenges in the Internet of Things.


Four short links: 29 March 2018


Facebook Container, Publishing Future, Social Media Ethics, and Online Virality

  1. Facebook Container -- Firefox add-on that isolates your Facebook identity from the rest of your web activity. When you install it, you will continue to be able to use Facebook normally. Facebook can continue to deliver their service to you and send you advertising. The difference is that it will be much harder for Facebook to use your activity collected off Facebook to send you ads and other targeted messages.
  2. What's Coming for Online Publishing (Doc Searls) -- What will happen when the Times, the New Yorker, and other pubs own up to the simple fact that they are just as guilty as Facebook of leaking its readers’ data to other parties, for—in many if not most cases—God knows what purposes besides “interest-based” advertising? (via Piers Harding)
  3. Affiliate Marketing Not Disclosed on Social Media (Freedom to Tinker) -- Of all the YouTube videos and Pinterest pins that contained affiliate links, only ~10% and ~7% respectively contained accompanying disclosures. (paper)
  4. The Structural Virality of Online Diffusion -- Indeed, the very label “viral hit” implies precisely the exponential spreading of the sort observed in contagion models in their supercritical regime. It is therefore notable that essentially everything we observe, including the very largest and rarest events, can be accounted for by a simple model operating entirely in the low infectiousness parameter regime.

Continue reading Four short links: 29 March 2018.


A graphical user interface to build apps on top of microservices


How to enable non-programmer business users to create their own data applications.Companies are increasingly asking their IT staffs for rapid turn-around on tasks that require programming. The most likely path to attaining quick turn-around would be to let non-programmers create their own applications—an approach that can be achieved with a combination of microservices, APIs, and graphical user interfaces (GUIs). The goal of letting non-programmers manipulate data is an old one. When IBM released early versions of the SQL programming language, they intended it for business users. They really expected a mid-level manager to sit down in the morning and type UPDATE PRODUCT.SPEC SET WIDTH = 19, HEIGHT = 12, DEPTH = 4 WHERE PRODUCT_NUM LIKE 'GK145%' onto a green screen. More realistically, though, SQL was hidden behind various forms that eventually moved to the web. But these forms do not offer the full power of the language. Visual programming, using some technique such as moving boxes around a screen to create control flows, has also been researched for decades. It is best known in children's educational tools such as Alice and MIT's Scratch. The problem with trying to create large-scale applications using visual programming—and the reason I'm guessing that visual programming hasn't won a large user base—is that programming's complexity lies more in the mind than in the syntax. Exposing the complexity of control flow and math through boxes and lines doesn't make it easier than using words and punctuation. It may be quite a different matter, however, when powerful chunks of functionality, already hooked up to an organization's data sets and control structure, are offered in a visual interface, also known as a “low-code development platform.” For instance, an auto insurance assessor can do something really useful if she can automate part of her job by drawing a line between client information and a tool that calculates how much to pay for each part of a car. Where do microservices enter the picture? They work at a higher level than function or library calls, offering precise access to specific data and services in the organization. I talked about microservices with Bruno Trimouille, a senior marketing executive at TIBCO Software. He spoke of an airline that has microservices for travel services, customer profile data, flight information, and other elements of their workflow. Visual programming exposes all these things to the average employee. If someone in the luggage department thinks up an application that can find a bag more quickly or send a voucher to a customer's mobile phone, he can hook up services to create the application. The low-code application environment can tap microservices to digitize a process that previously was done manually, such as submitting expenses or getting travel approval. I[...]

Four short links: 28 March 2018


Business Logic, Digital Forgery, Twitter Demetricator, and ML for Kids

  1. The Business Logic of Silicon Valley (Cory Doctorow) -- Selling hardware does not have exponential growth potential, because people only need so many sex toys, and really successful models are likely to be cloned or imitated, driving down the price. For a "smart" sex toy startup to cash out its investors, it will need to have a second, much more profitable business, and that, inevitably, is private data about sex toy users.
  2. Commoditization of AI, Digital Forgery, and the End of Trust -- Once tools for fabrication become a commodity, the effects will be more dramatic than the current phenomenon of fake news. In the tech circles, the issue are discussed only at a philosophical level; no clear solution is known at present time. This post discusses pros and cons of two classes of potential solutions: digital signatures and learning-based detection systems.
  3. Twitter Demetricator -- a Chrome interface that hides all the gamification numbers in Twitter.
  4. Machine Learning for Kids -- Each project is a stand-alone activity, written to last for a single lesson, and will guide children to create a game or interactive project that demonstrates a real-world use of artificial intelligence and machine learning.

Continue reading Four short links: 28 March 2018.


Four short links: 27 March 2018


Database Attack, Death of Android Predicted, Predicting Recidivism, and Cybersecurity Law and Policy Deep Dive: Database Attacks -- I enjoyed the description of how this attack worked: using Postgres to write and run executables, smuggling in an executable that sets up a cryptocurrency mining operation on the machine. Yegge on Android (Steve Yegge) -- Remember that I said it could take 20 minutes to see a 1-line code change in the regular Android stack? That can happen in the biggest apps like Nest or Facebook, but even for medium-size apps it can be two or three minutes. Whereas with React Native, it’s instantaneous. You make a change; you see the change. And that, folks, means you get to launch features 10x faster, which means faster time to market, which means first-mover advantage, which means you win, win, win. Abandoning native programming in favor of fast-cycle cross-platform frameworks like React Native is a winning strategy. Tim Bray disagrees with some of Yegge's points. (Also: Yegge's hiring, which is why he's blogging again) The Accuracy, Fairness, and Limits of Predicting Recidivism -- We show, however, that the widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise. We further show that a simple linear predictor provided with only two features is nearly equivalent to COMPAS with its 137 features. (via Aravind Narayanan) Teaching Cybersecurity Law and Policy -- My syllabus is much more than a one- or two-pager just listing the topics and weekly readings. Though there are a lot of reading assignments, the syllabus itself functions a bit like a casebook in that there also is a ton of narrative text framing each week’s topic, and also extensive questions for consideration matched to each reading. The full syllabus is 58 pages long. (via Bobby Chesney) Continue reading Four short links: 27 March 2018.[...]

Four short links: 26 March 2018


Honey Tokens, Digital Protection Agency, Intent, and Go on Pi

  1. Breach Detection at Scale With AWS Honey Tokens -- AWS keys make extremely good honey tokens because they're very interesting to attackers (because if you find someone's AWS keys, you may have just found several thousand dollars worth of cryptocurrency mining hardware in someone else's cloud); and because you, the defender, can really easily secure AWS keys, and alert if anyone tries to use them.
  2. It's Time for a Digital Protection Agency (Paul Ford) -- When you think of a Superfund site, you think of bad things, like piles of dead wildlife or stretches of fenced-off, chemical-infused land, or hospital wings filled with poisoned families. No one thinks about all the great chemicals that get produced, or the amazing consumer products we all enjoy. Nobody sets out to destroy the environment; they just want to make synthetic fibers or produce industrial chemicals. The same goes for our giant tech platforms.
  3. On Retirement (Jan Chipchase) -- Intent impacts everything downstream. True of so many things.
  4. gokrazy -- effort to build an all-Go userland for Raspberry Pi, removing memory errors as a source of exploits.

Continue reading Four short links: 26 March 2018.


Four short links: 23 March 2018


Tech Facts, History Lesson, Chrome DevTools, and Online Communities

  1. 12 Things Everyone Should Understand About Tech (Anil Dash) -- must read. Some "obvious" things in our world that aren't obvious to people outside it (or, indeed, to everyone in it). Tech history is poorly documented and poorly understood. [...] It’s often near impossible to know why certain technologies flourished, or what happened to the ones that didn’t.
  2. AI, Functional Programming, OOP (Alan Kay) -- fascinating reading for the historical perspective on time-tagged data, state, consistency, and scale. John started thinking about modal logics, but then realized that simply keeping histories of changes and indexing them with a “pseudo-time” when a “fact” was asserted to hold, could allow functional and logical reasoning and processing. He termed “situations” all the “facts” that held at a particular time—a kind of a “layer” that cuts through the world lines of the histories.
  3. Cool Chrome DevTools Tips and Tricks -- Screenshot a single element: select an element, press cmd-shift-p (or ctrl-shift-p in Windows) to open the Command Menu, and select Capture node screenshot to screenshot a single element. OH HO HO THIS CHANGES EVERYTHING.
  4. -- A modern platform for online communities. Powerful management tools. Mobile ready. Free and paid plans. No ads, no tracking. For those looking to take their groups off Facebook and Google.

Continue reading Four short links: 23 March 2018.


What are good resources for learning about JavaScript?



This collection of JavaScript resources will get you up to speed on the basics, best practices, and latest techniques.

Whether you’re just getting into JavaScript or you’re an experienced practitioner, you’ll find something useful on our list of JavaScript resources.

The items on this list were curated by O’Reilly’s editorial experts.

Continue reading What are good resources for learning about JavaScript?.


Rebecca Parsons on evolutionary architecture



The O’Reilly Programming Podcast: How to build evolvable systems.

In this episode of the O’Reilly Programming Podcast, I talk with Rebecca Parsons, chief technology officer at ThoughtWorks. She will be leading the workshop Building Evolutionary Architectures Hands-On at the O’Reilly Open Source Convention (OSCON), July 16-19, 2018, in Portland, Oregon. Parsons also is co-author (with Neal Ford and Patrick Kua) of the book Building Evolutionary Architectures.

Continue reading Rebecca Parsons on evolutionary architecture.


Four short links: 22 March 2018


Security Policy, Censored 3D Printers, Standup Tips, and Auto-Banning

  1. Protecting Security Researchers -- Dropbox issues, amongst other good steps toward public security researchers, a pledge to not initiate legal action for security research conducted pursuant to the policy, including good faith, accidental violations.
  2. Early-stage Malicious Activity Detection in 3D Printing -- teaching a 3D printer to recognize that it's being used to print a gun, so it won't. (via Miles Brundage)
  3. 5 Ways to Tune Up Your Standup -- Teams need to start thinking of impediments in terms of "what is slowing me down" rather than "what has stopped me." Testify!
  4. Fail2Ban -- scans log files (e.g., /var/log/apache/error_log) and bans IPs that show the malicious signs -- too many password failures, seeking for exploits, etc. Generally, Fail2Ban is then used to update firewall rules to reject the IP addresses for a specified amount of time, although any arbitrary other action (e.g., sending an email) could also be configured. Out-of-the-box Fail2Ban comes with filters for various services (apache, courier, SSH, etc).

Continue reading Four short links: 22 March 2018.


Interpreting predictive models with Skater: Unboxing model opacity


A deep dive into model interpretation as a theoretical concept and a high-level overview of Skater.Over the years, machine learning (ML) has come a long way, from its existence as experimental research in a purely academic setting to wide industry adoption as a means for automating solutions to real-world problems. But oftentimes, these algorithms are still perceived as alchemy because of the lack of understanding of the inner workings of these model (see Ali Rahimi, NIPS '17). There is often a need to verify the reasoning of such ML systems to hold algorithms accountable for the decisions predicted. Researchers and practitioners are grappling with the ethics of relying on predictive models that might have unanticipated effects on human life, such as the algorithms evaluating eligibility for mortgage loans or powering self-driving cars (see Kate Crawford, NIPS '17, “The Trouble with Bias”). Data Scientist Cathy O’Neil has recently written an entire book filled with examples of poor interpretability as a dire warning of the potential social carnage from misunderstood models—e.g., modeling bias in criminal sentencing or using dummy features with human bias while building financial models. Figure 1. Traditional methods for interpreting predictive models are not enough. Image courtesy of Pramit Choudhary. There is also a trade off in balancing a model’s interpretability and its performance. Practitioners often choose linear models over complex ones, compromising performance for interpretability, which might be fine for many use cases where the cost of an incorrect prediction is not high. But, in some scenarios, such as credit scoring or the judicial system, models have to be both highly accurate and understandable. In fact, the ability to account for the fairness and transparency of these predictive models has been mandated for legal compliance. At, where I’m a lead data scientist, we feel passionately about the ability of practitioners to use models to ensure safety, non-discrimination, and transparency. We recognize the need for human interpretability, and we recently open sourced a Python framework called Skater as an initial step to enable interpretability for researchers and applied practitioners in the field of data science. Model evaluation is a complex problem, so I will segment this discussion into two parts. In this first piece, I will dive into model interpretation as a theoretical concept and provide a high-level overview of Skater. In the second part, I will share a more detailed explanation on the algorithms Skater currently supports, as well as the library’s feature [...]

Data governance and the death of schema on read


Comcast’s system of storing schemas and metadata enables data scientists to find, understand, and join data of interest.In the olden days of data science, one of the rallying cries was the democratization of data. No longer were data owners at the mercy of enterprise data warehouses (EDWs) and extract, transform, load (ETL) jobs, where data had to be transformed into a specific schema (“schema on write”) before it could be stored in the enterprise data warehouse and made available for use in reporting and analytics. This data was often most naturally expressed as nested structures (e.g., a base record with two array-typed attributes), but warehouses were usually based on the relational model. Thus, the data needed to be pulled apart and “normalized" into flat relational tables in first normal form. Once stored in the warehouse, recovering the data’s natural structure required several expensive relational joins. Or, for the most common or business-critical applications, the data was “de-normalized,” in which formerly nested structures were reunited, but in a flat relational form with a lot of redundancy. This is the context in which big data and the data lake arose. No single schema was imposed. Anyone could store their data in the data lake, in any structure (or no consistent structure). Naturally nested data was no longer stripped apart into artificially flat structures. Data owners no longer had to wait for the IT department to write ETL jobs before they could access and query their data. In place of the tyranny of schema on write, schema on read was born. Users could store their data in any schema, which would be discovered at the time of reading the data. Data storage was no longer the exclusive provenance of the DBAs and the IT departments. Data from multiple previously siloed teams could be stored in the same repository. Where are we today? Data lakes have ballooned. The same data, and aggregations of the same data, are often present redundantly—often many times redundant, as the same interesting data set is saved to the data lake by multiple teams, unknown to each other. Further, data scientists seeking to integrate data from multiple silos are unable to identify where the data resides in the lake. Once found, diverse data sets are very hard to integrate, since the data typically contains no documentation on the semantics of its attributes. Attributes on which data sets would be joined (e.g., customer billing ID) have been given different names by different teams. The rule of thumb is that data scientists spend 70% of their time finding, interpreting, and cleaning [...]

Do no harm


In the software world, we’re often ignorant of the harms we do because we don’t understand what we’re working with.“First, do no harm.” That seems to have been a touchstone for the many discussions of data ethics that have been taking place. And, while that is a nice old saying that goes back to the Hippocratic oath, we need to think a lot more carefully about it. First, let’s take this statement at its word. “Do no harm.” But doctors did nothing but harm until 200 or so years ago. “You have a fever. That’s caused by bad blood. Let me take some of it out.” It was a race to see whether the patient died of blood loss from repeated bleedings before their immune system did its job. The immune system frequently lost. “You’ve got a broken leg. Here, let me cut it off. Without washing my hands first.” Most of the patients probably died from the ensuing infection. Mortality in childbirth had more than a little to do with doctors who just didn’t see the need to scrub down. “Syphilis? Let’s try mercury. Or arsenic. Or both.” A medical school professor I know tells me that the arsenic might have worked, if it didn’t kill you first. Mercury wasn’t good at anything but killing you slowly. The problem wasn’t that the centuries of doctors prior to the discovery of bacteria and antibiotics were uncaring, or in some way intended to do harm. On the whole, they were well-meaning and, for their time, fairly well educated. It’s that they didn’t understand “harm.” They were working with things (human bodies and their ailments) that they didn’t understand. Why would you wash your hands before surgery? What sort of idiot believes that invisible animals living on your skin will get into the patient and kill them? That is the problem we face in the software world. Most of the people in our industry don’t want to do harm, but we’re often ignorant of the harms we do because we don’t understand what we’re working with: human psyches. So often, when we look back at the consequences of what we’ve done, we say “oh, that was obvious.” But at the time, it was no more obvious than bacteria. I was reminded of this when I read about the connection between radicalization and engagement. When you see it spelled out like that, it’s a no-brainer. A generation of internet applications, certainly including YouTube and Facebook, were built to maximize engagement. How do you get people more engaged? Make them more passionate. Turn them into radicals. If you’re reading or watching something a little on the edge, what will m[...]

Real-world VR applications beyond entertainment



Jody Medich takes a look at our rapidly dematerializing world.

Continue reading Real-world VR applications beyond entertainment.


Building for the modern web is really, really hard


The 80+ sessions covering web performance, security, and accessibility at Fluent 2018 won’t make building for the web easy, but they’ll definitely make it easier.When I started working on the web back in the mid-90s, I had no idea how easy we all had it. Sure, our crappy 14.4 modems crashed roughly 6.2 times per hour. And those early days of all-grey backgrounds were kind of dull. But in terms of skill mastery, all you had to do was hack together some HTML, paste some copy, choose a font from the dazzling array of 12 fonts available to you, add an image or two, push it all live, restart your modem (again), grab a Snapple, watch that dancing baby GIF and try to figure out what all the fuss was about, and call it a day. The web has become mind-bogglingly complex The average web page is creeping up on 4MB in size, according to the HTTP Archive. It contains hundreds of assets served from scores of different third (and fourth and fifth) parties, and it relies on a multitude of programming languages to function. And that’s just the beginning. As developers, designers, and UX folks, we’re tasked with so much more than just making these huge, complex pages function. We’re tasked with making them function on an ever-increasing number of browsers and devices, both mobile and desktop. (In their latest annual device fragmentation report, OpenSignal identified 3,997 distinct Android devices alone downloading their app.) We’re also tasked with making these pages function reliably and consistently, regardless of the user’s connection type—from blazing-fast broadband speeds of 160 Mbps, to spotty 2G in developing countries, to hotels with mysteriously and arbitrarily execrable bandwidth. (See below. Ahem.) Figure 1. I experienced 1 Mbps bandwidth in 2017. Tell my story. Most important: we have to make these huge, complex pages fast, safe, and accessible for everyone. For almost a decade, I’ve been working deep in the world of web performance—but even from way down here in the trenches, I can tell you that while speed is a crucial pillar of the modern web, it’s just one pillar. Making the web fast without also making it accessible and secure is like building a high-performance race car—and then neglecting to install doors and a seatbelt. The Fluent Conference focuses on web performance, accessibility, and security Over the past several months, our Fluent Conference program committee has worked hard to develop a program of 80+ sessions that will give you the breadth and[...]

Four short links: 21 March 2018


Incident Response, Google VPN, Deep Learning, and Neural Network Quine

  1. 10 Steps to Develop an Incident Response Plan You’ll ACTUALLY Use (Salesforce) -- high-level process, but solid.
  2. Outline -- self-hosted VPN from Google.
  3. nGraph -- open source C++ implementation of a framework-neutral deep neural network (DNN) model compiler that can target a variety of devices. Intel's software entry into the neural network space. Why Intel? Users can run these frameworks on several devices: Intel Architecture, GPU, and Intel Nervana Neural Network Processor (NNP).
  4. Neural Network Quine -- clever! I have a Python quine scarf, but I'm not sure I'm going to enjoy this so much. Here we describe how to build and train self-replicating neural networks. The network replicates itself by learning to output its own weights.

Continue reading Four short links: 21 March 2018.


The evolution of systems requires an evolution of systems engineers


Systems and site reliability engineers, architects, and application developers must create new strategies to meet industry shifts and their constraints.Over the last few weeks, we've been reflecting on changes in the technology industry from when we first started our careers up to now. We’ve been looking at the changes in two different but overlapping spheres: changes to technology and changes to methodology. The systems we worked on when many of us first started out were the first generations of client-server applications. They were fundamentally different from the prior generation: terminals connecting to centralized apps running on mainframe or midrange systems. Engineers learned to care about the logic of their application client as well as the server powering it. Connectivity, the transmission of data, security, latency and performance, and the synchronization of state between the client and the server became issues that now had to be considered to manage those systems. This increase in sophistication spawned commensurate changes to the complexity of the methodologies and skills required to manage those systems. New types of systems meant new skills, understanding new tools, frameworks, and programming languages. We can trace back to this moment the spawning of numerous new specializations that had previously been more concentrated in single roles: front-end engineers, back-end engineers, data scientists, designers, UX/UI specialists, and a myriad other specialities. We can perhaps also trace back to this period the construction of more siloed functions and the increased complexity in transitions between those silos. The silos that the DevOps and SRE communities are attempting to dismantle today. Since the first generation of client-server systems, we’ve seen significant evolution. Much of it driven by the emergence of technology as being mission critical to doing business—for any business in every industry. This has been coupled with customer demand for fast, immediate functionality available on devices, delivered seamlessly across different geographies and fabrics. Take, for example, the evolution of renting videos from the corner video store to streaming on Netflix and Hulu and their peers. Our expectation of latency for the delivery of content has dropped from hours or minutes to seconds. Our expectation of the delivery of that content is that it’ll be available to us 24x7x365 on every device we own and in every location: from our homes and offi[...]

Four short links: 20 March 2018


Magic Leap, Autonomous Death, Computational Cognitive Neuroscience, and Analysis of Competing Hypotheses

  1. Creator -- Magic Leap released their SDK, with some details of the product, including: eye tracking, persistent location tracking across multiple locations, small field-of-vision, can't draw black, etc. (The HN commentary is informative)
  2. Uber Autonomous Car Hits Pedestrian -- first of many. This is where the rubber of the (faith-based) statistical argument "machines will kill fewer pedestrians than people do" meets the road of human emotional perception.
  3. Computational Cognitive Neuroscience -- This is a new wiki textbook, serving as a second edition to Computational Explorations in Cognitive Neuroscience.
  4. Open Synthesis -- an open platform for intelligence analysis. We're taking the best practices from the intelligence and business communities and adapting them to work at internet-scale. Open source, too.

Continue reading Four short links: 20 March 2018.