Learn how to create thread-safe instances with the singleton pattern in C#.
Continue reading How do I use the singleton pattern in C#?.(image)
Learn how to correctly implement the repository pattern in C#.
Continue reading How do I use the repository pattern in C#?.(image)
Car Security, Civ Math, Free Mindstorms, and Chinese AI Research
Continue reading Four short links: 20 Feb 2017.(image)
Your company is probably already doing AI and machine learning, but it needs a road map.
Continue reading How to drive shareholder value with artificial intelligence.(image)
Robot Governance, Emotional Labour, Predicting Personality, and Music History
Continue reading Four short links: 17 February 2017.(image)
The O’Reilly Bots Podcast: Slack’s head of developer relations talks about what bots can bring to Slack channels.
In this episode of the O’Reilly Bots Podcast, Pete Skomoroch and I speak with Amir Shevat, head of developer relations at Slack and the author of the forthcoming O’Reilly book Designing Bots: Creating Conversational Experiences.
Continue reading Amir Shevat on workplace communication.(image)
The O’Reilly Design Podcast: The guiding light of strategy, designing Allbirds, and what makes the magic of a brand identity.
In this week’s Design Podcast, I sit down with Simon Endres, creative director and partner at Red Antler. We talk about working from a single idea, how Red Antler is helping transform product categories, and the importance of having a point of view.
Continue reading Simon Endres on designing in an arms race of high-tech materials.(image)
Continue reading Four short links: 16 February 2017.(image)
Use Python's magic methods to amplify your code.
Continue reading How Python syntax works beneath the surface.(image)
Docker Data, Smart Broadcasting, Open Source, and Cellphone Spy Tools
Continue reading Four short links: 15 Feb 2017.(image)
The O’Reilly Security Podcast: The problem with perimeter security, rethinking trust in a networked world, and automation as an enabler.
In this episode, I talk with Doug Barth, site reliability engineer at Stripe, and Evan Gilman, Doug’s former colleague from PagerDuty who is now working independently on Zero Trust networking. They are also co-authoring a book for O’Reilly on Zero Trust networks. They discuss the problems with traditional perimeter security models, rethinking trust in a networked world, and automation as an enabler.
Continue reading Doug Barth and Evan Gilman on Zero Trust networks.(image)
How to map out a plan for finding value in data.
Rapping Neural Network, H1B Research, Quantifying Controversy, Social Media Research Tools
Continue reading Four short links: 14 Feb 2017.(image)
Urban Attractors, Millimetre-Scale Computing, Ship Small Code, and C++ Big Data
Continue reading Four short links: 13 Feb 2017.(image)
David Beyer talks about AI adoption challenges, who stands to benefit most from the technology, and what's missing from the conversation.
Continue reading The dirty secret of machine learning.(image)
Microsoft Graph Engine, Data Exploration, Godel Escher Bach, and Docker Secrets
Continue reading Four short links: 10 Feb 2017.(image)
Alex Rice on the importance of inviting hackers to find vulnerabilities in your system, and how to measure the results of incorporating their feedback.
Continue reading Hacker quantified security.(image)
The O'Reilly Radar Podcast: The value humans bring to AI, guaranteed job programs, and the lack of AI productivity.
This week, I sit down with Tom Davenport. Davenport is a professor of Information Technology and Management at Babson College, the co-founder of the International Institute for Analytics, a fellow at the MIT Center for Digital Business, and a senior advisor for Deloitte Analytics. He also pioneered the concept of “competing on analytics.” We talk about how his ideas have evolved since writing the seminal work on that topic, Competing on Analytics: The New Science of Winning; his new book Only Humans Need Apply: Winners and Losers in the Age of Smart Machines, which looks at how AI is impacting businesses; and we talk more broadly about how AI is impacting society and what we need to do to keep ourselves on a utopian path.
Continue reading Tom Davenport on mitigating AI's impact on jobs and business.(image)
The O’Reilly Data Show Podcast: Jason Dai on BigDL, a library for deep learning on existing data frameworks.
In this episode of the Data Show, I spoke with Jason Dai, CTO of big data technologies at Intel, and co-chair of Strata + Hadoop World Beijing. Dai and his team are prolific and longstanding contributors to the Apache Spark project. Their early contributions to Spark tended to be on the systems side and included Netty-based shuffle, a fair-scheduler, and the “yarn-client” mode. Recently, they have been contributing tools for advanced analytics. In partnership with major cloud providers in China, they’ve written implementations of algorithmic building blocks and machine learning models that let Apache Spark users scale to extremely high-dimensional models and large data sets. They achieve scalability by taking advantage of things like data sparsity and Intel’s MKL software. Along the way, they’ve gained valuable experience and insight into how companies deploy machine learning models in real-world applications.
Continue reading Deep learning for Apache Spark.(image)
2017-02-09T12:00:00Z5 questions for Aarron Walter: Shaping products, growing teams, and managing through change.I recently asked Aarron Walter, VP of design education at InVision and author of Designing for Emotion, to discuss what he has learned through his years of building and managing design teams. At the O’Reilly Design Conference, Aaron will be presenting a session, Hard-learned lessons in leading design. Your talk at the upcoming O'Reilly Design Conference is titled Hard-learned lessons in leading design. Tell me what attendees should expect. I had the unique opportunity of watching a company grow from just a handful of people to more than 550 over the course of eight years at MailChimp. When I started we had a few thousand customers, but when I left in February of 2016, there were more than 10 million worldwide. We saw tremendous growth, and I learned so much in my time there. In my talk, I'll be sharing the most salient lessons I learned along the way—how to shape a product, grow a team, how a company changes and how it changes people's careers, and a lot more. What are some of the challenges that come along with building and leading a design team in a strong growth period? As a company grows, the people who run it have to grow, too. There's a steep learning curve. When you're a small team it's easy to make decisions and get things done. But when a company grows, clear processes are needed, more people need to be brought into the planning process, and rapport has to be developed between teams and key individuals. The trick is you never really know what stage the company is in, so there's always uncertainty about whether you're doing the right thing. Everyone has to adapt and change with each new stage, and that can be hard for some people. What are some of the more memorable lessons you learned along the way? Early on as the director of UX, I thought my most important job was designing a great product. That was true but only until we needed to start building teams. Then my most important job was hiring great people. That remained my top priority for years to come, and I see it as my lasting legacy within the company. There are so many smart, talented people at MailChimp. I'm proud to have played a part in hiring and mentoring a number of people who've gone on to lead their own teams. In the early years of the product, we were focused on the future, toward new features and new ideas. But as the product and company matured, we had to master the art of refinement. Feature production is a treadmill: there will always be something else you can build. But if those features are half-baked or unrefined, you can end up with a robust product that is too complicated or too broken to use. Phil Libin said it best, "The best product companies in the world have figured out how to make constant quality improvements part of their essential DNA." You will be speaking about the importance of building a strong design practice. Can you explain what this looks like? A strong design practice has these things going for it: A product vision that makes it clear to everyone how the product fits into the lives of the audience. A rigorous process for understanding the problem through research, customer interaction, and debate. A culture of feedback where designers can continue to grow and the work gets pushed to its potential. Strong relationship with other teams. Design is a continuum, not just a step in the process. You have to work with everyone in the process to produce great products. You're speaki[...]
In-Memory Malware, Machine Ethics, Open Source Maintainer's Dashboard, and Cards Against Silicon Valley
Continue reading Four short links: 9 February 2017.(image)
Becoming a Troll, Magic Paper, HTTPS Interception, and Deep NLP
Continue reading Four short links: 8 February 2017.(image)
Learn how to allow for parallelization using the reduce algorithm, new in C++17.
Continue reading What is the new reduce algorithm in C++17?.(image)
Game Theory, Algorithms and Robotics, High School Not Enough, and RethinkDB Rises
Continue reading Four short links: 7 February 2017.(image)
Understanding the FTC’s role in policing analytics.
Continue reading Staying out of trouble with big data.(image)
Learn how to extract data from a structure correctly and efficiently using Python's slice notation.
In this tutorial, we will review the Python slice notation, and you will learn how to effectively use it. Slicing is used to retrieve a subset of values.
The basic slicing technique is to define the starting point, the stopping point, and the step size - also known as stride.
Continue reading How do I use the slice notation in Python?.(image)
NPC AI, Deep Learning Math Proofs, Amazon Antitrust, and Code is Law
Continue reading Four short links: 6 February 2017.(image)
Learn how to set up your configuration file to indicate the types of packages you want to install by using the “yum” command.
Continue reading How do you customize packages in a Kickstart installation?.(image)
Stream Alerting, Probabilistic Cognition, Migrations at Scale, and Interactive Machine Learning
Continue reading Four short links: 3 February 2017.(image)
Sara M. Watson from Digital Asia Hub discusses the state of personalization and how it can become more useful for consumers.
Continue reading Personalization's big question: Why am I seeing this?.(image)
Learn how to create and make changes to a Kickstart configuration file using the anaconda-ks.cfg.
Continue reading How do you create a Kickstart file?.(image)
Learn how to handle array comparisons using the set_intersection algorithm in C++.
Continue reading How do I use the set_intersection algorithm in C++?.(image)
The O’Reilly Hardware Podcast: Powering connected devices with low-power networks.
In this episode of the O’Reilly Hardware Podcast, Brian Jepson and I speak with Mike Vladimer, co-founder of the Orange IoT Studio at Orange Silicon Valley. Vladimer discusses how Internet of Things devices could benefit from connectivity options other than those provided by well-known technologies (including cellular, WiFi, and Bluetooth), and explains the LoRa wireless protocol, which supports long-range and lower-power applications.
Continue reading Mike Vladimer on IoT connectivity.(image)
How to use the wordcount example as a starting point (and you thought you’d escape the wordcount example).
While Spark ML pipelines have a wide variety of algorithms, you may find yourself wanting additional functionality without having to leave the pipeline model. In Spark MLlib, this isn't much of a problem—you can manually implement your algorithm with RDD transformations and keep going from there. For Spark ML pipelines, the same approach can work, but we lose some of the nicely integrated properties of the pipeline, including the ability to automatically run meta-algorithms, such as cross-validation parameter search. In this article, you will learn how to extend the Spark ML pipeline model using the standard wordcount example as a starting point (one can never really escape the intro to big data wordcount example).
Continue reading Extend Spark ML for your own model/transformer types.(image)
The O’Reilly Design Podcast: Building bridges across disciplines, universal vs. inclusive design, and what playground design can teach us about inclusion.
In this week’s Design Podcast, I sit down with Kat Holmes, principal design director, inclusive design at Microsoft. We talk about what she looks for in designers, working on the right problems to solve, and why both inclusive and universal design are important but not the same.
Physical Authentication, Crappy Robots, Immigration Game, and NN Flashcards
Continue reading Four short links: 2 February 2017.(image)
Learn to use Kickstart to get the same look on multiple Red Hat Enterprise Linux system installations.
Continue reading What is a Kickstart installation and why would you use it?.(image)
Learn how to write shorter, better performing, and easier to read code using standard algorithms with object methods in C++.
Continue reading How do you use standard algorithms with object methods in C++?.(image)
Unhappy Developers, Incident Report, Compliance as Code, AI Ethics
Continue reading Four short links: 1 February 2017.(image)
The O’Reilly Security Podcast: Saving the Network Time Protocol, recruiting and building future open source maintainers, and how speed and security aren’t at odds with each other.
In this episode, O’Reilly’s Mac Slocum talks with Susan Sons, senior systems analyst for the Center for Applied Cybersecurity Research (CACR) at Indiana University. They discuss how she initially got involved with fixing the open source Network Time Protocol (NTP) project, recruiting and training new people to help maintain open source projects like NTP, and how security needn’t be an impediment to organizations moving quickly.
2017-02-01T12:00:00ZThe adventures in deep learning and cheap hardware continue!Yes, you can run TensorFlow on a $39 Raspberry Pi, and yes, you can run TensorFlow on a GPU powered EC2 node for about $1 per hour. And yes, those options probably make more practical sense than building your own computer. But if you’re like me, you’re dying to build your own fast deep learning machine. OK, a thousand bucks is way too much to spend on a DIY project, but once you have your machine set up, you can build hundreds of deep learning applications, from augmented robot brains to art projects (or at least, that’s how I justify it to myself). At the very least, this setup will easily outperform a $2,800 Macbook Pro on every metric other than power consumption and, because it’s easily upgraded, stay ahead of it for a few years to come. I hadn’t built a computer since the ’80s, and I was pretty intimidated by dropping hundreds of dollars on something I might not be able to build (and might not really use), but I’m here to tell you it can be done! Also, it’s really fun, and you will end up with a great general-purpose computer that will generally do inference and learning 20 times faster than your laptop. Here’s what you need to buy and some specific recommendations: Motherboard Motherboards come in different sizes. Since I didn’t want to use multiple GPUs, the cheapest and smallest standard size is called mini-ITX, which will be fine for this sort of project. My minimum requirements were a PCIe slot to plug the GPU into and two DDR4 slots to plug RAM into, and the board I went with was an ASUS Mini ITX DDR4 LGA 1151 B150I PRO GAMING/WIFI/AURA Motherboard for $125 on Amazon. It comes with a WiFi antenna, which is actually super useful in my basement. Case Cases don’t matter much, but they’re pretty cheap, and since this market for DIY computers is dominated by gamers, they come in all kinds of fun shapes and colors. The size should match the motherboard, so it needs to have mini-ITX in the name. I bought a Thermaltake Core V1 Mini ITX Cube on Amazon for $50. RAM I can’t believe how cheap RAM has gotten! You need to buy DDR4 RAM to match the motherboard (that’s most of what you will find online) and the prices are all about the same. I bought two 8GB of Corsair Vengeance for $129. I spent the extra $5 because of the Amazon review that stated, “For those who just cannot get enough LEDs crammed into their system, these are the perfect choice.” If you build a computer in your basement and you don’t embrace your inner Burning Man/teenager aesthetic, you are going to have a really hard time finding components. CPU I looked at speed comparison CPU tests online, and I think I would have been fine with a slower CPU, as very few things I do are CPU-limited (except training neural nets and I’m going to use the GPU for that). But I couldn’t bring myself to build a whole computer with a CPU three gen[...]
Data governance is straightforward; data strategy is not.
Continue reading What’s a CDO to do?.(image)
Historic Language, Activist Security, Microcode Assembler, and PDP-10 ITS Source
Continue reading Four short links: 31 January 2017.(image)
Elixir’s key organizational concept, the process, is an independent component built from functions that sends and receives messages.
Elixir is a functional language, but Elixir programs are rarely structured around simple functions. Instead, Elixir’s key organizational concept is the process, an independent component (built from functions) that sends and receives messages. Programs are deployed as sets of processes that communicate with each other. This approach makes it much easier to distribute work across multiple processors or computers, and also makes it possible to do things like upgrade programs in place without shutting down the whole system.
Taking advantage of those features, though, means learning how to create (and end) processes, how to send messages among them, and how to apply the power of pattern matching to incoming messages.
Continue reading Playing with processes in Elixir.(image)
Toward a virtuous cycle between people, devices, and cloud.
You have a lot of options available when you’re building a smart, connected device. For example, in recent years, your hardware options have multiplied massively. Even the humble Raspberry Pi, originally designed as an educational tool for youth, is getting into the game with NEC’s announcement of Raspberry Pi Compute Module support in their commercial/industrial display panels.
And there has long been plenty of choices for those who want to roll their own devices from scratch. Every embedded hardware platform has some kind of evaluation board available that works as a starting point for your own designs. For example, you can prototype with an inexpensive reference module like MediaTek’s LinkIt ONE, and then design your own module that has only the parts you need.
Continue reading Prototyping and deploying IoT in the enterprise.(image)
Liquid Lenses, SRE Book, MEGA Source, and Founder Game
Continue reading Four short links: 30 January 2017.(image)
2017-01-30T12:00:00ZIntroducing the solar correlation map, and how to easily create your own.An ancient curse haunts data analysis. The more variables we use to improve our model, the exponentially more data we need. By focusing on the variables that matter, however, we can avoid underfitting, and the need to collect a huge pile of data points. One way of narrowing input variables is to identify their influence on the output variable. Here correlation helps—if the correlation is strong, then a significant change in the input variable results in an equally strong change in the output variable. Rather than using all available variables, we want to pick input variables strongly correlated to the output variable for our model. There's a catch though—and it arises when the input variables have a strong correlation among themselves. As an example, suppose we want to predict parental education, and we find a strong correlation with country club membership, the number household cars, and costs of vacations in our data set. All of these luxuries grow from the same root: the family is rich. The true underlying correlation is that highly educated parents usually have a higher income. We can either use the household income to predict parental education, or use the array of variables above. We call this type of correlation “intercorrelation.” Intercorrelation is the correlation between explanatory variables. Adding many variables, where one suffices, conjures up the curse of dimensionality, and requires large amounts of data. It is sometimes beneficial therefore, to elect just one representative for a group of intercorrelated input variables. In this article, we’ll explore both correlation and intercorrelation with a “solar correlation map”—a new type of visualization created for this purpose, and we’ll show you how to simply create a solar correlation for yourself. Using the solar correlation map on housing price data We can use covariance and coefficient matrices to apply the solar correlation map to housing price data. As efficient as these tools are, however, they are hard to read. Thankfully, there are visualizations that can beautifully and succinctly represent the matrices to explore the correlations. The solar correlation map is designed for a dual purpose—it addresses: the visual representation of the correlation of each input variable, to the output variable the intercorrelation of the input variables Let's generate the solar correlation map for a standard data set and explore it. Carnegie Mellon University has collected data on Boston Housing prices in the 1990s; it is one of the freely accessible data sets from the UCI (University of California Irvine) Machine Learning repository. Our goal in this data set is to predict the output vari[...]
Ethics of AI, Vertically Integrated Internet, Assessing Empirical Observations, Battery Teardown
Continue reading Four short links: 27 January 2017.(image)
HTTP/2 is still new and, although deploying it is relatively easy, there are a few things to be on the lookout for when enabling it.
HTTP/1.x (h1) was standardized in 1999, we've had years of experience deploying it, we understand how browsers and servers behave with it, and we've learned how to optimize for it too. In contrast, it has been just 18 months since HTTP/2 (h2) was standardized, and there’s already widespread support of it in browsers, servers and CDNs.
So what makes h2 different from h1, and what should you watch out for when enabling h2 support for a site? Here are five things to look out for along the way.
Continue reading Pitfalls of HTTP/2.(image)
2017-01-27T11:00:00ZBuild regulatory compliance into development and operations, and write compliance and checks and auditing into continuous delivery, so it becomes an integral part of how your DevOps team works. DevOps can be followed to achieve what Justin Arbuckle at Chef calls “Compliance as Code”: building compliance into development and operations, and wiring compliance policies and checks and auditing into Continuous Delivery so that regulatory compliance becomes an integral part of how DevOps teams work on a day-to-day basis. Chef Compliance Chef Compliance is a tool from Chef that scans infrastructure and reports on compliance issues, security risks, and outdated software. It provides a centrally managed way to continuously and automatically check and enforce security and compliance policies. Compliance profiles are defined in code to validate that systems are configured correctly, using InSpec, an open source testing framework for specifying compliance, security, and policy requirements. You can use InSpec to write high-level, documented tests/assertions to check things such as password complexity rules, database configuration, whether packages are installed, and so on. Chef Compliance comes with a set of predefined profiles for Linux and Windows environments as well as common packages like Apache, MySQL, and Postgres. When variances are detected, they are reported to a central dashboard and can be automatically remediated using Chef. A way to achieve Compliance as Code is described in the “DevOps Audit Defense Toolkit”, a free, community-built process framework written by James DeLuccia, IV, Jeff Gallimore, Gene Kim, and Byron Miller.1 The Toolkit builds on real-life examples of how DevOps is being followed successfully in regulated environments, on the Security as Code practices that we’ve just looked at, and on disciplined Continuous Delivery. It’s written in case-study format, describing compliance at a fictional organization, laying out common operational risks and control strategies, and showing how to automate the required controls. Defining Policies Upfront Compliance as Code brings management, compliance, internal audit, the PMO and infosec to the table, together with development and operations. Compliance policies and rules and control workflows need to be defined upfront by all of these stakeholders working together. Management needs to understand how operational risks and other risks will be controlled and managed through the pipeline. Any changes to these policies or rules or workflows need to be formally approved and documented; for example, in a Change Advisory Board (CAB) meeting. But instead of relying on checklists and procedures and meetings, the policies [...]
2017-01-26T17:05:00ZHow Project Jupyter got here and where we are headed.In this post, we’ll look at Project Jupyter and answer three questions: Why does the project exist? That is, what are our motivations, goals, and vision? How did we get here? Where are things headed next, in terms of both Jupyter itself and the context of data and computation it exists in? Project Jupyter aims to create an ecosystem of open source tools for interactive computation and data analysis, where the direct participation of humans in the computational loop—executing code to understand a problem and iteratively refine their approach—is the primary consideration. Anchoring Jupyter around humans is key to the project; it helps us both narrow our scope in some directions (e.g., we are not building generic frameworks for graphical user interfaces) and generalize in others (e.g., our tools are language agnostic despite our team’s strong Python heritage). In service of this goal, we: Explore ideas and develop open standards that try to capture the essence of what humans do when using the computer as a companion to reasoning about data, models, or algorithms. This is what the Jupyter messaging protocol or the Notebook format provide for their respective problems, for example. Build libraries that support the development of an ecosystem, where tools interoperate cleanly without everyone having to reinvent the most basic building blocks. Examples of this include tools for creating new Jupyter kernels (the components that execute the user’s code) or converting Jupyter notebooks to a variety of formats. Develop end-user applications that apply these ideas to common workflows that recur in research, education, and industry. This includes tools ranging from the now-venerable IPython command-line shell (which continues to evolve and improve) and our widely used Jupyter Notebook to new tools like JupyterHub for organizations and our next-generation JupyterLab modular and extensible interface. We strive to build highly usable, very high-quality applications, but we focus on specific usage patterns: for example, the architecture of JupyterLab is optimized for a web-first approach, while other projects in our ecosystem target desktop usage, like the open source nteract client or the support for Jupyter Notebooks in the commercial PyCharm IDE. Host a few services that facilitate the adoption and usage of Jupyter tools. Examples include NBViewer, our online notebook sharing system, or the free demonstration service try.jupyter.org. These services are themselves fully open source, enabling others to either deploy them in custom environments or build new te[...]
The O’Reilly Radar Podcast: AI on the hype curve, imagining nurturing technology, and gaps in the AI conversation.
This week, I sit down with anthropologist, futurist, Intel Fellow, and director of interaction and experience research at Intel, Genevieve Bell. We talk about what she’s learning from current AI research, why the resurgence of AI is different this time, and five things that are missing from the AI conversation.
The O’Reilly Data Show Podcast: Adam Gibson on the importance of ROI, integration, and the JVM.
As data scientists add deep learning to their arsenals, they need tools that integrate with existing platforms and frameworks. This is particularly important for those who work in large enterprises. In this episode of the Data Show, I spoke with Adam Gibson, co-founder and CTO of Skymind, and co-creator of Deeplearning4J (DL4J). Gibson has spent the last few years developing the DL4J library and community, while simultaneously building deep learning solutions and products for large enterprises.
Continue reading The key to building deep learning solutions for large enterprises.(image)
The O’Reilly Bots Podcast: The 2017 bot outlook with one of the field’s early adopters.
In this episode of the O’Reilly Bots Podcast, Pete Skomoroch and I speak with Chris Messina, bot evangelist, creator of the hashtag, and, until recently, developer experience lead at Uber. We talk about the origins of MessinaBot, ruminate on the need for bots that truly exploit their medium rather than imitating older apps, and take a look at what’s ahead for bots in 2017.
Continue reading Chris Messina on conversational commerce.(image)
2017-01-26T12:05:00ZData, algorithms, and better business results are key to developing AI.As I read this post from the World Economic Forum, This is why China has the edge in Artificial Intelligence, what struck me wasn't whether China has an edge in AI, or even if I care. What struck me is the proposed five building blocks required for AI development: Massive data Automatic data tagging systems Top scientists Defined industry requirements Highly efficient computing power It made me wonder, are these factors essential to building a solid foundation for AI? Does high performance in these areas give an edge to AI projects? And, overall, my answer was: somewhat, but misleading. Let me explain, by block: Massive data. IMHO, this is the red herring of AI. Too many believe "s/he who has the most data wins." Data is absolutely valuable, but volume alone does not bring value. Within volume, you can have data that is generic or redundant. Therefore, massive amounts of data only help you if it can be used for differentiation. Specifically, you’re able to drive better results from that data. And, three other V's define big data: variety, velocity, and veracity. Variety and velocity do not require "massive-ness." As for veracity, you know the value of massive amounts of garbage data. Finally, I'd add that massive data can quickly lead to tyranny of popularity (i.e., those instances with the most data win). We all have examples of when one nugget of information was the key; sometimes the small data should win. Bottom line: big data is a building block—check; massive data—misleading. Automatic data tagging systems. The automated tagging systems are AI, so we get caught in an infinite loop if we take this as a building block. Bottom line: automatic data tagging systems are sub-assemblies, not building blocks. Top scientists. First, none of this is possible without research. None. HT to Bengio(s), LeCun, Ng, Hinton, et al. And, the WEF article calls out a combination of scientists and engineers, but with more of a waterfall approach versus one based on requirements. The question must be what you are trying to build and how important it is for you to create the algorithms versus use algorithms conceived or created by others. You need to decide this for your business—where is science important and where is implementation important? The two are different blocks, and both are critical. And, you might have different answers to different parts of your problem. Bottom line: top scientists and/or experienced engineers create the building bl[...]
2017-01-26T12:00:00Z5 questions for Noah Iliinsky: Solving real problems, measuring success, and adopting holistic thinking.I recently asked Noah Iliinsky, senior UX architect at Amazon Web Services, and co-editor of Beautiful Visualization and co-author of Designing Data Visualizations, to discuss the principles for successful design, common missteps designers make, and why holistic thinking is an important skill for all designers. At the O’Reilly Design Conference, Noah will be presenting a session, Guaranteed successful design. You're presenting a talk at the O'Reilly Design Conference called Guaranteed successful design. Tell me more about what folks can expect. It's a survey of design techniques, approaches, and tenets that are either not-well-known-enough (Wardley mapping, design for human inaction), or are understood but not sufficiently practiced (draw the map or diagram). This talk originated as a lightning talk, where each topic was mostly just a headline and a single line of description. I'll be walking through them in the same order as before, but giving more depth and background for each technique. You are covering 17 principles that can improve odds of success. How do you measure success? Great question. Success can me measured by subjective user experience (frustrating, easy, delightful, confusing, etc.) as well as by metrics around task completion rate, number of errors, etc. There's also the greater question of solving the right problem in the first place. Even if your design is perfect, it can't be a success if it isn't solving a real problem. Each of these topics is designed to guide the right sort of inquiry to increase the likelihood of solving the right problem in a satisfying manner. Conversely, what are some of the major missteps designers make when approaching their work? The are two major classes of design process error I see frequently. The first is people providing solutions for problems that don't actually exist, or only exist for a small subset of people (who are often similar types of people to the solution-makers). The second class of error is problem solvers falling in love with a particular implementation of a solution, rather than understanding that each implementation is one of many that can satisfy a particular requirement, and each have different strengths and weaknesses. Not coincidentally, these topics are both heavily addressed in my talk. Why do you think it's so difficult for designers to think more holistically about the [...]
2017-01-26T12:00:00ZValidating your data requires asking the right questions and using the right data.Almost 10 years ago, I started as an intern on a data engineering team, working my way up to senior developer after working on dozens of projects and processes, including simple queries of data, data warehousing, parsing raw logs, translations, aggregations, and creating products for final reports and analysis. After working with data for many years to create reliable reports and analysis, I've seen many people join our team who are processing and analyzing data for the first time (and sometimes even after years of working with only pristine, reportable data), and who struggle at first to understand how to reliably maintain and use data that is more raw and random. Often, they don’t know to ensure only the right data is used, and answering the question, "How would you test that?” has been difficult to even begin to answer. However, after working in data sets, this becomes more obvious and easy. To help answer this question, it's helpful to focus on boundaries and hard expectations within the data, specifically on format and validity of the values being observed. Ask yourself these questions: Do I have all the data I started with? Are there nulls in the data that should have values? Are there duplicates in the data? Other key things to look at when evaluating data’s validity include trends and how different components of data relate. For example, if you’re testing a set of data that represents a shopping experience with users, products, purchases, and carts, some key questions to answer may include: Do all purchases relate to valid products? Does every cart and purchase have a valid user? Is the total number of carts less than total users? (Assuming each user should only have at most one cart.) Recently, I worked on a large project that encompassed almost all of the data processes and testing practices that we’ve developed over the past decade or so. Therefore, I decided to use this project as a case study to create a guide for how I test data and think about the process. This case study also included the larger question: "Can I trust the data I'm using?", which goes beyond verifying the accuracy of data transformations and processes to ensure the right data sets are used for analysis. This project started with processing and parsing raw data from log files and finished with optimized data tables for reporting in business intelligence tools. These[...]
Soda Locker, Building Fabricator, Familied Traveler Advice, and Technically Competent Bosses
Continue reading Four short links: 26 January 2017.(image)
2017-01-25T13:00:00ZA peek into the clickstream analysis and production pipeline for processing tens of millions of daily clicks, for thousands of articles.In the distributed age, news organizations are likely to see their stories shared more widely, potentially reaching thousands of readers in a short amount of time. At the Washington Post, we asked ourselves if it was possible to predict which stories will become popular. For the Post newsroom, this would be an invaluable tool, allowing editors to more efficiently allocate resources to support a better reading experience and richer story package, adding photos, videos, links to related content, and more, in order to more deeply engage the new and occasional readers clicking through to a popular story. Here’s a behind-the-scenes look at how we approached article popularity prediction. Data science application: Article popularity prediction There has not been much formal work in article popularity prediction in the news domain, which made this an open challenge. For our first approach to this task, Washington Post data scientists identified the most-viewed articles on five randomly selected dates, and then monitored the number of clicks they received within 30 minutes after being published. These clicks were used to predict how popular these articles would be in 24 hours. Using the clicks 30 minutes after publishing yielded poor results. As an example, here are five very popular articles: Figure 1. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission. Figure 2. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission. Figure 3. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission. Figure 4. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission. Figure 5. Credit: Shuguang Wang and Eui-Hong (Sam) Han, used with permission. Table 1 lists the actual number of clicks these five articles received 30 minutes and 24 hours after being published. The takeaway: looking at how many clicks a story gets in the first 30 minutes is not an accurate way to measure its potential for popularity: Table 1. Five popular articles. Articles # clicks @ 30mins # clicks @ 24hours 9/11 Flag 6,245 67,028 Trump Policy 2,015 128,217 North Carolina 1,952 [...]