I'm really excited to announce that I'm going to be joining Facebook to work on the infrastructure team. With all the growth Facebook has seen comes unique challenges in scaling systems. I'm looking forward to working on this.
Of course, this also means saying good bye to Google and reCAPTCHA. It's been over 3 years since we started working on the crazy idea of getting millions of people to digitize the world's books in their spare time. I'm looking forward to seeing reCAPTCHA continue to grow.
From time to time, reCAPTCHA will give users odd juxtapositions of words. I got quite a kick out of seeing this one:(image)
I think this would make a great show!
This cookie is a nice variation on the classic double chocolate chip.
Tip: I've found that I'll eat these cookies far quicker than I really should. In order to ration the cookies, I will mix up a double batch, and freeze the balls of dough. Once the dough is frozen, you can place the balls in a ziplock bag. The cookies can be baked from frozen by adding 2 minutes to the baking time and last a few weeks. Cookies can then be baked in small batches. I can't tell any difference between cookies from frozen dough or fresh dough.
I'm trying to get my laptop to display on dual external monitors. I'd appreciate some guidance in getting this configuration set up
What I have
What I want
I took a look at CloudFront today. They have really good intentions. The CDN space is quite a mess -- it could easily be a pay-as-you-go, self-service industry. However, players such as Akamai try to make a large profit. The CDN space is especially hard for small sites -- you can't get any reasonable pricing unless you are doing high levels of traffic.
Amazon wants to change all of that. However, I think they made a number of missteps in their initial offering.
I do hope that Amazon fixes up CloudFront. It's a fantastic concept. They have the power to force reason into the market.
It's oh so exciting to see that GWAP (games with a purpose) has launched. GWAP is part of the research on human computation that started the reCAPTCHA project. GWAP is a framework which allows researchers to create fun games which generate useful data. For example, "Matchin" is a game where you and a random stranger on the internet get a pair of images. You must agree on which one is "better" without talking with each other. The game is fast, fun, and very addicting. From this game, you can actually get quite a bit of useful information. Most importantly, it's possible to find the "good" photos from a site like Flicker.
What I find most exciting about GWAP is that it is a production service. Many researchers will write papers about ideas, maybe create a prototype or a mock-up. However, they don't really do anything to bring their ideas to fruition. Luis's work is different. For example, with reCAPTCHA, we've spent months developing systems for serving CAPTCHAs to the internet. Most of this time was spent making our code reliable, scalable, and fast. Our efforts really payed off. reCAPTCHA now serves CAPTCHAs on a wide range of sites including Ticketmaster, Facebook, StumbleUpon, Twitter and Bebo.
I think this tendency for productionizing is going to pay off with GWAP. Lots of time was spent on things like UI design and scalability. The UI makes the games fun to play, the scalability makes sure that the team can have a real world impact.
Congrats to Mike and the rest of the GWAP team on a job well done.
I'm looking for 1 bedroom in area below between May 1 and Aug 20 (so hopefully, something furnished). Closer to the green arrow thingy is better. Please email me at email@example.com if you have something like this.
The open source community is slowly growing a software stack that emulates a number of internal Google technologies. What's interesting is that the stack is being developed by a number of tech companies -- ranging from giants (mostly Yahoo!) to medium size firms (Facebook) to startup companies (Last.fm, PowerSet, Krugle, Veoh). So far, the following pieces of infrastructure have been developed
Hadoop is an umbrella project that covers a number of technologies: the Hadoop Filesystem (replacing Google's GFS), Map-Reduce, and HBase a database inspired by BigTable. In short, Hadoop covers storage technology as well as the infrastructure needed to compute over large amounts of storage.
Thrift is a RPC system which was initially developed by Facebook. While, in some senses, there is no lack of RPC-subsystems out there, Thrift is different. Thrift focuses on inter-language RPC. Interop between a large number of languages is transparent. Thrift also avoids the bloat that comes with some of the older RPC systems, such as SOAP. Finally, Thrift doubles as a serialization system (sort of like Python's Pickle or Java serialization). Because the binary data it generates is compact, it can be used for logging or in a database. Thrift is partially modeled off of Google's Protocol Buffers.
ZooKeeper is a service for coodination between distributed servers simplifying functions such as master election, configuration lookup, and distributed locks. ZooKeeper was recently open sourced by Yahoo.
These projects seem to be the start of a positive trend: a number of companies are realizing that in order to rapidly develop new services, they need to have infrastructure. In contrast to Google's typical strategy, these companies are sharing their work with each other. I think that this collaboration is critical to developing a robust infrastructure.
In some ways, I think that these developments put Google in a tough spot. It's not unforeseeable that these open source stack will grow to point where it is superior to Google's own stack. It would be unrealistic for Google to migrate to these new technologies. If there's going to be an open version of Google's technology, it would be in Google's best interest for that implementation to be their own. Not only would Google benefit from external improvements to their stack, it might make it easier for them to acquire new startups and integrate their technology into Google.
I'm looking at Silicon Mechanics for potentially buying a few rack mount servers. I'm wondering if anybody has good (or bad) experience dealing with them.
On paper, it looks like they have prices that are much cheaper than Dell (of course). Also, they have a nice thing where they can have two servers in a 1U (16 cores in a 1U... that's alot). I just want to see if anybody here has some real experience with them.
While being out of disk space on a server isn't generally much fun, this error message made it worth the while(image)
Randall Munroe came to CMU to give a talk yesterday. It was the most attended talk of the semester. Randall's talk was fantastic -- he said that he is trying to fill part of his bedroom with plastic balls (like in one of his cartoons). He also drew two mini-cartoons while talking, which was very cool. Finally, he set two very useful standards. First, he declared that punctuation need not go inside quotation mark (eg, did she say "I love you"?). Second, he declared himself president of the Internet (so that another evil person wouldn't take it.
I guess what I really like about Randall's talk (and about XKCD) is that it's a way to relax and reflect on being a tech person. It feels like The Daily Show for geeks.
2007-11-08T14:15:44.317-05:00We're really proud to announce the release of two new APIs for the client side.
Internationalization For users of our default theme, we've added a way to request the display of the user interface in a different language. We've translated our UI and help page into the following languages
We're looking into getting translations for other languages. If your language isn't supported, you can use the theming API to allow you to customize the user interface.
You can take a look at an example of internationalization at:
Theming The ability to customize the appearance of reCAPTCHA has been a frequent request. Our Theming API allows you to completely customize the way reCAPTCHA looks. For example you can see a demo at:
In this demo, we show how you can completely remove the UI of reCAPTCHA and make a non-themed interface. You can use this as a starting point for blending reCAPTCHA into your site.
More information To get more information about these new APIs, look at our client API guide:
Dear Ubuntu hackers,
Gusty is out. I want it. But the update process for your wonderful operating system is broken. It seems that this process on getting files from a central machine. Said machines seem to be overwhelmed by other users who also want to upgrade -- they won't even respond to a TCP SYN packet
In the future, it'd be really nice if this process used mirrors so that it wouldn't be as flooded.
If the file really does need to be centralized, choose the host carefully. S3 is very good at this stuff. Akamai does even better (though they'll charge you an arm and a leg).
PS: now that I got past accessing that server, Gusty is downloading at 4 megabytes per second, thanks to Internet 2. It would have been a lot fast if it used the mirror for that one small part
2007-10-10T00:47:10.790-04:00The hacker inside me found this funny
2007-09-21T20:00:06.349-04:00Randy Pausch is a professor at Carnegie Mellon who believes in making programming, specifically for entertainment purposes, accessible to everybody. Sadly, Professor Pausch has an untreatable form of cancer which will take his life in the next few months. This week, he gave a talk which is truly inspirational. This is a person I wish I could have gotten to know.
2007-09-21T11:36:50.165-04:00I really wish somebody had told me about this a long time ago:
Federico posted about some work he was doing on making Firefox not cache as many uncompressed bitmaps in memory. I was playing around with the cache stuff and noticed something: my Firefox cache is full of youtube videos. YouTube videos aren't exactly the best thing for Firefox to cache. My internet connection is fast enough that streaming the videos works just fine. I suspect that most people who use online video frequently do so on a connection that can support streaming (otherwise, YouTube would be painfully slow, and they'd go do something else).
Even worse is if you listen to a flash-based media player. The MP3s that this downloads are cached just like anything else. So if you listen to 50 MB worth of music your disk cache gets blown away.
Probably LRU isn't the best technique to use here. I'm not sure how one would evaluate various choices (what is a representative test set of browsing sessions?)
Today, we had our first drill with dual homing on reCAPTCHA. In Pittsburgh, the water main that serves the Carnegie Mellon area broke today, causing a complete water outage on campus. This has resulted in many servers being shut down. reCAPTCHAs servers were kept up, as they are production servers, however we were told that it was possible they'd be shut down.
It's times like these when you just love having a backup. We have a DNS service that does automatic health checking and routes away from unplanned outages. However, with DNS it takes a few minutes for these sorts of changes to take affect. We proactively switched away our traffic off of the pittsburgh servers.
One of the funny things about using DNS for Dual Homing is how long it takes to really kick in. We're still getting requests to our pittsburgh servers even hours after we made the switch. This is one reason it's important to have DNS not be the only load balancing solution (you need a L7 or L4 load balancer as well)
So if your profile says you are single, and looking for women, single women looking for men might soon get a higher ranking in search results? I'm not sure what other "intentions" facebook might know about
Of course, this could open up a whole new era of social networking: I'd call if AdFaces. If you feel that you are not showing up often enough in search results, you can bid for clicks on your profile with a CPC model. Or maybe Facebook can experiment with a cost-per-action model. Then we'll need a product like Facebook Analytics to improve profile conversion rate, and FaceSense to allow publishers to embed targeted profile advertisements on their website.
Two interesting bugs from today.
First, you gotta be careful with order of operations. I wrote this code:
int someValue = ...; storePref(MY_PREF_NAME, "" + someValue + 1);
The code looks innocent enough. However, order of operations kicks in here. The compiler translates this as: (("" + someValue) + 1), or Integer.toString(someValue) + Integer.toString (1). So rather than adding one, we multiply by 10 and then add one :-). The fun part about this experience was that I had Neal Gafter sitting next to me to explain exactly what I'd done, and also to point out where this problem is discussed in his fantastic book Java Puzzlers (Neal gave me a copy, which I've been meaning to read).
BitDefender went a bit overboard in their claim about CAPTCHAs. Their statement about CAPTCHAs was issued as a press release (which clearly has meet their goals of getting press -- regardless of the accuracy of their statements). The article states that about 500 accounts are being created per hour. This is about the effort of one person solving CAPTCHAs. If they had actually broken the CAPTCHAs of Hotmail and Yahoo, there would be tens of thousands of accounts every hour. The article also mentions that about 15,000 accounts has been created. At 2 cents per CAPTCHA, that's a $300 investment to manually solve the CAPTCHAs (this rate is easily obtainable in some countries). It's extremely unlikely that one could hire a person to break the CAPTCHAs of Yahoo and Hotmail for this price. Also, if you're working on a virus-type program, one of the easiest ways to generate CAPTCHA solutions would be to use your infected users (eg, make them type in a CAPTCHA once per day. If you integrate it into the web browser, it might not raise suspicion).
The information that BitDefender has published actually suggests that these spammers/virus makers have not beaten CAPTCHAs using OCR
This blog is pretty funny. It's sort of like what the Daily Show might say about Google -- the facts are mostly true (some are pretty out-dated), but they're twisted in the opposite direction of how things actually are.
The blog entry got me thinking about what I like and don't like about an internship at Google. One of my favorite things is the freedom to set my own hours. I personally have an aversion to waking up any time before 10am. Usually, I wake up, read some blogs, check personal email and reCAPTCHA support email (I can't check Google email remotely as an intern), then I walk to work around 11:30-12:30. Having the free meals every day (I rarely get to take advantage of breakfast, which ends at 9:30) is a huge plus. The blog article hinted at the end how huge of a factor the free food actually is. It's a relatively inexpensive perk that makes a huge difference.
The comments about how the developer's work areas are laid out is also really interesting. The first time I saw the Google layout, I was a bit surprised. "I thought I was getting an office!". I ended up really liking this in the end. Before Google, when working on the Mono project, the primary way to communicate with other people was IRC. When I had to ask a question, sometimes it wasn't always possible to get a response right away. At Google, my coworkers are sitting very close by. I can work something out on a whiteboard with them. I don't have to walk a long way to their office.
One thing the article didn't mention (probably because it's a problem that's worse at MSFT than Google) is that going into a big environment like Google can be intimidating. With open source, building things was always easy. ./configure; make; make install. The process takes about 10 minutes the first time, 2-3 minutes every day, depending on how many changes. At Google (and I'm sure pretty much any place similar), checking things out can be an adventure. A simple build process is probably an advantage of working on an open source project, or at a smaller company.
At the end of the day, the thing I really enjoy about Google is the access to the vast repository of interesting code Google has to offer. Being able to see how a Google product works, under the hood, is just an amazing experience. I remember going snorkeling on an 8th grade trip to the Bahamas. The excitement of being able to see ocean life for the first time is very similar to my experience of being able to look into the moving parts of Google. Surely this isn't something unique to Google. I'm sure there are as many fascinating moving parts inside Microsoft, or many other large companies.
On another note, the reCAPTCHA launch went fantastically well. I was happy and relieved that we didn't have any embarrassing incidents like crashing under the load of Digg (Our servers handled it just fine!). We've had some exciting customers adopting our product. I hope to write more soon.
2007-06-11T04:49:58.618-04:00The New York Times is running an article today on CAPTCHAs. The article really misses some key points. For example, it talks about the CAPTCHAs on YouTube. YouTube's CAPTCHA is really, really bad. The CAPTCHA is mis-designed, using different colors to attempt to provide security. I can't imagine solving this as a color blind user, it must be nearly impossible. Most CAPTCHA providers have migrated to using a monochrome CAPTCHA (for example Google, Yahoo and MSN). The way to create a challenging CAPTCHA today is to make segmentation difficult. This can be achieved without causing as much pain for humans.
Then there's this Asirra thing. Did anybody from the Times actually try it? Here's an unscalled image of what it looks like:
Now, you can hover over an image for a larger version. But now to solve one of these CAPTCHAs, you've got to hover over 12 images, and make a decision on each. Asirra is undeniably cute, but it's not clear that it's all that much easier than the current, well designed, CAPTCHAs. The security of Asirra is also unclear. It'd be interesting to see what happens if Asirra is ever put in front of a high value target (something that can be used to send email, host pagerank-gaining links, or host porn/warez). I have a feeling that some spammer would find a way to abuse a botnet and take advantage of some of the design issues in Asirra.
Many of us developers have a bashrc that has lines like:
I've always known that this isn't perfect, that one should check $LD_LIBRARY_PATH isn't empty, but had always thought it was just a minor point. It turns out that the loader sees an empty entry as meaning the current working directory. This means that it looks there for libraries.
The reason I noticed this is because I was using sshfs to mount something on my workstation in Pittsburgh from my laptop in California. When I ran any command (for example "ls"), the loader would look for tons of libraries. Each one of these libraries, it'd execute a stat for. A round trip between Pittsburgh and California is 90ms... so you can imagine everything was quite slow.
Of course, there are security implications too. I'm not that worried about a rogue directory on my laptop, but on shared systems (such as some of the university ones), I can imagine this being a risk.