Meet with the Data Brains Behind the Rise of Facebook

Jay Parikh sits at a desk inside Building 16 at Facebook’s headquarters in Menlo Park, California, and his administrative assistant, Genie Samuel, sits next to him. Every so often, Parikh will hear her giggle, and that means she just tagged him in some sort of semi-embarrassing photo uploaded to, yes, Facebook. Typically, her giggle is immediately followed by a notification on his Facebook page. If it’s not, he has some work to do.

Parikh is Facebook’s vice president of infrastructure engineering. He oversees the hardware and software that underpins the world’s most popular social network, and if that notification doesn’t appear within seconds, it’s his job to find out why. The trouble is that the Facebook infrastructure now spans four data centers in four separate parts of the world, tens of thousands of computer servers, and more software tools than you could list without taking a deep breath in the middle of it all. The cause of that missing notification is buried somewhere inside one of the largest operations on the net.

The trouble is that the Facebook infrastructure now spans four data centers in four separate parts of the world, tens of thousands of computer servers, and more software tools than you could list without taking a deep breath in the middle of it all.



But that’s why Parikh and his team build tools like Scuba. Scuba is a new-age software platform that lets Facebook engineers instantly analyze data describing the length and breadth of the company’s massive infrastructure. Typically, when you crunch such enormous amounts of information, there’s a time lag. You might need hours to process it all. But Scuba is what’s called an in-memory data store. It keeps all that data in the high-speed memory systems running across hundreds of computer servers — not the hard disks, the memory systems — and this means you can query the data in near realtime.

“It gives us this very dynamic view into how our infrastructure is doing — how our servers are doing, how our network is doing, how the different software systems are interacting,” Parikh says. “So, when Genie tags me in a photo and it doesn’t show up within seconds, we can look to Scuba.”

In the nine years since Mark Zuckerberg launched Facebook out of his Harvard dorm room — Monday marks the anniversary of the service — it has evolved into more than just the world’s most popular social network. Zuckerberg and company have also built one of the most sophisticated engineering operations on the planet — largely because they had to. Facebook is faced with a uniquely difficult task — how to serve a personalized homepage to one billion different people, juggling one billion different sets of messages, photos, videos, and so many other data feeds — and this requires more tech talent than you might expect.

Yes, Facebook’s engineering army includes people like Lars Rasmussen who create web applications like the company’s Graph Search tool — the stuff you can see on your Facebook page. It includes other software engineers who fashion the tools and widgets needed to build, test, and deploy those web applications. And nowadays, it includes hardware engineers like Amir Michael who design custom servers, storage devices, and, yes, entire data centers.

But it also spans a team of top engineers who deal in data — an increasingly important part of modern online operations. Scuba is just one of many “Big Data” software platforms Facebook has fashioned to harness the information generated by its online operation — platforms that push the boundaries of distributed computing, the art of training hundreds or even thousands of computers on a single task.

Built by engineers such as Raghu Murthy, Avery Ching, and Josh Metzler, these tools not only troubleshoot problems inside Facebook’s data centers, they help Facebook data scientists analyze the effectiveness of the company’s online applications and the behavior of its users, and in some cases, they’re even feeding data directly to Facebook users, driving familiar web applications such as Facebook Messages.

Google’s Big Data platforms are still viewed as the web’s most advanced, but as Facebook strives to expand its own online empire, it isn’t far behind, and in contrast to Google, Facebook is intent on sharing much of its software with the rest of the world. Google often shares its big ideas, but Facebook also shares its code, hoping others will make good use of it. “Our mission as a company is to make the world more open and connected,” Parihk says, “and in building our infrastructure, we’re also contributing to that mission.”
The Tale of the Broken News Feed

Facebook’s data team was founded by a man named Jeff Hammerbacher. Hammerbacher was a contemporary of Mark Zuckerberg at Harvard, where he studied mathematics, and before taking a job at Facebook in the spring 2006, he worked as a data scientist inside the big-name (but now defunct) New York financial house Bear Sterns.

Hammerbacher likes to say that the roots of Facebook’s data operation stretch back to an afternoon at Bear Sterns when the Reuters data feed suddenly went belly up. With the data feed down, no one could make trades — or make any money — and the feed stayed down for a good hour, because the one guy who ran the thing was out to lunch. For Hammerbacher, this snafu showed that data tools were just as important as data experts — if not more so.

“I realized that the delta between the data models that I generated and the models generated by a mathematician at another firm was going be pretty small compared to the amount of money we lost during that two hours without the Reuters data feed,” Hammerbacher remembers. “I felt like there was an opportunity to build a complete system that starts with data ingest and runs all the way to data model building — and try to optimize that system at every point.”

‘I felt like there was an opportunity to build a complete system that starts with data ingest and runs all the way to data model building — and try to optimize that system at every point.’
— Jeff Hammerbacher

That’s basically what he did at Facebook. The company hired him as a data scientist — someone who could help make sense of the company’s operation through information analysis — but with that broken Reuters data feed in the back of his mind, he went several steps further. He built a team that would take control of the company’s data. The team would not only analyze data. It would build and operate the tools needed to collect and process that data.

When he first joined the company, it was still trying to juggle information using old school Oracle data warehouse. But such software wasn’t designed to accomodate an operation growing as quickly as Facebook, and Hammerbacher helped push the company onto Hadoop, an open source software platform that had only recently been bootstrapped by Yahoo.

Hadoop spreads data across a sea of commodity servers, before using the collective power of those machine to transform the data into something useful. It’s attractive because commodity servers are cheap, and as your data expands, you just add more of them.

Yahoo used Hadoop to build an index for its web search engine, but Hammerbacher and Facebook saw it as a means of empowering the company’s data scientists — a way of analyzing much larger amounts of information than it could stuff into an Oracle data warehouse. The company went to work on a tool Hive — which would let analysts crunch data atop Hadoop using something very similar to the structured query language (SQL) that has been widely used since the 80s — and this soon became its primary tool for analyzing the performance of online ads, among other things.

Hammerbacher left the company in the fall of 2008 to help found Cloudera, a startup intent on bringing Hadoop to businesses beyond the web. But the die was cast. Before he left, Hammerbacher even graced the Facebook data team with its own theme song.

Source: Wired.com

Labels: , ,