updates


8 May 2013

How to Choose a NoSQL Database

The world of NoSQL databases is a very noisy (and confusing) space.  Matt Aslett at the 451 Group has done an amazing job of cataloging various databases (including NoSQL) in his Database Landscape Map.

image

To simplify the NoSQL world, lets take a look at the top 3 databases in terms of current popularity and how they compare to Apache Accumulo, which is at the core of our product, Sqrrl Enterprise.

MongoDB:  It is a wonderfully easy-to-use document store that many select as a flexible replacement for a SQL database, as it (like all NoSQL databases) does not require pre-defined schemas.   However, MongoDB has difficulty scaling to very large datasets (e.g., 100+ TB) and does not natively work with your Hadoop cluster.  It also does not possess fine-grained security controls.

Cassandra:  This is an excellent choice if your data is too big for MongoDB and you require multi-datacenter replication.  Although Cassandra was not originally designed to run natively on your Hadoop cluster, it now has integrations with MapReduce, Pig, and Hive.  It does not possess fine-grained security controls.

HBase:  HBase natively integrates with Hadoop, and it can handle very large datasets.  However, it does not have fine-grained security controls. 

Accumulo:  Accumulo has an architecture most similar to HBase, which allows it also to natively plug into your Hadoop cluster.  It is far more scalable than MongoDB, and with reported cluster sizes in the multiple thousands within the Intelligence Community it is also significantly more scalable than HBase and Cassandra.  Accumulo is the only NoSQL database with cell-level security capabilities.  Accumulo also has other features that could lead one to choose it over HBase or Cassandra for reasons other than security or scalability.  For example, Accumulo has a powerful server-side programming mechanism called Iterators, which provide it with the capability to do a variety of real-time aggregations and analytics.

These high level differences between MongoDB, Cassandra, HBase, and Accumulo are summarized in the decision tree diagram below.  Of course, there are a wide variety of more detailed technical differences that will be explored in greater detail in a later post.  This decision tree can be summarized with a few simple statements:

  • If you need a quick, simple solution and have “small” Big Data (e.g., a few dozen terabytes), MongoDB may be the answer.
  • If you need cell-level security or multi-petabyte scalability, Accumulo is the right answer.
  • If you have data that is too big for MongoDB and don’t need cell-level security or massive scalability, we would recommend testing HBase, Cassandra, and Accumulo for your specific workloads.  Each has their own nuanced advantages and disadvantages.
  • If you don’t need real-time analytics, you are probably on the wrong decision tree and can stick with the Hadoop Distributed File System and batch analytics.

 image

It is worth noting that the NoSQL databases above are all open source databases.  Sqrrl Enterprise builds upon Accumulo and adds a number of additional features to Accumulo including streaming ingest, JSON, encryption, identity management integrations, full-text search, SQL queries, graph search, and statistics.  We believe that these features set Sqrrl Enterprise apart from other Big Data platforms.



6 May 2013

The History of Sqrrl

Interested in the history of Sqrrl?  Check out this podcast with Ely Kahn from Sqrrl, Luke Fretwell from FedScoop, and Gunnar Hellekson (Red Hat’s Public Sector Chief Technology Strategist).

http://fedscoop.com/fedoss-sqrrl-brings-open-source-big-data/



2 May 2013

CSO Article on Securing Big Data Infrastructure

CSO discusses here the difficulty in bolting on security to Big Data infrastructure.  Sqrrl Enterprise is the only Big Data platform with security baked in from the start. 

Read more here:

http://www.csoonline.com/article/732342/big-data-can-be-a-big-headache-for-data-defenders



5 April 2013

Sqrrl April Newsletter

New CEO, New Website, New Jobs.  Check out our April newsletter here.



26 March 2013

Sqrrl's Take on the Big Data Ecosystem

We are often asked where Sqrrl resides in the Big Data ecosystem.  This is a great question since there is some much buzz (and confusion) about Big Data, and the larger picture often gets lost.  

We view the ecosystem in 11 large buckets (which have many similarities to the buckets in Dave Feinlab’s ecosystem map in Forbes):

  1. Hardware providers:  Big Data software runs on both commodity disks and flash/SSD.
  2. Services providers:  These folks help with both strategy and implementation of Big Data solutions
  3. Cloud providers:  Many organizations run their Big Data solutions in public, private, or hybrid clouds
  4. Enterprise Data Warehouse (EDW) vendors:  These are traditional EDW vendors and the relational databases that typically sit on top of them.
  5. Data Integration vendors:  These companies sell the tools that assist in getting data into Hadoop or Scale-Out databases.
  6. Hadoop vendors:  These folks license commercial distributions of the Hadoop Distributed File System and related Apache projects (or in some cases, just sell support services around them).
  7. Security vendors:  They sell security tools, such as encryption and key management, specifically designed for Big Data.
  8. Scale-Out Database vendors:  Includes both NoSQL (unstructured and semi-structured data) and NewSQL (structured data) databases.
  9. Horizontal Big Data Platforms:  These are application development platforms often built on top of Hadoop and/or scale-out platforms and provide additional analytical capabilities beyond what the underlying database can natively provide.
  10. Vertical Big Data Platforms:  Similar to Horizontal Big Data Platforms, but these are specialized applications for a specific industry vertical.
  11. Business Intelligence and Analytical Tools:  Focused on static reporting, analytics, and dashboards for data held in Hadoop.

As depicted in the diagram above, today Sqrrl falls in the intersection of four of these boxes:  Hadoop, Security, Scale-Out Databases, and Horizontal Platforms.  This is because our solution, Sqrrl Enterprise, consists of the following:

  • Hadoop:  Although we prefer to partner with Hadoop vendors, we can also ship our solution with open source HDFS.
  • Security:  We are the only Big Data solution with cell-level security, including fine-grained access controls and encryption.
  • Scale-Out Database:  At our core is Apache Accumulo, which is a NoSQL database with scalability to the tens of petabytes.
  • Horizontal Big Data Platform:  Sqrrl Enterprise powers real-time Big Data applications (aka “Big Apps”), and we do this by layering a number of real-time analytic APIs on top of Accumulo, including full-text search, statistics, and graph analysis.

Hope this helps folks trying to make sense of the Big Data landscape.



24 March 2013

Big Apps > Big Data

The team here at Sqrrl has coined a new term:  Big Apps™.  We are seeing an important trend in the marketplace in that many organizations want move beyond storing and querying Big Data.  More and more, organizations now want to build real-time applications on top of Big Data.  We refer to these applications as Big Apps.

Big Apps could be used for a variety of different use cases ranging from clinical analysis, stock trade analysis, energy trading, immigration analysis, and cybersecurity.  In all of these cases, organizations need to bridge the gap between traditional OLAP and OLTP capabilities and build applications that can process and analyze petabytes of data in real-time.

To read more how Sqrrl can help organizations create Big Apps using Apache Accumulo, click here.



20 March 2013

New sqrrl CEO

image

sqrrl is excited to announce that we have a new CEO, Mark Terenzoni from F5 Networks.  Mark brings a wealth of knowledge around growing technology startups into large successful companies.  Check out some of the media coverage here:

http://siliconangle.com/blog/2013/03/19/sqrrl-appoints-new-ceo-to-spearhead-big-data-security-push/

http://www.bizjournals.com/boston/blog/startups/2013/03/f5-networks-database-startup-sqrrl.html

Full press release here:

http://www.prweb.com/releases/2013/3/prweb10559578.htm



7 March 2013

sqrrl and Technica Team Up

Today sqrrl and Technica Corporation announced a reseller partnership that makes it easier for government agencies to license sqrrl’s software product, sqrrl enterprise.  

sqrrl is now on Technica’s NASA SEWP IV contract vehicle.  This means that both DoD and civilian agencies can utilize the SEWP IV procurement process to quickly and easily procure sqrrl products.  

You can read more about it here.



20 February 2013

Another Step Forward... Accumulo on Amazon Elastic MapReduce

Today, Amazon has released some important new work related to Apache Accumulo.  Now Accumulo users can easily spin up Accumulo clusters utilizing Amazon’s Elastic MapReduce (EMR) Framework. 

sqrrl is excited about this for a few reasons.  First, we strongly support any effort to decrease the friction associated with installing and using Accumulo.  Our engineering team consists of many of the original developers of Accumulo, and we are eager to further increase the Accumulo user base.  Secondly, our software product, sqrrl enterprise, runs on top of Apache Accumulo and Apache Hadoop, and Amazon’s efforts with ERM provide our customers with another use pattern for our product.

If you are interested in taking sqrrl enterprise for a spin on AWS, send us a note at info@sqrrl.com.



4 February 2013

sqrrl Software and Services Now Available Via GSA

Today we officially announced a partnership with Triad Technology partners to offer sqrrl software and services to government customers via’s Triad’s GSA schedule.  

The GSA schedule provides government clients with a simple and fast way of purchasing sqrrl products.  You can read more about the announcement here:

http://www.prnewswire.com/news-releases/triad-technology-partners-expands-gsa-schedule-with-sqrrl-189641841.html



15 January 2013

Quick Accumulo Install

Since joining sqrrl, I’ve been introducing many people to Apache Accumulo. While everyone is eager to take advantage of Accumulo’s unique technical strengths, inevitably their first question is “How do I get started?” Even though all of the steps are documented, it can be intimidating — especially if you haven’t even used Hadoop before.

So, I assembled this guide to getting Accumulo running quickly on a single machine. Most of these steps are documented in the Hadoop Single Node Setup Guide, the ZooKeeper Getting Started Guide, and the README installed with Accumulo. Take a look at these three documents if you would like to learn more about the steps below.

If you have questions or suggestions, contact me on twitter at @jbpopp. Our team at sqrrl is always working on making it even easier to get started with Accumulo, so we’d love to hear your feedback. If this manual walk-through isn’t your bag, you may want to download our pre-canned Accumulo 1.4.2 Virtualbox VM or check out sqrrl’s Accumulo setup shell script that accomplishes the same steps.

-Ben Popp, Director of Engineering at sqrrl


Install single-node Accumulo in minutes

The following instructions will:

  • Install Apache Hadoop 1.0.4
  • Install Apache ZooKeeper 3.3.6
  • Install Apache Accumulo 1.4.2

Pre-requisites:

  • This guide assumes you are running Linux.
  • Java 1.6.x must be installed and the ‘java’ command must be on the path.
  • ssh must be installed and sshd must be running so that the Hadoop scripts will be able to manage various processes. You need to be able to ssh to localhost without using a passphrase. The Hadoop Single Node Setup Guide has directions for this if needed.

Step 1: Install Hadoop 1.0.4

Download hadoop-1.0.4-bin.tar.gz from an Apache mirror and unpack the archive.

In the distribution, edit the conf/hadoop-env.sh file to define JAVA_HOME to be the root of your Java installation.

Even though we’re running on a single node, we’ll install in “Pseudo-Distributed Operation” where each Hadoop daemon runs in a separate Java process. Edit the hadoop configuration files to include the following. Make sure that the parent of the dfs.data.dir and dfs.name.dir is a directory that already exists.

conf/core-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

conf/hdfs-site.xml:

<configuration>       	
 <property>
   <name>dfs.data.dir</name>
   <value>/var/lib/hadoop/hdfs/data</value>
 </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/var/lib/hadoop/hdfs/name</value>
  </property>    
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

conf/mapred-site.xml:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

Format a new distributed filesystem:

$ bin/hadoop namenode -format

Start the hadoop daemons:

$ bin/start-all.sh

View the web interface for the NameNode (http://localhost:50070/) and the JobTracker (http://localhost:50030/) to confirm that the processes are launched.

Step 2: Install Zookeeper 3.3.6

Download zookeeper-3.3.6.tar.gz from an Apache mirror and unpack the archive.

Create a new conf/zoo.cfg file with the following contents. Make sure to choose a valid local path as a value for dataDir.

tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
maxClientCnxns=100

Start ZooKeeper with the command

$ bin/zkServer.sh start

Use the ZooKeeper shell to validate that ZooKeeper is running as intended. Start the shell with the command

$ bin/zkCli.sh

Enter “ls /” to see the contents of ZooKeeper (not much at this point), and type “quit” to exit the shell.

Step 3: Install Accumulo 1.4.2

Download accumulo-1.4.2-dist.tar.gz from an Apache mirror and unpack the archive.

Copy the example configuration to set up your accumulo environment. For testing on a single computer, use a fairly small configuration:

$ cp conf/examples/512MB/standalone/* conf

Edit conf/accumulo-env.sh to set your JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME.

Create the Accumulo logs directory at the default location, a ‘logs’ directory inside the Accumulo home directory.

Run “bin/accumulo init” to create the HDFS directory structure and initial ZooKeeper settings. Choose a name and root password for your instance when prompted.

Start Accumulo using the bin/start-all.sh script.

Visit the Accumulo monitor page at http://localhost:50095 to confirm that you’re live!

Use the command “bin/accumulo shell -u root” command to run an accumulo shell as the Accumulo root user. (Use the instance password you just chose above.) Now you have full access to your instance.

Congratulations!

After you get Accumulo up and running, jump into the Accumulo User Manual to learn more, and get involved with the project.



3 January 2013

Congress Recognizes Accumulo as a Successful Open Source Software Project

Yesterday the President of the United States signed the 2013 National Defense Authorization Act (NDAA).

image

sqrrl and our friends worked effectively with Congressional leaders to change language that would have attempted to limit the use of Apache Accumulo within the Department of Defense. In the end Congress rightfully recognized Accumulo as a successful open source project and as an example of the best of government innovation. This marked a significant victory for open source software.

Accumulo is proven to address the confluence of challenges faced by those hoping to take advantage of Big Data: security, scalability, and analytic adaptability. These unique strengths resolve some of the technical limitations of existing Big Data technologies and also enable our direct efforts to build analytic applications that solve the pressing issues of our time.

We applaud the efforts of all who have contributed to bring about this successful result and look forward to working further to advance open source software.

image




20 December 2012

sqrrl Joins Open Source Software Institute

We are excited to announce that sqrrl has joined the Open Source Software Institute (OSSI) as a platinum member.  The mission of OSSI is to promote the development and implementation of open source software solutions within U.S. Federal, state, and local government agencies.  

sqrrl has participated in OSSI panels previously, and we have always been impressed by the crowds that they bring together and the thought leadership that they show.  As a platinum member of OSSI, we will have the ability to help shape OSSI’s agenda and work hand-in-hand with them to grow the enthusiasm for open source software in government.

As a first action, we are working with OSSI to put together a DHS Industry Day on January 14th in Maryland near the BWI airport.  You can sign up here (no charge for government employees):

http://oss-institute.org/component/events/event/30

Hope to see you there!



18 December 2012

Security And Adaptability: Unlocking The Full Potential Of Big Data And The Cloud

Enthusiasm for and investment in Big Data and the Cloud is spurring innovation in a suite of new technologies that seek to transform information into knowledge at reduced costs. But the potential of Big Data and the Cloud is threatened by security, privacy, legal and regulatory constraints which prevent data integration and information sharing.

While the costs to capture, store and exploit data are declining, the costs of mishandling data are rising for every enterprise; and threaten to extend the data-poor environments in which we have long operated, forcing continued inferences and limits on data insights.

Read More at the GoGrid Blog



10 December 2012

sqrrl at Theory Day

As Director of Data Science for sqrrl , I’m always looking to push the limits of data theory and application.  Recently I attended the 30th meeting of New York Area Theory Day, a semi-annual seminar put on by Columbia University’s Department of Computer Science. This years schedule included four interesting talks, which I’ve summarized below, and which we’ll be thinking about @sqrrl_inc on how to apply to help our customers make the most of their Big Data efforts.

Professor Daniel Spielman of Yale gave a talk about graph sparsifiers. For those of you unfamiliar, a sparsifier utilizes techniques to take a large graph G and create a graph H with the same nodes but with many fewer edges, O(n log n) or O(n) for example. The graph H has some very nice properties, like having the same communities, eigenvalues, etc. of the original graph G. Sparsifiers are quite relevant to web-scale data. Huge graphs can be analyzed by reducing the graph in scale while retaining the essential properties of the graph. Dr. Spielman didn’t talk much about the scalability of these algorithms to web-scale graphs, but we’re thinking about this actively for practical applications.

Professor Vijay Vazirani of Georgia Tech gave a talk about solving complicated Nash games in the context of both economic consumption (the usual case) and production (a less studied case) and the associated complexity of these solutions. I believe that game theory has wide reaching applications for our customers at sqrrl. I think there could be some very interesting applications of high-dimensional game theory to the realm of practical data science.

Professor Maria Chudnovsky from Columbia followed with a talk about some interesting results in translating local graph properties (“this graph has none of this small graph in it”) to global results (“this graph has no clique or stable set of size less than log |V|). This area of theory encompasses the Erdos-Hajnal Conjecture and its derivatives, and most general cases of these statements are not yet proven. However, the theory has a very rich and interesting set of proved lemmas and open problems. I’m unsure still how to apply these results to practical data problems, but will be actively following the space to see how others begin to make such applications.

Professor Sanjeev Khanna of the University of Pennsylvania finished with a talk about the state of the art in the edge-disjoint paths problem: given sets of pairs of sources and sinks, we wish to find the most paths through the graph between the sources and sinks that don’t share an edge. Many practical problems can potentially be cast as variants of this problem. The problem is NP-Hard even in very restricted cases but there are some practical approximations for the problem, especially when a limited number of shared edges are allowed. In the construction of these approximation algorithms there are some interesting constructs that could by themselves be useful for practical graph theory and data science. An example would be grid embeddings present in planar graphs that are used to construct approximate answers.

A lot of great theory was packed into just a few hours at Theory Day. If anyone would like to share their take on the event, please send me tweets @_SecretStache_ or send me an email at chris@sqrrl.com



19 November 2012

Register for Accumulo Workshops

sqrrl is excited to deliver Accumulo training sessions in partnership with UMBC.  We are offering workshops for Managers, Developers, and Administrators.  Registration is now open for sessions in December, so sign up today!

http://www.umbc.edu/trainctr/it/hadoop.html



18 November 2012

sqrrl Sponsors 1st Hack / Reduce Hackathon

Just a week after launching, Hack / Reduce opened its doors to the public for its 1st Hackathon.  A packed house of data scientists worked throughout the day on a number of diverse data sets to produce some very compelling insights and applications. The sqrrl team was on hand to motivate and lend technical support to the participants. 



9 November 2012

sqrrl and UMBC Announce Training Partnership for Accumulo

This week sqrrl and UMBC announced that sqrrl-developed Accumulo training classes will be available beginning in December at UMBC’s training facility in Maryland.  Classes will include a half day managers session, a 2 day administrators workshop, and a 3 day developers workshop.

Read the press release here:

http://umbc.edu.resultsnetwork.com/trainctr/pr/sqrrl-partnership.aspx



7 November 2012

Boston preps big kickoff for Big Data hub

Hack/reduce will launch its cool new work space near Cambridge’s Kendall Square on Thursday. The goal of the effort is to bring together the best big data people from private and public sectors and academia to train up the next generation of data scientists…

…The facility can accommodate 150 dedicated hackers and is fielding 50 applications per week for spots. The first residents are Sqrrl, a big data startup launched by former National Security Agency technologists. “These 7 young men out of NSA spent 5 years building a big data store in Washington and now we have it in Boston,” said Lynch, who co-founded Vertica.

Read more of Barb Darrow’s article on GigaOM



6 November 2012

Breaking Through Big Data Barriers with Cell-Level Security

sqrrl enables organizations to securely leverage all of their data and build powerful real-time Big Data applications using Apache Accumulo.  These applications are applicable to a wide range of industries, including finance, healthcare, energy, consumer Internet, and government. Solutions in these industries demand fine-grained access controls that promote data integration and sharing without impacting performance or analytic adaptability. 

Apache Accumulo (supported by sqrrl) is the only non-relational database with cell-level security, and it provides organizations with entirely new Big Data capabilities.  These capabilities include:  

  • Secure information sharing. Organizations can integrate disparate data sets and user communities within a single data store, being assured that only authorized users can access appropriate data. This allows for improved sharing of information within and across organizations.
  • Deeper analytical insights. By increasing the amount of data available to analysts, and breaking down barriers around crude security schemes, organizations can conduct analyses that previously were not possible.
  • Simplified application development environment and greater analytic innovation. Organizations no longer need to fracture data across many databases. Apache Accumulo can serve as a central data store that securely feeds data to hundreds if not thousands of applications.

Other databases have explored the concept of data-centric security through table, document, column, and row-level restrictions, but these are not sufficient approaches.

Table or Document-level security is a blunt-force security approach that requires locking down an entire document or table that may hold a variety of differently data types.  

Column-level security is only sufficient when the data schema is static, well known, and aligned with security concerns.

Row-level security can sometimes account for variations in accessibility of data from different sources, but breaks down when a single record conveys multiple levels of information or tables become more complicated than simple event logs.

Cell-level security introduces a powerful orthogonal dimension, supporting data-centric security independent of table design.  

Using Apache Accumulo, data providers can finely control data through simple, explicit encoding of existing policies and requirements.  We have found that this model is an infinitely extensible language that effectively and efficiently scales to tera- and petabyte amounts of data.

Cell-level security opens up new possibilities within the Big Data and Hadoop ecosystem.  Using Apache Accumulo, organizations are no longer constrained by security and privacy requirements in conducting Big Data analytics.  With Accumulo, organizations can move past the concern that NoSQL = No Security.


More updates at blog.sqrrl.com