The world of NoSQL databases is a very noisy (and confusing) space. Matt Aslett at the 451 Group has done an amazing job of cataloging various databases (including NoSQL) in his Database Landscape Map.
To simplify the NoSQL world, lets take a look at the top 3 databases in terms of current popularity and how they compare to Apache Accumulo, which is at the core of our product, Sqrrl Enterprise.
MongoDB: It is a wonderfully easy-to-use document store that many select as a flexible replacement for a SQL database, as it (like all NoSQL databases) does not require pre-defined schemas. However, MongoDB has difficulty scaling to very large datasets (e.g., 100+ TB) and does not natively work with your Hadoop cluster. It also does not possess fine-grained security controls.
Cassandra: This is an excellent choice if your data is too big for MongoDB and you require multi-datacenter replication. Although Cassandra was not originally designed to run natively on your Hadoop cluster, it now has integrations with MapReduce, Pig, and Hive. It does not possess fine-grained security controls.
HBase: HBase natively integrates with Hadoop, and it can handle very large datasets. However, it does not have fine-grained security controls.
Accumulo: Accumulo has an architecture most similar to HBase, which allows it also to natively plug into your Hadoop cluster. It is far more scalable than MongoDB, and with reported cluster sizes in the multiple thousands within the Intelligence Community it is also significantly more scalable than HBase and Cassandra. Accumulo is the only NoSQL database with cell-level security capabilities. Accumulo also has other features that could lead one to choose it over HBase or Cassandra for reasons other than security or scalability. For example, Accumulo has a powerful server-side programming mechanism called Iterators, which provide it with the capability to do a variety of real-time aggregations and analytics.
These high level differences between MongoDB, Cassandra, HBase, and Accumulo are summarized in the decision tree diagram below. Of course, there are a wide variety of more detailed technical differences that will be explored in greater detail in a later post. This decision tree can be summarized with a few simple statements:
It is worth noting that the NoSQL databases above are all open source databases. Sqrrl Enterprise builds upon Accumulo and adds a number of additional features to Accumulo including streaming ingest, JSON, encryption, identity management integrations, full-text search, SQL queries, graph search, and statistics. We believe that these features set Sqrrl Enterprise apart from other Big Data platforms.
Interested in the history of Sqrrl? Check out this podcast with Ely Kahn from Sqrrl, Luke Fretwell from FedScoop, and Gunnar Hellekson (Red Hat’s Public Sector Chief Technology Strategist).
CSO discusses here the difficulty in bolting on security to Big Data infrastructure. Sqrrl Enterprise is the only Big Data platform with security baked in from the start.
Read more here:
We are often asked where Sqrrl resides in the Big Data ecosystem. This is a great question since there is some much buzz (and confusion) about Big Data, and the larger picture often gets lost.
We view the ecosystem in 11 large buckets (which have many similarities to the buckets in Dave Feinlab’s ecosystem map in Forbes):
As depicted in the diagram above, today Sqrrl falls in the intersection of four of these boxes: Hadoop, Security, Scale-Out Databases, and Horizontal Platforms. This is because our solution, Sqrrl Enterprise, consists of the following:
Hope this helps folks trying to make sense of the Big Data landscape.
The team here at Sqrrl has coined a new term: Big Apps™. We are seeing an important trend in the marketplace in that many organizations want move beyond storing and querying Big Data. More and more, organizations now want to build real-time applications on top of Big Data. We refer to these applications as Big Apps.
Big Apps could be used for a variety of different use cases ranging from clinical analysis, stock trade analysis, energy trading, immigration analysis, and cybersecurity. In all of these cases, organizations need to bridge the gap between traditional OLAP and OLTP capabilities and build applications that can process and analyze petabytes of data in real-time.
To read more how Sqrrl can help organizations create Big Apps using Apache Accumulo, click here.
sqrrl is excited to announce that we have a new CEO, Mark Terenzoni from F5 Networks. Mark brings a wealth of knowledge around growing technology startups into large successful companies. Check out some of the media coverage here:
Full press release here:
Today sqrrl and Technica Corporation announced a reseller partnership that makes it easier for government agencies to license sqrrl’s software product, sqrrl enterprise.
sqrrl is now on Technica’s NASA SEWP IV contract vehicle. This means that both DoD and civilian agencies can utilize the SEWP IV procurement process to quickly and easily procure sqrrl products.
You can read more about it here.
Today, Amazon has released some important new work related to Apache Accumulo. Now Accumulo users can easily spin up Accumulo clusters utilizing Amazon’s Elastic MapReduce (EMR) Framework.
sqrrl is excited about this for a few reasons. First, we strongly support any effort to decrease the friction associated with installing and using Accumulo. Our engineering team consists of many of the original developers of Accumulo, and we are eager to further increase the Accumulo user base. Secondly, our software product, sqrrl enterprise, runs on top of Apache Accumulo and Apache Hadoop, and Amazon’s efforts with ERM provide our customers with another use pattern for our product.
If you are interested in taking sqrrl enterprise for a spin on AWS, send us a note at email@example.com.
Today we officially announced a partnership with Triad Technology partners to offer sqrrl software and services to government customers via’s Triad’s GSA schedule.
The GSA schedule provides government clients with a simple and fast way of purchasing sqrrl products. You can read more about the announcement here:
Since joining sqrrl, I’ve been introducing many people to Apache Accumulo. While everyone is eager to take advantage of Accumulo’s unique technical strengths, inevitably their first question is “How do I get started?” Even though all of the steps are documented, it can be intimidating — especially if you haven’t even used Hadoop before.
So, I assembled this guide to getting Accumulo running quickly on a single machine. Most of these steps are documented in the Hadoop Single Node Setup Guide, the ZooKeeper Getting Started Guide, and the README installed with Accumulo. Take a look at these three documents if you would like to learn more about the steps below.
If you have questions or suggestions, contact me on twitter at @jbpopp. Our team at sqrrl is always working on making it even easier to get started with Accumulo, so we’d love to hear your feedback. If this manual walk-through isn’t your bag, you may want to download our pre-canned Accumulo 1.4.2 Virtualbox VM or check out sqrrl’s Accumulo setup shell script that accomplishes the same steps.
-Ben Popp, Director of Engineering at sqrrl
Install single-node Accumulo in minutes
The following instructions will:
Step 1: Install Hadoop 1.0.4
Download hadoop-1.0.4-bin.tar.gz from an Apache mirror and unpack the archive.
In the distribution, edit the conf/hadoop-env.sh file to define JAVA_HOME to be the root of your Java installation.
Even though we’re running on a single node, we’ll install in “Pseudo-Distributed Operation” where each Hadoop daemon runs in a separate Java process. Edit the hadoop configuration files to include the following. Make sure that the parent of the dfs.data.dir and dfs.name.dir is a directory that already exists.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
<configuration> <property> <name>dfs.data.dir</name> <value>/var/lib/hadoop/hdfs/data</value> </property> <property> <name>dfs.name.dir</name> <value>/var/lib/hadoop/hdfs/name</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>
Format a new distributed filesystem:
$ bin/hadoop namenode -format
Start the hadoop daemons:
View the web interface for the NameNode (http://localhost:50070/) and the JobTracker (http://localhost:50030/) to confirm that the processes are launched.
Step 2: Install Zookeeper 3.3.6
Download zookeeper-3.3.6.tar.gz from an Apache mirror and unpack the archive.
Create a new conf/zoo.cfg file with the following contents. Make sure to choose a valid local path as a value for dataDir.
tickTime=2000 dataDir=/var/lib/zookeeper clientPort=2181 maxClientCnxns=100
Start ZooKeeper with the command
$ bin/zkServer.sh start
Use the ZooKeeper shell to validate that ZooKeeper is running as intended. Start the shell with the command
Enter “ls /” to see the contents of ZooKeeper (not much at this point), and type “quit” to exit the shell.
Step 3: Install Accumulo 1.4.2
Download accumulo-1.4.2-dist.tar.gz from an Apache mirror and unpack the archive.
Copy the example configuration to set up your accumulo environment. For testing on a single computer, use a fairly small configuration:
$ cp conf/examples/512MB/standalone/* conf
Edit conf/accumulo-env.sh to set your JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME.
Create the Accumulo logs directory at the default location, a ‘logs’ directory inside the Accumulo home directory.
Run “bin/accumulo init” to create the HDFS directory structure and initial ZooKeeper settings. Choose a name and root password for your instance when prompted.
Start Accumulo using the bin/start-all.sh script.
Visit the Accumulo monitor page at http://localhost:50095 to confirm that you’re live!
Use the command “bin/accumulo shell -u root” command to run an accumulo shell as the Accumulo root user. (Use the instance password you just chose above.) Now you have full access to your instance.
After you get Accumulo up and running, jump into the Accumulo User Manual to learn more, and get involved with the project.
Yesterday the President of the United States signed the 2013 National Defense Authorization Act (NDAA).
sqrrl and our friends worked effectively with Congressional leaders to change language that would have attempted to limit the use of Apache Accumulo within the Department of Defense. In the end Congress rightfully recognized Accumulo as a successful open source project and as an example of the best of government innovation. This marked a significant victory for open source software.
Accumulo is proven to address the confluence of challenges faced by those hoping to take advantage of Big Data: security, scalability, and analytic adaptability. These unique strengths resolve some of the technical limitations of existing Big Data technologies and also enable our direct efforts to build analytic applications that solve the pressing issues of our time.
We applaud the efforts of all who have contributed to bring about this successful result and look forward to working further to advance open source software.
We are excited to announce that sqrrl has joined the Open Source Software Institute (OSSI) as a platinum member. The mission of OSSI is to promote the development and implementation of open source software solutions within U.S. Federal, state, and local government agencies.
sqrrl has participated in OSSI panels previously, and we have always been impressed by the crowds that they bring together and the thought leadership that they show. As a platinum member of OSSI, we will have the ability to help shape OSSI’s agenda and work hand-in-hand with them to grow the enthusiasm for open source software in government.
As a first action, we are working with OSSI to put together a DHS Industry Day on January 14th in Maryland near the BWI airport. You can sign up here (no charge for government employees):
Hope to see you there!
Enthusiasm for and investment in Big Data and the Cloud is spurring innovation in a suite of new technologies that seek to transform information into knowledge at reduced costs. But the potential of Big Data and the Cloud is threatened by security, privacy, legal and regulatory constraints which prevent data integration and information sharing.
While the costs to capture, store and exploit data are declining, the costs of mishandling data are rising for every enterprise; and threaten to extend the data-poor environments in which we have long operated, forcing continued inferences and limits on data insights.
Read More at the GoGrid Blog
As Director of Data Science for sqrrl , I’m always looking to push the limits of data theory and application. Recently I attended the 30th meeting of New York Area Theory Day, a semi-annual seminar put on by Columbia University’s Department of Computer Science. This years schedule included four interesting talks, which I’ve summarized below, and which we’ll be thinking about @sqrrl_inc on how to apply to help our customers make the most of their Big Data efforts.
Professor Daniel Spielman of Yale gave a talk about graph sparsifiers. For those of you unfamiliar, a sparsifier utilizes techniques to take a large graph G and create a graph H with the same nodes but with many fewer edges, O(n log n) or O(n) for example. The graph H has some very nice properties, like having the same communities, eigenvalues, etc. of the original graph G. Sparsifiers are quite relevant to web-scale data. Huge graphs can be analyzed by reducing the graph in scale while retaining the essential properties of the graph. Dr. Spielman didn’t talk much about the scalability of these algorithms to web-scale graphs, but we’re thinking about this actively for practical applications.
Professor Vijay Vazirani of Georgia Tech gave a talk about solving complicated Nash games in the context of both economic consumption (the usual case) and production (a less studied case) and the associated complexity of these solutions. I believe that game theory has wide reaching applications for our customers at sqrrl. I think there could be some very interesting applications of high-dimensional game theory to the realm of practical data science.
Professor Maria Chudnovsky from Columbia followed with a talk about some interesting results in translating local graph properties (“this graph has none of this small graph in it”) to global results (“this graph has no clique or stable set of size less than log |V|). This area of theory encompasses the Erdos-Hajnal Conjecture and its derivatives, and most general cases of these statements are not yet proven. However, the theory has a very rich and interesting set of proved lemmas and open problems. I’m unsure still how to apply these results to practical data problems, but will be actively following the space to see how others begin to make such applications.
Professor Sanjeev Khanna of the University of Pennsylvania finished with a talk about the state of the art in the edge-disjoint paths problem: given sets of pairs of sources and sinks, we wish to find the most paths through the graph between the sources and sinks that don’t share an edge. Many practical problems can potentially be cast as variants of this problem. The problem is NP-Hard even in very restricted cases but there are some practical approximations for the problem, especially when a limited number of shared edges are allowed. In the construction of these approximation algorithms there are some interesting constructs that could by themselves be useful for practical graph theory and data science. An example would be grid embeddings present in planar graphs that are used to construct approximate answers.
A lot of great theory was packed into just a few hours at Theory Day. If anyone would like to share their take on the event, please send me tweets @_SecretStache_ or send me an email at firstname.lastname@example.org
Just a week after launching, Hack / Reduce opened its doors to the public for its 1st Hackathon. A packed house of data scientists worked throughout the day on a number of diverse data sets to produce some very compelling insights and applications. The sqrrl team was on hand to motivate and lend technical support to the participants.
This week sqrrl and UMBC announced that sqrrl-developed Accumulo training classes will be available beginning in December at UMBC’s training facility in Maryland. Classes will include a half day managers session, a 2 day administrators workshop, and a 3 day developers workshop.
Read the press release here:
Hack/reduce will launch its cool new work space near Cambridge’s Kendall Square on Thursday. The goal of the effort is to bring together the best big data people from private and public sectors and academia to train up the next generation of data scientists…
…The facility can accommodate 150 dedicated hackers and is fielding 50 applications per week for spots. The first residents are Sqrrl, a big data startup launched by former National Security Agency technologists. “These 7 young men out of NSA spent 5 years building a big data store in Washington and now we have it in Boston,” said Lynch, who co-founded Vertica.
sqrrl enables organizations to securely leverage all of their data and build powerful real-time Big Data applications using Apache Accumulo. These applications are applicable to a wide range of industries, including finance, healthcare, energy, consumer Internet, and government. Solutions in these industries demand fine-grained access controls that promote data integration and sharing without impacting performance or analytic adaptability.
Apache Accumulo (supported by sqrrl) is the only non-relational database with cell-level security, and it provides organizations with entirely new Big Data capabilities. These capabilities include:
Other databases have explored the concept of data-centric security through table, document, column, and row-level restrictions, but these are not sufficient approaches.
Table or Document-level security is a blunt-force security approach that requires locking down an entire document or table that may hold a variety of differently data types.
Column-level security is only sufficient when the data schema is static, well known, and aligned with security concerns.
Row-level security can sometimes account for variations in accessibility of data from different sources, but breaks down when a single record conveys multiple levels of information or tables become more complicated than simple event logs.
Cell-level security introduces a powerful orthogonal dimension, supporting data-centric security independent of table design.
Using Apache Accumulo, data providers can finely control data through simple, explicit encoding of existing policies and requirements. We have found that this model is an infinitely extensible language that effectively and efficiently scales to tera- and petabyte amounts of data.
Cell-level security opens up new possibilities within the Big Data and Hadoop ecosystem. Using Apache Accumulo, organizations are no longer constrained by security and privacy requirements in conducting Big Data analytics. With Accumulo, organizations can move past the concern that NoSQL = No Security.