Tuesday, September 6, 2011

NoSQL Now Talk:Finding the Right Data Solution for Your Application in the Data Storage Haystack

Following are the slides for my NoSQL Now talk. Talk walk through different NoSQL choices and make concreate recommendations on which one to should be used when. End of the day, it provides tables that give you the choices directly. You can find the abstract from here.


Following table depicts the key idea of the talk. The table provide data store recommendations for given usecase (application) based on three properties. This only covers structured data, and slides have tables for unstructured and semi-structured data.  Thee properties are
  1. type of Search needed by the application(Different Columns, colored Blue)
  2. amount of Scale needed by the application (colored green)
  3. amount of consistency required by the application (colored brown)
Here we are presenting 3D data using a 2D table. Repeated columns under each type of scale column sets takes care of that. For example, "DB" in the forth row forth column says "if you need where clause like search, with small scale and transactions, use a DB". Other cells use the same idea.
The notation I use is KV: Key-Value Systems, CF: Column Families, Doc: document based Systems. Questions marks in the table means, it might work, but you should verify. The table only put the recommendations and does not exactly say how I come up with the recommendations. More details are in the slides, and I will get out a writeup soon. Some of the key ideas are

  1. Transactions and Joins does not scale great
  2. KV scale most, then CF and Doc models, then DB. So if KV is good enough, go for that. 
  3. Offline case have time to do MapReduce and walk through the data. 
  4. If you need transactions or Joins with scale, you have to try partitioned DBs. But you have to try and see, and it might not work either. If it does not, you are out of luck. 
Small (1-3 nodes)
Scalable (10 nodes)
Highly Scalable (1000s nodes)
Loose
Consistency
Operation Consistency
ACID
Transactions
Loose
Consistency
Operation Consistency
ACID
Transactions
Loose
Consistency
Operation Consistency
ACID
Transactions
Primary Key
DB/ KV/ CF
DB/ KV/ CF
DB
KV/CF
KV/CF
DB?
KV/CF
KV/CF
No
Where
DB/  CF/Doc
DB/  CF/Doc
DB
CF/Doc(?)
CF/Doc (?)
DB?
CF/Doc
CF/Doc
No
JOIN
DB
DB
DB
??
??
??
No
No
No
Offline
DB/CF/Doc
DB/CF/Doc
DB/CF/Doc
CF/Doc
CF/Doc
No
CF/Doc
CF/Doc
No

Survey of System Management Frameworks


Following is couple of tables I took out of Survey I did on system management frameworks. A detailed writeup of this survey is in my thesis under related works. I want to write up a survey paper on this, but have not got to that yet.

Lets start with brief outline on rational for the choice of matrices to compare different implementations. System management frameworks monitor a system, take decisions on how to fix things, and execute those decisions. They often consist of three parts: Sensors, a brain that make decisions, and set of actuators to execute decisions. From 10,000-foot view, most architectures follow this model, but differ on how they collect data, make decisions, and execute them. First table compares them on this aspect.

System management frameworks themselves have to scale and provide High availability. To do those, they have to be composed of multiple servers (managers). Taking decisions with multiple managers (coordination) is one of the key challenges in system management framework design. The second table compares and contrast different coordination models using different architectural styles.

The 3rd table discusses decision models—in other words, implementation of the brain. More details can find from http://people.apache.org/~hemapani/research/system-management-survey.html.

Systems based on Functionality/Design



Sensors->Brain-> (Actuators)? InfoSpect(Prolog),WRMFDS (AI), Rule-basedDignosis (Hierachy) (Monitoring & Dignosis)
Ref Archi, Autopilot, Policy2Actions (Centralized)
JADE, JINI-Fed, ReactiveSys, (Managers)
Tivoli (Managers + Manual Coordinator)
Sensors->Gauges->Brain-> (Actuators)? Paradyn MRNet (Monitoring Only)
Rainbow (Archi Model*), ACME-CMU, CoordOfSysMngServices, Code
InfraMonMangGrid(Centralized)
eXtreme(KX),Galaxy* (Managers)
Sensors->DataModel->Brain->Actuators DealingWithScale(CIM), Marvel( Centralized)
Scalable Management (PLDB/Managers), Nester (Distributed directory service), SelfOrgSWArchi
Sensors->ECA->Actuators DREAM, SmartSub, IrisLog, HiFi, Sophia (Prolog) - (Managers)
Management Hierarchy NaradaBrokerMng, WildCat, BISE,ReflexEngine
Queries on demand ACME , Decentralizing Network Management, Mng4P2P (Managers)
Workflow systems/ Resource management Unity, RecipeBased (Centralized),
SelfMngWFEngine, Automate Accord, ScWAreaRM, P2PDesktopGrid (Distributed)
Fault tolerant systems WSRF-Container, GEMS (Distributed)
Decentralized DMonA,Guerrilla, K-Componets
Deployment frameworks SmartFrog , Plush, CoordApat
Monitoring Inca (Centralized), PIPER (multicast over DHT), MDS (LDAP, Events), NWS(LDAP), Scalable Info management, Astorable (Gossip), Monitoring with accuracy, RGMA (DB like access)

Systems based on Coordination and Communication model


Monitoring Only One Manager without coordination One Manager with coordination Managers without coordination Managers with coordination
Pull / events InfoSpect
Sophia
Rainbow, Unity, Ref Archi, Autopilot, InfraMonMangGrid CoordOfSysMngServices JADE DMonA, Tivoli (Manual)
Pub/Sub hierarchy
DealingWithScale, ACME-CMU
Scalable Management, DREAM, SmartSub NaradaBrokerMng,
P2P PIPER,

Automate Accord, Mng4P2P, WSRF-Container,
Hierarchy Gangila, Globus MDS, Paradyn MRNet, WRMFDS

IrisLog, HiFi, JINI-Fed, eXtreme(KX), ScWAreaRM WildCat
Gossip Astorable, GEMS

Galaxy
Spanning Tree (Network/P2P)


Scalable Info management, ACME
Group Communication


ReactiveSys, Galaxy SelfOrgSWArchi
Distributed Queue
SelfMngWFEngine


Decision Models in management Systems


Rules Conflict resolution Verifier Batch Mode used? Meta-model used? Planning
DIOS++ If/then yes, use priority No results of rules applied in next iteration Yes No
Rainbow If/then
No
Yes No
InfoSpect Prolog Like Only Monitoring No No Yes No
Marvel(1995) PreCnd->action->PostCond
Yes No Yes No
CoordOfSysMngServices
Yes (Detect ->Human Help) Yes Yes No Yes
Sophia Prolog like
No No
No
RecipeBased Java Code Yes
No No No
HiFi, DREAM pub/sub Filter->action No No No No No
IrisLog DB triggers No No No No No
ACME (timer/sensor/completion)
conditions->action
No No possible with timer conditions Yes No
ReactiveSys (1993) if/then No No No Yes No
Policy2Actions (2007) Policy- name, conditions (name of method to run + parameters), actions, target component Yes, based on runtime state + history There are tests associated with each actions that decide should action need to run - No Yes

Friday, August 19, 2011

WSO2 Stratos in contrast to Other Java PasS Offerings

In the article "Java PaaS shootout" Michael J. Yuan provides a pretty nice comparison of Google AppEngine, Amazon Elastic Beanstalk, and CloudBees RUN@Cloud. Following table provides a summary of that while adding WSO2 Stratos to the comparison.

App Engine
Amazon beanstalk
 CloudBee's Run@Cloud
WSO2 Stratos
What is it?
Users can upload servlets. AppEngine hosts them and manage them. 
Managed Tomcat
Expensive
Tomcat, load balancer. Integrated with SVN. Can change source code and update all deployment aspects.
SOA platform as a service.
Fully multi-tenant.
Java support
Yes, does not support some I/O and network operations
Full java
Full Java
Yes, but File access is limited
Outbound connections
Time out in 10 seconds
OK
OK
OK
Support for standard java Libs
Have problems when they use unsupported APIs
Yes
Yes
Yes (java security manager limits file accesses)
Performance and scalability
Auto scale, High scalability, but have bit high latency.
Swapping the app out might slow down first request
Auto scale by creating EC2 instances
Can swap unused processes out of JVM. Can load balance multiple tomcats in the same EC2 instance
Can lazy load services and other artifacts.
Auto scale (up down) by monitoring the load and creating new nodes. LB route the requests.
Storage
Support Big Table and Hosted MySQL. 
However, search support in BigTable case is limited. 
e.g.  Each query can only have 100 results.
Support RDS (relational) , SimpleDB (NoSQL) or can run with your own DB
Has managed MySQL databases and provide a console to manage them
Support Cassandra as a Service, managed MySQL, and HDFS. 
Cassandra and HDFS support native multi-tenancy
Import/ export data
No (hard due to 30 sec limit)
Can write code to automate
Can write code to automate
Can write code to automate
Integration with others
Integrate well with other Google services
SQS, SES (email service), payment APIs
S3, SQS, SES etc.
With Google auth model and other WSO2 services. Also S3, SQS, SES etc. 
Session handling
Store sessions to storage and handles them seamlessly
Only sticky sessions
Transparent session management
Only sticky sessions
Multi-tenancy
Yes
No
No
Yes


Also let me listout couple of key differentiators of WSO2 Stratos.

  1. All these offerings support Web App Hosting as a Service. WSO2 Stratos supports that, but provides much more. In addition to Web App Hosting, it supports hosting Axis2 based services, Mediation, and Workflow hosting as a Service. It is real SOA platform as a Service, the only one to does that. 
  2. It let you move your Axis2 based Web Services (.aar files) and workflows to the Cloud (to WSO2 Statos Live) without any change to them. If you have some Axis2 based services, chances are that you can upload them to WSO2 Stratos and it will just work. 
  3. WSO2 Stratos provides real multi-tenancy support. That is different tenants will think that he has his own server, while actually all are served from one Java Server. In other words, Isolation is done at Java level, not at Virtualization level. That means it can provide greater sharing and provide "Pay as you go" and "Pay for what you use" better than VM based model. Only AppEngine does that out of other three PaaS offering. More details are in following papers. 
    • A. Azeez, S. Perera, D. Gamage et al. (2010) Multi-tenant SOA Middleware for Cloud Computing, 458–465. In 2010 IEEE 3rd International Conference on Cloud Computing.
    • A. Azeez and S. Perera et al., WSO2 Stratos: An Industrial Stack to Support Cloud Computing, IT: Methods and Applications of Informatics and Information Technology Journal, the special Issue on Cloud Computing, 2011.
    • Milinda Pathirage, Srinath Perera, Sanjiva Weerawarana, Indika Kumara, A Multi-tenant Architecture for Business Process Execution, 9th International Conference on Web Services (ICWS), 2011

What can you do with WSO2 Platform?


We have explained WSO2 platform in many ways. Here I am trying to take a more scenario driven approach.

1. Implementing Business Logic

If you are looking to build a SOA based architecture, first step is to implement your business logic as a set of services.  Users can use WSO2 Application Server (AS) to do this, and AS enables users to implement a business logic using any of the following methods.
  1. Using Java Code (WSO2 AS, any Axis2 .aar file works)
  2.  Exposing data in a data source (e.g. Relational Database, CSV file etc., see WSO2 Data services Server)
  3.  Using Business Rules (Supports Drools, see WSO2 BRS)
  4.   Using a script language (e.g. Javascript, Phython, see WSO2 AS)
Once the service is created, users can secure them using HTTPs or WS-Security. Furthermore, he can enable most of the WS-* support also using WSO2 Application Server. 

2. Mediating Messages

Often SOA deployment has third party services, legacy applications and other types of frication that force the architecture to mediate messages that flow between services.   Users can use WSO2 Enterprise Service Bus (ESB) to do this. Mediation can be of several forms. (* there are others, and following are only some of them.)
  1. Message transformations
  2. Message editing
  3.  Message filtering
  4.  Route (e.g. load balancing, Pub/Sub)
  5.  Support Enterprise Integration Patterns (EIP)

3. Creating the User Interface Layer

Often these services have user facing components. WSO2 platform provides two choices to do this.
  1.  Java Web Applications (.war file) in WSO2 AS
  2.  Google Gadget support in WSO2 Gadget Server

4. Composing Services

If the target application is complicated, specially when service execution flows are dynamic and themselves contains business logic, users may opt to compose those services to create business processes. WSO2 platform provides three choices for doing this.
  1. Create the process as a BPEL document, deploy, and run the business process using WSO2 Business Process Server
  2. Create the process as a mediation sequence using WSO2 ESB
  3. Create the process using java scripts with WSO2 Mashup Server

5. Scaling

If some of the services receive higher loads that exceed single node capacity, users would want to scale the system by clustering those bottleneck service. Exact details are usecase specific, but WSO2 platform support clustering through Axis2 clustering
Cross Cutting Concerns 
In addition, WSO2 platform supports two cross cutting aspects: namely governance and business actively monitoring.
1. Govern SOA
If a SOA deployment has many services and they goes through frequent business logic and configurations changes, ad-hoc management of the deployment could be very complex. WSO2 Governance registry let users store all service WSDL in a single repository and manage their configurations and lifecycles from a single point. 
2. Monitoring the Business
Most businesses needs a detailed close to real time view of their activities. WSO2 Business Activity Monitoring (BAM) let users define trace points in activities (service executions). At the runtime, BAM  and monitors data collected at those trace points, and present them to end users through Gadgets that aggregate and visualize them.

Wednesday, August 17, 2011

WSO2 Conference 2011

Tuesday, August 16, 2011

Useful data sets

If you are doing a performance test, it always a good thing to do that using a real dataset. Following are several useful datasets.

Often you can find data in the CSV format, and then parsing and using it is pretty easy.
  1. DLPB catalog - http://kdl.cs.umass.edu/data/dblp/dblp-info.html - this is data about publications. About 900MB raw size. 
  2. Google Fusion tables, http://www.google.com/fusiontables/Home
     - this has several useful datasets as CSV.
  3. Federal reserve economic data - http://research.stlouisfed.org/fred2/  
  4. Amazon public datasets - aws.amazon.com/publicdatasets
There are lot more. Following are some of them. If anyone knows list giving a sizes of datasets and nature of datasets, please let me know. 
  • http://dvn.iq.harvard.edu/dvn/dv/cid
  • http://www.thejanuarist.com/9-fascinating-datasets-available-online-for-free/
  • http://bios.dfg.ca.gov/dataset_index.asp
  • http://www.uic.edu/orgs/rin/dataset.html
  • http://www.nas.nasa.gov/Resources/datasets.html
  • http://www.datawrangling.com/some-datasets-available-on-the-web
  • http://news.ycombinator.com/item?id=2165497
  • https://bitly.com/bundles/hmason/1
  • http://www.gutenberg.org/wiki/Gutenberg:Feeds
  • http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
  • http://getthedata.org/
  • http://news.ycombinator.com/item?id=2165497



No comments:

Post a Comment