Welcome to Rubix’s documentation!¶
RubiX is a light-weight data caching framework that can be used by Big-Data engines. RubiX can be extended to support any engine that accesses data in cloud stores using Hadoop FileSystem interface via plugins. Using the same plugins, RubiX can also be extended to be used with any cloud store
RubiX¶
RubiX is a light-weight data caching framework that can be used by Big-Data engines. RubiX can be extended to support any engine that accesses data in cloud stores using Hadoop FileSystem interface via plugins. Using the same plugins, RubiX can also be extended to be used with any cloud store
Usecase¶
RubiX provides disk or in-memory caching of data, which would otherwise be accessed over network when it resides in cloud store, thereby improving performance.
Supported Engines and Cloud Stores¶
- Presto: Amazon S3
- Spark: Amazon S3
- Any engine using hadoop-2 or hadoop-1, e.g. Hive can utilize RubiX. Amazon S3 is supported
Installation Guide¶
This section provides instructions to install Rubix and use it with Presto or Hive. Instructions are tested on an EMR 5.X cluster. If you want help to install on-prem or on other distributions, please contact the community with your questions or issues.
The instructions require RubiX Admin
Start and connect to master¶
Start an EMR Cluster¶
Create an EMR cluster, use ‘advanced option’ to create a cluster with the required spec.
- Specify at least 30GB space for Root device EBS volume.
- To login into cluster, choose a EC2 key pair.
Connect to Master Node¶
Log into master and all worker nodes as “hadoop” user.
ssh <key> hadoop@master-public-ip and install required libs.
Setup and check passwordless SSH between cluster machines
ssh hadoop@localhost
ssh hadoop@worker-public-ip
Install and start RubiX¶
Install RubiX Admin¶
pip install rubix_admin
Update Config File¶
rubix_admin -h
This will create rubix-admin config file at ~/.radminrc with the follwoing format
hosts:
- localhost
- worker-ip1
- worker-ip2
..
remote_packages_path: /tmp/rubix_rpms
Install RubiX¶
rubix_admin installer install
This will install the latest version of RubiX. To install a specific version of Rubix,
rubix_admin installer install --rpm-version <RubiX Version>
To install from a rpm file,
rubix_admin installer install --rpm <path-to-rubix-rpm>
To enable debugging and see the rubix activity, create /usr/lib/presto/etc/log.properties file with bellow config. com.qubole=DEBUG
Start RubiX Daemons¶
rubix_admin daemon start --debug
# To verfiy the daemons are up
verfiy process ids for both
BookKeeperServer and LocalDiscoveryServer.
sudo jps -m
Apache Hive¶
Add RubiX jars to Hadoop ClassPath¶
Copy jars to Hadoop Lib
cp /usr/lib/rubix/lib/rubix-* /usr/lib/hadoop/lib
OR
Add Rubix Jars
add jar /usr/lib/rubix/lib/rubix-bookkeeper.jar
add jar /usr/lib/rubix/lib/rubix-core.jar
add jar /usr/lib/rubix/lib/rubix-hadoop2.jar
Restart Hive Metastore Server¶
hive --service metastore --stop
hive --service metastore --start
Configure Apache Hive to use RubiX FileSystem¶
hive --hiveconf \
fs.rubix.impl=com.qubole.rubix.hadoop2.CachingNativeS3FileSystem \
fs.rubix.awsAccessKeyId=<AWS ACCESS KEY> \
fs.rubix.awsSecretAccessKey=<AWS SECRET ACCESS KEY>
(Advanced) Configure Apache Hive to use RubiX FileSystem for S3 and S3A schemes¶
If you use this option, all tables with their location in AWS S3 will automatically start using RubiX.
hive --hiveconf \
fs.s3n.impl=com.qubole.rubix.hadoop2.CachingNativeS3FileSystem
fs.s3.impl=com.qubole.rubix.hadoop2.CachingNativeS3FileSystem
fs.s3a.impl=com.qubole.rubix.hadoop2.CachingS3AFileSystem
Run your first query using RubiX¶
Start Hive Client¶
hive --hiveconf hive.metastore.uris="" --hiveconf fs.rubix.impl=com.qubole.rubix.hadoop2.CachingNativeS3FileSystem
Create External Table¶
CREATE EXTERNAL TABLE wikistats_orc_rubix
(language STRING, page_title STRING,
hits BIGINT, retrived_size BIGINT)
STORED AS ORC
LOCATION 'rubix://emr.presto.airpal/wikistats/orc';
Run Query (Presto or Hive CLI)¶
SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats_orc_rubix
WHERE language = 'en'
AND page_title NOT IN ('Main_Page', '404_error/')
AND page_title NOT LIKE '%index%'
AND page_title NOT LIKE '%Search%'
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;
RubiX Stats (supported on Presto only)¶
The cache statistics are pushed to MBean named rubix:name=stats. To check the stats, execute
SELECT Node, CachedReads,
ROUND(extrareadfromremote,2) as ExtraReadFromRemote,
ROUND(hitrate,2) as HitRate,
ROUND(missrate,2) as MissRate,
ROUND(nonlocaldataread,2) as NonLocalDataRead,
NonLocalReads,
ROUND(readfromcache,2) as ReadFromCache,
ROUND(readfromremote, 2) as ReadFromRemote,
RemoteReads
FROM jmx.current."rubix:name=stats";
Metrics¶
These are the metrics currently available for RubiX.
RubiX Metric | Description |
---|---|
rubix.bookkeeper.live_workers.gauge | The number of workers currently reporting to the master node. |
rubix.bookkeeper.cache_eviction.count | The number of entries evicted from the local cache. |
rubix.bookkeeper.cache_hit_rate.gauge | The percentage of cache hits for the local cache. |
rubix.bookkeeper.cache_miss_rate.gauge | The percentage of cache misses for the local cache. |
rubix.bookkeeper.cache_size.gauge | The current size of the local cache in MB. |
rubix.bookkeeper.local_request.count | The number of requests made for data cached locally. |
rubix.bookkeeper.nonlocal_request.count | The number of requests made for data cached on another node. |
rubix.bookkeeper.remote_request.count | The number of requests made for data not currently cached. |
Contribution Guidelines¶
This section provides guidelines to contribute to the project through code, issues and documentation.
Developer Environment¶
Rubix is a Maven project and uses Java 8. It uses JUnit as the testing framework. Ensure that you have a development environment that support the above configuration.
Fork your own copy of RubiX into your github account by clicking on the “Fork” button
Navigate to your account and clone that copy to your development box
git clone https://github.com/<username>/rubix
Run tests in the RubiX root directory.
mvn test
Add Qubole RubiX as upstream
git remote add upstream https://github.com/qubole/rubix.git git fetch upstream
How to contribute code on Github¶
1. Create a branch and start working on your change.¶
cd rubix
git checkout -b new_rubix_branch
2. Code¶
- Adhere to code standards.
- Include tests and ensure they pass.
3. Commit¶
For every commit please write a short (max 72 characters) summary in the first line followed with a blank line and then more detailed descriptions of the change.
Don’t forget a prefix!
More details in Commit Guidelines
4. Update your branch¶
git fetch upstream
git rebase upstream/master
5. Push to remote¶
git push -u origin new_rubix_branch
6. Issue a Pull Request¶
- Navigate to the Rubix repository you just pushed to (e.g. https://github.com/your-user-name/rubix)
- Click Pull Request.
- Write your branch name in the branch field (this is filled with master by default)
- Click Update Commit Range.
- Ensure the changesets you introduced are included in the Commits tab.
- Ensure that the Files Changed incorporate all of your changes.
- Fill in some details about your potential patch including a meaningful title.
- Click Send pull request.
7. Respond to feedback¶
The RubiX team may recommend adjustments to your code. Part of interacting with a healthy open-source community requires you to be open to learning new techniques and strategies; don’t get discouraged! Remember: if the RubiX team suggest changes to your code, they care enough about your work that they want to include it, and hope that you can assist by implementing those revisions on your own.
8. Postscript¶
Once all the changes are approved, one contributor will push the change to the upstream code.
Coding conventions¶
- two spaces, no tabs
- no trailing whitespaces, blank lines should have no spaces
- Do not mix multiple fixes into a single commit.
- Add comments for your future selves and for your current/future peers
- Do not make whitespace changes as part of your regular/feature commits.
- If you feel whitespace issues need to be fixed, please push a separate commit for the same. It will be approved quickly without any discussion.
Commit Message¶
Commits are used as a source of truth for various reports. A couple of examples are:
- Release Notes
- Issues resolved for QA to plan the QA cycle.
To be able to generate these reports, uniform commit messages are required. All your commits should follow the following convention:
For every commit please write a short (max 72 characters) summary in the first line followed with a blank line and then more detailed descriptions of the change.
Format of summary:
ACTION: AUDIENCE: COMMIT_MSG
Description:
ACTION is one of 'chg', 'fix', 'new'
Is WHAT the change is about.
'chg' is for refactor, small improvement, cosmetic changes...
'fix' is for bug fixes
'new' is for new features, big improvement
AUDIENCE is one of 'dev', 'usr', 'pkg', 'test', 'doc'
Is WHO is concerned by the change.
'dev' is for developers (API changes, refactors...)
'usr' is for final users
You will use your environment’s default editor (EDITOR=vi|emacs) to compose the commit message. Do NOT use the command line git commit -m “my mesg” as this only allows you to write a single line that most of the times turns out to be useless to others reading or reviewing your commit.
Example¶
new: dev: #124: report liveness metric for BookKeeper daemon (#139)
Add a liveness gauge that the daemon is up & alive. Right now, this
is a simple check that a thread (reporter to be added in a subsequent
commit) is alive. In the future, this simple framework will be used
to add more comprehensive health checks. Ref: #140
The above example shows the commit summary is:
- a single line composed of four columns
- column 1 tells us the nature of the change or ACTION: new
- a short one-line summary of WHAT the commit is doing
The description or the body of the commit message delves into more detail that is intended to serve as a history for developers on the team on how the code is evolving. There are more immediate uses of this description however. When you raise pull requests to make your contributions into the project, your commit descriptions serve as explanations of WHY you fixed an issue. HOW you fixed an issue is explained by code already. This is also the place where the peer-reviewers will begin understanding your code. An unclear commit message is the source of a lot of back and forth resulting in frustration between reviewers and committers.
Reference: http://chris.beams.io/posts/git-commit/
How to report issues¶
A bug report means something is broken, preventing normal/typical use of Rubix.
Make sure the bug isn’t already resolved. Search for similar issues.
Make sure you have clear instructions to reproduce your problem.
If possible, submit a Pull Request with a failing test, or;if you’d rather take matters into your own hands, try fix the bug yourselfMake a report of everything you know about the bug so far by opening an issue about it.When the bug is fixed, you can usually expect to see an update posted on the reporting issue.
Documentation Style Guide¶
- Documentation uses Sphinx documentation generator.
- Documentation is hosted on ReadTheDocs
- File issues if you notice bugs in documentation or to request more information.
Label issues with
doc
- Contributions to documentation is accepted as a Pull Request.
- Choose Markdown if you will add new pages
- Choose Rich Structured Text (rst) for indexes or if the documentation needs tables.