Category Archives: HBase|Hive

Sqoop常用操作

2015年4月2日 by debugo · 14 Comments

该脚本由香打小伙伴整理。
首先保证HDFS和HiveServer2正常运行，集群运行在debugo01,debugo02,debugo03三台主机上。

1. 准备mysql数据

在debugo03的MySQL中新建一个测试数据库，并建测试表employee_salary。

mysql -uroot -p
mysql> create database test_sqoop;
Query OK, 1 row affected (0.00 sec)
mysql> use test_sqoop;
SET FOREIGN_KEY_CHECKS=0;
DROP TABLE IF EXISTS `employee_salary`;
CREATE TABLE `employee_salary` (
  `name` text,
  `id` int(8) NOT NULL AUTO_INCREMENT,
  `salary` int(8) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;
INSERT INTO `employee_salary` VALUES ('zhangsan', '1', '5000');
INSERT INTO `employee_salary` VALUES ('lisi', '2', '5500');
commit;

CREATE USER 'test'@'%' IDENTIFIED BY 'test';
GRANT ALL PRIVILEGES ON test_sqoop.* TO 'test'@'%';

mysql -uroot -p

mysql> create database test_sqoop;

Query OK, 1 row affected (0.00 sec)

mysql> use test_sqoop;

SET FOREIGN_KEY_CHECKS=0;

DROP TABLE IF EXISTS `employee_salary`;

CREATE TABLE `employee_salary` (

`name` text,

`id` int(8) NOT NULL AUTO_INCREMENT,

`salary` int(8) DEFAULT NULL,

PRIMARY KEY (`id`)

) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;

INSERT INTO `employee_salary` VALUES ('zhangsan', '1', '5000');

INSERT INTO `employee_salary` VALUES ('lisi', '2', '5500');

commit;

CREATE USER 'test'@'%' IDENTIFIED BY 'test';

GRANT ALL PRIVILEGES ON test_sqoop.* TO 'test'@'%';

Continue reading →

Posted in BigData, HBase|Hive.

OpenTSDB部署手记

2015年3月13日 by debugo · 10 Comments

OpenTSDB是一个基于HBase上的实时监控信息收集和展示平台。它支持秒级数据采集metrics，使用HBase进行永久存储，可以做容量规划，并很容易的接入到现有的监控系统里。OpenTSDB可以从大规模的设备中获取相应的metrics并进行存储、索引以及服务，从而使得这些数据更容易让人理解，如web化，图形化等。 Continue reading →

Posted in BigData, HBase|Hive, NoSQL, Tools.

HBase目录结构与Compaction

2015年3月12日 by debugo · 2 Comments

我们首先查看一下HDFS中的HBASE存储，可以找到其中几个目录

hdfs dfs -ls -R /hbase

1	hdfs dfs -ls -R /hbase

临时文件 /hbase/.tmp
归档 /hbase/archive
WAL日志 /hbase/WALs/debugo01 …
数据/hbase/data// Continue reading →

Posted in BigData, HBase|Hive.

Bulkload是向HBase批量加载数据的方式，它会直接将数据进行准备和并加载成HFile，并直接讲文件插入到RegionServer中，这比通过一个MapReduce/Spark作业来加载性能高得多。详细的流程如下：
1. 抽取数据并形成固定格式的文件，比如csv。
2. 将数据转换称为HFile。这需要一个MapReduce作业，可以自己来实现Map方法来，HBase来完成后面的Reducer操作。最后，每一个region的HFile将会在输出目录被创建出来。
3. 将生成的HFile加载到HBase中，并在所有的regionserver上注册它们，即完成Complete Bulkload阶段。
Continue reading →

Posted in BigData, HBase|Hive, NoSQL.

HBase 权限控制

2015年3月10日 by debugo · 10 Comments

HBase的权限管理依赖协协处理器。所以我们需要配置hbase.security.authorization=true，以及hbase.coprocessor.master.classes和hbase.coprocessor.master.classes使其包含org.apache.hadoop.hbase.security. access.AccessController来提供安全管控能力。所以需要设置下面参数：

<property>
      <name>hbase.superuser</name>
      <value>hbase</value>
</property>
<property>
    <name>hbase.coprocessor.region.classes</name>    
<value>org.apache.hadoop.hbase.security.access.AccessController</value>  </property>
  <property>
    <name>hbase.coprocessor.master.classes</name>
    <value>org.apache.hadoop.hbase.security.access.AccessController</value>
  </property>
  <property>
    <name>hbase.rpc.engine</name>
    <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value>
  </property>
<property>
      <name>hbase.security.authorization</name>
      <value>true</value>
  </property>

<name>hbase.superuser</name>

<value>hbase</value>

</property>

<name>hbase.coprocessor.region.classes</name>

<value>org.apache.hadoop.hbase.security.access.AccessController</value> </property>

<name>hbase.coprocessor.master.classes</name>

<value>org.apache.hadoop.hbase.security.access.AccessController</value>

</property>

<name>hbase.rpc.engine</name>

<value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value>

</property>

<name>hbase.security.authorization</name>

</property>

Continue reading →

Posted in BigData, HBase|Hive, NoSQL.

HBase 常用参数整理

2015年3月10日 by debugo · 2 Comments

1. 通用和master配置

hbase.rootdir
默认: file:///tmp/hbase-${user.name}/hbase
region server的数据根目录，用来持久化HBase。例如，要表示hdfs中的’/hbase’目录，namenode 运行在debugo01的8020端口,则需要设置为hdfs:// debugo01:8020/hbase。这个是必须要设置的项目，默认值本地文件系统的/tmp只能在单机模式使用。 Continue reading →

Posted in BigData, HBase|Hive, NoSQL.

HBase Shell 常用操作

2015年3月10日 by debugo · 19 Comments

HBase Shell是HBase的一个命令行工具，我们可以通过它对HBase进行维护操作。我们可以使用sudo -u hbase hbase shell来进入HBase shell。
在HBase shell中，可以使用status, version和whoami分别获得当前服务的状态、版本、登录用户和验证方式。

> status
3 servers, 1 dead, 1.3333 average load
> version
0.98.6-cdh5.3.1, rUnknown, Tue Jan 27 16:43:50 PST 2015
> whoami
hbase (auth:SIMPLE)
groups: hbase

> status

3 servers, 1 dead, 1.3333 average load

> version

0.98.6-cdh5.3.1, rUnknown, Tue Jan 27 16:43:50 PST 2015

> whoami

hbase (auth:SIMPLE)

groups: hbase

HBase shell中的帮助命令非常强大，使用help获得全部命令的列表，使用help ‘command_name’获得某一个命令的详细信息。 Continue reading →

Posted in BigData, HBase|Hive, NoSQL.

Hive点滴 – 查询练习

2014年5月4日 by debugo · 1 Comment

本文学习一下Hive中的一些查询技巧。

初始化数据

首先创建我们所需的数据库：

CREATE DATABASE Sales;
use Sales;

1 2	CREATE DATABASE Sales; use Sales;

第一个表定义了一张标识日期的纬度表。通过日期的ID可以找到该日期的年月，年，月，日，星期几，第几周，第几季度，旬、半月等信息

CREATE TABLE DateList (
DateID string,
theyearmonth string,
theyear string,
themonth string,
thedate string,
theweek string,
theweeks string,
thequot string,
thetenday string,
thehalfmonth string
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

CREATE TABLE DateList (

DateID string,

theyearmonth string,

theyear string,

themonth string,

thedate string,

theweek string,

theweeks string,

thequot string,

thetenday string,

thehalfmonth string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

第二个表文件定义了订单的相关信息，主要字段有订单序号，交易地点ID，交易日期ID。

CREATE TABLE OrderList(
ordernumber STRING,
locationid STRING,
dateID string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

CREATE TABLE OrderList(

ordernumber STRING,

locationid STRING,

dateID string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

第三章表定义了订单详细信息，其内容包括：订单号，行号，货品，数量，金额。

CREATE TABLE OrderDetails(
ordernumber STRING,
rownum int,
itemid STRING,
qty INT,
price int,
amount int
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

CREATE TABLE OrderDetails(

ordernumber STRING,

rownum int,

itemid STRING,

qty INT,

price int,

amount int

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ;

下面将数据加载到三个表中：

LOAD DATA LOCAL INPATH '/var/lib/hive/DateList.txt' INTO TABLE DateList;
LOAD DATA LOCAL INPATH '/var/lib/hive/OrderList.txt' INTO TABLE OrderList;
LOAD DATA LOCAL INPATH '/var/lib/hive/OrderDetails.txt' INTO TABLE OrderDetails;

LOAD DATA LOCAL INPATH '/var/lib/hive/DateList.txt' INTO TABLE DateList;

LOAD DATA LOCAL INPATH '/var/lib/hive/OrderList.txt' INTO TABLE OrderList;

LOAD DATA LOCAL INPATH '/var/lib/hive/OrderDetails.txt' INTO TABLE OrderDetails;

检查数据：

0: jdbc:hive2://debugo02:10000> select count(*) from OrderDetails;
| 287950  |

1 2	0: jdbc:hive2://debugo02:10000> select count(*) from OrderDetails; \| 287950 \|

通过HQL完成数据校验

在ETL中可能出现一些不正确的数据，比如OrderDetails和OrderList中信息不匹配。我们可以通过HQL语句来找到这些数据。

select count(*) from sales.OrderList a,sales.OrderDetails b where a.ordernumber=b.ordernumber;
...... 
287950
select count(*) from sales.OrderList a, sales.OrderDetails b,DateList c where a.ordernumber=b.ordernumber and a.dateid=c.dateid;
 ...... 
287942

select count(*) from sales.OrderList a,sales.OrderDetails b where a.ordernumber=b.ordernumber;

......

287950

select count(*) from sales.OrderList a, sales.OrderDetails b,DateList c where a.ordernumber=b.ordernumber and a.dateid=c.dateid;

......

287942

这可能出现没有正确的dateid字段的订单。我们下面通过一个not in语句来找到这些订单！
select a.* from sales.OrderList a where a.dateid not in (select dateid from sales.DateList);

通过HQL完成报表统计

统计所有订单中每年的销售单数、销售增额。

select c.theyear,count(distinct a.ordernumber),sum(b.amount) 
from sales.orderlist a, sales.orderdetails b, sales.datelist c 
where a.ordernumber=b.ordernumber and a.dateid=c.dateid 
group by c.theyear 
order by c.theyear;

select c.theyear,count(distinct a.ordernumber),sum(b.amount)

from sales.orderlist a, sales.orderdetails b, sales.datelist c

where a.ordernumber=b.ordernumber and a.dateid=c.dateid

group by c.theyear

order by c.theyear;

所有订单中销售最高的季度前n名

select c.theyear,c.thequot,sum(b.amount) as sumofamount 
from sales.orderlist a,sales.orderdetails b,sales.datelist c 
where a.ordernumber=b.ordernumber 
and a.dateid=c.dateid 
group by c.theyear,c.thequot 
order by sumofamount desc 
limit 3;

select c.theyear,c.thequot,sum(b.amount) as sumofamount

from sales.orderlist a,sales.orderdetails b,sales.datelist c

where a.ordernumber=b.ordernumber

and a.dateid=c.dateid

group by c.theyear,c.thequot

order by sumofamount desc

limit 3;

列出销售金额在100000以上的订单

select a.ordernumber,sum(b.amount) as amount 
from sales.orderlist a,sales.orderdetails b 
where a.ordernumber=b.ordernumber 
group by a.ordernumber 
having amount>100000;

select a.ordernumber,sum(b.amount) as amount

from sales.orderlist a,sales.orderdetails b

where a.ordernumber=b.ordernumber

group by a.ordernumber

having amount>100000;

找出订单中每年最畅销的商品
第一步：找到按年、商品id进行的统计汇总

CREATE VIEW IF NOT EXISTS v_yearItemSummary as
select c.theyear,b.itemid,sum(b.amount) as amount
from sales.orderlist a,sales.orderdetails b,sales.datelist c 
where a.ordernumber=b.ordernumber and a.dateid=c.dateid 
group by c.theyear,b.itemid;

CREATE VIEW IF NOT EXISTS v_yearItemSummary as

select c.theyear,b.itemid,sum(b.amount) as amount

from sales.orderlist a,sales.orderdetails b,sales.datelist c

where a.ordernumber=b.ordernumber and a.dateid=c.dateid

group by c.theyear,b.itemid;

第二步: 找到每年最大的商品销量

CREATE VIEW IF NOT EXISTS v_yearItemSummary as
select c.theyear,b.itemid,sum(b.amount) as amount
from sales.orderlist a,sales.orderdetails b,sales.datelist c 
where a.ordernumber=b.ordernumber and a.dateid=c.dateid 
group by c.theyear,b.itemid;

CREATE VIEW IF NOT EXISTS v_yearItemSummary as

select c.theyear,b.itemid,sum(b.amount) as amount

from sales.orderlist a,sales.orderdetails b,sales.datelist c

where a.ordernumber=b.ordernumber and a.dateid=c.dateid

group by c.theyear,b.itemid;

第三步：通过连接获得itemid

select distinct  v.theyear,v.itemid,f.maxamount 
from v_yearItemSummary v , (select theyear, max(amount) as maxamount from  v_yearItemSummary group by theyear) f 
where v.theyear=f.theyear and v.amount=f.maxamount 
order by v.theyear;

select distinct v.theyear,v.itemid,f.maxamount

from v_yearItemSummary v , (select theyear, max(amount) as maxamount from v_yearItemSummary group by theyear) f

where v.theyear=f.theyear and v.amount=f.maxamount

order by v.theyear;

Posted in BigData, HBase|Hive.

通过Squrirel连接Hive

2014年5月3日 by debugo · 2 Comments

SQuirrel SQL Client是一个用Java写的数据库客户端，用JDBC统一数据库访问接口以后，可以通过一个统一的用户界面来操作MySQL，MSSQL，Greenplum，Hive等等任何支持JDBC访问的数据库。使用起来非常方便。
1. 安装
下载地址： http://squirrel-sql.sourceforge.net/ 最新安装包：squirrel-sql-3.5.3-standard.jar及驱动jar包，此外我们还需要$HIVE_HOME/lib下的相关jar驱动包。安装需要依赖于JRE环境，直接双击squirrel-sql-3.5.3-standard.jar文件来执行安装程序。安装过程可以根据自己需要选择插件。

2. 添加JDBC Driver
安装完成后，运行squirrel-sql.bat，进入图形界面。此时我们还需要添加Hive的driver连接驱动程序。在Driver标签中添加新的driver。在external标签中添加了jar包后，通过list driver来选择jdbc.HiveDriver。

3. 登录Hive
完成后，我们在alias标签中建立一个hive连接的快捷别名：

完成后，我们可以查看Hive的schema以及通过SQL标签页来执行查询，非常方便。

Posted in BigData, HBase|Hive.

Hive点滴 – 单行转多行（split）

2014年5月1日 by debugo · 2 Comments

很多数据标签都是有在一个字段中有多条信息，而这些信息如果能直接在HQL中解析多行，无疑是极好滴。一个简单的栗子：

create table test_split(
    string tag;
    string value;
);

create table test_split(

string tag;

string value;

);

我们插入一条数据
tag1 value1,value2,value3
普通模式查询：

select * from test_tag;
+---------------+-----------------------+--+
| test_tag.tag  |    test_tag.value     |
+---------------+-----------------------+--+
| tag1          | value1,value2,value3  |
+---------------+-----------------------+--+

select * from test_tag;

+---------------+-----------------------+--+

| test_tag.tag | test_tag.value |

+---------------+-----------------------+--+

| tag1 | value1,value2,value3 |

+---------------+-----------------------+--+

通过split来拆分为多行,非常的方便。

select tag, v from test_tag lateral view explode(split(value,',')) adtable as v;  
+-------+---------+--+
|  tag  |    v    |
+-------+---------+--+
| tag1  | value1  |
| tag1  | value2  |
| tag1  | value3  |
+-------+---------+--+

select tag, v from test_tag lateral view explode(split(value,',')) adtable as v;

+-------+---------+--+

| tag | v |

+-------+---------+--+

| tag1 | value1 |

| tag1 | value2 |

| tag1 | value3 |

+-------+---------+--+

Posted in BigData, HBase|Hive.

Category Archives: HBase|Hive

Sqoop常用操作

1. 准备mysql数据

OpenTSDB部署手记

HBase目录结构与Compaction

使用importtsv命令加载数据

HBase 权限控制

HBase 常用参数整理

1. 通用和master配置

HBase Shell 常用操作

Hive点滴 – 查询练习

初始化数据

通过HQL完成数据校验

通过HQL完成报表统计

通过Squrirel连接Hive

Hive点滴 – 单行转多行（split）

近期文章

热评文章

文章归档

分类目录

友链

功能