2023年11月26日发(作者:)

Sqoop增量同步(⾃动更新lastvalue

我们在使⽤Sqoop进⾏增量同步的时候,是需要指定 last value 的。但⼀般我们都是⾃动化进⾏数据同步的,这就需要有⼀个地⽅,能够

⾃动记录和填充 上次增量同步的 last value。

抛开 ⼿动 维护这个 last value,繁琐,⽽且还很容易失败。后⾯查了下Sqoop 的官⽹,发现 Sqoop job 提供了类似的功能。

这⾥记录下,供后续查看。

上⾯有详细的说明,这⾥说明下⼏个点:

1. ⽬的

The job tool allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so

they can be re-executed by invoking the job by its handle.

重点在于重复执⾏

If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in

the saved job to allow the job to continually import only the newest rows.

增量同步的推荐⽅式。

2. ⽤法

提交命令:

$ sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

$ sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]

参数:

Argument Description

--create <job-id> Define a new saved job with the specified job-id (name).

A second Sqoop command-line, separated by a -- should be specified; this defines

the saved job.

--delete <job-id> Delete a saved job.

--exec <job-id> Given a job defined with --create, run the saved job.

--show <job-id> Show the parameters for a saved job.

--list List all saved jobs

执⾏时,覆盖参数:

The exec action allows you to override arguments of the saved job by supplying

them after a --. For example, if the database were changed to require a username,

we could specify the username and password with:

$ sqoop job --exec myjob -- --username someuser -P

Enter password:

3. 关于metadata的参数

sqoop job 执⾏时,默认情况下是需要指定 --meta-connect ,并且要输⼊密码的,那么如果我们不想在每次执⾏ job 的时候,都去输⼊

密码、meta的连接地址,有可以怎么做呢?

sqooop 提供两个参数:

nnect=true

rd=true

这样就可以不⽤每次执⾏都输⼊密码了。

4. 增量同步的说明

Incremental imports are performed by comparing the values in a check column against a

reference value for the most recent import. For example, if the --incremental append

argument was specified, along with --check-column id and --last-value 100, all rows

with id > 100 will be imported. If an incremental import is run from the command line,

the value which should be specified as --last-value in a subsequent incremental

import will be printed to the screen for your reference. If an incremental import

is run from a saved job, this value will be retained in the saved job. Subsequent

runs of sqoop job --exec someIncrementalJob will continue to import only newer

rows than those previously imported.

⼤意就是,使⽤ sqoop job 进⾏增量同步时,sqoop 的 meta 会⾃动记录上次同步的 last value ,这样,后续再次执⾏执⾏同步任务时,

不⽤指定 last value 也可以拉取到最新的数据了。⾮常好⽤!!

5. 实例

sqoop job

--create myjob1

-- import

--connect jdbc:mysql://xxx:3306/xxx

--username root

--password root

--table category

--target-dir /user/survey/sqoop1

--incremental append

--check-column updated_at

--last-value "2019-06-12 12:39:43"

--m 1

--fields-terminated-by "001"

这⾥,有个细节说明下:

这时候会在hdfs的/user/survey/sqoop1⽂件夹下⽣成⼀个新的⽂件,⽂件中会有两条记录,⼀条是新增的记录,另⼀条是修改后的数据,在全

量导⼊⽣成的⽂件中有修改前的数据

注意:这个时候的–append和append模式不同,此处的append代表lastmodified模式下的append类型,可选值有append | merge-key

如果将上⾯的–append换成–merge-key id ,hdfs上不会⽣成新⽂件,⽽是⽣成⼀个经过reduce合并的总⽂件,⽂件⾥是修改后的⽼数据和

新增数据,这⾥是根据 id 进⾏合并

更新下:

Sqoop 的 lastmodified 和–append 不能和Hive table ⼀起⽤,会报错。但是数据导⼊时使⽤–target则没有问题;

Sqoop 中的merge-key貌似对Hive 不⽣效,不能合并更新的数据