LeetCode 1225. Report Contiguous Dates (MYSQL + hive UDTF版本)

Stella981
• 阅读 694

一、原题描述

Table: Failed

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| fail_date    | date    |
+--------------+---------+
Primary key for this table is fail_date.
Failed table contains the days of failed tasks.

Table: Succeeded

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| success_date | date    |
+--------------+---------+
Primary key for this table is success_date.
Succeeded table contains the days of succeeded tasks.

A system is running one task every day. Every task is independent of the previous tasks. The tasks can fail or succeed.

Write an SQL query to generate a report of period_state for each continuous interval of days in the period from 2019-01-01 to 2019-12-31.

period_state is 'failed' if tasks in this interval failed or 'succeeded' if tasks in this interval succeeded. Interval of days are retrieved as start_date and end_date.

Order result by start_date.

The query result format is in the following example:

Failed table: +-------------------+ | fail_date | +-------------------+ | 2018-12-28 | | 2018-12-29 | | 2019-01-04 | | 2019-01-05 | +-------------------+

Succeeded table: +-------------------+ | success_date | +-------------------+ | 2018-12-30 | | 2018-12-31 | | 2019-01-01 | | 2019-01-02 | | 2019-01-03 | | 2019-01-06 | +-------------------+

Result table: +--------------+--------------+--------------+ | period_state | start date | end date | +--------------+--------------+--------------+ | succeeded | 2019-01-01 | 2019-01-03 | | failed | 2019-01-04 | 2019-01-05 | | succeeded | 2019-01-06 | 2019-01-06 | +--------------+--------------+--------------+

The report ignored the system state in 2018 as we care about the system in the period 2019-01-01 to 2019-12-31. From 2019-01-01 to 2019-01-03 all tasks succeeded and the system state was "succeeded". From 2019-01-04 to 2019-01-05 all tasks failed and system state was "failed". From 2019-01-06 to 2019-01-06 all tasks succeeded and system state was "succeeded".

二、简要翻译

两张表, `Failed 和 Succeeded,用来记录一个每日定时跑的系统任务的失败和成功。要求返回一张结果表,按顺序展示该任务失败和成功的连续时间段以及起止时间。

**三、SQL分析

**0、建表语句
`

CREATE TABLE IF NOT EXISTS `Failed`(
`fail_date` DATE,
PRIMARY KEY ( `fail_date` )
)ENGINE=INNODB DEFAULT CHARSET=utf8;

CREATE TABLE IF NOT EXISTS `Succeeded`(
   `success_date` DATE,
   PRIMARY KEY ( `success_date` )
)ENGINE=INNODB DEFAULT CHARSET=utf8;

INSERT INTO Succeeded(success_date) VALUES('2018-12-30'),('2018-12-31'),('2019-01-01'),('2019-01-02'),('2019-01-03'),('2019-01-06');

INSERT INTO Failed(fail_date) VALUES('2018-12-28'),('2018-12-29'),('2019-01-04'),('2019-01-05'),('2019-01-07');

1、将两张表里面的数据按照时间是否连续进行分组。

 1 #失败表   
 2 SELECT 
 3       fail_date,
 4       IF(
 5         @date = DATE_SUB(fail_date, INTERVAL 1 DAY),
 6         @rank := @rank,
 7         @rank := @rank + 1
 8       ) AS rank,
 9       @date := fail_date 
10     FROM
11       Failed,
12       (SELECT 
13         @date := NULL,
14         @rank := 0) r 
15     WHERE fail_date > '2018-12-31' 
16       AND fail_date < '2020-01-01' 
17     ORDER BY fail_date;
18 
19 #成功表
20 SELECT 
21       success_date,
22       IF(
23         @date = DATE_SUB(success_date, INTERVAL 1 DAY),
24         @rank := @rank,
25         @rank := @rank + 1
26       ) AS rank,
27       @date := success_date 
28     FROM
29       Succeeded,
30       (SELECT 
31         @date := NULL,
32         @rank := 0) r 
33     WHERE success_date > '2018-12-31' 
34       AND success_date < '2020-01-01' 
35     ORDER BY success_date

2、将上一步的两张表分组查询最大日期,最小日期,然后合并再排序

 1 SELECT 
 2   * 
 3 FROM
 4   (SELECT 
 5     'succeeded' AS period_state,
 6     MIN(success_date) AS start_date,
 7     MAX(success_date) AS end_date 
 8   FROM
 9     (SELECT 
10       success_date,
11       IF(
12         @date = DATE_SUB(success_date, INTERVAL 1 DAY),
13         @rank := @rank,
14         @rank := @rank + 1
15       ) AS rank,
16       @date := success_date 
17     FROM
18       Succeeded,
19       (SELECT 
20         @date := NULL,
21         @rank := 0) r 
22     WHERE success_date > '2018-12-31' 
23       AND success_date < '2020-01-01' 
24     ORDER BY success_date) t1 
25   GROUP BY t1.rank 
26   UNION
27   ALL 
28   SELECT 
29     'failed' AS period_state,
30     MIN(fail_date) AS start_date,
31     MAX(fail_date) AS end_date 
32   FROM
33     (SELECT 
34       fail_date,
35       IF(
36         @date = DATE_SUB(fail_date, INTERVAL 1 DAY),
37         @rank := @rank,
38         @rank := @rank + 1
39       ) AS rank,
40       @date := fail_date 
41     FROM
42       Failed,
43       (SELECT 
44         @date := NULL,
45         @rank := 0) r 
46     WHERE fail_date > '2018-12-31' 
47       AND fail_date < '2020-01-01' 
48     ORDER BY fail_date) t2 
49   GROUP BY t2.rank) temp 
50 ORDER BY start_date ;

`**四、HIVE版本

0.1 建表语句 略

0.2、xml依赖
**`

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>
    <dependencies>

        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
        </dependency>

        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.7.2</version>
        </dependency>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.junit.jupiter</groupId>
            <artifactId>junit-jupiter-api</artifactId>
            <version>RELEASE</version>
            <scope>compile</scope>
        </dependency>
    </dependencies>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.apache.hive</groupId>
                <artifactId>hive-exec</artifactId>
                <version>2.1.0</version>
            </dependency>
            <dependency>
                <groupId>junit</groupId>
                <artifactId>junit</artifactId>
                <version>4.12</version>
            </dependency>
            <dependency>
                <groupId>jdk.tools</groupId>
                <artifactId>jdk.tools</artifactId>
                <version>1.8</version>
            </dependency>
        </dependencies>
    </dependencyManagement>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

1、用自定义的UDTF代替三里面的步骤1。

 1 package com.hive.udf;
 2 
 3 import java.text.ParseException;
 4 import java.text.SimpleDateFormat;
 5 import java.util.ArrayList;
 6 import java.util.Calendar;
 7 import java.util.Date;
 8 import java.util.GregorianCalendar;
 9 
10 import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
11 import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
12 import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
13 import org.apache.hadoop.hive.ql.metadata.HiveException;
14 import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
15 import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
16 import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
17 import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
18 import org.junit.jupiter.api.Test;
19 
20 public class TestGenericUDTF extends GenericUDTF {
21     private Date startDate = null;
22     private Date lastDate = null;
23     private Date lastEmitDate = null;
24     private static SimpleDateFormat s = new SimpleDateFormat("yyyy-MM-dd");
25 
26     @Override
27     public void close() throws HiveException {
28         if (startDate != null && lastDate != null && lastEmitDate != startDate) {
29             String[] result = new String[]{s.format(startDate), s.format(lastDate)};
30             forward(result);
31         }
32     }
33 
34     @Override
35     public StructObjectInspector initialize(ObjectInspector[] args)
36             throws UDFArgumentException {
37         if (args.length != 1) {
38             throw new UDFArgumentLengthException("TestGenericUDTF takes only one argument");
39         }
40         if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
41             throw new UDFArgumentException("TestGenericUDTF takes string as a parameter");
42         }
43 
44         ArrayList<String> fieldNames = new ArrayList<String>();
45         ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
46         fieldNames.add("start_date");
47         fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
48         fieldNames.add("end_date");
49         fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
50 
51         return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
52     }
53 
54     @Override
55     public void process(Object[] args) throws HiveException {
56         String dateStr = args[0].toString();
57         Date date = null;
58         try {
59             date = s.parse(dateStr);
60         } catch (ParseException e) {
61             e.printStackTrace();
62         }
63         if (date == null) {
64             if (startDate != null && lastDate != null && lastEmitDate != startDate) {
65                 String[] result = new String[]{startDate.toString(), lastDate.toString()};
66                 forward(result);
67                 lastEmitDate = startDate;
68             }
69             startDate = null;
70             lastDate = null;
71             return;
72         }
73         if (startDate == null) {
74             startDate = date;
75         } else if (lastDate != null) {
76             GregorianCalendar calander = new GregorianCalendar();
77             calander.setTime(lastDate);
78             calander.add(Calendar.DATE, 1);
79             Date time = calander.getTime();
80             if (!time.equals(date)) {
81                 String[] result = new String[]{s.format(startDate), s.format(lastDate)};
82                 forward(result);
83                 lastEmitDate = startDate;
84                 startDate = date;
85             }
86         }
87         lastDate = date;
88     }
89 }

2、打肥jar包上传服务器,使用该udtf。

add JAR /home/hadoop/hive-dateudtf-jar-with-dependencies.jar;

create temporary function dateUDTF as 'com.hive.udf.TestGenericUDTF';
点赞
收藏
评论区
推荐文章
blmius blmius
3年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
皕杰报表之UUID
​在我们用皕杰报表工具设计填报报表时,如何在新增行里自动增加id呢?能新增整数排序id吗?目前可以在新增行里自动增加id,但只能用uuid函数增加UUID编码,不能新增整数排序id。uuid函数说明:获取一个UUID,可以在填报表中用来创建数据ID语法:uuid()或uuid(sep)参数说明:sep布尔值,生成的uuid中是否包含分隔符'',缺省为
待兔 待兔
5个月前
手写Java HashMap源码
HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程22
Jacquelyn38 Jacquelyn38
3年前
2020年前端实用代码段,为你的工作保驾护航
有空的时候,自己总结了几个代码段,在开发中也经常使用,谢谢。1、使用解构获取json数据let jsonData  id: 1,status: "OK",data: 'a', 'b';let  id, status, data: number   jsonData;console.log(id, status, number )
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Stella981 Stella981
3年前
HIVE 时间操作函数
日期函数UNIX时间戳转日期函数: from\_unixtime语法:   from\_unixtime(bigint unixtime\, string format\)返回值: string说明: 转化UNIX时间戳(从19700101 00:00:00 UTC到指定时间的秒数)到当前时区的时间格式举例:hive   selec
Wesley13 Wesley13
3年前
00:Java简单了解
浅谈Java之概述Java是SUN(StanfordUniversityNetwork),斯坦福大学网络公司)1995年推出的一门高级编程语言。Java是一种面向Internet的编程语言。随着Java技术在web方面的不断成熟,已经成为Web应用程序的首选开发语言。Java是简单易学,完全面向对象,安全可靠,与平台无关的编程语言。
Stella981 Stella981
3年前
Django中Admin中的一些参数配置
设置在列表中显示的字段,id为django模型默认的主键list_display('id','name','sex','profession','email','qq','phone','status','create_time')设置在列表可编辑字段list_editable
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
为什么mysql不推荐使用雪花ID作为主键
作者:毛辰飞背景在mysql中设计表的时候,mysql官方推荐不要使用uuid或者不连续不重复的雪花id(long形且唯一),而是推荐连续自增的主键id,官方的推荐是auto_increment,那么为什么不建议采用uuid,使用uuid究
Python进阶者 Python进阶者
11个月前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这