PySpark Plaso
Release 2019
A tool for distributed extraction of timestamps from various files using extractors adapted from the Plaso engine to Apache Spark.
|
Public Member Functions | |
def | create_files_rdd (cls, sc, hdfs_uri) |
Static Public Member Functions | |
def | get_java_rdd_helpers_package (sc) |
def | list_files (hdfs_uri) |
def | transform_files_rdd_to_extracted_events_rdd (sc, files_rdd) |
def | action_events_rdd_by_saving_into_halyard (sc, events_rdd, table_name, hbase_zk_quorum, hbase_zk_port) |
def | action_events_rdd_by_collecting_into_json (sc, events_rdd) |
Provides methods to create, modify, and evaluate RDD for the Plaso extraction.
|
static |
Collects the content of a given RDD with (HDFS URI, event) pairs into a JSON documents collection. :param sc: Spark Context of the RDD :param events_rdd: the RDD of (HDFS URI, event) paris to save :return the JSON documents collection as a string
|
static |
Save the content of a given RDD with (HDFS URI, event) pairs into Halyard into a given table. :param sc: Spark Context of the RDD :param events_rdd: the RDD of (HDFS URI, event) paris to save :param table_name: Halyard repository HBase table name :param hbase_zk_quorum: HBase Zookeeper quorum of HBase config path :param hbase_zk_port: the Zookeeper client port
def plaso.tarzan.app.pyspark_plaso.PySparkPlaso.create_files_rdd | ( | cls, | |
sc, | |||
hdfs_uri | |||
) |
Create a new RDD with HDFS URIs to all files (recursively) from the HDFS base URI. :param sc: Spark Context of the RDD :param hdfs_uri: the HDFS base URI :return the RDD of HDFS URIs
|
static |
Access a JVM gateway of the Spark Context and get Java package jvm.tarzan.helpers.rdd. :param sc: Spark Context :return: the Java package
|
static |
Get all files (recursively) from a given HDFS URI. :param hdfs_uri: the HDFS URI to get files from :return a list of HDFS URI in the given HDFS URI base-dir
|
static |
Transform RDD of HDFS URIs into a new RDDs of pairs produced by extractors where each pair consists of the HDFS URI of a file and one of event extracted from the file. :param sc: Spark Context of the RDD :param files_rdd: the RDD of HDFS URIs :return: the RDD of (HDFS URI, event) pairs