PySpark Plaso  Release 2019
A tool for distributed extraction of timestamps from various files using extractors adapted from the Plaso engine to Apache Spark.
Public Member Functions | Static Public Member Functions | List of all members
plaso.tarzan.app.pyspark_plaso.PySparkPlaso Class Reference
Inheritance diagram for plaso.tarzan.app.pyspark_plaso.PySparkPlaso:
Inheritance graph
[legend]
Collaboration diagram for plaso.tarzan.app.pyspark_plaso.PySparkPlaso:
Collaboration graph
[legend]

Public Member Functions

def create_files_rdd (cls, sc, hdfs_uri)
 

Static Public Member Functions

def get_java_rdd_helpers_package (sc)
 
def list_files (hdfs_uri)
 
def transform_files_rdd_to_extracted_events_rdd (sc, files_rdd)
 
def action_events_rdd_by_saving_into_halyard (sc, events_rdd, table_name, hbase_zk_quorum, hbase_zk_port)
 
def action_events_rdd_by_collecting_into_json (sc, events_rdd)
 

Detailed Description

Provides methods to create, modify, and evaluate RDD for the Plaso extraction.

Member Function Documentation

◆ action_events_rdd_by_collecting_into_json()

def plaso.tarzan.app.pyspark_plaso.PySparkPlaso.action_events_rdd_by_collecting_into_json (   sc,
  events_rdd 
)
static
Collects the content of a given RDD with (HDFS URI, event) pairs into a JSON documents collection.
:param sc: Spark Context of the RDD
:param events_rdd: the RDD of (HDFS URI, event) paris to save
:return the JSON documents collection as a string

◆ action_events_rdd_by_saving_into_halyard()

def plaso.tarzan.app.pyspark_plaso.PySparkPlaso.action_events_rdd_by_saving_into_halyard (   sc,
  events_rdd,
  table_name,
  hbase_zk_quorum,
  hbase_zk_port 
)
static
Save the content of a given RDD with (HDFS URI, event) pairs into Halyard into a given table.
:param sc: Spark Context of the RDD
:param events_rdd: the RDD of (HDFS URI, event) paris to save
:param table_name: Halyard repository HBase table name
:param hbase_zk_quorum: HBase Zookeeper quorum of HBase config path
:param hbase_zk_port: the Zookeeper client port

◆ create_files_rdd()

def plaso.tarzan.app.pyspark_plaso.PySparkPlaso.create_files_rdd (   cls,
  sc,
  hdfs_uri 
)
Create a new RDD with HDFS URIs to all files (recursively) from the HDFS base URI.
:param sc: Spark Context of the RDD
:param hdfs_uri: the HDFS base URI
:return the RDD of HDFS URIs

◆ get_java_rdd_helpers_package()

def plaso.tarzan.app.pyspark_plaso.PySparkPlaso.get_java_rdd_helpers_package (   sc)
static
Access a JVM gateway of the Spark Context and get Java package jvm.tarzan.helpers.rdd.
:param sc: Spark Context
:return: the Java package

◆ list_files()

def plaso.tarzan.app.pyspark_plaso.PySparkPlaso.list_files (   hdfs_uri)
static
Get all files (recursively) from a given HDFS URI.
:param hdfs_uri: the HDFS URI to get files from
:return a list of HDFS URI in the given HDFS URI base-dir

◆ transform_files_rdd_to_extracted_events_rdd()

def plaso.tarzan.app.pyspark_plaso.PySparkPlaso.transform_files_rdd_to_extracted_events_rdd (   sc,
  files_rdd 
)
static
Transform RDD of HDFS URIs into a new RDDs of pairs produced by extractors where
each pair consists of the HDFS URI of a file and one of event extracted from the file.
:param sc: Spark Context of the RDD
:param files_rdd: the RDD of HDFS URIs
:return: the RDD of (HDFS URI, event) pairs

The documentation for this class was generated from the following file: