Search This Blog

Thursday, 11 April 2019

CH5/3 Records and Row type



Records and Rows:

scala> df.take(4)
res19: Array[org.apache.spark.sql.Row] = Array([United States,Romania,15], [United States,Croatia,1], [United States,Ireland,344], [Egypt,United States,15])

As shown below commands that return individual rows to the driver will always return one or more Row types when we are working with DataFrames.

  • We can create rows by manually instantiating a Row object with the values that belong in each column. It’s important to note that only DataFrames have schemas. Rows themselves do not have schemas. This means that if you create a Row manually, you must specify the values in the same order as the schema of the DataFrame to which they might be appended (we will see this when we discuss creating DataFrames)

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val myRow = Row("Hello", null, 1, false)
myRow: org.apache.spark.sql.Row = [Hello,null,1,false]

Note that the Row companion object has apply method, which allows us to instantiate Row object using Row (…..)
Remember Row is Trait and also has a companion Object.

  • Following are imp methods available on the Row Companion Object:

def apply(values: Any*): Row
This method can be used to construct a Row with the given values.
val empty: Row
Returns an empty row.
def fromSeq(values: Seq[Any]): Row
This method can be used to construct a Row from a Seq of values.
def fromTuple(tuple: Product): Row
def merge(rows: Row*): Row
Merge multiple rows into a single row, one after another.
def unapplySeq(row: Row): Some[Seq[Any]]
This method can be used to extract fields from a Row object in a pattern match. Example:
import org.apache.spark.sql._

val pairs = sql("SELECT key, value FROM src").rdd.map {
  case Row(key: Int, value: String) =>
    key -> value
}

  • To access data in Row object is easy. We just need to specify the position that we would like. Following are some of the methods available on Row trait to access elements:

def apply(i: Int): Any

Returns the value at position i. If the value is null, null is returned. The following is a mapping between Spark SQL types and return types:

BooleanType -> java.lang.Boolean
ByteType -> java.lang.Byte
ShortType -> java.lang.Short
IntegerType -> java.lang.Integer
FloatType -> java.lang.Float
DoubleType -> java.lang.Double
StringType -> String
DecimalType -> java.math.BigDecimal

DateType -> java.sql.Date
TimestampType -> java.sql.Timestamp

BinaryType -> byte array
ArrayType -> scala.collection.Seq (use getList for java.util.List)
MapType -> scala.collection.Map (use getJavaMap for java.util.Map)
StructType -> org.apache.spark.sql.Row
def getString(i: Int): String
Returns the value at position i as a String object.
def getLong(i: Int): Long
Returns the value at position i as a primitive long.
def getAs[T](i: Int): T
Returns the value at position i. For primitive types if value is null it returns 'zero value' specific for primitive ie. 0 for Int - use isNullAt to ensure that value is not null


scala> myRow(0)
res20: Any = Hello

scala> myRow(0).asInstanceOf[String]
res26: String = Hello

scala> myRow(1)
res21: Any = null

scala> myRow.getString(0)
res22: String = Hello

scala> myRow.getLong(2)
java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long
  at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:105)
  at org.apache.spark.sql.Row$class.getLong(Row.scala:231)
  at org.apache.spark.sql.catalyst.expressions.GenericRow.getLong(rows.scala:166)
  ... 51 elided

scala> myRow.getInt(2)
res24: Int = 1


No comments:

Post a Comment