Tuesday, 3 May 2016

Schema evolution with Avro

what is schema evolution
Schema evolution is the term used for how the store behaves when schema is changed after data has been written to the store using an older version of that schema. 
The modifications one can safely perform to schema without any concerns are:
> A field with a default value is added.
> A field that was previously defined with a default value is removed.
> A field's doc attribute is changed, added or removed.
> A field's order attribute is changed, added or removed.
> A field's default value is added, or changed.
> Field or type aliases are added, or removed.

Rules for Changing Schema:

1.For best results, always provide a default value for the fields in your schema. This makes it possible to delete fields later on if you decide it is necessary. If you do not provide a default value for a field, you cannot delete that field from your schema.

2.You cannot change a field's data type. If you have decided that a field should be some data type other than what it was originally created using, then add a whole new field to your schema that uses the appropriate data type.

3.You cannot rename an existing field. However, if you want to access the field by some name other than what it was originally created using, add and use aliases for the field.

4.A non-union type may be changed to a union that contains only the original type, or vice-versa.

How do you handle schema evolution with AVRO:

Schema evolution is the automatic transformation of Avro schema. This transformation is between the version of the schema that the client is using (its local copy), and what is currently contained in the store. When the local copy of the schema is not identical to the schema used to write the value (that is, when the reader schema is different from the writer schema), this data transformation is performed. When the reader schema matches the schema used to write the value, no transformation is necessary.
Schema evolution is applied only during deserialization. If the reader schema is different from the value's writer schema, then the value is automatically modified during deserialization to conform to the reader schema. To do this, default values are used.
There are two cases to consider when using schema evolution: when you add a field and when you delete a field. Schema evolution takes care of both scenarios, so long as you originally assigned default values to the fields that were deleted, and assigned default values to the fields that were added.

Avro schemas can be written in two ways, either in a JSON format:
{
            "type": "record",
            "name": "Person",
            "fields": [
              {"name": "userName",        "type": "string"},
              {"name": "favouriteNumber", "type": ["null", "long"]},
              {"name": "interests",       "type": {"type": "array", "items": "string"}}
             ]
}
…or in an IDL:

record Person {
    string               userName;
    union { null, long } favouriteNumber;
    array<string>        interests;
}

The following are the key advantages of Avro:
* Schema evolution – Avro requires schemas when data is written or read. Most interesting is that you can use different schemas for serialization and deserialization, and Avro will handle the missing/extra/modified fields.
* Untagged data – Providing a schema with binary data allows each datum be written without overhead. The result is more compact data encoding, and faster data processing.
* Dynamic typing – This refers to serialization and deserialization without code generation. It complements the code generation, which is available in Avro for statically typed languages as an optional optimization.

Example code to handle the schema evolution

Create an External table with Avro Schema -- This can be tried with external or managed table

CREATE external TABLE avro_external_table
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  LOCATION 'hdfs://quickstart.cloudera:8020/user/cloudera/avro_data/'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs://localhost:8020/user/cloudera/old_schema.avsc');

-- Select query to check the data
Select * from avro_external_table

-- Alter table statement to alter the schema file, to check the schema evolution
ALTER TABLE avro_external_table SET TBLPROPERTIES ('avro.schema.url'='hdfs://localhost:8020/user/cloudera/new_schema.avsc');

-- Select query
Select * from avro_external_table 

Old_schema:
=========

{
"type": "record",
"name": "Meetup",
"fields": [{
"name": "name",
"type": "string"
}, {
"name": "meetup_date",
"type": "string"
}, {
"name": "going",
"type": "int"
}, {
"name": "organizer",
"type": "string",
"default": "unknown"
}, {
"name": "topics",
"type": {
"type": "array",
"items": "string"
}
}]
}


New_Schema :
(renames the going to attendance and adds new field with “location” ):
=================================================

{
"type": "record",
"name": "Meetup",
"fields": [{
"name": "name",
"type": "string"
}, {
"name": "meetup_date",
"type": "string",
"java-class": "java.util.Date"
}, {
"name": "attendance",
"type": "int",
"aliases": ["going"]
}, {
"name": "location",
"type": "string",
"default": "unknown"
}]
}