Hadoop を使った基盤地図情報の GeoJSON タイル変換 Pt. 1

本物の MapReduce を使った基盤地図情報の GeoJSON タイル変換の流れができた。ソースデータごとに GeoJSONL に変換するステップの後、一度に MapReduce を使って変換する方法である。ソースデータの一部が変更された時に、変更を受けていないソースデータに対応する GeoJSONL 変換をしなくて済むようになっている。

Hadoop のインストール（OS X）

MapReduce といっても、特段分散計算機環境を設定しない、HomeBrew のデフォルト状態の Hadoop を使っている。それでも、特にソースデータが大規模の場合に、Unix sort よりは高速になる。Hadoop をきちんと設定して高速化する作業は、あとの楽しみにとっておく。

$ brew install mapreduce

現在、手元では hadoop の 2.5.1 がインストールされている。
OS X の Java はターミナルで日本語を CP932 で出力して文字化けさせてしまう癖がある。これを回避して日本語を綺麗に表示するためには、次のようにする。Haddop のステータス出力には日本語が混じるので、設定しておいたほうがよい。

$ export _JAVA_OPTIONS="-Dfile.encoding=UTF-8"

また、大規模なデータセットを変換する場合には、heap space 不足でエラーを出す場合があるので、これを予防するために、次のように環境変数を設定しておくと良い。

$ export HADOOP_CLIENT_OPTS="-Xmx2048m $HADOOP_CLIENT_OPTS"

これらは、.bash_profile あたりに書き込んでおくとよい。

ディレクトリの構成

ディレクトリの構成は、次のようにしている。

. -+- source/ : 基盤地図情報XMLファイルを格納
   +- input/ : convert.rb を使って基盤地図情報をGeoJSONL.gzにしたものを格納
   +- output/ : hadoop を使って、GeoJSONLをタイルごとにreduceしたファイルを格納
   +- dst/ : deploy.rb を使って、output をタイルに展開したものを格納
   +- Rakefile : タスクを記述
   +- convert.rb
   +- reduce.rb
   +- deploy.rb

Rakefile

task :default do
  sh "rm -r output" if File.directory?('output')
  Dir.glob('source/*.xml') {|path|
    dst_path = "#{path.sub('source', 'input').sub('xml', 'geojsonl.gz')}"
    next if File.exist?(dst_path)
    sh "ruby convert.rb #{path} | gzip -c > #{dst_path}"
  }
  sh "hadoop jar /usr/local/Cellar/hadoop/2.5.1/libexec/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -input input -output output -mapper cat -reducer reduce.rb"
  sh "ruby deploy.rb"
end

convert.rb

現在のところ、数値標高データ（DEM5A及びDEM10B）についてXMLデータをGeoJSONLに変換する部分ができている。

# coding: utf-8
require 'json'

# Meshcode is probably Japanese English.
module Meshcode
  def self.width(code)
    case code.size
    when 8
      45.0 / 60 / 60
    else
      raise 'not implemented.'
    end
  end

  def self.height(code)
    case code.size
    when 8
      30.0 / 60 / 60
    else
      raise 'not implemented.'
    end
  end

  def self.lefttop(code)
    case code.size
    when 6
      [(code[2..3].to_f + code[5].to_f / 8) + 100, 
       (code[0..1].to_f + (code[4].to_f + 1) / 8) * 2 / 3]
    when 8
      [(code[2..3].to_f + code[5].to_f / 8 + code[7].to_f/ 80) + 100, 
       (code[0..1].to_f + code[4].to_f / 8 + (code[6].to_f + 1) / 80) * 2 / 3]
    else
      raise 'not implemented.'
    end
  end
end

module Math
  def self.sec(x)
    1.0 / cos(x)
  end
end

class DEM
  def parse(params)
    (left, top) = Meshcode::lefttop(params[:meshcode])
    skip = true
    count = 0
    File.foreach(params[:path], encoding: 'cp932') {|l|
      if l.include?('<gml:tupleList>')
        skip = false
        next
      elsif l.include?('</gml:tupleList>')
        skip = true
        next
      elsif !skip
        (i, j) = [count % @n_lng, count / @n_lng]
        lng = left + @d_lng * (i + 0.5)
        lat = top - @d_lat * (j + 0.5)
        lat_rad = lat * 2 * Math::PI / 360
        n = 2 ** params[:z]
        x = n * ((lng + 180) / 360)
        y = n *
          (1 - (Math::log(Math::tan(lat_rad) + Math::sec(lat_rad)) /
          Math::PI)) / 2
        x = x.to_i
        y = y.to_i
        (type, height) = l.encode('UTF-8').strip.split(',')
        next unless type
        f = {:type => 'Feature', 
          :geometry => {:type => 'Point', :coordinates => [lng, lat]},
          :properties => {
            :type => type, :height => height.to_f,
            :datePublished => params[:datePublished]
        }}
        print "#{params[:z]}/#{x}/#{y}.geojson\t#{JSON::dump(f)}\n"
        count += 1
      end
    }
  end
end

class DEM5A < DEM
  def initialize
    @n_lng = 225
    @n_lat = 150
    @d_lng = 1.0 / 80 / @n_lng
    @d_lat = 2.0 / 3 / 80 / @n_lat
  end
end

class DEM10B < DEM
  def initialize
    @n_lng = 1125
    @n_lat = 750
    @d_lng = 1.0 / 8 / @n_lng
    @d_lat = 2.0 / 3 / 8 / @n_lat
  end
end

ARGV.each {|path|
  next unless /xml$/.match path
  r = File.basename(path, '.xml').split('-')
  r.pop if r[-1].size == 4
  next unless r.shift == 'FG'
  next unless r.shift == 'GML'
  datePublished = r.pop.insert(4, '-').insert(7, '-')
  type = r.pop
  meshcode = r.join
  params = {:path => path, :type => type, :meshcode => meshcode, :z => 18,
    :datePublished => datePublished}
  case type
  when 'DEM5A'
    Kernel.const_get(type).new.parse(params)
  when 'dem10b'
    DEM10B.new.parse(params)
  else
    # print "converter for #{type} not implemented.\n"
  end
}

reduce.rb

hadoop の reducer に当てて、タイルごとに GeoJSONL のカンマ区切りリストを作成するスクリプト。reduce.rb で GeoJSON に組み立てることも可能であるが、reduce を多段に実施する場合を考えて、FeatureCollection を当てて GeoJSON に組み立てる作業は deploy.rb に移譲している。

#!/usr/bin/env ruby
require 'fileutils'
require 'json'

last = nil
geojsonl = nil

def write(geojsonl, path)
  print "#{path}\t#{JSON.dump(geojsonl)[1..-2]}\n"
end

while gets
  r = $_.strip.split("\t")
  current = r[0]
  if current != last
    write(geojsonl, last) unless last.nil?
    geojsonl = []
  end
  geojsonl << JSON.parse(r[1])
  last = current
end
write(geojsonl, last)

deploy.rb

上記 reduce の結果をファイルシステムにタイルセットとして配備するスクリプト。ここで Feature のカンマ区切りリストを FeatureCollection に組み上げ、valid な GeoJSON ファイルにしている。

require 'fileutils'
Dir.glob('output/part*') {|path|
  File.foreach(path) {|l|
    (path, geojsonl) = l.split("\t")
    path = "dst/#{path}"
    FileUtils.mkdir_p(File.dirname(path))
    File.open(path, 'w') {|w|
      geojson = <<-EOS
{"type": "FeatureCollection", "features": [#{geojsonl}]}
      EOS
      print "#{geojson.size} characters to #{path}\n"
      w.print geojson
    }
  }
}