14. Polarsの複雑な操作¶

もっと複雑な条件でフィルタリングなどをしていきましょう。

`pl.when`を使ってみよう¶

pl.when() は、条件によって入れる値を変えたいときに使います。

ここでは、翼の長さで long、mid、short に分類します。

条件	入れる値
`flipper_length_mm > 190`	`long`
`flipper_length_mm > 185`	`mid`
それ以外	`short`

import polars as pl

df = pl.read_csv("examples/penguins.csv")

df_flipper_group = df.with_columns(
    pl.when(pl.col("flipper_length_mm") > 190)
    .then(pl.lit("long"))
    .when(pl.col("flipper_length_mm") > 185)
    .then(pl.lit("mid"))
    .otherwise(pl.lit("short"))
    .alias("flipper_group")
)

print(df_flipper_group.select(["species", "flipper_length_mm", "flipper_group"]).head())

.otherwise() は、どの条件にも当てはまらないときの値です。

.otherwise() を省略すると、どの条件にも当てはまらない行は null になります。

import polars as pl

df = pl.read_csv("examples/penguins.csv")

df_flipper_group = df.with_columns(
    pl.when(pl.col("flipper_length_mm") > 190)
    .then(pl.lit("long"))
    .alias("flipper_group")
)

print(df_flipper_group.select(["species", "flipper_length_mm", "flipper_group"]).head())

名前空間の利用¶

Polarsでは、特定のデータ型に対して豊富な機能を提供する「名前空間 (namespace)」が用意されています。これにより、直感的かつ効率的に専門的な操作を行うことができます。

よく使う名前空間は次のようなものです。

書き方	対象	例
`.str`	文字列	文字数、分割、正規表現
`.dt`	日付や時刻	年、月、日、曜日
`.list`	リスト	最初の要素、長さ

文字列¶

文字列列には .str を使えます。

import polars as pl

people = pl.DataFrame({
    "full_name": [
        "Ada Lovelace (UK)",
        "Alan Turing (UK)",
        "Grace Hopper (US)",
    ]
})
print(people)

people = people.with_columns([
    pl.col("full_name").str.len_chars().alias("name_length"),
    pl.col("full_name").str.split(" ").list.first().alias("first_name"),
    pl.col("full_name").str.extract(r"\((.*?)\)", 1).alias("country"),
])

print("                         ")
print(people)

str.split(" ") は文字列を空白で分割します。

str.extract() は正規表現に一致する部分を取り出します。

日付¶

import polars as pl

logs = pl.DataFrame({
    "date": ["2024-01-05", "2024-02-12", "2024-03-20"],
    "value": [10, 20, 15],
})

print(logs)

logs = logs.with_columns(
    pl.col("date").str.to_date().alias("date")
).with_columns([
    pl.col("date").dt.year().alias("year"),
    pl.col("date").dt.month().alias("month"),
    pl.col("date").dt.day().alias("day"),
])

print("                         ")
print(logs)

str.to_date() で文字列を日付型として扱えるようにします。

そのあと .dt.year()、.dt.month()、.dt.day() で年、月、日を取り出せます。

カテゴリ型¶

同じ文字列が何度も出てくる列は、カテゴリ型に変換すると扱いやすいことがあります。

内部的には文字列を整数にマッピングして管理しています。

import polars as pl

df = pl.read_csv("examples/penguins.csv")

print(df.select("species"))
print(df.select("species").schema)

df = df.with_columns(
    pl.col("species").cast(pl.Categorical).alias("species_category")
)

print("                         ")
print(df.select("species_category"))
print(df.select("species_category").schema)

species のように、決まった種類の文字列が繰り返し出てくる列はカテゴリ型の候補になります。

データフレームの結合¶

結合の仕方をひたすら見ていきましょう。

結合は、複数のDataFrameをキー列でつなげる操作です。

students

student_id	name
1	Ada
2	Alan
3	Grace
4	Linus
5	Guido

scores

student_id	score
1	90
3	85
4	70
6	60
7	95

student_id をキーにして結合します。

内部結合¶

両方のDataFrameに存在するキーだけを残します。

import polars as pl

students = pl.DataFrame({
    "student_id": [1, 2, 3, 4, 5],
    "name": ["Ada", "Alan", "Grace", "Linus", "Guido"],
})

scores = pl.DataFrame({
    "student_id": [1, 3, 4, 6, 7],
    "score": [90, 85, 70, 60, 95],
})

joined = students.join(scores, on="student_id", how="inner")

print(joined)

結果は次のようになります。両方に共通する student_id だけが残ります。

student_id	name	score
1	Ada	90
3	Grace	85
4	Linus	70

左外部結合¶

左側のDataFrameの行をすべて残します。

import polars as pl

students = pl.DataFrame({
    "student_id": [1, 2, 3, 4, 5],
    "name": ["Ada", "Alan", "Grace", "Linus", "Guido"],
})

scores = pl.DataFrame({
    "student_id": [1, 3, 4, 6, 7],
    "score": [90, 85, 70, 60, 95],
})

joined = students.join(scores, on="student_id", how="left")

print(joined)

左側 (students) の行はすべて残り、右側に対応するキーがない場合は null になります。

student_id	name	score
1	Ada	90
2	Alan	null
3	Grace	85
4	Linus	70
5	Guido	null

完全外部結合¶

左右どちらかに存在するキーをすべて残します。

import polars as pl

students = pl.DataFrame({
    "student_id": [1, 2, 3, 4, 5],
    "name": ["Ada", "Alan", "Grace", "Linus", "Guido"],
})

scores = pl.DataFrame({
    "student_id": [1, 3, 4, 6, 7],
    "score": [90, 85, 70, 60, 95],
})

joined = students.join(scores, on="student_id", how="full", coalesce=True)

print(joined)

左右どちらかにあるキーはすべて残り、足りない値は null になります。

student_id	name	score
1	Ada	90
2	Alan	null
3	Grace	85
4	Linus	70
5	Guido	null
6	null	60
7	null	95

クロス結合¶

すべての組み合わせを作ります。

models

model
small
large
huge

learning_rates

lr
0.01
0.001
0.0001

import polars as pl

models = pl.DataFrame({"model": ["small", "large", "huge"]})
learning_rates = pl.DataFrame({"lr": [0.01, 0.001, 0.0001]})

grid = models.join(learning_rates, how="cross")

print(grid)

3 × 3 = 9 通りの組み合わせができます。

model	lr
small	0.01
small	0.001
small	0.0001
large	0.01
large	0.001
large	0.0001
huge	0.01
huge	0.001
huge	0.0001

複数キーで結合¶

複数の列をキーにすることもできます。

preds

student_id	task	pred
1	A	0.8
1	B	0.4
2	A	0.7
2	B	0.6
3	A	0.9

labels

student_id	task	label
1	A	1
1	B	0
2	A	1
2	B	1
3	A	0

student_id と task の2つをキーにして結合します。

import polars as pl

preds = pl.DataFrame({
    "student_id": [1, 1, 2, 2, 3],
    "task": ["A", "B", "A", "B", "A"],
    "pred": [0.8, 0.4, 0.7, 0.6, 0.9],
})

labels = pl.DataFrame({
    "student_id": [1, 1, 2, 2, 3],
    "task": ["A", "B", "A", "B", "A"],
    "label": [1, 0, 1, 1, 0],
})

joined = preds.join(labels, on=["student_id", "task"], how="inner")

print(joined)

student_id と task の両方が一致する行だけが残ります。

student_id	task	pred	label
1	A	0.8	1
1	B	0.4	0
2	A	0.7	1
2	B	0.6	1
3	A	0.9	0

セミ結合¶

右側にキーが存在する左側の行だけを残します。

students

student_id	name
1	Ada
2	Alan
3	Grace
4	Linus
5	Guido

submitted

student_id
1
3
5

import polars as pl

students = pl.DataFrame({
    "student_id": [1, 2, 3, 4, 5],
    "name": ["Ada", "Alan", "Grace", "Linus", "Guido"],
})

submitted = pl.DataFrame({"student_id": [1, 3, 5]})

matched = students.join(submitted, on="student_id", how="semi")

print(matched)

submitted に含まれる student_id を持つ行だけが残ります。右側の列は追加されません。

student_id	name
1	Ada
3	Grace
5	Guido

アンチ結合¶

右側にキーが存在しない左側の行だけを残します。

import polars as pl

students = pl.DataFrame({
    "student_id": [1, 2, 3, 4, 5],
    "name": ["Ada", "Alan", "Grace", "Linus", "Guido"],
})

submitted = pl.DataFrame({"student_id": [1, 3, 5]})

not_submitted = students.join(submitted, on="student_id", how="anti")

print(not_submitted)

submitted に含まれない student_id を持つ行だけが残ります。

student_id	name
2	Alan
4	Linus

Asof結合¶

時刻が完全一致しないときに、直前の時刻の行を対応させる結合です。

events

time	event
1	start
5	middle
10	end
15	resume
20	stop

measurements

time	value
0	100
3	120
8	130
12	140
18	150

import polars as pl

events = pl.DataFrame({
    "time": [1, 5, 10, 15, 20],
    "event": ["start", "middle", "end", "resume", "stop"],
})

measurements = pl.DataFrame({
    "time": [0, 3, 8, 12, 18],
    "value": [100, 120, 130, 140, 150],
})

joined = events.join_asof(measurements, on="time")

print(joined)

各 events.time に対して、それ以下の最大の measurements.time の value が対応します。

time	event	value
1	start	100
5	middle	120
10	end	130
15	resume	140
20	stop	150

非等価結合¶

等しいかどうかではなく、大小関係などの条件で結合することもあります。

sales

amount
80
120
250
350
500

discounts

min_amount	discount_rate
100	0.05
200	0.10
400	0.20

amount >= min_amount を満たす組み合わせを残します。

import polars as pl

sales = pl.DataFrame({
    "amount": [80, 120, 250, 350, 500],
})

discounts = pl.DataFrame({
    "min_amount": [100, 200, 400],
    "discount_rate": [0.05, 0.10, 0.20],
})

joined = sales.join_where(
    discounts,
    pl.col("amount") >= pl.col("min_amount"),
)

print(joined)

条件を満たす組み合わせがすべて残ります。amount=500 は3つすべての割引条件を満たすので3行になります。

amount	min_amount	discount_rate
120	100	0.05
250	100	0.05
250	200	0.10
350	100	0.05
350	200	0.10
500	100	0.05
500	200	0.10
500	400	0.20

縦に結合¶

同じ列を持つDataFrameを縦に積み上げるときは、pl.concat() を使います。

train

split	score
train	0.80
train	0.90
train	0.85

test

split	score
test	0.70
test	0.75
test	0.72

import polars as pl

train = pl.DataFrame({
    "split": ["train", "train", "train"],
    "score": [0.80, 0.90, 0.85],
})

test = pl.DataFrame({
    "split": ["test", "test", "test"],
    "score": [0.70, 0.75, 0.72],
})

all_scores = pl.concat([train, test])

print(all_scores)

行が縦に積み上げられます。

split	score
train	0.80
train	0.90
train	0.85
test	0.70
test	0.75
test	0.72

今回のまとめ¶

pl.when().then().otherwise() で条件分岐を書ける
.otherwise() を省略すると null になる
.str、.dt、.list のような名前空間がある
文字列は長さ、分割、正規表現抽出ができる
日付はパースして、年、月、日を取り出せる
カテゴリ型は、同じ文字列が繰り返し出る列で候補になる
join() でDataFrame同士を結合できる
pl.concat() で縦方向に結合できる