/images/profile/profile.png

Se Hyeon Kim

Email Validation

Configurations Docker Image We use official Airflow image. We have to install the necessary libraries and packages into the Airflow container. For that, we have to create a Dockerfile 1 2 3 4 5 6 FROM apache/airflow:2.10.2 USER airflow COPY requirements.txt /requirements.txt RUN pip install -r /requirements.txt 1 2 3 confluent-kafka cassandra-driver pymongo This Dockerfile will be used to install airflow:2.10.2. Then, it will install all necessary libraries in the requirements.

Crawler

Chapter 9: Design a web crawler A web crawler is known as a robot or spider. It is widely used by search engines to discover new or updated content on the web. Content can be a web page, an image, a video, a PDF file, etc. A web crawler starts by collecting a few web pages and then follows links on those pages to collect new content.

Airflow_on_kubernetes

Install Helm chart 1 brew install helm Install the Chart 1 2 3 4 5 6 7 8 9 {seilylook} ๐Ÿ’Žminikube start {seilylook} ๐Ÿ’Žhelm repo add apache-airflow https://airflow.apache.org "apache-airflow" has been added to your repositories {seilylook} ๐Ÿ’Ž helm repo list NAME URL apache-airflow https://airflow.apache.org Upgrade the Chart 1 2 3 4 5 6 7 8 9 10 11 {seilylook} ๐Ÿ’Ž helm upgrade --install airflow apache-airflow/airflow --namespace airflow --create-namespace {seilylook} ๐Ÿ’Ž ๎‚ฐ ~/Development/Devlog ๎‚ฐ ๎‚  main ยฑ ๎‚ฐ kubectl get pods -n airflow -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES airflow-postgresql-0 1/1 Running 0 9m10s 10.

Create_nodes

Install and start Minikube Install the Minikube 1 2 curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-darwin-amd64 sudo install minikube-darwin-amd64 /usr/local/bin/minikube Start minikube cluster and Check the status 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 {seilylook} ๐Ÿš€ minikube start ๐Ÿ˜„ Darwin 14.6.1 (arm64) ์˜ minikube v1.33.0 โœจ ๊ธฐ์กด ํ”„๋กœํ•„์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ docker ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ค‘ ๐Ÿ‘ Starting "minikube" primary control-plane node in "minikube" cluster ๐Ÿšœ Pulling base image v0.

Scrapy

Introduction ํ•™๋ถ€ ์‹œ์ ˆ Django๋ฅผ ์ฒ˜์Œ ๋ฐฐ์šธ ๋•Œ๋ฅผ ์ œ์™ธํ•˜๊ณ  ๊ฐ„๋งŒ์— ์›น ํฌ๋กค๋ง์„ ๊ฒฝํ—˜ํ•  ๊ธฐํšŒ๊ฐ€ ์ƒ๊ฒผ๋‹ค. Beautifulsoup | Selenium์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ ์ฐพ์•„๋ณด๋‹ˆ ๋น…๋ฐ์ดํ„ฐ ํ˜น์€ ๋”ฅ๋Ÿฌ๋‹์—์„œ ๋ฐ์ดํ„ฐ ํฌ๋กค๋ง์„ ํ•  ๋•Œ Scrapy๋ฅผ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜๊ณ  ์ด๋ฒˆ ๊ธฐํšŒ์— ์‚ฌ์šฉํ•ด ๋ณด๊ธฐ๋กœ ๋งˆ์Œ ๋จน๊ณ  ๊ฐ„๋‹จํ•œ ๋ฐ์ดํ„ฐ ํฌ๋กค๋ง์„ ๊ตฌ์ถ•ํ•ด๋ดค๋‹ค. ์„ค์น˜ ๋ฐ ์‹œ์ž‘ 1 pip install scrapy 1 scrapy startproject arxiv_crawling startproject ๋ช…๋ น์–ด๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด scrapy๊ฐ€ ์ž๋™์œผ๋กœ ํ…œํ”Œ๋ฆฟ ํด๋”๋ฅผ ์ƒ์„ฑํ•ด์ค€๋‹ค. ์ƒ์„ฑ๋œ ํ”„๋กœ์ ํŠธ ๋””๋ ‰ํ† ๋ฆฌ๋กœ ์ด๋™ํ•ด์„œ target url์— ๋งž๋Š” ํ”„๋กœ์ ํŠธ๋ฅผ ์ƒ์„ฑํ•ด์ค€๋‹ค.

Docker

Introduction ์ฟ ๋ฒ„๋„คํ‹ฐ์Šค๋ฅผ ๊ณต๋ถ€ํ–ˆ์ง€๋งŒ ์ •์ž‘ Docker์— ๋Œ€ํ•ด์„œ๋Š” ์ œ๋Œ€๋กœ ๊ณต๋ถ€ํ•œ ์ ์ด ์—†๋Š” ๊ฒƒ์ด ๋งˆ์Œ์— ๋“ค์ง€ ์•Š์•„, ์ด๋ฒˆ ๊ธฐํšŒ์— Docker๋ฅผ ์™„๋ฒฝํžˆ ์ดํ•ดํ•˜๊ณ  ๋‚ด ๊ฒƒ์œผ๋กœ ๋งŒ๋“ ๋‹ค. ๊ณต์‹ ๋ฌธ์„œ๋ฅผ ์ฝ์œผ๋ฉด์„œ ์ดํ•ด๋˜์ง€ ์•Š๊ฑฐ๋‚˜ ์•ž์œผ๋กœ ๊ณ„์† ์‚ฌ์šฉํ•ด์•ผ ํ•  ํ•ต์‹ฌ ์ฝ”๋“œ ์œ„์ฃผ๋กœ ์ •๋ฆฌํ•ด ๋†“๋Š”๋‹ค. Docker Image ์ƒ์„ฑ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 # Spark Docker # builder step used to download and configure spark environment FROM openjdk:11.