Here we will create a SQL table from external kaggle dataset of Nobel Laureates. It is available in csv format.
Connect with SQL database
import sqlalchemy as db
import pandas as pd
# Create an engine to the `test` database
# The typical form of a database URL is `dialect+driver://username:password@host:port/database`
engine = db.create_engine('mysql+pymysql://root:root@localhost:3306/test')
# Print the table names
engine.table_names()
['student_data', 'world']
Handy function for future use.
def ex_df(query):
"""Execute query and return result in pandas.DataFrame"""
data = engine.execute(query).fetchall()
cols = [x[0] for x in engine.execute(query).cursor.description]
return pd.DataFrame(data=data, columns=cols)
Create table from external csv
Read csv file
df_nobel = pd.read_csv("./nobel_laureates/archive.csv")
df_nobel.head(2)
Year | Category | Prize | Motivation | Prize Share | Laureate ID | Laureate Type | Full Name | Birth Date | Birth City | Birth Country | Sex | Organization Name | Organization City | Organization Country | Death Date | Death City | Death Country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | 160 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Male | Berlin University | Berlin | Germany | 1911-03-01 | Berlin | Germany |
1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | 569 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | Male | NaN | NaN | NaN | 1907-09-07 | Châtenay | France |
The dataset has column names with space. It would be difficult to refer such columns in sql. Hence remove those space and keep all columns in lowercase.
df_nobel.columns = [s.replace(' ', '_').lower() for s in df_nobel.columns.tolist()]
Create a SQL tabel with name nobel
df_nobel.to_sql('nobel', engine, index=False)
engine.table_names()
['nobel', 'student_data', 'world']
View top two records
SELECT *
FROM nobel
LIMIT 2
year | category | prize | motivation | prize_share | laureate_id | laureate_type | full_name | birth_date | birth_city | birth_country | sex | organization_name | organization_city | organization_country | death_date | death_city | death_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | 160 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Male | Berlin University | Berlin | Germany | 1911-03-01 | Berlin | Germany |
1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | 569 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | Male | None | None | None | 1907-09-07 | Châtenay | France |
Change data type of date columns to DATE
ALTER TABLE nobel
MODIFY birth_date DATE;
ALTER TABLE nobel
MODIFY death_date DATE;
Check nobel table columns and their details
DESCRIBE nobel
Field | Type | Null | Key | Default | Extra | |
---|---|---|---|---|---|---|
0 | year | year(4) | YES | None | ||
1 | category | text | YES | None | ||
2 | prize | text | YES | None | ||
3 | motivation | text | YES | None | ||
4 | prize_share | text | YES | None | ||
5 | laureate_id | bigint(20) | YES | None | ||
6 | laureate_type | text | YES | None | ||
7 | full_name | text | YES | None | ||
8 | birth_date | date | YES | None | ||
9 | birth_city | text | YES | None | ||
10 | birth_country | text | YES | None | ||
11 | sex | text | YES | None | ||
12 | organization_name | text | YES | None | ||
13 | organization_city | text | YES | None | ||
14 | organization_country | text | YES | None | ||
15 | death_date | date | YES | None | ||
16 | death_city | text | YES | None | ||
17 | death_country | text | YES | None |
View last two records of table.
SELECT *
FROM nobel
ORDER BY year DESC
LIMIT 2
year | category | prize | motivation | prize_share | laureate_id | laureate_type | full_name | birth_date | birth_city | birth_country | sex | organization_name | organization_city | organization_country | death_date | death_city | death_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016 | Chemistry | The Nobel Prize in Chemistry 2016 | "for the design and synthesis of molecular mac... | 1/3 | 931 | Individual | Jean-Pierre Sauvage | 1944-10-21 | Paris | France | Male | University of Strasbourg | Strasbourg | France | None | None | None |
1 | 2016 | Chemistry | The Nobel Prize in Chemistry 2016 | "for the design and synthesis of molecular mac... | 1/3 | 932 | Individual | Sir J. Fraser Stoddart | 1942-05-24 | Edinburgh | United Kingdom | Male | Northwestern University | Evanston, IL | United States of America | None | None | None |
Who are the living nobel laureate
There are some records with early 19 with death_date as none. Seem like it is incomplete information as those records even does not have birth_date. So I will consider the laureate as living if they have birth entry but not death entry.
SELECT year,
full_name,
birth_date,
ROUND(DATEDIFF(NOW(), birth_date) / 365.25, 0) AS Age
FROM nobel
WHERE birth_date IS NOT NULL
AND death_date IS NULL
ORDER BY age
year | full_name | birth_date | Age | |
---|---|---|---|---|
0 | 2001 | A. Michael Spence | 1943-00-00 | None |
1 | 2014 | Malala Yousafzai | 1997-07-12 | 22 |
2 | 2011 | Tawakkol Karman | 1979-02-07 | 40 |
3 | 2010 | Konstantin Novoselov | 1974-08-23 | 45 |
4 | 2011 | Leymah Gbowee | 1972-02-01 | 47 |
5 | 2011 | Adam G. Riess | 1969-12-16 | 49 |
6 | 2011 | Adam G. Riess | 1969-12-16 | 49 |
7 | 2011 | Brian P. Schmidt | 1967-02-24 | 52 |
8 | 2014 | Stefan W. Hell | 1962-12-23 | 56 |
9 | 2014 | Stefan W. Hell | 1962-12-23 | 56 |
10 | 2014 | May-Britt Moser | 1963-01-04 | 56 |
11 | 2001 | Eric A. Cornell | 1961-12-19 | 57 |
12 | 2012 | Shinya Yamanaka | 1962-09-04 | 57 |
13 | 2012 | Shinya Yamanaka | 1962-09-04 | 57 |
14 | 2014 | Edvard I. Moser | 1962-04-27 | 57 |
15 | 2006 | Craig C. Mello | 1960-10-19 | 58 |
16 | 2009 | Carol W. Greider | 1961-04-15 | 58 |
17 | 2009 | Barack H. Obama | 1961-08-04 | 58 |
18 | 2014 | Eric Betzig | 1960-01-13 | 59 |
19 | 2014 | Hiroshi Amano | 1960-09-11 | 59 |
20 | 1992 | Rigoberta Menchú Tum | 1959-01-09 | 60 |
21 | 2002 | Koichi Tanaka | 1959-08-03 | 60 |
22 | 2006 | Andrew Z. Fire | 1959-04-27 | 60 |
23 | 2010 | Andre Geim | 1958-10-21 | 60 |
24 | 2015 | Takaaki Kajita | 1959-03-09 | 60 |
25 | 2001 | Wolfgang Ketterle | 1957-10-21 | 61 |
26 | 2011 | Bruce A. Beutler | 1957-12-29 | 61 |
27 | 2011 | Bruce A. Beutler | 1957-12-29 | 61 |
28 | 2003 | Roderick MacKinnon | 1956-02-19 | 63 |
29 | 2003 | Roderick MacKinnon | 1956-02-19 | 63 |
... | ... | ... | ... | ... |
293 | 1990 | Harry M. Markowitz | 1927-08-24 | 92 |
294 | 1994 | George A. Olah | 1927-05-22 | 92 |
295 | 2002 | Vernon L. Smith | 1927-01-01 | 92 |
296 | 2002 | Sydney Brenner | 1927-01-13 | 92 |
297 | 1975 | Ben Roy Mottelson | 1926-07-09 | 93 |
298 | 1980 | Paul Berg | 1926-06-30 | 93 |
299 | 1982 | Aaron Klug | 1926-08-11 | 93 |
300 | 2000 | Paul Greengard | 1925-12-11 | 93 |
301 | 2002 | Masatoshi Koshiba | 1926-09-19 | 93 |
302 | 1973 | Leo Esaki | 1925-03-12 | 94 |
303 | 2002 | Jimmy Carter | 1924-10-01 | 94 |
304 | 2005 | Roy J. Glauber | 1925-09-01 | 94 |
305 | 1974 | Antony Hewish | 1924-05-11 | 95 |
306 | 1977 | Roger Guillemin | 1924-01-11 | 95 |
307 | 1977 | Philip Warren Anderson | 1923-12-13 | 95 |
308 | 1981 | Torsten N. Wiesel | 1924-06-03 | 95 |
309 | 1987 | Robert M. Solow | 1924-08-23 | 95 |
310 | 1973 | Henry A. Kissinger | 1923-05-27 | 96 |
311 | 1986 | Stanley Cohen | 1922-11-17 | 96 |
312 | 1992 | Rudolph A. Marcus | 1923-07-21 | 96 |
313 | 2000 | Arvid Carlsson | 1923-01-25 | 96 |
314 | 1957 | Chen Ning Yang | 1922-09-22 | 97 |
315 | 1988 | Leon M. Lederman | 1922-07-15 | 97 |
316 | 1989 | Hans G. Dehmelt | 1922-09-09 | 97 |
317 | 1972 | Kenneth J. Arrow | 1921-08-23 | 98 |
318 | 1988 | Jack Steinberger | 1921-05-25 | 98 |
319 | 1981 | Nicolaas Bloembergen | 1920-03-11 | 99 |
320 | 1992 | Edmond H. Fischer | 1920-04-06 | 99 |
321 | 1997 | Jens C. Skou | 1918-10-08 | 100 |
322 | 1997 | Paul D. Boyer | 1918-07-31 | 101 |
323 rows × 4 columns
Data correction
Laureate A. Michael Spence has wrong birth date. Lets correct it.
UPDATE nobel
SET birth_date = '1943-11-07'
WHERE full_name = 'A. Michael Spence'
Check if it is updated.
SELECT year,
full_name,
birth_date
FROM nobel
WHERE full_name = 'A. Michael Spence'
year | full_name | birth_date | |
---|---|---|---|
0 | 2001 | A. Michael Spence | 1943-11-07 |
Also some of the entries are repeated as some laureates got more than one nobel prize.
SELECT DISTINCT year,
full_name,
birth_date,
ROUND(DATEDIFF(NOW(), birth_date)/365.25, 0) AS Age
FROM nobel
WHERE birth_date IS NOT NULL
AND death_date IS NULL
ORDER BY age DESC
year | full_name | birth_date | Age | |
---|---|---|---|---|
0 | 1997 | Paul D. Boyer | 1918-07-31 | 101 |
1 | 1997 | Jens C. Skou | 1918-10-08 | 100 |
2 | 1981 | Nicolaas Bloembergen | 1920-03-11 | 99 |
3 | 1992 | Edmond H. Fischer | 1920-04-06 | 99 |
4 | 1972 | Kenneth J. Arrow | 1921-08-23 | 98 |
5 | 1988 | Jack Steinberger | 1921-05-25 | 98 |
6 | 1957 | Chen Ning Yang | 1922-09-22 | 97 |
7 | 1988 | Leon M. Lederman | 1922-07-15 | 97 |
8 | 1989 | Hans G. Dehmelt | 1922-09-09 | 97 |
9 | 1973 | Henry A. Kissinger | 1923-05-27 | 96 |
10 | 1986 | Stanley Cohen | 1922-11-17 | 96 |
11 | 1992 | Rudolph A. Marcus | 1923-07-21 | 96 |
12 | 2000 | Arvid Carlsson | 1923-01-25 | 96 |
13 | 1974 | Antony Hewish | 1924-05-11 | 95 |
14 | 1977 | Philip Warren Anderson | 1923-12-13 | 95 |
15 | 1977 | Roger Guillemin | 1924-01-11 | 95 |
16 | 1981 | Torsten N. Wiesel | 1924-06-03 | 95 |
17 | 1987 | Robert M. Solow | 1924-08-23 | 95 |
18 | 1973 | Leo Esaki | 1925-03-12 | 94 |
19 | 2002 | Jimmy Carter | 1924-10-01 | 94 |
20 | 2005 | Roy J. Glauber | 1925-09-01 | 94 |
21 | 1975 | Ben Roy Mottelson | 1926-07-09 | 93 |
22 | 1980 | Paul Berg | 1926-06-30 | 93 |
23 | 1982 | Aaron Klug | 1926-08-11 | 93 |
24 | 2000 | Paul Greengard | 1925-12-11 | 93 |
25 | 2002 | Masatoshi Koshiba | 1926-09-19 | 93 |
26 | 1957 | Tsung-Dao (T.D.) Lee | 1926-11-24 | 92 |
27 | 1967 | Manfred Eigen | 1927-05-09 | 92 |
28 | 1977 | Andrew V. Schally | 1926-11-30 | 92 |
29 | 1987 | K. Alexander Müller | 1927-04-20 | 92 |
... | ... | ... | ... | ... |
260 | 2014 | Kailash Satyarthi | 1954-01-11 | 65 |
261 | 2014 | Shuji Nakamura | 1954-05-22 | 65 |
262 | 2012 | Brian K. Kobilka | 1955-05-30 | 64 |
263 | 2012 | Mo Yan | 1955-02-02 | 64 |
264 | 2003 | Roderick MacKinnon | 1956-02-19 | 63 |
265 | 2010 | Liu Xiaobo | 1955-12-28 | 63 |
266 | 2013 | Thomas C. Südhof | 1955-12-22 | 63 |
267 | 2001 | Wolfgang Ketterle | 1957-10-21 | 61 |
268 | 2011 | Bruce A. Beutler | 1957-12-29 | 61 |
269 | 1992 | Rigoberta Menchú Tum | 1959-01-09 | 60 |
270 | 2002 | Koichi Tanaka | 1959-08-03 | 60 |
271 | 2006 | Andrew Z. Fire | 1959-04-27 | 60 |
272 | 2010 | Andre Geim | 1958-10-21 | 60 |
273 | 2015 | Takaaki Kajita | 1959-03-09 | 60 |
274 | 2014 | Eric Betzig | 1960-01-13 | 59 |
275 | 2014 | Hiroshi Amano | 1960-09-11 | 59 |
276 | 2006 | Craig C. Mello | 1960-10-19 | 58 |
277 | 2009 | Barack H. Obama | 1961-08-04 | 58 |
278 | 2009 | Carol W. Greider | 1961-04-15 | 58 |
279 | 2001 | Eric A. Cornell | 1961-12-19 | 57 |
280 | 2012 | Shinya Yamanaka | 1962-09-04 | 57 |
281 | 2014 | Edvard I. Moser | 1962-04-27 | 57 |
282 | 2014 | May-Britt Moser | 1963-01-04 | 56 |
283 | 2014 | Stefan W. Hell | 1962-12-23 | 56 |
284 | 2011 | Brian P. Schmidt | 1967-02-24 | 52 |
285 | 2011 | Adam G. Riess | 1969-12-16 | 49 |
286 | 2011 | Leymah Gbowee | 1972-02-01 | 47 |
287 | 2010 | Konstantin Novoselov | 1974-08-23 | 45 |
288 | 2011 | Tawakkol Karman | 1979-02-07 | 40 |
289 | 2014 | Malala Yousafzai | 1997-07-12 | 22 |
290 rows × 4 columns
There are multiple laureates who are no more, but database is not updated. e.g. Paul D. Boyer died on 2 June 2018 wikipedia. Let’s update records for top few.
UPDATE nobel
SET death_date = CASE full_name
WHEN 'Paul D. Boyer' THEN '2018-06-02'
WHEN 'Jens C. Skou' THEN '2018-05-28'
WHEN 'Nicolaas Bloembergen' THEN '2017-09-05'
WHEN 'Kenneth J. Arrow' THEN '2017-02-21'
WHEN 'Leon M. Lederman' THEN '2018-10-03'
ELSE death_date
END
WHERE full_name IN ( 'Paul D. Boyer', 'Jens C. Skou', 'Nicolaas Bloembergen',
'Kenneth J. Arrow', 'Leon M. Lederman' )
Check if those values got updated
SELECT year,
full_name,
birth_date,
death_date
FROM nobel
WHERE full_name IN ( 'Paul D. Boyer', 'Jens C. Skou', 'Nicolaas Bloembergen',
'Kenneth J. Arrow', 'Leon M. Lederman' )
year | full_name | birth_date | death_date | |
---|---|---|---|---|
0 | 1972 | Kenneth J. Arrow | 1921-08-23 | 2017-02-21 |
1 | 1981 | Nicolaas Bloembergen | 1920-03-11 | 2017-09-05 |
2 | 1988 | Leon M. Lederman | 1922-07-15 | 2018-10-03 |
3 | 1997 | Paul D. Boyer | 1918-07-31 | 2018-06-02 |
4 | 1997 | Jens C. Skou | 1918-10-08 | 2018-05-28 |
Ideally I should write an utility to scrap wikipedia page for laureate and find if he/she is alive and update the records accrodingly.
Indian Nobel laureates
SELECT full_name,
year,
category,
motivation
FROM nobel
WHERE birth_country = 'India'
OR organization_country = 'India'
OR death_country = 'India'
full_name | year | category | motivation | |
---|---|---|---|---|
0 | Ronald Ross | 1902 | Medicine | "for his work on malaria, by which he has show... |
1 | Rabindranath Tagore | 1913 | Literature | "because of his profoundly sensitive, fresh an... |
2 | Sir Chandrasekhara Venkata Raman | 1930 | Physics | "for his work on the scattering of light and f... |
3 | Har Gobind Khorana | 1968 | Medicine | "for their interpretation of the genetic code ... |
4 | Mother Teresa | 1979 | Peace | None |
5 | Amartya Sen | 1998 | Economics | "for his contributions to welfare economics" |
6 | Venkatraman Ramakrishnan | 2009 | Chemistry | "for studies of the structure and function of ... |
7 | Kailash Satyarthi | 2014 | Peace | "for their struggle against the suppression of... |
Knights of the realm. List the winners, year and category where the winner starts with Sir.
SELECT full_name,
year,
category
FROM nobel
WHERE full_name LIKE ( 'Sir%' )