acts as a separator — it indicates the end of the URL resource path and the start of the parameters
?
specifies what the page will be about
groups=top_1000
takes us to the the next or the previous page. The reference is the page we’re currently on.
&ref_adv_prv
and
adv_nxt
are two possible values — translated to advance to next page and advance to previous page.
adv_prv
is the variable we create and assign the URL to
url
is the variable we create to store our request.get action
results
is the method we use to grab the contents of the URL.
requests.get(url, headers=headers)
part tells our scraper to bring us English, based on our previous line of code.
headers
is the variable we create to assign the method BeatifulSoup to, which specifies a desired format of results using the HTML parser — this allows Python to read the components of the page rather than treating it as one long string
soup
will print what we’ve grabbed in a more structured tree format, making it easier to read
print(soup.prettify())
. Remember the list of information we wanted to grab from each movie from earlier:
[]
function until we need to use it again.
print
elements to the right with a
div
attribute that has two values:
class
and
lister-item
.
mode-advanced
), we’ll see 50 matches representing the 50 movies displayed on a single page. We now know all the information we seek lies within this specific
lister-item mode-advanced
tag.
div
divs
lister-item mode-advanced
down:
find_all
is the variable we’ll use to store all of the div containers with a class of
movie_div
lister-item mode-advanced
method extracts all the div containers that have a class attribute of lister-item mode-advanced from what we have stored in our variable soup.
find_all()
div
lister-item mode-advanced
lister-item mode-advanced
container, we need the scraper to loop to the next
div
lister-item mode-advanced
container and grab those movie items too. And then it needs to loop to the next one and so on — 50 times for each page. For this to execute, we’ll need to wrap our scraper in a for loop.
div
loop is used for iterating over a sequence. Our sequence being every
for
lister-item mode-advanced
container that we stored in
div
movie_div
is the name of the variable that enters each div. You can name this whatever you want (
container
,
x
,
loop
,
banana
), and it wont change the function of the loop.
cheese
. This tag is nested within a header tag,
<a>
. The
<h3>
tag is nested within a
<h3>
tag. This
<div>
is the third of the
<div>
s nested in the container of the first movie.
div
is the variable we’ll use to store the title data we find
name
is what used in our
container
loop — it’s used for iterating over each time.
for
and
h3
is attribute notation and tells the scraper to access each of those tags.
.a
tells the scraper to grab the text nested in the <a> tag
text
tells the scraper to take what we found and stored in name and to add it into our empty list called titles, which we created in the beginning
titles.append(name)
tag below the
<span>
tag that contains the title of the movie. The dot notation, which we used for finding the title data (
<a>
), worked because it was the first
.h3.a
tag after the
<a>
tag. Since the
h3
tag we want is the second
<span>
tag, we have to use a different method.
<span>
. We’ll use the
<span>
method, which is similar to
find()
except it only returns the first match.
find_all()
is the variable we’ll use to store the year data we find
year
is what we used in our for loop — it’s used for iterating over each time.
container
is attribute notation, which tells the scraper to access that tag.
h3
is a method we’ll use to access this particular
find()
tag
<span>
) is the distinctive
‘span’, class_ = ‘lister-item-year’
tag we want
<span>
tells the scraper to take what we found and stored in year and to add it into our empty list called years (which we created in the beginning)
years.append(year)
tag with a class of runtime. Like we did with year, we can do something similar:
<span>
is the variable we’ll use to store the time data we find
runtime
is what we used in our for loop — it’s used for iterating over each time.
container
is a method we’ll use to access this particular
find()
tag
<span>
tag we want
<span>
says if there’s data there, grab it — but if the data is missing, then put a dash there instead.
if container.p.find(‘span’, class_=’runtime’) else ‘-’
tells the scraper to grab that text in the
text
tag
<span>
tells the scraper to take what we found and stored in runtime and to add it into our empty list called time (which we created in the beginning)
time.append(runtime)
tag. Since I don’t see any other
<strong>
tags, we can use attribute notation (dot notation) to grab this data.
<strong>
is the variable we’ll use to store the IMDB ratings data it finds
imdb
is what we used in our
container
loop — it’s used for iterating over each time.
for
is attribute notation that tells the scraper to access that tag.
strong
tells the scraper to grab that text
text
method turns the text we find into a float — which is a decimal
float()
tells the scraper to take what we found and stored in
imdb_ratings.append(imdb)
and to add it into our empty list called
imdb
(which we created in the beginning).
imdb_ratings
tag that has a class that says
<span>
.
metascore favorable
. Since these tags are different, it’d be safe to tell the scraper to use just the class
metascore mixed
when scraping:
metascore
is the variable we’ll use to store the Metascore-rating data it finds
m_score
is what we used in our
container
loop — it’s used for iterating over each time.
for
is a method we’ll use to access this particular
find()
tag
<span>
) is the distinctive
‘span’, class_ = ‘metascore’
tag we want.
<span>
tells the scraper to grab that text
text
says if there is data there, grab it — but if the data is missing, then put a dash there
if container.find(‘span’, class_=’metascore’) else ‘-’
method turns the text we find into an integer
int()
tells the scraper to take what we found and stored in
metascores.append(m_score)
and to add it into our empty list called
m_score
(which we created in the beginning)
metascores
tag that has a
<span>
attribute that equals
name
and a
nv
attribute that holds the values of the distinctive number we need for each.
data-value
is an entirely new variable we’ll use to hold both the votes and the gross
nv
tags
<span>
is what we used in our
container
loop for iterating over each time
for
is the method we’ll use to grab both of the
find_all()
tags
<span>
) is how we can grab attributes of that specific tag.
‘span’, attrs = ‘name’ : ’nv’
is the variable we’ll use to store the votes we find in the
vote
tag
nv
tells the scraper to go into the
nv[0]
tag and grab the first data in the list — which are the votes because votes comes first in our HTML code (computers count in binary — they start count at 0, not 1).
nv
tells the scraper to grab that text
text
tells the scraper to take what we found and stored in
votes.append(vote)
and to add it into our empty list called
vote
(which we created in the beginning)
votes
is the variable we’ll use to store the gross we find in the
grosses
tag
nv
tells the scraper to go into the
nv[1]
tag and grab the second data in the list — which is gross because gross comes second in our HTML code
nv
says if the length of
nv[1].text if len(nv) > 1 else ‘-’
is greater than one, then find the second datum that’s stored. But if the data that’s stored in
nv
isn’t greater than one — meaning if the gross is missing — then put a dash there.
nv
tells the scraper to take what we found and stored in
us_gross.append(grosses)
and to add it into our empty list called
grosses
(which we created in the beginning)
us_grosses
['Parasite', 'Jojo Rabbit', '1917', 'Knives Out', 'Uncut Gems', 'Once Upon a Time... in Hollywood', 'Joker', 'The Gentlemen', 'Ford v Ferrari', 'Little Women', 'The Irishman', 'The Lighthouse', 'Toy Story 4', 'Marriage Story', 'Avengers: Endgame', 'The Godfather', 'Blade Runner 2049', 'The Shawshank Redemption', 'The Dark Knight', 'Inglourious Basterds', 'Call Me by Your Name', 'The Two Popes', 'Pulp Fiction', 'Inception', 'Interstellar', 'Green Book', 'Blade Runner', 'The Wolf of Wall Street', 'Gone Girl', 'The Shining', 'The Matrix', 'Titanic', 'The Silence of the Lambs', 'Three Billboards Outside Ebbing, Missouri', "Harry Potter and the Sorcerer's Stone", 'The Peanut Butter Falcon', 'The Handmaiden', 'Memories of Murder', 'The Lord of the Rings: The Fellowship of the Ring', 'Gladiator', 'The Martian', 'Bohemian Rhapsody', 'Watchmen', 'Forrest Gump', 'Thor: Ragnarok', 'Casino Royale', 'The Breakfast Club', 'The Godfather: Part II', 'Django Unchained', 'Baby Driver']
['(2019)', '(2019)', '(2019)', '(2019)', '(2019)', '(2019)', '(2019)', '(2019)', '(2019)', '(2019)', '(2019)', '(I) (2019)', '(2019)', '(2019)', '(2019)', '(1972)', '(2017)', '(1994)', '(2008)', '(2009)', '(2017)', '(2019)', '(1994)', '(2010)', '(2014)', '(2018)', '(1982)', '(2013)', '(2014)', '(1980)', '(1999)', '(1997)', '(1991)', '(2017)', '(2001)', '(2019)', '(2016)', '(2003)', '(2001)', '(2000)', '(2015)', '(2018)', '(2009)', '(1994)', '(2017)', '(2006)', '(1985)', '(1974)', '(2012)', '(2017)']
['132 min', '108 min', '119 min', '131 min', '135 min', '161 min', '122 min', '113 min', '152 min', '135 min', '209 min', '109 min', '100 min', '137 min', '181 min', '175 min', '164 min', '142 min', '152 min', '153 min', '132 min', '125 min', '154 min', '148 min', '169 min', '130 min', '117min', '180 min', '149 min', '146 min', '136 min', '194 min', '118 min', '115 min', '152 min', '97 min', '145 min', '132 min', '178 min', '155 min', '144 min', '134 min', '162 min', '142 min', '130 min', '144 min', '97 min', '202 min', '165 min', '113 min']
[8.6, 8.0, 8.5, 8.0, 7.6, 7.7, 8.6, 8.1, 8.2, 8.0, 8.0, 7.7, 7.8, 8.0, 8.5, 9.2, 8.0, 9.3, 9.0, 8.3, 7.9, 7.6, 8.9, 8.8, 8.6, 8.2, 8.1, 8.2, 8.1,8.4, 8.7, 7.8, 8.6, 8.2, 7.6, 7.7, 8.1, 8.1, 8.8, 8.5, 8.0, 8.0, 7.6, 8.8, 7.9, 8.0, 7.9, 9.0, 8.4, 7.6]
['96 ', '58 ', '78 ', '82 ', '90 ', '83 ', '59 ', '51 ', '81 ', '91 ', '94 ', '83 ', '84 ', '93 ', '78 ', '100 ', '81 ', '80 ', '84 ', '69 ', '93 ', '75 ', '94 ', '74 ', '74 ', '69 ', '84 ', '75 ', '79 ', '66 ', '73 ', '75 ', '85 ', '88 ', '64 ', '70 ', '84 ', '82 ', '92 ', '67 ', '80 ', '49 ', '56 ', '82 ', '74 ', '80 ', '62 ', '90 ', '81 ', '86 ']
['282,699', '142,517', '199,638', '195,728', '108,330', '396,071', '695,224', '42,015', '152,661', '65,234', '249,950', '77,453', '160,180', '179,887', '673,115', '1,511,929', '414,992', '2,194,397', '2,176,865', '1,184,882', '178,688', '76,291', '1,724,518', '1,925,684', '1,378,968', '293,695', '656,442', '1,092,063', '799,696', '835,496', '1,580,250', '994,453', '1,191,182', '383,958', '595,613', '34,091', '92,492', '115,125', '1,572,354', '1,267,310', '715,623', '410,199', '479,811', '1,693,344', '535,065', '555,756', '330,308', '1,059,089', '1,271,569', '398,553']
['-', '$0.35M', '-', '-', '-', '$135.37M', '$192.73M', '-', '-', '-', '-', '$0.43M', '$433.03M', '-', '$858.37M', '$134.97M', '$92.05M', '$28.34M', '$534.86M', '$120.54M', '$18.10M', '-', '$107.93M', '$292.58M', '$188.02M', '$85.08M', '$32.87M', '$116.90M', '$167.77M', '$44.02M', '$171.48M', '$659.33M', '$130.74M', '$54.51M', '$317.58M', '$13.12M', '$2.01M', '$0.01M', '$315.54M', '$187.71M', '$228.43M', '$216.43M', '$107.51M', '$330.25M', '$315.06M', '$167.45M', '$45.88M', '$57.30M', '$162.81M', '$107.83M']
is what we’ll name our DataFrame
movies
is how we initialize the creation of a DataFrame with pandas
pd.DataFrame
,
year
,
timeMin
, and
metascore
show they’re objects when they should be integer data types, and our
votes
is an object instead of a
us_grossMillions
data type. How did this happen?
float
and the
cheese
are both strings. If we were to get rid of everything except the
phrase I ate 10 blocks of cheese
from the
10
string, it’s still a string — but now it’s one that only says
I ate 10 blocks of cheese
.
10
tells pandas to go to the column year in our
movies[‘year’]
DataFrame
this method:
.str.extract(‘(\d+’)
says to extract all the digits in the string
(‘(\d+’)
method converts the result to an integer
.astype(int)
into the bottom of our program to see what our year data looks like, this is the result:
print(movies[‘year’])
is our votes data in our movies
movies[‘votes’]
. We’re assigning our new cleaned up data to our votes
DataFrame
.
DataFrame
grabs the string and uses the
.str.replace(‘ , ’ , ‘’)
method to replace the commas with an empty quote (nothing)
replace
method converts the result into an integer
.astype(int)
is our gross data in our movies
movies[‘us_grossMillions’]
. We’ll be assigning our new cleaned up data to our
DataFrame
column.
us_grossMillions
tells pandas to go to the
movies[‘us_grossMillions’]
in our
column us_grossMillions
DataFrame
function calls the specified function for each item of an iterable
.map()
is an anonymous functions in Python (one without a name). Normal functions are defined using the
lambda x: x
keyword.
def
is our function arguments. This tells our function to strip the
lstrip(‘$’).rstrip(‘M’)
from the left side and strip the
$
from the right side.
M
is stripped of the elements we don’t need, and now we’ll assign the conversion code data to it to finish it up
movies[‘us_grossMillions’]
is a method we can use to change this column to a float. The reason we use this is because we have a lot of dashes in this column, and we can’t just convert it to a float using .astype(float) — this would catch an error.
pd.to_numeric
will transform the nonnumeric values, our dashes, into NaN (not-a-number ) values because we have dashes in place of the data that’s missing
errors=’coerce’
movies.to_csv('movies.csv')
extension. I named mine
.csv
, as you can see above, but feel free to name it whatever you like. Just make sure to change the code above to match it.
movies.csv
extension. Then, add the code to the end of your program:
.csv
movies.to_csv(‘the_name_of_your_csv_here.csv’)