This was in my drafts from 1/2010. It appears to be from a class that I took in college, but either way it might be helpful. *I am not the original author of the content below.*
World Wide Web
- System of interlinked, hypertext documents accessed via the InternetNavigate from page to page via hyperlinks.
- Al Gore really did invent it. Kind of. Al Gore created legislation (Gore Bill) which funded the High-Performance Computing and Communications Initiative and lead to Mosaic, the first graphical web browser.
- It’s all very hyper.Hyper. HyperCard. HyperText. HyperLink. HyperText Markup Language. HyperText Transfer Protocol. Here, hyper does not refer to the frantic pace of the web. Rather, it simply means “linked”
- It’s ginormous
- Terabyte = 1,000 gigabytes = 1,000,000 megabytes
- Library of Congress: 11 terabytes
- Indexed web: 167 terabytes
- “Deep web”: 7,500 – 91,000 terabytes
The indexed web is the web that the search engines know about, that you’re capable of finding via google. The deep web is the web that is not easily accessible. It could be not indexed because it’s not linked from any other page. It could also be hidden because it’s generated from a dynamic web page that requires input
About HTTP Requests
- HyperText Transfer ProtocolAll requests made for web pages are made via HTTP. HTTP is the “language” that computers use to talk each other.
- Client/ServerThe client is the party making the request for content. It’s typically a web browser, though it could be a spider or programming language. The server is the computer that has what the client wants.Spiders: Spiders are automated programs, or bots, that crawl the web looking for things. The most common example is Google. Another more malicious example are spiders looking for e-mail addresses to spam. They are called spiders because of the way they “crawl the web”
- Stateless modelThis has nothing to do with political boundaries. In this case state means “the way something is with respect to its main attributes; “the current state of knowledge”; “his state of health”; “in a weak financial state”. When we say HTTP is stateless we mean each request stands by itself.
- Uniform Resource Locators (URLs) and their partsURLs consist of a scheme and address. http is the scheme for most web requests.
- Domain NamesDomain names are the name that identifies a computer on the Internet. A Top Level Domain (TLD) is the (usually 3 letter) suffix on a web site, such as .com, .edu, or .co.uk. Domain names, the part just to the left of the TLD (‘scc-fl’) are typically purchased through Domain Registrars. Subdomains are the to the left of the domain name and are used to further specify which computer on the network you want (‘www’).
- HeadersThey define various characteristics of the data that is requested or the data that has been provided. They are the “action” words of the HTTP request.
- Get RequestRequests a representation of the specified resource. By far the most common method used on the Web today.
- Post RequestSubmits data to be processed (e.g. from an HTML form) to the identified resource. The data is included in the body of the request.
- Sample http request:
GET /tward/week_1.htm HTTP/1.1 Host: www2.cccc.edu User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:184.108.40.206) ... Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9, ... Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://www2.ccc.edu/scssule.htm Cookie: [private cookies go here]
- Sample http response headers:
HTTP/1.x 200 OK Content-Length: 1457 Content-Type: text/html Last-Modified: Thu, 17 Jan 2008 03:05:43 GMT Accept-Ranges: bytes Etag: "28f1fdd8b558c81:d0bd" Server: Microsoft-IIS/6.0 X-Powered-By: ASP.NET IISExport: This web site was exported using IIS Export v4.2 MicrosoftOfficeWebServer: 5.0_Pub Date: Thu, 17 Jan 2008 03:08:23 GMT
- HTTP over SSL – Secure Socket LayerTechnically not its own protocol. It is the same HTTP protocol, but with encryption to keep the data secure between the client and server. Once the data has been decrypted there is no difference between HTTP and HTTPs
- How secure?It’s very secure in protecting the data while it’s in transit, but once data arrives at its destination, it’s only as safe as the computer it’s on. Gene Spafford states that it is like “using an armored truck to transport rolls of pennies between someone on a park bench and someone doing business from a cardboard box.”
Terms Associated with HTTP requests
- CookiesA cookie is plain text sent by the web server to the web browser. The web browser saves the text to a file and sends back this text to the server each time it makes a request to that server only. It’s a way to work around the stateless nature of HTTP.Only the server which you visit can send you a cookie, and your browser will only send cookies back to the server which originally created the cookie. However, you can still get a lot of tracking cookies by the way of ads.
- CacheA way to store things locally to increase performance. While it can greatly speed up the speed of the web requests, it can lead to frustration by developers when users don’t see a new version of their page. There is more than one potential cache location for a web request.
- Content TypeThe way the server tells the browser what kind of data it’s sending and how it should handle the data
- Character encoding – bits and bytesA bit is the lowest level of data storage in computers. It can either be a 1 or 0. 8 bits is a byte. In early computing, a byte represented one character (a letter). With 8 bytes, you have a maximum of 256 different characters possible. For early computing (in English) that’s plenty. However, in many languages there are more than 256 letters. In addition, people wanted to be able to start typing different types of symbols. Now, more than one byte is typically used to represent one character. So, many different character maps appeared. Now, the computers need to know which encoding method is used so it knows whether 01010100… is a ‘d’ or a ^.If you see “funky” characters on the screen, or question marks where there should be something else, there are character encoding issues. The biggest culprit of this is people pasting text from Microsoft Word, which uses ISO-8859-1 (the Microsoft standard), into something meant for the web, which typically uses UTF-8.
- Transmission Control Protocol / Internet Protocol
- Official Protocol of the Internet™
- PacketsTCP/IP information is split up into packets. You do not receive the entire web site in one “blob”, rather it gets split into chunks, or packets. Each packet is self contained.Packets consist of a header and the payload. The header contains where the packet is going, where it came from, a sequence number that lets the server put the packets back in order, and a checksum that lets the server validate that the contents of the packet did not get corrupted.Packets allow traffic to be load balanced, sending half of the packets on one route to get there and the other half a different way.Time to download is more dependent on number of packets than the size of the file.Sometimes, one packet will be lost and will need to be resent.
- Other common TCP/IP Protocols
- FTP – File Transfer ProtocolA method of transferring files across the Internet. HTTP allows file downloads, but you cannot send files.
- DNS – Domain Name SystemThe “phonebook” of the Internet. It is a specialized server that converts the domain names you enter into your browser and converts them into the numeric IP address where the computer is located.
- IMAP and POP – e-mail Protocols
- SSH – Secure ShellUsed as a way to get access to the server and execute commands on it.
- Market shareNetscape won initially by being first to market. However, their lack of resources allowed them to be overtaken by Internet Explorer. Netscape released their code base to allow other browsers to get a working start. Browsers based off of Netscape’s code base are called “Gecko” based browsers. Mozilla, using this code base, has gained a great deal of ground on Internet Explorer but is unlikely to ever beat out Internet Explorer due to its integration with the Windows Operating System.
- Test in allYou never know which browser your visitor, customer, or instructor will be using to visit your site. Make sure it works in all browsers as best you can.
- The Browser WarIn the 90s, Netscape and IE were fighting for control of the market place. By doing so, they were both trying to push the market forward as fast as they could. They were able to push the technology forward at a very rapid pace, but there were significant growing pains.
- Browser differences – Standards ComplianceIn their rush to push technology forward, each browser developed proprietary elements that the other browser did not have. A number of these were good ideas and have stuck around. A good number of these were bad ideas and are no longer used (hopefully).The rapid development cycle and the rush to be different from each other caused many web sites that took advantage of these new features to only work in those browsers. The infamous “Best viewed in Internet Explorer 4.0” image.Finally, the best interests of the web won out and standards were created. There are still many differences between the browsers, but for the most part they are minor and not in areas of HTML that are used frequently.
- Internet ExplorerThe most popular browser. The fact that it comes with every Windows computer and thus requires the least amount of effort to use has made it the most popular. However, its status as the most popular caused its development to lag. Features that were appearing in many browsers were absent from IE for a number of years until the release of IE7. In addition, IE is the least standards compliant of any of the browsers. They sometimes adopt the attitude that developers need to conform to their standards.
- FirefoxFirefox is based off of Netscape’s code. It’s an open source project. Open source means anyone in the world can contribute code to the project. This allows a program run by a very small team to add features at least as quickly as a large company like Microsoft. They’re essentially leveraging volunteer work from all around the world. What makes their browser stand out is the Extensions.
- SafariSafari was originally written for the Macintosh. However, as part of the iPhone launch they had to port Safari to Windows and so they released a Windows version of their browser. While Firefox is slightly faster than Internet Explorer, Safari is much faster than either.
HyperText Markup Language – HTML
- HTML”Markup” – HTML was designed as a way to lightly style text. HTML was designed a long time ago when the reach of the web was not imagined. HTML was not designed to allow people to create documents and pages for presentation, as you see today. Originally, HTML was used to share scholarly papers and research. It had a very Spartan appearance.During the browser war of the 90s the focus shifted to creating presentation style web pages that looked pretty. As a result, a lot of new tags were introduced and a lot of bad ways of doing things were created.XHTML and CSS attempt to restore HTML to being a markup language and separate the style from the page content. XHTML gives very narrow definitions of what is valid markup. Perhaps too narrow.
- “Hello World” HTML
<html> <head> <title>Hello World</title> </head> <body> <h2>Hello world!</h2> <p>Thank you for visiting</p> </body> </html>
- Any text editor can create HTML.Any word processor can create the markup for HTML. Once its saved to the hard drive, a web browser on that computer can read it and display it as a page.
- Valid XHTMLXHTML is a subset of HTML that’s much more strict about what is allowed and what is not allowed.XHTML defines what is a valid tag and whichs tags are allowed to have which attributes.
Life cycle of an HTTP request
- Request to DNS ServerBrowser sends a request to the DNS server and asks for the IP address of the server it’s trying to connect to.
- DNS server converts the domain name into IP addressThe DNS server looks up the IP address and sends it back to the client. The network of DNS servers rely on each other to keep an accurate record. DNS can sometimes take a long time to propagate depending on your ISP.
- Request to web serverUsing the IP address found from the DNS server, the browser creates an HTTP request and sends it on to the web server.
- Response of server, hopefully containing HTMLThe web server attempts to fill the request. Once completed, success or failure, the server sends a status code back to the browser. If it’s a successful request, it will also send the requested content.
- Subsequent requests of additional mediaThe browser reads the HTML. As it encounters things it needs to display the page (images, etc) it makes additional requests to the web server. Remember, HTTP is stateless so each request goes through this life cycle and knows nothing of previous or future requests.The many requests needed to display a page are why “hits” are a bad way to view traffic to a web site. Each page may require a dozen or more hits to the server
- Browser displays contentThe browser displays the HTML and additional media back to the user, dependent on the browser’s rules. This is where browser differences show.