 Last 7 days

news.ycombinator.com news.ycombinator.com

afternoons are spent reading/researching/online classes.This has really helped me avoid burn out. I go into the weekend less exhausted and more motivated to return on Monday and implement new stuff. It has also helped generate some inspiration for weekend/personal projects.
Learning at work as solution to burn out and inspiration for personal projects


sloanreview.mit.edu sloanreview.mit.edu

Nothing truly novel, nothing that matters, is ever learned with ease
If you don't struggle you don't learn, right?

We want to learn, but we worry that we might not like what we learn. Or that learning will cost us too much. Or that we will have to give up cherished ideas.
I believe it is normal to worry about the usage of a new domainbased knowledge

Talented people flock to employers that promise to invest in their development whether they will stay at the company or not.
Cannot agree more on that

Leaders in every sector seem to agree: Learning is an imperative, not a cliché. Without it, careers derail and companies fail.
Don't stop learning
Tags
Annotators
URL


arxiv.org arxiv.org

(a) PASCAL 2012

Figure 4

Figure 2
Tags
Annotators
URL


www.kdnuggets.com www.kdnuggets.com

Figure 2


arxiv.org arxiv.org

Figure 1
Tags
Annotators
URL


news.ycombinator.com news.ycombinator.com

I also recently took about 10 months off of work, specifically to focus on learning. It was incredible, and I don’t regret it financially. I would often get up at 6 in the morning or even earlier (which I never do) just from excitement about what I was going to learn about and accomplish in the day. Spending my time focused Only on what I was most interested in was incredibly rewarding.
Approach of taking 10 months off from work just to learn something new


paulgraham.com paulgraham.com


lsc.cornell.edu lsc.cornell.edu

The Cornell Notetaking System
The Cornell Notetaking System reassembling the combination of active learning and spaced repetition, just as Anki

 Dec 2019


(there’s an argument to be made that social media has exacerbated these tendencies, as partisan complaint is often the most “engaged” and therefore most “valuable” content on social networks).
Technology as "phases," not endpoints.
Regarding discussing technology. Very interesting that this sort of behavior might leak back into daily life from social media. Will we end up speaking in "sound bites"?

 Nov 2019

www.edsurge.com www.edsurge.com

While online courses can certainly reach more students than their lecture hall counterparts, colleges don’t always scale up staff to compensate. That can make it difficult for librarians to provide timely assistance to patrons.


teachonline.asu.edu teachonline.asu.edu

Integrating Technology with Bloom’s Taxonomy
This article was published by a team member of the ASU Online Instructional Design and New Media (IDNM) team at Arizona State University. This team shares instructional design methods and resources on the TeachOnline site for online learning. "Integrating Technology with Bloom's Taxonomy" describes practices for implementing 6 principles of Bloom's Digital Taxonomy in online learning. These principles include Creating, Evaluating, Analyzing, Applying, Understanding, and Remembering. The purpose of implementing this model is to create more meaningful and effective experiences for online learners. The author guides instructors in the selection of digital tools that drive higherorder thinking, active engagmenent, and relevancy. Rating 9/10


www.opm.gov www.opm.gov

Training and Development Policy Wiki
This webpage, under the Office of Personnel Management (OPM) .gov site, provides an extensive list of technology resources that can be and have been implemented into a variety of employee deveolpment programs. These tools allow for more personalized learning, active participation, collaboration, and communication.In the first section of the site, examples of Web 2.0 tools are listed that can promote collaboration and constructive learning. You can also find technologies that are used in specific sectors, such as the Federal Government and the Private Sector. Clicking on the links redirects you to additional resources on the tech tools, including how to use them effectively and professionally for employee training. Rating 10/10



Using Technology to Enhance Teaching & Learning
This website provides technology teaching resources as part of the Southern Methodist University (SMU) Center for Teaching Excellence. Users can find informational links to various technology tools that can be used for enhancing teaching and learning in online, hybrid, or facetoface courses. On the right of the page under "Technology," users can click on the tech tools for additional resources/research on their implementation. Examples of these technologies include Blackboard LMS, PowerPoint presentation software, Google Suite products, blogs, and social media sites. Rating 8/10


www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov

Empowering Education: A New Model for Inservice Training of Nursing Staff
This research article explores an andragogical method of learning for the inservice training of nurses. In a study of a training period for 35 nurses, research found an empowering model of education that was characterized by selfdirected learning and practical learning. This model suggests active participation, motivation, and problemsolving as key indicators of effective training for nurses. Rating 8/10


lincs.ed.gov lincs.ed.gov

Digital Literacy Initiatives
This website outlines digital literacy initiatives provided by the Literacy Information and Communication System (LINCS). The U.S. Department of Education, Office of Career, Technical, and Adult Education (OCTAE) implements these intitatives to aid adult learners in the successful use of technology in their education and careers. Students have free access to learning material on different subjects under the "LINCS Learner Center" tab. Teachers and tutors also have access to resoruces on implementing educational technology for professional development and effective instruction. Rating 8/10


digitalcommons.macalester.edu digitalcommons.macalester.edu

1Engaging Adults Learners with TechnologyThrough
Instruction Librarians from the Twin Cities Campus Library created this instructional gudie as a workshop for implementing technology for adult learning. First, the authors describe key characteristics of adult learners as identified in the theory of andragogy. Examples of these characteristics include the need to know, learner responsibility, past experiences, and motivation to learn. The authors then suggest instructional practices and activities to meet the needs of adult learners, Finally, they provide examples of technology tools for effectively engaging adult learners. Rating 10/10

Designed to be used in a workshop setting, the content provides an understanding of adult learning theory and it's application of best practices in both face to face and elearning environments. Participants are provided a list of web tools to facilitate learning.
6/10: the format is bit difficult to access out of context


www.iste.org www.iste.org

ISTE Standards Transform learning and teaching.
This resource is the website for the International Society for Technology in Education (ISTE), which serves educators and professionals in the implementation of technology in education. The site provides open access readings, learning guides, and membership material for educators' development with technology. You can also find ISTE Standards for teachers, students, technology coaches, and educational leaders/administrators. These standards serve as the skills and knowledge each group should obtain for effective teaching and learning with technology.


www.citejournal.org www.citejournal.org

This article, developed by faculty members at NAU, provides research behind and practices for technologyinfused professional development (PD) programs. The authors first emphasize the importance of designing professional development for teachers around how they and their students learn best. Many approaches to PD have taken a onesizefitsall approach in which learners take a more passive role in absorbing standardized information. The authors in this article suggest the need for a more effective model, one in which teachers play an active role in learning in ways that they find most effective for them and their students. Technology can support this PD through interactive and learnercentered instruction. Rating: 9/10


www.nap.edu www.nap.edu

Advantages of Online Professional Development
This chapter, "Advantages of Online Professional Development" describes the benefits of online teacher professional development (OTPD), which implements technology to deliver training and learning in an online environment. OTPD allows teachers to participate in a flexible, selfdirected, and collaborative learning community. They can interact with other teachers synchronously and asynchronously, or take professional development courses at their own schedule.


www.advanced.org www.advanced.org

Training for Transformation: Teachers, Technology, and the Third Millennium
This article emphasizes the importance of preparing educators for the effective implementation of technology in a rapidly advancing digital society. Institutions have taken measures to ensure that students are prepared to use educational technology and how that can supplement and enhance learning. However, it is also just as important to ensure that teachers are prepared and to consider how these tools impact their practices. This article outlines examples of training programs and models that teachers can use for technology implementation professional development. Rating: 9/10


www.angelo.edu www.angelo.edu

Section 1.3 Theories of Education and the Online Environment
This website is part of Angelo State University's online teaching training course for faculty members. This section outlines three prominent theories of educationBehaviorism, Social Cognitive Theory, and Constructivismand applies them to online learning. Instructional Designers and course instructors can use this guide for the construction of meaningful and active learning environment for students. Rating: 10/10
Tags
 active learning
 educational theories
 collaborative learning
 elearning
 etc556
 adult learning
 etcnau
 online teaching
 instructional design
 edtech
 Behaviorism
 technology integration
 Social Cognitive Theory
 higher education
 selfdirected learning
 andragogy
 Constructivism
 Angelo State University
 professional development
 online instruction
 adult education
Annotators
URL


www.angelo.edu www.angelo.edu

Section 1.5 Online Learner Characteristics, Technology and Skill Requirements
This website outlines Section 1.5 of Angelo State University's guide to instructional design and online teaching. Section 1.5 describes key characteristics of online learners, as well as the technology and computer skills that research has identified as being important for online learners. Successful online learners are described as selfdirected, motivated, wellorganized, and dedicated to their education. The article also notes that online learners should understand how to use technology such as multimedia tools, email, internet browsers. and LMS systems. This resource serves as a guide to effective online teaching. Rating 10/10


www.learningtheories.com www.learningtheories.com

ELearning Theory (Mayer, Sweller, Moreno)
This website outlines key principles of the ELearning Theory developed by Mayer, Sweller, and Moreno. ELearning Theory describes how the implementation of educational technology can be combined with key principles of how we learn for better outcomes. This site describes those principles as a guide of more effective instructional design. Users can also find other learning theories under the "Categories" link at the top of the page. Examples include Constructivist theories, Media & Technology theories, and Social Learning theories. Rating: 8/10


www.youtube.com www.youtube.com

www.youtube.com www.youtube.com

www.youtube.com www.youtube.com
Tags
Annotators
URL


journal.alt.ac.uk journal.alt.ac.uk

A multimedia approach to affective learning and training can result in more lifelike trainings which replicate scenarios and thus provide more targeted feedback, interventions, and experience to improve decision making and outcomes. Rating: 7/10


www.cleveroad.com www.cleveroad.com

What’s the Difference Between AI, Machine Learning and Data Science?


learnuseast1prodfleet01xythos.s3.useast1.amazonaws.com learnuseast1prodfleet01xythos.s3.useast1.amazonaws.com

Adult Learning in the Workplace:Emotion Work or Emotion Learning?

The chapter examines learning and emotion at work andhow emotional intelligence and emotion work affect wellbeing, identity development, and power relations.The chapter also considers how human resource development and emotion interact in learning, training, andchange initiatives.


www.instructionaldesign.org www.instructionaldesign.org

Learning Domains
This website provides several examples of domains adults may learn in or engage with. By clicking on each type, you are redirected to a detailed description of the domain. Descriptions include, but are not limited to, definitions, theories and research behind the topic, and realworld examples. You can also find references used in the description, which can be helpful for further exploration. This InstructionalDesign.org website also provides extensive lists of learning concepts (i.e. motivation, personalized learning, storyboard, etc.) and theories (i.e. Adult Learning Theory, Social Learning, Constructivism, etc.). Each learning theory link provides a theoretical definition, applications, examples, key principles, references, and related websites. Rating 10/10.


ignitedlabs.education.asu.edu ignitedlabs.education.asu.edu

Tech Literacy Resources
This website is the "Resources" archive for the IgniteED Labs at Arizona State University's Mary Lou Fulton Teachers College. The IgniteED Labs allow students, staff, and faculty to explore innovative and emerging learning technology such as virtual reality (VR), artifical intelligence (AI), 3D printing, and robotics. The left side of this site provides several resources on understanding and effectively using various technologies available in the IgniteED labs. Each resources directs you to external websites, such as product tutorials on Youtube, setup guides, and the products' websites. The right column, "Tech Literacy Resources," contains a variety of guides on how students can effectively and strategically use different technologies. Resources include "howto" user guides, online academic integrity policies, and technology support services. Rating: 9/10


Local file Local file

We accept as axiomatic that students learn by doing
While I personally agree that "learning by doing" is perhaps one of the or even the most powerful forms of pedagogy, a very large part of current and historical pedagogy does not really engage doing. So either not ALL learning involves doing, or the majority of education that happens without doing doesn't involve learning.


www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov

The text discusses the implementation of videoconferencing to provide mental health services to children with a history of emotional and behavioral concerns. 89% of sessions were delivered via telehealth videoconferencing. Rating: 7/10 Need a followup on impact of telehealth services on student behaviors/outcomes.


www.insidehighered.com www.insidehighered.com

Using Technology to Help FirstGen Students
This article highlights the need for and benefits of implementing more technology tools to support firstgeneration college students' learning, engagement, and success. For many firstgen students, especially those from lowincome backgrounds, the transition to college can be challenging; this leads to lower retention rates, performance, and confidence. The authors, drawing off of research, suggest mobile devices and Web 2.0 technologies to prevent these challenges. Example of such tools include dictionary and annotation apps that are readilyaccessible and aid in students' understanding of material. Fistgen students can also use social media apps (Twitter, Facebook, etc.) to maintain supportive connections with family, peers, and mentors. Rating: 8/10


intra.ece.ucr.edu intra.ece.ucr.edu

We construct a graph from the unlabeled data to representthe underlying structure, such that each node represents adata point, and edges represent the interrelationships between them. Thereafter, considering the flow of beliefs in thisgraph, we choose those samples for labeling which minimizethe joint entropy of the nodes of the graph.
ciekawe podejście



A blended learning approach to emotional intelligence is the methodology for the course created by Google and implemented in various institutions around the globe. Rating: 8/10


www.fastcompany.com www.fastcompany.com

An emotional intelligence course initiated by Google became a tool to improve mindfulness, productivity, and emotional IQ. The course has since expanded into other businesses which report that employees are coping better with stressors and challenges. Rating: 7/10 Key questions...what is the format of the course, tools etc?


europepmc.org europepmc.org

Online mindbody training (MBT) is evaluated and results suggest that improvements in psychological wellbeing can be achieved with online MBT. Rating: 7/10


blogs.edweek.org blogs.edweek.org

In the text "10 Current and Emerging Trends in Adult Education," ten current trends are briefly reviewed. Among these are the emphasis on effort, growth, and socialemotional learning. In terms of technology, reallife simulations and AI are being used to better prepare learners for their professional encounters and responsibilities. In terms of what is on the horizon for adult learning, one can expect mastery to be emphasized rather than degrees. As a result of the information economy, it is expected that income inequality will grow and thus advocacy for adult learners and continued opportunities for working adults to grow will mitigate the negative consequences. Rating: 7/10


www.iftf.org www.iftf.org

The text examines the ways virtual reality can facilitate embodied learning.


journals.sagepub.com journals.sagepub.com

Preservice teachers can benefit from the use of simulations that reproduce classroom environments, student behaviors and profiles, and academic outcomes to guide their craft as educators. In this text, simSchool is briefly evaluated by student teachers to determine its usefulness. While the study had significant limitations of volunteer test subjects in a onetime usage of the tool, simSchool still was given some high marks for it's purpose and realistic depiction of student profiles and classroom environment. Finding suggest simulations like simSchool can continue to improve and with longterm use, would be effective at developing skills for educators. Rating: 8/10


wwwchroniclecom.libproxy.nau.edu wwwchroniclecom.libproxy.nau.edu

Technology
This website explores technology news within the field of higher education. The site contains a wide variety of news articles on current issues, trends, and research surrounding the integration of technology in universities and colleges. This includes technology's prevalence in teaching and learning, institutional decisions, and societal trends of higher education. The articles are published by authors for "The Chronicle of Higher Education," a leading newspaper and website for higher education journalism. Rating: 7/10


journals.uair.arizona.edu journals.uair.arizona.edu

Issues and Trends in Learning Technologies
This website covers "Issues and Trends in Learning Technologies (ITLT)," a peerreviewed openaccess journal published by the University of Arizona's Learning Technology program. This online journal features articles that explore theories, practices, and research surrounding educational technology. This includes discourse around the application and assessment of various learning technologies in educational settings. The "Archives" tab at the top of the site each volume ITLT, which feature articles such as research, reviews, and graduate student work. As an online publication, each article is accessible in PDF and HTML format free of charge. "Rating: 10/10"


www.srdc.org www.srdc.org

This article is a great example of a research model in measuring outcomes of adult learning.


lincs.ed.gov lincs.ed.gov

The article explains three theories of adult learning: andragogy, SDL, and transformational learning. The authors additionally provide practical application of the theories in the classroom.
8/10


www.armyupress.army.mil www.armyupress.army.mil

The article explains the shift in military training to implement practices that align with Kolb's experiential learning theory. More specifically, Pierson discusses how competencybased education can best be used to improve Army training programs.
9/10



ATD, a nonprofit organization that promotes training and development in the workplace, outlines the three primary learning theories that instructional designers need to know to provide effective corporate training.
10/10


elearningindustry.com elearningindustry.com

Pappas provides a brief breakdown of the characteristics of adult learners to help instructional designers develop effective content.
6/10


www.hraljournal.com www.hraljournal.com

Flores examines the current research as it relates to distance learning. She explores technology integration and learning theory. Throughout, she stresses the importance of professional development for instructors to equip them to provide quality distance education.
10/10


nsuworks.nova.edu nsuworks.nova.edu

The authors detail their development of a professional learning community to advance technology integration at Nova Southeastern University. After a literature review of the key components of online learning, they discuss the method of implementing the PLC and the major outcomes and then offer recommendations for starting a PLC within institutions of higher ed.
10/10


Local file Local fileuntitled1

Rossiter and Garcia evaluate the use of digital storytelling in adult learning classrooms, primarily through the use of "autobiographical learning" where learners share personal experiences and connections with the content. They outline "three key dimensions" that make storytelling valuable in adult learning: voice, creativity, and selfdirection.
10/10


digitalpromise.org digitalpromise.org

The authors present the benefits of coaching in professional development for educators in today's technologically advanced classrooms. Of particular interest is the explanation of the different methods of coaching: executive, coactive, cognitive, and instructional. They suggest that coaching provides more successful outcomes than single workshops and stress that finding the correct method for each situation and organization is crucial.
10/10


journals.uair.arizona.edu journals.uair.arizona.edu

To optimize learners' experience and the efficacy of learning outcomes, instructors need to consider how technology can offer approaches better suited to adult learning.
This website from University of Arizona provides a list of trends and issues in learning technologies
Rating 9/10


faculty.londondeanery.ac.uk faculty.londondeanery.ac.uk

Teaching and learning methods: opreparing for teaching ofacilitating the integration of knowledge, skills and attitudes oteaching and learning in groups ofacilitating learning and setting ground rules oexplaining ogroup dynamics omanaging the group olectures osmall group teaching methods and discussion techniques oseminars and tutorials ocomputer based teaching and learning – information technology and the World Wide Web ointroducing problem based learning ocase based learning and clinical scenarios
this website is consisted of available resources.
Rating: 9/10



Employers can engage in the creation and classroom integration of tech tools
LINCS.ed.gov is an website initiated in supplying true based practice in adult education.
Rating: 6/10


joitskehulsebosch.blogspot.com joitskehulsebosch.blogspot.com

The three major prominent learning theories are known as behaviourist, cognitivist and constructivist, though Siemens later developed the connectivism theory as a learning theory for the digital age.
The elearning learning websites is a collection of peer articles with from around the world. It is a collection of high quality articles, blogs, journals.
Rating: 7/10


www.physiology.org www.physiology.org

The main objectives of this article are to present the theoretical evidence for the design and delivery of instructional materials and to provide a practical framework for implementing those theories in the classroom and laboratory.
The American Journal of Physiology. org is an website dedicated to published journals and books of functions of life or living organisms.
Rating: 9/10


www.makerbot.com www.makerbot.com

The Definitive Guide to 3D Printing in the Classroom
Makerbot is an 3D manufacturer company, first to make 3D printing affordable and accessible to educators and professionals.
Rating: 10/10


Local file Local file

The article, "Keys to success: Selfdirected learning,' authors Fellows, Culver, and Beston discuss the components of Grow's selfdirected learning (SDL) model. Learners and instructors fit into a matrix which can be used to determine optimal instructional strategies to meet the readiness of the learner. The authors discuss how SDL is implemented in multiple institutions for higher education. Instructional methods are shared to address foundational SDL skills as well as issues that arose when learners were having difficulty transitioning from one stage of readiness to another. Overall, holistic learner skills were enhanced with SDL. Rating: 9/10


districtadministration.com districtadministration.com

In the text by Jennifer Herseim, virtual reality (VR) is identified as a tool to help with teacher training. Teachers can embark on a learning process in a secure environment with a diverse set of student avatars operated in part, by a real individual. Staff can explore their teaching methods and styles with recorded and measured skills and responses for future review and reflection. Rating 7/10


www.asu.edu www.asu.edu

Recognized by U.S. News & World Report as the country’s most innovative school, Arizona State University is where students and faculty work with NASA to develop, advance and lead innovations in space exploration.
Arizona State University is one of the best university leaders nationally and around the world. They are known by providing successful online services for online learners. Educators and potential educators should explore their site for leads and their own innovation.
Rating: 10/10
Tags
Annotators
URL


ppsd.smapply.io ppsd.smapply.io

Private postsecondary institutions that provide educational services in the State of New Mexico are subject to either the New Mexico PostSecondary Educational Institution Act (Section 21231 et seq. NMSA 1978) or the Interstate Distance Education Act (Section 2123B1 et seq. NMSA 1978) and can use this site to apply for State Authorization or submit other required applications to comply with State regulations. Students may request transcripts of closed schools where the New Mexico Higher Education Department is the designated custodian of records or may file complaints against any postsecondary institution that provides educational services in our State.
The NMHE website is about providing academic, financial and policies to new mexico public higher education institutions and community.


www.forbes.com www.forbes.com

Many of the world’s top universities have embraced Massive Open Online Courses (known as MOOCs).
Forbes is a business website that also focuses on innovation and technology
Rating: 8/10


www.techlearning.com www.techlearning.com

This web page can be used in many ways because it has theories from the old to new in education and technology, and from institutions to working environments, and military. You will find George Bush, Steve Jobs, and Seymour Papert from MIT, just to name a few.
It is really nice to see new and not so new perspectives of people that do not provide learning theories, but combine learning theories with technology, which to me is relevant for today's educators and learners. 5/10


www.youtube.com www.youtube.com

This video is an experience in Kentucky(entire state) on how they integrated technology by using a KYAE Technology Consultant in their adult education programs.
The consultant uses the SAMR Model by Dr. Rueben Puentedura, which is Substitution, Augmentation, Modification, Redefinition, all to develop and use full technology in a new way to redefine and engage students and educators.
A large part of technology integration are using what students own devices. But, teacher's must engage this process, it actually starts with them. And the speaker is just asking them educators to start small using the technology with their students, not the old way of teacher, but the methods that they are endorsing across the state is using them together.<br> They also talk about using surveys with experiences from instructors and students to see how they are measuring up in the success of this integration program. for example, are teachers using smart boards or did they try them and go back to not using them and why.
The process for which measuring success and needs for improvements are rubrics, point surveys, and a three year goal with technology plan to a total technology integration. Overcoming hurdles of device and internet access is addressed as well.
I think that this hits on learning environments, adult learning, and a possible profession for educational technology students as it is from the perspective of a technology consultant. 9/10
Tags
Annotators
URL


journalssagepubcom.libproxy.nau.edu journalssagepubcom.libproxy.nau.edu

In this article we learn about the transition for the disabled student to life beyond high school. Initially, students with disabilities in school are assigned and Individualized Education Program (IEP) to evaluate skills and determine services needed for the success, progression, and learning of the student. Once students are 16 and older or leave school how do they deal with work, home, or even continuing education? The article provides details on implementing simulations in the "acquisition of functional skills, and how "when paired" with technology or digital simulations the student can practice more and maintain skills better. The article offers a great charts for trying technology based software, multimedia, training activities, with students with disabilities and outcomes. 10/10


reader.elsevier.com reader.elsevier.com

The text documents a yearlong research project into experiential learning in teacher professional development. Teachers participated in experiential learning themselves to then begin to implement it into their own classrooms to serve their students. By and large, teachers were receptive, had misconceptions addressed, changed their practices with their colleagues and students to develop more engaging and active classrooms. Essentially, a shift from teachercentered learning to studentcentered learning was achieved in small increments by using experiential learning and reflection to facilitate teacher growth thereby creating new pathways for student learning. Given the nature of the traditional methods predominantly used, this study seems to reflect some elements of transformative learning in which teacher conventions and ideas were challenged and adjusted through heterogenous groups and personal reflection. Rating: 9/10


www.cal.org www.cal.org

Problembased learning (PBL) in a growing trend in approaching adult learning, particularly in ESL/ELL classrooms. In this text, the basic principles and methods of PBL for ELL/ESL classes are covered for instructors to implement. Key aspects of PBL include relevance to student lives and the opportunity to practice English in a heterogenous group with the end goal being application to another area of life. Multiple resources are helpful for implementation of PBL including technology. A review of the benefits of PBL is summarized as well as drawbacks with embedded suggestions to resolve possible difficulties. Rating: 8/10


testingjavascript.com testingjavascript.com

www.leadinglearning.com www.leadinglearning.com

Author Jeff Cobb features guest Celisa to discuss trends in the field of lifelong learning. The speakers note twelve existing trends such as MOOCs, microcredentials, neuroscience, and selfdirected learning. Both private and public sectors or contributing to existing and emerging trends. Lifelong learning is transforming as services explore free and paid services to extend learning to more populations.


leanpub.com leanpub.com

Local file Local file

In this text, authors Kit Kacirek and Michael Miller explore adult learning for mature adults, or those identified as senior citizens. Research into mature adult learning programs centered around leisure activities, reveals situational pedagogy in which some traditional adult learning theory may need to be adapted to suit the cognitive changes in adults with advanced age. A brief description of the research methods reveals that adults in advanced age prefer lecture, use of media, and field trips. The implications for such a study are useful as the population of mature adults grows due to advancements in medicine and thus the demand for learning opportunities increases as well.



Section 508 compliance is discussed to support instructors knowledge of section 508 and how to begin the process of ensuring instructional content is 508 compliant. Section 508 of the federal Rehabilitation Act governs access of media to all persons whether they have a disability or not. Including captions, audio description, and accessible video players are vital to compliance. Compliance with 508 is necessary given that data that illustrates the percent of employees that have need for accommodations to support their learning. This brief article seems highly related to Universal Design of Learning. Rating: 10/10



Author Douglas Lieberman provides insights into how to use text to improve learning. Suggestions for type of text, volume of text, animations, and graphics are discussed to maximize their usefulness and convey information to learners and/or facilitate discussion among learners. Rating: 6/10


humanservices.ucdavis.edu humanservices.ucdavis.edu

The Northwest Center of Public Health Practice's toolkit title "Effective Adult Learning: A toolkit for teaching adults," is . a highly comprehensive resource for instructional design for adult learning instructors. Sections include course or training design, objectives of adult learning, various tools to help in the process of course design, and brief overviews of adult learning methods and theory. The embedded section review charts make it easier for quick references. Rating: 10/10

To be effective in teaching adults, it’s important to know your audience and have a general understanding of how adults learn
This literature is a resource to assist in adult teaching. The first section of the reading defines who your audience (background, does your selected audience need more training, learning objectives). Then explains the learning objectives in more detail and how to develop effective learning objectives (Specify, Measureable, Achievable, Relevant, and Timebound) and if needed the ABCD model (Audience, Behavior, Condition, Degree) can be utilized. Secondly, developing training content. Lastly, deliver your training. The article is very good. Rating: 5/5


digitalpromise.org digitalpromise.org

Universal design for learning (UDL) is the topic of focus for this webinar hosted by Digital Promise. Multiple experts discuss UDL for adult learners and strategies for UDL. Rating: 6/10


blog.cathymoore.com blog.cathymoore.com

conventional learning objectives can work against us.
Cathy Moore discusses the lovehate relationship with learning objectives. Objectives can be a critical tool to guide instruction however, we can miss the boat when it comes to meaningful, applicable, and relevant learning. In the text, Moore is critical of objectives that merely are used to ensure a learner knows content. It is preferential, and superior instruction, to ensure a learner can exercise the knowledge with observable actions in context. Rating: 9/10


www.cael.org www.cael.org

The Council for Adult and Experiential Learning (CAEL) provides opportunities for professional development for adult learning instructors and organizations that serve adult learners. CAEL has launched its first live stream of the conference to allow people to attend remotely. While the conference has since passed, this resource could be useful to calendar for the coming year. Included on the is a blog, newsletter sign up, and resources for higher education, employers and workforce development. Rating 8/10


www.cvadult.org www.cvadult.org

The lesson plan template provided is a helpful tool for designing a basic lesson with adult learning concepts. Some of the lesson plan template is also a part of pedagogy, but some key elements reflect adult learning theory. For example, the section on Practice and Application encourages activities to transfer skills to new situations and concluded by a reflection activity. Given adult learners may have various goals for their learning, the segment addresses adult learning theory. The template could be used or adapted to begin designing around technological tools used for instruction as well. The template does seem to reflect a model of synchronous, facetoface learning given it suggests the instructor move around the room to monitor progress and assist learners. Rating: 6/10


www.aabri.com www.aabri.com

The use of online instructional delivery methodscontinues to grow as technological and societal changes have enabled and encouraged this growth.
The article was written to help the reader understand how adult learners comprehend lessons and their learning styles. The type of learning method that is used in this article is the andragogical process model (eight element process). The article is an interesting view of how the andragogical process model can be used to explore how the adult mind understands how to use online learning to educate themselves. Rating 3/5


digitalpromise.org digitalpromise.org

The use of technology to support learning for K12 students is gaining popularity, leading many to ask whether there might be similar solutions for lowskilled adults.
This article emphasizes on the topic of how adult learning is hindered by technology and how to teach an adult learner. Using five theories; 1) Shared experience 2) Problemsolving scenarios 3) Reflection on experience 4) Own their learning 5) Have an ahha moment. Adults all differently and all want that opportunity to have a new learning moment. Rating 5/5


www.scienceintheclassroom.org www.scienceintheclassroom.org

We combined these data
Here, the authors have brought together several batches of data, where each batch represents the relative abundances of isotopes present in fluid inclusions inside diamonds, like helium, PbSr, traceelements, and carbon isotopes. After collecting this data, they have plotted them in multiple graphs to highlight comparisons between them. This is very common in scientific research and requires training in data analysis and graph plotting. This is also recommended in the Science Practice 5 of AP Physics 2 Course and Exam Description.

After establishing the sublithospheric origin for our diamonds, we measured helium isotopes of the fluid inclusions.
The helium isotopes were measured from the fluid inclusions using mass spectrometry. Mass Spectrometry is a specialized technique, which is used to determine relative abundances of isotopes from a sample. See Essential Knowledge 1.D.2 in the AP Chemistry Course and Exam Description.


elearningindustry.com elearningindustry.com

Pappas breaks down Knowles theory of andragogy and provides practical application to computer training that is easily applied to implementing new technology.
8/10


digitalcommons.fiu.edu digitalcommons.fiu.edu

Drawing from constructivist principles, the authors address how emotions affect motivation and learning for adults. They then provide practical application for instructors to implement to create productive learning environments where adult learners feel safe to explore new knowledge and learn from their experiences.
9/10: while most of the application is to learning in general, the strategies are still applicable to technology in the classroom


theelearningcoach.com theelearningcoach.com

Transformative learning theory and methods to support it are discussed in this text. Andragogy is initially reviewed in order for the reader to become acclimated to basic principles of adult learning. Transformative learning segments emphasize the methods and environments needed to achieve such deep and challenging learning. Due to the intensive personal nature of transformative learning, one must understand the readiness of the learner. The text notes that learners in transition are more apt to engage in transformative learning if given an opportunity to develop selfawareness, and a willingness to be in discomfort in open, nonhierarchical environments.


www.shiftelearning.com www.shiftelearning.com

In this text, instructional designers are given brief synopses of three adult learning theories including andragogy, transformational learning, and experiential learning in order to understand how adults best learn and apply learning. The structure of the text is brief paragraphs with numerated descriptors and/or bullet points for reader convenience. Suggestions for learning activities are also provided for the instructional designer to consider in their course design. In the segment for transformative learning, a link is provided to provide the instructional designer more specific methods to incorporate. At the end of the text, diagrams are provided to visual core aspects and flow of each learning theories process. Rating: 7/10


digitalpromise.org digitalpromise.org

The Digital Promise article presents four major factors to consider when implementing technology for adult learning purposes. The factors include flexibility and benefits of blended learning, data use to support development of instruction, environments with diverse technology available support various learners, and allow the instructor's role to change to meet learner needs. Issues related to each factor are shared and suggestions for resolutions are provided. Rating: 7/10a good resource for introduction to factors and issues in adult learning via technology.


onlineeducator.pbworks.com onlineeducator.pbworks.com

As online learning matures, it is important for both theorists and practitioners to understand how to apply new and emerging educational practices and technologies that foster a sense of community and optimize the online learning environment.
The article expresses the design theory elements (goals, values, methods) and how it can assist with defining new tools for online learning. Rating 5/5


www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov

An understanding of adult learning theories (ie, andragogy) in healthcare professional education programs is important for several reasons.
The author of this article articulates the instrumental learning theories in the healthcare industry. The information provided is more like a speedy way for students and healthcare providers to understand the learning theories. Rating: 4/5


files.eric.ed.gov files.eric.ed.gov

n. Key to this model is the assumption that online education has evolved as a subset of learning in general rather than a subset of distance learning
This article helps the reader understand the major theories that are related to technology using the leaning theories, theoretical frameworks, and models. Rate: 4/5


evolllution.com evolllution.com

Twitter offers two distinct benefits to engaging learners. First of all, it allows learners to respond to classroom discussions in a way that feels right for them, offering shy or introverted students a chance to participate in the class discussion without having to speak in a public forum. Secondly, it allows students to continue the conversation after class is completed, posting relevant links to course material, and reaching out to you (the educator) with additional thoughts or questions.
The article explains how social media, student learning through digital experience, and Learning Management Systems can be beneficial to the learner/student. Article Rating: 3/5


www.gettingsmart.com www.gettingsmart.com

Some of our adulted students take their courses virtually, with students checking in with teachers via Skype or by email, but a majority spend at least some time in a classroom.
This article expresses how learning can be taught using the internet and one does not have to be in class to learn.

 Oct 2019

neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

As a prototype it hits a sweet spot: it's challenging  it's no small feat to recognize handwritten digits  but it's not so difficult as to require an extremely complicated solution, or tremendous computational power. Furthermore, it's a great way to develop more advanced techniques, such as deep learning. And so throughout the book we'll return repeatedly to the problem of handwriting recognition. Later in the book, we'll discuss how these ideas may be applied to other problems in computer vision, and also in speech, natural language processing, and other domains.Of course, if the point of the chapter was only to write a computer program to recognize handwritten digits, then the chapter would be much shorter! But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent. Throughout, I focus on explaining why things are done the way they are, and on building your neural networks intuition. That requires a lengthier discussion than if I just presented the basic mechanics of what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the payoffs, by the end of the chapter we'll be in position to understand what deep learning is, and why it matters.PerceptronsWhat is a neural network? To get started, I'll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more common to use other models of artificial neurons  in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it's worth taking the time to first understand perceptrons.So how do perceptrons work? A perceptron takes several binary inputs, x1,x2,…x1,x2,…x_1, x_2, \ldots, and produces a single binary output: In the example shown the perceptron has three inputs, x1,x2,x3x1,x2,x3x_1, x_2, x_3. In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced weights, w1,w2,…w1,w2,…w_1,w_2,\ldots, real numbers expressing the importance of the respective inputs to the output. The neuron's output, 000 or 111, is determined by whether the weighted sum ∑jwjxj∑jwjxj\sum_j w_j x_j is less than or greater than some threshold value. Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms: output={01if ∑jwjxj≤ thresholdif ∑jwjxj> threshold(1)(1)output={0if ∑jwjxj≤ threshold1if ∑jwjxj> threshold\begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right. \tag{1}\end{eqnarray} That's all there is to how a perceptron works!That's the basic mathematical model. A way you can think about the perceptron is that it's a device that makes decisions by weighing up evidence. Let me give an example. It's not a very realistic example, but it's easy to understand, and we'll soon get to more realistic examples. Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors: Is the weather good? Does your boyfriend or girlfriend want to accompany you? Is the festival near public transit? (You don't own a car). We can represent these three factors by corresponding binary variables x1,x2x1,x2x_1, x_2, and x3x3x_3. For instance, we'd have x1=1x1=1x_1 = 1 if the weather is good, and x1=0x1=0x_1 = 0 if the weather is bad. Similarly, x2=1x2=1x_2 = 1 if your boyfriend or girlfriend wants to go, and x2=0x2=0x_2 = 0 if not. And similarly again for x3x3x_3 and public transit.Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there's no way you'd go to the festival if the weather is bad. You can use perceptrons to model this kind of decisionmaking. One way to do this is to choose a weight w1=6w1=6w_1 = 6 for the weather, and w2=2w2=2w_2 = 2 and w3=2w3=2w_3 = 2 for the other conditions. The larger value of w1w1w_1 indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of 555 for the perceptron. With these choices, the perceptron implements the desired decisionmaking model, outputting 111 whenever the weather is good, and 000 whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.By varying the weights and the threshold, we can get different models of decisionmaking. For example, suppose we instead chose a threshold of 333. Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it'd be a different model of decisionmaking. Dropping the threshold means you're more willing to go to the festival.Obviously, the perceptron isn't a complete model of human decisionmaking! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions: In this network, the first column of perceptrons  what we'll call the first layer of perceptrons  is making three very simple decisions, by weighing the input evidence. What about the perceptrons in the second layer? Each of those perceptrons is making a decision by weighing up the results from the first layer of decisionmaking. In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer. And even more complex decisions can be made by the perceptron in the third layer. In this way, a manylayer network of perceptrons can engage in sophisticated decision making.Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It's less unwieldy than drawing a single output line which then splits.Let's simplify the way we describe perceptrons. The condition ∑jwjxj>threshold∑jwjxj>threshold\sum_j w_j x_j > \mbox{threshold} is cumbersome, and we can make two notational changes to simplify it. The first change is to write ∑jwjxj∑jwjxj\sum_j w_j x_j as a dot product, w⋅x≡∑jwjxjw⋅x≡∑jwjxjw \cdot x \equiv \sum_j w_j x_j, where www and xxx are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what's known as the perceptron's bias, b≡−thresholdb≡−thresholdb \equiv \mbox{threshold}. Using the bias instead of the threshold, the perceptron rule can be rewritten: output={01if w⋅x+b≤0if w⋅x+b>0(2)(2)output={0if w⋅x+b≤01if w⋅x+b>0\begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right. \tag{2}\end{eqnarray} You can think of the bias as a measure of how easy it is to get the perceptron to output a 111. Or to put it in more biological terms, the bias is a measure of how easy it is to get the perceptron to fire. For a perceptron with a really big bias, it's extremely easy for the perceptron to output a 111. But if the bias is very negative, then it's difficult for the perceptron to output a 111. Obviously, introducing the bias is only a small change in how we describe perceptrons, but we'll see later that it leads to further notational simplifications. Because of this, in the remainder of the book we won't use the threshold, we'll always use the bias.I've described perceptrons as a method for weighing evidence to make decisions. Another way perceptrons can be used is to compute the elementary logical functions we usually think of as underlying computation, functions such as AND, OR, and NAND. For example, suppose we have a perceptron with two inputs, each with weight −2−22, and an overall bias of 333. Here's our perceptron: Then we see that input 000000 produces output 111, since (−2)∗0+(−2)∗0+3=3(−2)∗0+(−2)∗0+3=3(2)*0+(2)*0+3 = 3 is positive. Here, I've introduced the ∗∗* symbol to make the multiplications explicit. Similar calculations show that the inputs 010101 and 101010 produce output 111. But the input 111111 produces output 000, since (−2)∗1+(−2)∗1+3=−1(−2)∗1+(−2)∗1+3=−1(2)*1+(2)*1+3 = 1 is negative. And so our perceptron implements a NAND gate!The NAND example shows that we can use perceptrons to compute simple logical functions. In fact, we can use networks of perceptrons to compute any logical function at all. The reason is that the NAND gate is universal for computation, that is, we can build any computation up out of NAND gates. For example, we can use NAND gates to build a circuit which adds two bits, x1x1x_1 and x2x2x_2. This requires computing the bitwise sum, x1⊕x2x1⊕x2x_1 \oplus x_2, as well as a carry bit which is set to 111 when both x1x1x_1 and x2x2x_2 are 111, i.e., the carry bit is just the bitwise product x1x2x1x2x_1 x_2: To get an equivalent network of perceptrons we replace all the NAND gates by perceptrons with two inputs, each with weight −2−22, and an overall bias of 333. Here's the resulting network. Note that I've moved the perceptron corresponding to the bottom right NAND gate a little, just to make it easier to draw the arrows on the diagram: One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron. When I defined the perceptron model I didn't say whether this kind of doubleoutputtothesameplace was allowed. Actually, it doesn't much matter. If we don't want to allow this kind of thing, then it's possible to simply merge the two lines, into a single connection with a weight of 4 instead of two connections with 2 weights. (If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to 2, all biases equal to 3, and a single weight of 4, as marked: Up to now I've been drawing inputs like x1x1x_1 and x2x2x_2 as variables floating to the left of the network of perceptrons. In fact, it's conventional to draw an extra layer of perceptrons  the input layer  to encode the inputs: This notation for input perceptrons, in which we have an output, but no inputs, is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we did have a perceptron with no inputs. Then the weighted sum ∑jwjxj∑jwjxj\sum_j w_j x_j would always be zero, and so the perceptron would output 111 if b>0b>0b > 0, and 000 if b≤0b≤0b \leq 0. That is, the perceptron would simply output a fixed value, not the desired value (x1x1x_1, in the example above). It's better to think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values, x1,x2,…x1,x2,…x_1, x_2,\ldots.The adder example demonstrates how a network of perceptrons can be used to simulate a circuit containing many NAND gates. And because NAND gates are universal for computation, it follows that perceptrons are also universal for computation.The computational universality of perceptrons is simultaneously reassuring and disappointing. It's reassuring because it tells us that networks of perceptrons can be as powerful as any other computing device. But it's also disappointing, because it makes it seem as though perceptrons are merely a new type of NAND gate. That's hardly big news!However, the situation is better than this view suggests. It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer. These learning algorithms enable us to use artificial neurons in a way which is radically different to conventional logic gates. Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.Sigmoid neuronsLearning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!): If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "8" when it should be a "9". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "9". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.The problem is that this isn't what happens when our network contains perceptrons. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 000 to 111. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. So while your "9" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hardtocontrol way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn.We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons: Just like a perceptron, the sigmoid neuron has inputs, x1,x2,…x1,x2,…x_1, x_2, \ldots. But instead of being just 000 or 111, these inputs can also take on any values between 000 and 111. So, for instance, 0.638…0.638…0.638\ldots is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input, w1,w2,…w1,w2,…w_1, w_2, \ldots, and an overall bias, bbb. But the output is not 000 or 111. Instead, it's σ(w⋅x+b)σ(w⋅x+b)\sigma(w \cdot x+b), where σσ\sigma is called the sigmoid function* *Incidentally, σσ\sigma is sometimes called the logistic function, and this new class of neurons called logistic neurons. It's useful to remember this terminology, since these terms are used by many people working with neural nets. However, we'll stick with the sigmoid terminology., and is defined by: σ(z)≡11+e−z.(3)(3)σ(z)≡11+e−z.\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{z}}. \tag{3}\end{eqnarray} To put it all a little more explicitly, the output of a sigmoid neuron with inputs x1,x2,…x1,x2,…x_1,x_2,\ldots, weights w1,w2,…w1,w2,…w_1,w_2,\ldots, and bias bbb is 11+exp(−∑jwjxj−b).(4)(4)11+exp(−∑jwjxj−b).\begin{eqnarray} \frac{1}{1+\exp(\sum_j w_j x_jb)}. \tag{4}\end{eqnarray}At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.To understand the similarity to the perceptron model, suppose z≡w⋅x+bz≡w⋅x+bz \equiv w \cdot x + b is a large positive number. Then e−z≈0e−z≈0e^{z} \approx 0 and so σ(z)≈1σ(z)≈1\sigma(z) \approx 1. In other words, when z=w⋅x+bz=w⋅x+bz = w \cdot x+b is large and positive, the output from the sigmoid neuron is approximately 111, just as it would have been for a perceptron. Suppose on the other hand that z=w⋅x+bz=w⋅x+bz = w \cdot x+b is very negative. Then e−z→∞e−z→∞e^{z} \rightarrow \infty, and σ(z)≈0σ(z)≈0\sigma(z) \approx 0. So when z=w⋅x+bz=w⋅x+bz = w \cdot x +b is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when w⋅x+bw⋅x+bw \cdot x+b is of modest size that there's much deviation from the perceptron model.What about the algebraic form of σσ\sigma? How can we understand that? In fact, the exact form of σσ\sigma isn't so important  what really matters is the shape of the function when plotted. Here's the shape: 4321012340.00.20.40.60.81.0zsigmoid function function s(x) {return 1/(1+Math.exp(x));} var m = [40, 120, 50, 120]; var height = 290  m[0]  m[2]; var width = 600  m[1]  m[3]; var xmin = 5; var xmax = 5; var sample = 400; var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]); var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))}; }); var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]); var y = d3.scale.linear() .domain([0, 1]) .range([height, 0]); var line = d3.svg.line() .x(function(d) { return x(d.x); }) .y(function(d) { return y(d.y); }) var graph = d3.select("#sigmoid_graph") .append("svg") .attr("width", width + m[1] + m[3]) .attr("height", height + m[0] + m[2]) .append("g") .attr("transform", "translate(" + m[3] + "," + m[0] + ")"); var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(4, 5, 1)) .orient("bottom") graph.append("g") .attr("class", "x axis") .attr("transform", "translate(0, " + height + ")") .call(xAxis); var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient("left") .ticks(5) graph.append("g") .attr("class", "y axis") .call(yAxis); graph.append("path").attr("d", line(data)); graph.append("text") .attr("class", "x label") .attr("textanchor", "end") .attr("x", width/2) .attr("y", height+35) .text("z"); graph.append("text") .attr("x", (width / 2)) .attr("y", 10) .attr("textanchor", "middle") .style("fontsize", "16px") .text("sigmoid function"); This shape is a smoothed out version of a step function: 4321012340.00.20.40.60.81.0zstep function function s(x) {return x < 0 ? 0 : 1;} var m = [40, 120, 50, 120]; var height = 290  m[0]  m[2]; var width = 600  m[1]  m[3]; var xmin = 5; var xmax = 5; var sample = 400; var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]); var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))}; }); var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]); var y = d3.scale.linear() .domain([0,1]) .range([height, 0]); var line = d3.svg.line() .x(function(d) { return x(d.x); }) .y(function(d) { return y(d.y); }) var graph = d3.select("#step_graph") .append("svg") .attr("width", width + m[1] + m[3]) .attr("height", height + m[0] + m[2]) .append("g") .attr("transform", "translate(" + m[3] + "," + m[0] + ")"); var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(4, 5, 1)) .orient("bottom") graph.append("g") .attr("class", "x axis") .attr("transform", "translate(0, " + height + ")") .call(xAxis); var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient("left") .ticks(5) graph.append("g") .attr("class", "y axis") .call(yAxis); graph.append("path").attr("d", line(data)); graph.append("text") .attr("class", "x label") .attr("textanchor", "end") .attr("x", width/2) .attr("y", height+35) .text("z"); graph.append("text") .attr("x", (width / 2)) .attr("y", 10) .attr("textanchor", "middle") .style("fontsize", "16px") .text("step function"); If σσ\sigma had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be 111 or 000 depending on whether w⋅x+bw⋅x+bw\cdot x+b was positive or negative* *Actually, when w⋅x+b=0w⋅x+b=0w \cdot x +b = 0 the perceptron outputs 000, while the step function outputs 111. So, strictly speaking, we'd need to modify the step function at that one point. But you get the idea.. By using the actual σσ\sigma function we get, as already implied above, a smoothed out perceptron. Indeed, it's the smoothness of the σσ\sigma function that is the crucial fact, not its detailed form. The smoothness of σσ\sigma means that small changes ΔwjΔwj\Delta w_j in the weights and ΔbΔb\Delta b in the bias will produce a small change ΔoutputΔoutput\Delta \mbox{output} in the output from the neuron. In fact, calculus tells us that ΔoutputΔoutput\Delta \mbox{output} is well approximated by Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,(5)(5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \tag{5}\end{eqnarray} where the sum is over all the weights, wjwjw_j, and ∂output/∂wj∂output/∂wj\partial \, \mbox{output} / \partial w_j and ∂output/∂b∂output/∂b\partial \, \mbox{output} /\partial b denote partial derivatives of the outputoutput\mbox{output} with respect to wjwjw_j and bbb, respectively. Don't panic if you're not comfortable with partial derivatives! While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news): ΔoutputΔoutput\Delta \mbox{output} is a linear function of the changes ΔwjΔwj\Delta w_j and ΔbΔb\Delta b in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.If it's the shape of σσ\sigma which really matters, and not its exact form, then why use the particular form used for σσ\sigma in Equation (3)σ(z)≡11+e−zσ(z)≡11+e−z\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{z}} \nonumber\end{eqnarray}$('#margin_387419264610_reveal').click(function() {$('#margin_387419264610').toggle('slow', function() {});});? In fact, later in the book we will occasionally consider neurons where the output is f(w⋅x+b)f(w⋅x+b)f(w \cdot x + b) for some other activation function f(⋅)f(⋅)f(\cdot). The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔbΔoutput≈∑j∂output∂wjΔwj+∂output∂bΔb\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}$('#margin_727997094331_reveal').click(function() {$('#margin_727997094331').toggle('slow', function() {});}); change. It turns out that when we compute those partial derivatives later, using σσ\sigma will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case, σσ\sigma is commonlyused in work on neural nets, and is the activation function we'll use most often in this book.How should we interpret the output from a sigmoid neuron? Obviously, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons don't just output 000 or 111. They can have as output any real number between 000 and 111, so values such as 0.173…0.173…0.173\ldots and 0.689…0.689…0.689\ldots are legitimate outputs. This can be useful, for example, if we want to use the output value to represent the average intensity of the pixels in an image input to a neural network. But sometimes it can be a nuisance. Suppose we want the output from the network to indicate either "the input image is a 9" or "the input image is not a 9". Obviously, it'd be easiest to do this if the output was a 000 or a 111, as in a perceptron. But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least 0.50.50.5 as indicating a "9", and any output less than 0.50.50.5 as indicating "not a 9". I'll always explicitly state when we're using such a convention, so it shouldn't cause any confusion. Exercises Sigmoid neurons simulating perceptrons, part I \mbox{} Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, c>0c>0c > 0. Show that the behaviour of the network doesn't change.Sigmoid neurons simulating perceptrons, part II \mbox{} Suppose we have the same setup as the last problem  a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won't need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such that w⋅x+b≠0w⋅x+b≠0w \cdot x + b \neq 0 for the input xxx to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant c>0c>0c > 0. Show that in the limit as c→∞c→∞c \rightarrow \infty the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when w⋅x+b=0w⋅x+b=0w \cdot x + b = 0 for one of the perceptrons? The architecture of neural networksIn the next section I'll introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network: As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons. The rightmost or output layer contains the output neurons, or, as in this case, a single output neuron. The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs. The term "hidden" perhaps sounds a little mysterious  the first time I heard the term I thought it must have some deep philosophical or mathematical significance  but it really means nothing more than "not an input or an output". The network above has just a single hidden layer, but some networks have multiple hidden layers. For example, the following fourlayer network has two hidden layers: Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.The design of the input and output layers in a network is often straightforward. For example, suppose we're trying to determine whether a handwritten image depicts a "9" or not. A natural way to design the network is to encode the intensities of the image pixels into the input neurons. If the image is a 646464 by 646464 greyscale image, then we'd have 4,096=64×644,096=64×644,096 = 64 \times 64 input neurons, with the intensities scaled appropriately between 000 and 111. The output layer will contain just a single neuron, with output values of less than 0.50.50.5 indicating "input image is not a 9", and values greater than 0.50.50.5 indicating "input image is a 9 ". While the design of the input and output layers of a neural network is often straightforward, there can be quite an art to the design of the hidden layers. In particular, it's not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network. We'll meet several such design heuristics later in this book. Up to now, we've been discussing neural networks where the output from one layer is used as input to the next layer. Such networks are called feedforward neural networks. This means there are no loops in the network  information is always fed forward, never fed back. If we did have loops, we'd end up with situations where the input to the σσ\sigma function depended on the output. That'd be hard to make sense of, and so we don't allow such loops.However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems in such a model, since a neuron's output only affects its input at some later time, not instantaneously.Recurrent neural nets have been less influential than feedforward networks, in part because the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent networks are still extremely interesting. They're much closer in spirit to how our brains work than feedforward networks. And it's possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks. However, to limit our scope, in this book we're going to concentrate on the more widelyused feedforward networks.A simple network to classify handwritten digitsHaving defined neural networks, let's return to handwriting recognition. We can split the problem of recognizing handwritten digits into two subproblems. First, we'd like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we'd like to break the imageinto six separate images, We humans solve this segmentation problem with ease, but it's challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, we'd like our program to recognize that the first digit above,is a 5.We'll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it's probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.To recognize individual digits we will use a threelayer neural network: The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many 282828 by 282828 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28784=28×28784 = 28 \times 28 neurons. For simplicity I've omitted most of the 784784784 input neurons in the diagram above. The input pixels are greyscale, with a value of 0.00.00.0 representing white, a value of 1.01.01.0 representing black, and in between values representing gradually darkening shades of grey.The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by nnn, and we'll experiment with different values for nnn. The example shown illustrates a small hidden layer, containing just n=15n=15n = 15 neurons.The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an output ≈1≈1\approx 1, then that will indicate that the network thinks the digit is a 000. If the second neuron fires then that will indicate that the network thinks the digit is a 111. And so on. A little more precisely, we number the output neurons from 000 through 999, and figure out which neuron has the highest activation value. If that neuron is, say, neuron number 666, then our network will guess that the input digit was a 666. And so on for the other output neurons.You might wonder why we use 101010 output neurons. After all, the goal of the network is to tell us which digit (0,1,2,…,90,1,2,…,90, 1, 2, \ldots, 9) corresponds to the input image. A seemingly natural way of doing that is to use just 444 output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to 000 or to 111. Four neurons are enough to encode the answer, since 24=1624=162^4 = 16 is more than the 10 possible values for the input digit. Why should our network use 101010 neurons instead? Isn't that inefficient? The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with 101010 output neurons learns to recognize digits better than the network with 444 output neurons. But that leaves us wondering why using 101010 output neurons works better. Is there some heuristic that would tell us in advance that we should use the 101010output encoding instead of the 444output encoding?To understand why we do this, it helps to think about what the neural network is doing from first principles. Consider first the case where we use 101010 output neurons. Let's concentrate on the first output neuron, the one that's trying to decide whether or not the digit is a 000. It does this by weighing up evidence from the hidden layer of neurons. What are those hidden neurons doing? Well, just suppose for the sake of argument that the first neuron in the hidden layer detects whether or not an image like the following is present:It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:As you may have guessed, these four images together make up the 000 image that we saw in the line of digits shown earlier:So if all four of these hidden neurons are firing then we can conclude that the digit is a 000. Of course, that's not the only sort of evidence we can use to conclude that the image was a 000  we could legitimately get a 000 in many other ways (say, through translations of the above images, or slight distortions). But it seems safe to say that at least in this case we'd conclude that the input was a 000.Supposing the neural network functions in this way, we can give a plausible explanation for why it's better to have 101010 outputs from the network, rather than 444. If we had 444 outputs, then the first output neuron would be trying to decide what the most significant bit of the digit was. And there's no easy way to relate that most significant bit to simple shapes like those shown above. It's hard to imagine that there's any good historical reason the component shapes of the digit will be closely related to (say) the most significant bit in the output.Now, with all that said, this is all just a heuristic. Nothing says that the threelayer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes. Maybe a clever learning algorithm will find some assignment of weights that lets us use only 444 output neurons. But as a heuristic the way of thinking I've described works pretty well, and can save you a lot of time in designing good neural network architectures.Exercise There is a way of determining the bitwise representation of a digit by adding an extra layer to the threelayer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 333 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.990.990.99, and incorrect outputs have activation less than 0.010.010.01. Learning with gradient descentNow that we have a design for our neural network, how can it learn to recognize digits? The first thing we'll need is a data set to learn from  a socalled training data set. We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST's name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States' National Institute of Standards and Technology. Here's a few images from MNIST: As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we'll ask it to recognize images which aren't in the training set!The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We'll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn't see during training.We'll use the notation xxx to denote a training input. It'll be convenient to regard each training input xxx as a 28×28=78428×28=78428 \times 28 = 784dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by y=y(x)y=y(x)y = y(x), where yyy is a 101010dimensional vector. For example, if a particular training image, xxx, depicts a 666, then y(x)=(0,0,0,0,0,0,1,0,0,0)Ty(x)=(0,0,0,0,0,0,1,0,0,0)Ty(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T is the desired output from the network. Note that TTT here is the transpose operation, turning a row vector into an ordinary (column) vector.What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x)y(x)y(x) for all training inputs xxx. To quantify how well we're achieving this goal we define a cost function* *Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks. : C(w,b)≡12n∑x∥y(x)−a∥2.(6)(6)C(w,b)≡12n∑x‖y(x)−a‖2.\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \ y(x)  a\^2. \tag{6}\end{eqnarray} Here, www denotes the collection of all weights in the network, bbb all the biases, nnn is the total number of training inputs, aaa is the vector of outputs from the network when xxx is input, and the sum is over all training inputs, xxx. Of course, the output aaa depends on xxx, www and bbb, but to keep the notation simple I haven't explicitly indicated this dependence. The notation ∥v∥‖v‖\ v \ just denotes the usual length function for a vector vvv. We'll call CCC the quadratic cost function; it's also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that C(w,b)C(w,b)C(w,b) is nonnegative, since every term in the sum is nonnegative. Furthermore, the cost C(w,b)C(w,b)C(w,b) becomes small, i.e., C(w,b)≈0C(w,b)≈0C(w,b) \approx 0, precisely when y(x)y(x)y(x) is approximately equal to the output, aaa, for all training inputs, xxx. So our training algorithm has done a good job if it can find weights and biases so that C(w,b)≈0C(w,b)≈0C(w,b) \approx 0. By contrast, it's not doing so well when C(w,b)C(w,b)C(w,b) is large  that would mean that y(x)y(x)y(x) is not close to the output aaa for a large number of inputs. So the aim of our training algorithm will be to minimize the cost C(w,b)C(w,b)C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent. Why introduce the quadratic cost? After all, aren't we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That's why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \ y(x)  a\^2 \nonumber\end{eqnarray}$('#margin_501822820305_reveal').click(function() {$('#margin_501822820305').toggle('slow', function() {});});. Isn't this a rather ad hoc choice? Perhaps if we chose a different cost function we'd get a totally different set of minimizing weights and biases? This is a valid concern, and later we'll revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \ y(x)  a\^2 \nonumber\end{eqnarray}$('#margin_555483302348_reveal').click(function() {$('#margin_555483302348').toggle('slow', function() {});}); works perfectly well for understanding the basics of learning in neural networks, so we'll stick with it for now.Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b)C(w,b)C(w, b). This is a wellposed problem, but it's got a lot of distracting structure as currently posed  the interpretation of www and bbb as weights and biases, the σσ\sigma function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we're going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we're going to imagine that we've simply been given a function of many variables and we want to minimize that function. We're going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we'll come back to the specific function we want to minimize for neural networks.Okay, let's suppose we're trying to minimize some function, C(v)C(v)C(v). This could be any realvalued function of many variables, v=v1,v2,…v=v1,v2,…v = v_1, v_2, \ldots. Note that I've replaced the www and bbb notation by vvv to emphasize that this could be any function  we're not specifically thinking in the neural networks context any more. To minimize C(v)C(v)C(v) it helps to imagine CCC as a function of just two variables, which we'll call v1v1v_1 and v2v2v_2:What we'd like is to find where CCC achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, I've perhaps shown slightly too simple a function! A general function, CCC, may be a complicated function of many variables, and it won't usually be possible to just eyeball the graph to find the minimum.One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where CCC is an extremum. With some luck that might work when CCC is a function of just one or a few variables. But it'll turn into a nightmare when we have many more variables. And for neural networks we'll often want far more variables  the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just won't work!(After asserting that we'll gain insight by imagining CCC as a function of just two variables, I've turned around twice in two paragraphs and said, "hey, but what if it's a function of many more than two variables?" Sorry about that. Please believe me when I say that it really does help to imagine CCC as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when it's appropriate to use each picture, and when it's not.)Okay, so calculus doesn't work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn't be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We'd randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of CCC  those derivatives would tell us everything we need to know about the local "shape" of the valley, and therefore how our ball should roll.Based on what I've just written, you might suppose that we'll be trying to write down Newton's equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we're not going to take the ballrolling analogy quite that seriously  we're devising an algorithm to minimize CCC, not developing an accurate simulation of the laws of physics! The ball'seye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?To make this question more precise, let's think about what happens when we move the ball a small amount Δv1Δv1\Delta v_1 in the v1v1v_1 direction, and a small amount Δv2Δv2\Delta v_2 in the v2v2v_2 direction. Calculus tells us that CCC changes as follows: ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.(7)(7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2. \tag{7}\end{eqnarray} We're going to find a way of choosing Δv1Δv1\Delta v_1 and Δv2Δv2\Delta v_2 so as to make ΔCΔC\Delta C negative; i.e., we'll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define ΔvΔv\Delta v to be the vector of changes in vvv, Δv≡(Δv1,Δv2)TΔv≡(Δv1,Δv2)T\Delta v \equiv (\Delta v_1, \Delta v_2)^T, where TTT is again the transpose operation, turning row vectors into column vectors. We'll also define the gradient of CCC to be the vector of partial derivatives, (∂C∂v1,∂C∂v2)T(∂C∂v1,∂C∂v2)T\left(\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T. We denote the gradient vector by ∇C∇C\nabla C, i.e.: ∇C≡(∂C∂v1,∂C∂v2)T.(8)(8)∇C≡(∂C∂v1,∂C∂v2)T.\begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T. \tag{8}\end{eqnarray} In a moment we'll rewrite the change ΔCΔC\Delta C in terms of ΔvΔv\Delta v and the gradient, ∇C∇C\nabla C. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the ∇C∇C\nabla C notation for the first time, people sometimes wonder how they should think about the ∇∇\nabla symbol. What, exactly, does ∇∇\nabla mean? In fact, it's perfectly fine to think of ∇C∇C\nabla C as a single mathematical object  the vector defined above  which happens to be written using two symbols. In this point of view, ∇∇\nabla is just a piece of notational flagwaving, telling you "hey, ∇C∇C\nabla C is a gradient vector". There are more advanced points of view where ∇∇\nabla can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we won't need such points of view.With these definitions, the expression (7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2ΔC≈∂C∂v1Δv1+∂C∂v2Δv2\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}$('#margin_512380394946_reveal').click(function() {$('#margin_512380394946').toggle('slow', function() {});}); for ΔCΔC\Delta C can be rewritten as ΔC≈∇C⋅Δv.(9)(9)ΔC≈∇C⋅Δv.\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v. \tag{9}\end{eqnarray} This equation helps explain why ∇C∇C\nabla C is called the gradient vector: ∇C∇C\nabla C relates changes in vvv to changes in CCC, just as we'd expect something called a gradient to do. But what's really exciting about the equation is that it lets us see how to choose ΔvΔv\Delta v so as to make ΔCΔC\Delta C negative. In particular, suppose we choose Δv=−η∇C,(10)(10)Δv=−η∇C,\begin{eqnarray} \Delta v = \eta \nabla C, \tag{10}\end{eqnarray} where ηη\eta is a small, positive parameter (known as the learning rate). Then Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_31741254841_reveal').click(function() {$('#margin_31741254841').toggle('slow', function() {});}); tells us that ΔC≈−η∇C⋅∇C=−η∥∇C∥2ΔC≈−η∇C⋅∇C=−η‖∇C‖2\Delta C \approx \eta \nabla C \cdot \nabla C = \eta \\nabla C\^2. Because ∥∇C∥2≥0‖∇C‖2≥0\ \nabla C \^2 \geq 0, this guarantees that ΔC≤0ΔC≤0\Delta C \leq 0, i.e., CCC will always decrease, never increase, if we change vvv according to the prescription in (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = \eta \nabla C \nonumber\end{eqnarray}$('#margin_48762573303_reveal').click(function() {$('#margin_48762573303').toggle('slow', function() {});});. (Within, of course, the limits of the approximation in Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_919658643545_reveal').click(function() {$('#margin_919658643545').toggle('slow', function() {});});). This is exactly the property we wanted! And so we'll take Equation (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = \eta \nabla C \nonumber\end{eqnarray}$('#margin_287729255111_reveal').click(function() {$('#margin_287729255111').toggle('slow', function() {});}); to define the "law of motion" for the ball in our gradient descent algorithm. That is, we'll use Equation (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = \eta \nabla C \nonumber\end{eqnarray}$('#margin_718723868298_reveal').click(function() {$('#margin_718723868298').toggle('slow', function() {});}); to compute a value for ΔvΔv\Delta v, then move the ball's position vvv by that amount: v→v′=v−η∇C.(11)(11)v→v′=v−η∇C.\begin{eqnarray} v \rightarrow v' = v \eta \nabla C. \tag{11}\end{eqnarray} Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing CCC until  we hope  we reach a global minimum.Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient ∇C∇C\nabla C, and then to move in the opposite direction, "falling down" the slope of the valley. We can visualize it like this:Notice that with this rule gradient descent doesn't reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. It's only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing ΔvΔv\Delta v just says "go down, right now". That's still a pretty good rule for finding the minimum!To make gradient descent work correctly, we need to choose the learning rate ηη\eta to be small enough that Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_560455937071_reveal').click(function() {$('#margin_560455937071').toggle('slow', function() {});}); is a good approximation. If we don't, we might end up with ΔC>0ΔC>0\Delta C > 0, which obviously would not be good! At the same time, we don't want ηη\eta to be too small, since that will make the changes ΔvΔv\Delta v tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, ηη\eta is often varied so that Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_157848846275_reveal').click(function() {$('#margin_157848846275').toggle('slow', function() {});}); remains a good approximation, but the algorithm isn't too slow. We'll see later how this works. I've explained gradient descent when CCC is a function of just two variables. But, in fact, everything works just as well even when CCC is a function of many more variables. Suppose in particular that CCC is a function of mmm variables, v1,…,vmv1,…,vmv_1,\ldots,v_m. Then the change ΔCΔC\Delta C in CCC produced by a small change Δv=(Δv1,…,Δvm)TΔv=(Δv1,…,Δvm)T\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T is ΔC≈∇C⋅Δv,(12)(12)ΔC≈∇C⋅Δv,\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient ∇C∇C\nabla C is the vector ∇C≡(∂C∂v1,…,∂C∂vm)T.(13)(13)∇C≡(∂C∂v1,…,∂C∂vm)T.\begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T. \tag{13}\end{eqnarray} Just as for the two variable case, we can choose Δv=−η∇C,(14)(14)Δv=−η∇C,\begin{eqnarray} \Delta v = \eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_869505431896_reveal').click(function() {$('#margin_869505431896').toggle('slow', function() {});}); for ΔCΔC\Delta C will be negative. This gives us a way of following the gradient to a minimum, even when CCC is a function of many variables, by repeatedly applying the update rule v→v′=v−η∇C.(15)(15)v→v′=v−η∇C.\begin{eqnarray} v \rightarrow v' = v\eta \nabla C. \tag{15}\end{eqnarray} You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position vvv in order to find a minimum of the function CCC. The rule doesn't always work  several things can go wrong and prevent gradient descent from finding the global minimum of CCC, a point we'll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move ΔvΔv\Delta v in position so as to decrease CCC as much as possible. This is equivalent to minimizing ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\Delta C \approx \nabla C \cdot \Delta v. We'll constrain the size of the move so that ∥Δv∥=ϵ‖Δv‖=ϵ\ \Delta v \ = \epsilon for some small fixed ϵ>0ϵ>0\epsilon > 0. In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases CCC as much as possible. It can be proved that the choice of ΔvΔv\Delta v which minimizes ∇C⋅Δv∇C⋅Δv\nabla C \cdot \Delta v is Δv=−η∇CΔv=−η∇C\Delta v =  \eta \nabla C, where η=ϵ/∥∇C∥η=ϵ/‖∇C‖\eta = \epsilon / \\nabla C\ is determined by the size constraint ∥Δv∥=ϵ‖Δv‖=ϵ\\Delta v\ = \epsilon. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease CCC.Exercises Prove the assertion of the last paragraph. Hint: If you're not already familiar with the CauchySchwarz inequality, you may find it helpful to familiarize yourself with it. I explained gradient descent when CCC is a function of two variables, and when it's a function of more than two variables. What happens when CCC is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the onedimensional case? People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ballmimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of CCC, and this can be quite costly. To see why it's costly, suppose we want to compute all the second partial derivatives ∂2C/∂vj∂vk∂2C/∂vj∂vk\partial^2 C/ \partial v_j \partial v_k. If there are a million such vjvjv_j variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since ∂2C/∂vj∂vk=∂2C/∂vk∂vj∂2C/∂vj∂vk=∂2C/∂vk∂vj\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j. Still, you get the point.! That's going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we'll use gradient descent (and variations) as our main approach to learning in neural networks.How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights wkwkw_k and biases blblb_l which minimize the cost in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \ y(x)  a\^2 \nonumber\end{eqnarray}$('#margin_1246306310_reveal').click(function() {$('#margin_1246306310').toggle('slow', function() {});});. To see how this works, let's restate the gradient descent update rule, with the weights and biases replacing the variables vjvjv_j. In other words, our "position" now has components wkwkw_k and blblb_l, and the gradient vector ∇C∇C\nabla C has corresponding components ∂C/∂wk∂C/∂wk\partial C / \partial w_k and ∂C/∂bl∂C/∂bl\partial C / \partial b_l. Writing out the gradient descent update rule in terms of components, we have wkbl→→w′k=wk−η∂C∂wkb′l=bl−η∂C∂bl.(16)(17)(16)wk→wk′=wk−η∂C∂wk(17)bl→bl′=bl−η∂C∂bl.\begin{eqnarray} w_k & \rightarrow & w_k' = w_k\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l\eta \frac{\partial C}{\partial b_l}. \tag{17}\end{eqnarray} By repeatedly applying this update rule we can "roll down the hill", and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.There are a number of challenges in applying the gradient descent rule. We'll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let's look back at the quadratic cost in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \ y(x)  a\^2 \nonumber\end{eqnarray}$('#margin_214093216664_reveal').click(function() {$('#margin_214093216664').toggle('slow', function() {});});. Notice that this cost function has the form C=1n∑xCxC=1n∑xCxC = \frac{1}{n} \sum_x C_x, that is, it's an average over costs Cx≡∥y(x)−a∥22Cx≡‖y(x)−a‖22C_x \equiv \frac{\y(x)a\^2}{2} for individual training examples. In practice, to compute the gradient ∇C∇C\nabla C we need to compute the gradients ∇Cx∇Cx\nabla C_x separately for each training input, xxx, and then average them, ∇C=1n∑x∇Cx∇C=1n∑x∇Cx\nabla C = \frac{1}{n} \sum_x \nabla C_x. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient ∇C∇C\nabla C by computing ∇Cx∇Cx\nabla C_x for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C∇C\nabla C, and this helps speed up gradient descent, and thus learning.To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number mmm of randomly chosen training inputs. We'll label those random training inputs X1,X2,…,XmX1,X2,…,XmX_1, X_2, \ldots, X_m, and refer to them as a minibatch. Provided the sample size mmm is large enough we expect that the average
