Understanding minimizing cost correctlyUnderstanding Locally Weighted Linear RegressionUnderstanding Logistic Regression Cost functionCost function for Ordinal Regression using neural networksCustom c++ LSTM slows down at 0.36 cost is usual?Policy Gradient Methods - ScoreFunction & Log(policy)How to Define a Cost Fucntion?Logistic regression cost functionCost function in linear regressionML / Multivariable cost minimization problems / approach summary?Loss function minimizing by pushing precision and recall to 0

Why does a 97 / 92 key piano exist by Bosendorfer?

What (if any) is the reason to buy in small local stores?

How can a new country break out from a developed country without war?

Not hide and seek

Rendered textures different to 3D View

What is the purpose of using a decision tree?

How to preserve electronics (computers, ipads, phones) for hundreds of years?

How would a solely written language work mechanically

Is there a distance limit for minecart tracks?

New Order #2: Turn My Way

Air travel with refrigerated insulin

What should be the ideal length of sentences in a blog post for ease of reading?

Trouble reading roman numeral notation with flats

Is this saw blade faulty?

Magnifying glass in hyperbolic space

Why doesn't Gödel's incompleteness theorem apply to false statements?

If the Dominion rule using their Jem'Hadar troops, why is their life expectancy so low?

Calculate Pi using Monte Carlo

Connection Between Knot Theory and Number Theory

Offset in split text content

What is the period/term used describe Giuseppe Arcimboldo's style of painting?

Showing mass murder in a kid's book

A seasonal riddle

What is the meaning of "You've never met a graph you didn't like?"

Understanding minimizing cost correctly

Understanding Locally Weighted Linear RegressionUnderstanding Logistic Regression Cost functionCost function for Ordinal Regression using neural networksCustom c++ LSTM slows down at 0.36 cost is usual?Policy Gradient Methods - ScoreFunction & Log(policy)How to Define a Cost Fucntion?Logistic regression cost functionCost function in linear regressionML / Multivariable cost minimization problems / approach summary?Loss function minimizing by pushing precision and recall to 0

I cannot wrap my head around this simple concept.

Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):

$h(x) = theta cdot x$

The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.

Then, theta would be updated as:

$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.

From my understanding the multiplier after the alpha term is the derivative of the error cost function $J$. This term tells us the direction to head in, in order to arrive at the minimum making a small step at a time. I understand the concept of "hill climbing" correctly, at least I think.

Here is where I don't seem to wrap my head around:

If the form of the error function is known (like in our case: we could visually plot the function if we take enough values of theta and plug them in the model), why can't we take the first derivative and set it to zero (partial derivative if the function has multiple thetas). This way we would have all the minimums of the function. Then with the second derivative, we could determine whether it's a min or a max.

I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?

Sorry for asking such a silly question.

Thank you.

edited 2 days ago

Siong Thye Goh

1,332419

asked 2 days ago

zafirzarya

132

New contributor

add a comment |

I cannot wrap my head around this simple concept.

Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):

$h(x) = theta cdot x$

The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.

Then, theta would be updated as:

$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.

Here is where I don't seem to wrap my head around:

I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?

Sorry for asking such a silly question.

Thank you.

edited 2 days ago

Siong Thye Goh

1,332419

asked 2 days ago

zafirzarya

132

New contributor

add a comment |

I cannot wrap my head around this simple concept.

Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):

$h(x) = theta cdot x$

The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.

Then, theta would be updated as:

$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.

Here is where I don't seem to wrap my head around:

I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?

Sorry for asking such a silly question.

Thank you.

edited 2 days ago

Siong Thye Goh

1,332419

asked 2 days ago

zafirzarya

132

New contributor

I cannot wrap my head around this simple concept.

Suppose we have a linear regression, and there is a single parameter theta to be optimized (for simplicity purposes):

$h(x) = theta cdot x$

The error cost function could be defined as $J(theta) = frac1m cdot sum (h(x) - y(x)) ^ 2$, for each $x$.

Then, theta would be updated as:

$theta = theta - alphacdot frac1m cdot sum (h(x) - y(x)) cdot x$, for each $x$.

Here is where I don't seem to wrap my head around:

I've seen this done in calculus for simple functions like $y = x^2 + 5x + 2$ (may years ago, maybe I am wrong), so what is stopping us from doing the same thing here?

Sorry for asking such a silly question.

Thank you.

linear-regression cost-function

edited 2 days ago

Siong Thye Goh

1,332419

asked 2 days ago

zafirzarya

132

New contributor

edited 2 days ago

Siong Thye Goh

1,332419

asked 2 days ago

zafirzarya

132

New contributor

edited 2 days ago

Siong Thye Goh

1,332419

edited 2 days ago

Siong Thye Goh

1,332419

edited 2 days ago

Siong Thye Goh

1,332419

asked 2 days ago

zafirzarya

132

New contributor

asked 2 days ago

zafirzarya

132

asked 2 days ago

zafirzarya

132

New contributor

zafirzarya is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

1 Answer
1

active

oldest

votes

Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$

Hence solving this, would give us $$X^TXtheta =X^Ty$$

Solving this would give us the optimal solution theoretically. However, numerical stability is an issue and also don't forget computational complexity. The complexity to solve a linear system is cubic.

Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.

answered 2 days ago

Siong Thye Goh

1,332419

1

$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago

$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47466%2funderstanding-minimizing-cost-correctly%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$

Hence solving this, would give us $$X^TXtheta =X^Ty$$

Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.

answered 2 days ago

Siong Thye Goh

1,332419

1

$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago

$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago

add a comment |

Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$

Hence solving this, would give us $$X^TXtheta =X^Ty$$

Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.

answered 2 days ago

Siong Thye Goh

1,332419

1

$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago

$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago

add a comment |

Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$

Hence solving this, would give us $$X^TXtheta =X^Ty$$

Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.

answered 2 days ago

Siong Thye Goh

1,332419

Consider differentiating this $$nabla_theta|Xtheta -y|^2=2X^T(Xtheta -y)=0$$

Hence solving this, would give us $$X^TXtheta =X^Ty$$

Also, sometimes, we do not even know even have a closed form, a gradient based approach can be more applicable.

answered 2 days ago

Siong Thye Goh

1,332419

answered 2 days ago

Siong Thye Goh

1,332419

answered 2 days ago

Siong Thye Goh

1,332419

answered 2 days ago

Siong Thye Goh

1,332419

1

$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago

$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago

add a comment |

1

$begingroup$
Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?
$endgroup$
– zafirzarya
2 days ago

$begingroup$
I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.
$endgroup$
– Siong Thye Goh
2 days ago

Thank you for replying. However, I am not that mathematically literate to understand your answer. Is there a simpler answer?

– zafirzarya
2 days ago

I found an answer in MSE to illustrate why computing $X^TX$ is bad. Most approaches that aim at directly solving the normal equation is more expensive than a gradient based approach. Also such gradient based approach have been adapted to a sampling based approach as well known as stochastic gradient descent that can handle very big data.

– Siong Thye Goh
2 days ago

add a comment |

zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

zafirzarya is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Hcfyk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Bruad Bilen | Luke uk diar | NawigatsjuunCommonskategorii: BruadCommonskategorii: RunstükenWikiquote: Bruad

What is the offset in a seaplane's hull?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Bruad Bilen | Luke uk diar | NawigatsjuunCommonskategorii: BruadCommonskategorii: RunstükenWikiquote: Bruad

What is the offset in a seaplane's hull?

1 Answer
1

1 Answer
1

1 Answer
1